*any two-tailed hypothesis is non-scientific because it cannot be falsified*. In this post, I further consider what effect sizes we should expect if the null hypothesis is true, and why that matters for how we interpret the results of a replication attempt.

In the original thought experiment, a study found a significant effect with effect size of d = 0.5, but a larger replication found an effect size of d = 0.1. Some commenters argued that we should treat the second study as a replication because it produced an effect in the same direction. But, as I noted in my earlier post, even if the null hypothesis were true, that would mean replicating 50% of the time (Uri Simonsohn made the same point in a recent talk). Let's flesh that out a bit, because the problem is even worse than that.

Even when the null hypothesis is true, we should not expect to find d = 0. Let's assume for a moment that the null hypothesis is true—you have two populations whose means and standard deviations actually are identical to infinite precision. Next, let's assume that you sample 15 people at random from each population, for a total sample of 30 participants, and you compare the means of those two samples. Remember, there is no difference between the populations, so the null hypothesis is true. (Note, I'm asking about the absolute value of the effect size—how big an effect would you expect to find, ignoring the direction of the effect?) Before reading further, if you had to guess, how big an effect should you expect to find?

__Answer__: the median effect size in this case is approximately d = 0.25. If the null hypothesis of no difference is actually true, you'll find an effect larger in magnitude than d = 0.25 fifty percent of the time! In fact, you would expect to find an effect size bigger than d = 0.43 more than 25% of the time. In other words, you'll find a lot of spurious medium-size effects.

Now, 15 subjects/group is not an unusual study size in psychology, but it is a little on the small side. What if we had 20/group? The median effect size then would be d = 0.21, and 25% of the time you'd expect to find an effect d > 0.36. Again, with a typical psychology study sample size, you should expect to find some sizable effects even when the null hypothesis is true. What if we had a large sample size, say 100 subjects/group for a total of 200 subjects? When the null hypothesis of no-difference is true, the median effect size would be d = 0.096 and more than 25% of the time the effect size would exceed d= 0.16.

Here is a graph illustrating the median effect size (with 25% and 75% quartiles in red) as a function of sample size when there is no difference between the populations.

In all cases, both groups are drawn from a standard normal distribution with a mean of 0 and standard deviation of 1, so the null hypothesis of no difference is true. (The values in the graph could be derived analytically, but I was playing with simulation code, so I just did it that way.) Note that small sample sizes tend to produce bigger and more variable effect size estimates than do large samples.

What does this mean? First, and not surprisingly, you need a big sample to reliably find a small effect. Second, if you have a small sample, you're far more likely to find a spuriously large effect. Typical psychology studies lack the power to detect tiny effects, and even with fairly large samples (by psychology standards), you will find a non-zero effect size when the null hypothesis is true. Even with a large sample, an effect size greater than d = 0.1 should be expected, even when there is no real difference. So, for practical purposes in a typical psychology study, an effect size between -0.1 and +0.1 is indiscriminable from an effect size of zero, and we probably should treat it as if it were zero. At most, we should treat it as only suggestive evidence of an effect, and not as confirmation.

__To conclude__: If you care only about the sign of a replication attempt, when there actually is no effect at all, you will mistakenly conclude that a replication supported the original finding 50% of the time. For that reason, I think it's necessary to consider both the sign and size of a replication effect when evaluating if it supports the conclusions of an original result. (Of course, an even better approach might estimate a confidence interval around replication effect size to determine whether it includes the original effect size. The bigger the sample size, the smaller the confidence interval. Bayesian estimation of the effect size would be better as well.)