*any two-tailed hypothesis is non-scientific because it cannot be falsified*. In this post, I further consider what effect sizes we should expect if the null hypothesis is true, and why that matters for how we interpret the results of a replication attempt.

In the original thought experiment, a study found a significant effect with effect size of d = 0.5, but a larger replication found an effect size of d = 0.1. Some commenters argued that we should treat the second study as a replication because it produced an effect in the same direction. But, as I noted in my earlier post, even if the null hypothesis were true, that would mean replicating 50% of the time (Uri Simonsohn made the same point in a recent talk). Let's flesh that out a bit, because the problem is even worse than that.

Even when the null hypothesis is true, we should not expect to find d = 0. Let's assume for a moment that the null hypothesis is true—you have two populations whose means and standard deviations actually are identical to infinite precision. Next, let's assume that you sample 15 people at random from each population, for a total sample of 30 participants, and you compare the means of those two samples. Remember, there is no difference between the populations, so the null hypothesis is true. (Note, I'm asking about the absolute value of the effect size—how big an effect would you expect to find, ignoring the direction of the effect?) Before reading further, if you had to guess, how big an effect should you expect to find?

__Answer__: the median effect size in this case is approximately d = 0.25. If the null hypothesis of no difference is actually true, you'll find an effect larger in magnitude than d = 0.25 fifty percent of the time! In fact, you would expect to find an effect size bigger than d = 0.43 more than 25% of the time. In other words, you'll find a lot of spurious medium-size effects.

Now, 15 subjects/group is not an unusual study size in psychology, but it is a little on the small side. What if we had 20/group? The median effect size then would be d = 0.21, and 25% of the time you'd expect to find an effect d > 0.36. Again, with a typical psychology study sample size, you should expect to find some sizable effects even when the null hypothesis is true. What if we had a large sample size, say 100 subjects/group for a total of 200 subjects? When the null hypothesis of no-difference is true, the median effect size would be d = 0.096 and more than 25% of the time the effect size would exceed d= 0.16.

Here is a graph illustrating the median effect size (with 25% and 75% quartiles in red) as a function of sample size when there is no difference between the populations.

In all cases, both groups are drawn from a standard normal distribution with a mean of 0 and standard deviation of 1, so the null hypothesis of no difference is true. (The values in the graph could be derived analytically, but I was playing with simulation code, so I just did it that way.) Note that small sample sizes tend to produce bigger and more variable effect size estimates than do large samples.

What does this mean? First, and not surprisingly, you need a big sample to reliably find a small effect. Second, if you have a small sample, you're far more likely to find a spuriously large effect. Typical psychology studies lack the power to detect tiny effects, and even with fairly large samples (by psychology standards), you will find a non-zero effect size when the null hypothesis is true. Even with a large sample, an effect size greater than d = 0.1 should be expected, even when there is no real difference. So, for practical purposes in a typical psychology study, an effect size between -0.1 and +0.1 is indiscriminable from an effect size of zero, and we probably should treat it as if it were zero. At most, we should treat it as only suggestive evidence of an effect, and not as confirmation.

__To conclude__: If you care only about the sign of a replication attempt, when there actually is no effect at all, you will mistakenly conclude that a replication supported the original finding 50% of the time. For that reason, I think it's necessary to consider both the sign and size of a replication effect when evaluating if it supports the conclusions of an original result. (Of course, an even better approach might estimate a confidence interval around replication effect size to determine whether it includes the original effect size. The bigger the sample size, the smaller the confidence interval. Bayesian estimation of the effect size would be better as well.)

This is an interesting thought experiment. I'm not sure it contains the right numbers. What you are looking at is what is the probability, given that the first experiment had a false positive, the second experiment of a given sample size will be in the same direction.

ReplyDeleteI think that typically what we care about is the non-conditional probability: what is the joint probability that the first results was a false positive and the second one is the same direction? Answering that question involves a lot more guesswork (we need to know the prior probability of false positives, which we don't, and people's guestimates seem to run pretty much the whole spectrum).

Josh -- I don't think the thought experiment makes any assumptions at all about the truth value of the initial study. Rather, it just asks what the odds of getting an effect size of a given magnitude/sign would be if the null hypothesis were true. The larger thought experiment asks whether a second study can be considered to have replicated the pattern of the first, whether or not the first result was true or a false positive. It isn't a conditional probability because I'm not making any assumptions about whether the initial result is real or a false positive. I'm just asking whether a second study replicates the pattern shown in the first one. We don't know the ground truth about the size of the actual effect in this thought experiment (just as we can only estimate it through experimental results in reality).

ReplyDeleteIn a real-world situation, we just don't know whether an initial result is a false positive or even an accurate estimate of an effect. And, no one study can tell us whether the original was a false positive. The broader goal, then, is to estimate the true underlying effect size. The best approach, without pre-existing knowledge of the ground truth is to conduct a meta-analysis across many studies. If the cumulative effect size approaches zero across many studies, then the original likely was a false positive, but we never know that for certain (we're estimating reality).

My point in this thought experiment is to note that the sign of an effect in a replication does not provide strong confirmation of the original effect.