Tuesday, February 12, 2013

What counts as a successful replication?


A recent discussion among cognitive and social psychologists prompted me to think about the nature of replications. What does it mean to call a replication "successful" or a failure? Below I write about a thought experiment that you can try for yourself. Please post your thoughts in the comments!

What counts as a replication and what counts as a failure to replicate? The way those terms have been used in the field, a success is a study that finds a significant effect in the same direction and a failure is one that doesn't. But that is a lousy criterion on which to evaluate the similarity of two findings.

As a thought experiment, take the case of a somewhat underpowered study that finds a significant effect (p=.049) with an effect size of d=.50. Which of the following should count as a failure to replicate?

  1. A study that has the same sample sizes as the original that does not find a significant difference (p > .05) but that produces an effect size of d=.48 in the same direction.
  2. A study with 50% larger sample sizes that significantly rejects the null (p<.05), with an effect size estimate of d=.40.
  3. A study with massive power (10x the original sample size) that does not significantly reject the null and produces an effect size of d=0.
  4. A study with massive power (10x the original sample size) that significantly rejects the null (p<05), but produces an effect size estimate of d=.10 in the same direction.


(For the sake of this thought experiment, let's assume the studies were conducted well and used the same protocol as the original study. Heck, it's a thought experiment -- let's have them be studies from the same lab run with the same RAs so there's no risk of lab differences).


    By the standard in the literature, study 1 would be treated as a failure to replicate because it was not statistically significant at p<.05. But that makes no sense at all. The effect size estimate is nearly identical. Just because one result is significant and the other isn't does not mean that the two are different from each other. There is no meaningful difference in terms of the effect size estimate for a study that produces p=.049 and one that produces p=.051 (with the same design and sample size). This example reveals the problem with just tallying up significant and non-significant results.

    Take case number 2. I think most researchers would agree that such a study would count as a successful replication. It is significantly significant (p<.05) and the effect size is in the same range as the original. It has larger power, so the effect size estimate will be more precise (and accurate). In other words, it is qualitatively similar to the original result. We should expect direct replications to find smaller effects on average due to publication biases, even if the original effect is real and substantial. That's one reason why replications must have greater power than an original study—we should assume that the original study overestimated the size of the effect. There's no individual fault or blame for a slightly reduced effect size estimate in this sort of case—it's a consequence of our field's tendency to publish only statistically significant findings.

    Take case number 3. I think all of us would agree that such a study is a failure to replicate by any standard we can come up with. It provides a highly precise estimate of the effect size (zero) that is not in the same direction as the original and is not statistically significant. I'm hoping nobody would quibble with this one.

    But what do we make of case #4. With massive power, the effect size estimate is highly precise. The effect size is in the same direction as the original and significantly different from zero. But it is substantially smaller. Here we run into the problem of what it means to be qualitatively the same. Would you treat this result as a replication because it is in the same direction and significantly different from 0? Or, should it be treated as a failure to replicate because it does not find the same magnitude of effect size. What counts as too different to be treated as qualitatively the same? I don't believe there is a definitively right answer to this question. There is an effect, but it is substantially smaller than the original estimate and requires much greater sample sizes to detect it reliably. How should the field treat such a finding?