Saturday, March 9, 2013

Further thoughts on what counts as a replication

Last month I posted a replication thought experiment that I hoped would provoke an interesting discussion of what counts as a replication. It did. Today I want to flesh out why I think one of the most common interpretations doesn't work. In short, if you follow it to its logical conclusion, any two-tailed hypothesis is inherently unscientific!

I wasn't surprised that opinions were mixed about the crucial case: 
An original underpowered study produces a significant effect (p=.049) with an effect size of d=.50. A replication attempt uses much greater power (10x the original sample size) and significantly rejects the null (p<05) with a much smaller effect size in the same direction (d=.10).
Commenters fell into three camps:

  1. The new result replicates the original because the effect was in the same direction.
  2. The new result partially replicates the original because it was in the same direction, but it is not the same effect because it is meaningfully smaller. 
  3. The new result does not replicate the original because the new result is meaningfully smaller, and therefore it does not have the same theoretical/practical meaning as the original. 

Although there is no objectively right interpretation of this result, I do think the first interpretation has some theoretical ramifications that even its proponents might not like. When coupled with the logic of null hypothesis significance testing, a not-so appealing conclusion falls out by necessity: Any conclusion based on a two-tailed hypothesis test is unfalsifiable, and therefore not scientific!

Here's the logic:

  • A two-tailed hypothesis predicts that two groups will differ, but it does not predict the direction of difference, and either direction would be interesting. Two-tailed significance tests are common in psychology.
  • The null-hypothesis of no-difference is never true. Two groups may produce the same mean, but only with a limited level of measurement precision. With infinite precision, the groups will differ. That is a property of any continuous distribution: The probability of any exact value on that distribution is 0 (this is a matter of math, not methods). This issue is one reason that many people object to null-hypothesis significance testing, but we don't need to consider that debate here.
  • Given that the null is never true when measured with infinite precision, the measured effect will always fall on one side or the other of the null. And, with a large enough sample, that difference will be statistically significant.
And here's the problem:
  • For a two tailed hypothesis, any significant deviation from zero counts as a replication.
  • With enough power, the measured effect will always differ from zero.
  • Therefore, no result can falsify the hypothesis.
  • A hypothesis that cannot be falsified is non-scientific.
  • Therefore, two-tailed hypotheses and their accompanying tests are not scientific.
In some cases, that conclusion seems reasonable. For example, proponents of ESP will accept performance that is significantly better or worse than chance as support for the existence of ESP. But, in other cases, two-tailed hypotheses seem reasonable and scientifically grounded. Consequently, we must challenge one of the premises.

Let's assume for now that we accept the logic of null hypothesis significance testing. The best approach, in my view, would be to differentiate between tiny effect sizes and more sizable ones. If we are willing to say that effects near zero are functionally equivalent to no effect and different from large effects, then we can avoid the logical conclusion that two-tailed tests are inherently unscientific. 

But once we make that assumption, it applies to one-tailed hypotheses as well, effectively ruling out interpretation #1 from our thought experiment. We have to treat near-zero effect sizes as failures to replicate large effect sizes, even if they fall on the same size of zero and are significantly different from zero.

Another reason to make that assumption is that, even when the null hypothesis is true, effects will fall in the same direction as the original effect 50% of the time by chance. That means, by chance, if any effect in the same direction counts as a replication, we would replicate an original false positive 50% of the time by chance. That seems problematic as well.

If I've made an error in my reasoning or you see a way to salvage the idea that an infinitely small effect in the same direction as an original effect counts as replicating that effect, I would love to hear about it in the comments.