Daniel Simons: Further thoughts on what counts as a replication

Saturday, March 9, 2013

Further thoughts on what counts as a replication

Last month I posted a replication thought experiment that I hoped would provoke an interesting discussion of what counts as a replication. It did. Today I want to flesh out why I think one of the most common interpretations doesn't work. In short, if you follow it to its logical conclusion, any two-tailed hypothesis is inherently unscientific!

I wasn't surprised that opinions were mixed about the crucial case:

An original underpowered study produces a significant effect (p=.049) with an effect size of d=.50. A replication attempt uses much greater power (10x the original sample size) and significantly rejects the null (p<05) with a much smaller effect size in the same direction (d=.10).

Commenters fell into three camps:

The new result replicates the original because the effect was in the same direction.
The new result partially replicates the original because it was in the same direction, but it is not the same effect because it is meaningfully smaller.
The new result does not replicate the original because the new result is meaningfully smaller, and therefore it does not have the same theoretical/practical meaning as the original.

Although there is no objectively right interpretation of this result, I do think the first interpretation has some theoretical ramifications that even its proponents might not like. When coupled with the logic of null hypothesis significance testing, a not-so appealing conclusion falls out by necessity: Any conclusion based on a two-tailed hypothesis test is unfalsifiable, and therefore not scientific!

Here's the logic:

A two-tailed hypothesis predicts that two groups will differ, but it does not predict the direction of difference, and either direction would be interesting. Two-tailed significance tests are common in psychology.
The null-hypothesis of no-difference is never true. Two groups may produce the same mean, but only with a limited level of measurement precision. With infinite precision, the groups will differ. That is a property of any continuous distribution: The probability of any exact value on that distribution is 0 (this is a matter of math, not methods). This issue is one reason that many people object to null-hypothesis significance testing, but we don't need to consider that debate here.
Given that the null is never true when measured with infinite precision, the measured effect will always fall on one side or the other of the null. And, with a large enough sample, that difference will be statistically significant.

And here's the problem:

For a two tailed hypothesis, any significant deviation from zero counts as a replication.
With enough power, the measured effect will always differ from zero.
Therefore, no result can falsify the hypothesis.
A hypothesis that cannot be falsified is non-scientific.
Therefore, two-tailed hypotheses and their accompanying tests are not scientific.

In some cases, that conclusion seems reasonable. For example, proponents of ESP will accept performance that is significantly better or worse than chance as support for the existence of ESP. But, in other cases, two-tailed hypotheses seem reasonable and scientifically grounded. Consequently, we must challenge one of the premises.

Let's assume for now that we accept the logic of null hypothesis significance testing. The best approach, in my view, would be to differentiate between tiny effect sizes and more sizable ones. If we are willing to say that effects near zero are functionally equivalent to no effect and different from large effects, then we can avoid the logical conclusion that two-tailed tests are inherently unscientific.

But once we make that assumption, it applies to one-tailed hypotheses as well, effectively ruling out interpretation #1 from our thought experiment. We have to treat near-zero effect sizes as failures to replicate large effect sizes, even if they fall on the same size of zero and are significantly different from zero.

Another reason to make that assumption is that, even when the null hypothesis is true, effects will fall in the same direction as the original effect 50% of the time by chance. That means, by chance, if any effect in the same direction counts as a replication, we would replicate an original false positive 50% of the time by chance. That seems problematic as well.

If I've made an error in my reasoning or you see a way to salvage the idea that an infinitely small effect in the same direction as an original effect counts as replicating that effect, I would love to hear about it in the comments.

12 comments:

Unknown3/11/2013 04:01:00 AM
I would say the main problem with that argument is the notion of 'replication' as referring to the result, rather than to the process of replicating. Replication is the process of repeating another experiment to some extent, as closely as possible in the case of direct replication and with variations in the case of conceptual replications. How to interpret the result of the replication depends on how different it is from the original *and* in what ways the replication was different from the original experiment. The problem lies with that last aspect: two experiments are never exactly the same, and whether and in what way the differences are relevant depends on one's theory about what is happening in the experiment. Since one of those theories is the one being tested, there is no independent criterion to judge the adequacy of the replication or the meaning of its result. Falsificationism doesn't work at the frontier of science.

So yes: a very small effect in the same direction may be interpreted as somehow consistent with the original result. The best thing to do is to muddle on and do more experiments. 'Adversarial collaboration' between original researcher and critics is a good model, think.
ReplyDelete
Replies
Anonymous3/11/2013 03:44:00 PM
I agree with everything in this post and wish to add one thought, concerning the statement "We have to treat near-zero effect sizes as failures to replicate..." Any study can have an effect that goes one direction, goes the other direction, or is so close to zero that it might as well be zero. But how close to zero is that? It depends on many things, including theoretical context (e.g., what other effects are competing as explanations for the phenomenon and what is their size?)and practical implications (e.g., how many lives will be saved by the widespread use of a treatment that has this effect size?).
The situation is a bit different when we are talking about replication. Here, the question is whether the new study obtained an effect size large enough to support the existence of the original effect. But how large is that? Here's an interesting possibility:
If the original finding was reported in the context of NHST, then an N was reported along with a critical p-value. What is the smallest effect size that would have attained that critical p-value, given that N? The answer to this question tells us what the original investigator, implicitly, is saying is the minimum effect size that would make the finding worth taking seriously. Because: if it had been smaller than that, the investigator wouldn't have reported it!
If this logic is correct, then it follows that a subsequent study that does not attain at least this effect size cannot count as being "big enough" to confirm the finding, and if the confidence interval around the new effect doesn't include the old effect size, then the original study can be said to have been disconfirmed!
Did I get anything wrong here?
ReplyDelete
Replies
Anonymous3/11/2013 05:10:00 PM
Briefly: You're indeed right that I don't favor using p-values to establish the legitimacy of an effect. But that's the common practice. My only small point was that investigator who plays that game is barring himself/herself from reporting any effect smaller than one sufficient to attain the critical p-level given the N, and therefore implicitly accepting that an effect smaller than that is one that should never see the light of day. Which, as much as anything, shows the flaw in the whole logic of NHST.
David Funder
ReplyDelete
Replies
Unknown3/13/2013 09:03:00 AM
Suppose you ran two experiments (the second being an attempted replication of the first) on different subjects and you want to know what their aggregated data tell us. As a Bayesian, you shouldn't care which came first. Data should be undated (with some exceptions). The first experiment is as much a replication of the second as vice versa. Thus the idea of replication is misleading. If you have two experiments that are similar enough to compare effects, you should think of experiments as a random effect, i.e., these experiments are a sample from an infinite population of possible experiments, with subjects nested within experiments. The tools for doing this are readily available in the R packages to do linear mixed-effects models, or programs such as HLM.
ReplyDelete
Replies

Add comment

New comments are not allowed.