Saturday, March 9, 2013

Further thoughts on what counts as a replication

Last month I posted a replication thought experiment that I hoped would provoke an interesting discussion of what counts as a replication. It did. Today I want to flesh out why I think one of the most common interpretations doesn't work. In short, if you follow it to its logical conclusion, any two-tailed hypothesis is inherently unscientific!

I wasn't surprised that opinions were mixed about the crucial case: 
An original underpowered study produces a significant effect (p=.049) with an effect size of d=.50. A replication attempt uses much greater power (10x the original sample size) and significantly rejects the null (p<05) with a much smaller effect size in the same direction (d=.10).
Commenters fell into three camps:

  1. The new result replicates the original because the effect was in the same direction.
  2. The new result partially replicates the original because it was in the same direction, but it is not the same effect because it is meaningfully smaller. 
  3. The new result does not replicate the original because the new result is meaningfully smaller, and therefore it does not have the same theoretical/practical meaning as the original. 

Although there is no objectively right interpretation of this result, I do think the first interpretation has some theoretical ramifications that even its proponents might not like. When coupled with the logic of null hypothesis significance testing, a not-so appealing conclusion falls out by necessity: Any conclusion based on a two-tailed hypothesis test is unfalsifiable, and therefore not scientific!

Here's the logic:

  • A two-tailed hypothesis predicts that two groups will differ, but it does not predict the direction of difference, and either direction would be interesting. Two-tailed significance tests are common in psychology.
  • The null-hypothesis of no-difference is never true. Two groups may produce the same mean, but only with a limited level of measurement precision. With infinite precision, the groups will differ. That is a property of any continuous distribution: The probability of any exact value on that distribution is 0 (this is a matter of math, not methods). This issue is one reason that many people object to null-hypothesis significance testing, but we don't need to consider that debate here.
  • Given that the null is never true when measured with infinite precision, the measured effect will always fall on one side or the other of the null. And, with a large enough sample, that difference will be statistically significant.
And here's the problem:
  • For a two tailed hypothesis, any significant deviation from zero counts as a replication.
  • With enough power, the measured effect will always differ from zero.
  • Therefore, no result can falsify the hypothesis.
  • A hypothesis that cannot be falsified is non-scientific.
  • Therefore, two-tailed hypotheses and their accompanying tests are not scientific.
In some cases, that conclusion seems reasonable. For example, proponents of ESP will accept performance that is significantly better or worse than chance as support for the existence of ESP. But, in other cases, two-tailed hypotheses seem reasonable and scientifically grounded. Consequently, we must challenge one of the premises.

Let's assume for now that we accept the logic of null hypothesis significance testing. The best approach, in my view, would be to differentiate between tiny effect sizes and more sizable ones. If we are willing to say that effects near zero are functionally equivalent to no effect and different from large effects, then we can avoid the logical conclusion that two-tailed tests are inherently unscientific. 

But once we make that assumption, it applies to one-tailed hypotheses as well, effectively ruling out interpretation #1 from our thought experiment. We have to treat near-zero effect sizes as failures to replicate large effect sizes, even if they fall on the same size of zero and are significantly different from zero.

Another reason to make that assumption is that, even when the null hypothesis is true, effects will fall in the same direction as the original effect 50% of the time by chance. That means, by chance, if any effect in the same direction counts as a replication, we would replicate an original false positive 50% of the time by chance. That seems problematic as well.

If I've made an error in my reasoning or you see a way to salvage the idea that an infinitely small effect in the same direction as an original effect counts as replicating that effect, I would love to hear about it in the comments.


  1. I would say the main problem with that argument is the notion of 'replication' as referring to the result, rather than to the process of replicating. Replication is the process of repeating another experiment to some extent, as closely as possible in the case of direct replication and with variations in the case of conceptual replications. How to interpret the result of the replication depends on how different it is from the original *and* in what ways the replication was different from the original experiment. The problem lies with that last aspect: two experiments are never exactly the same, and whether and in what way the differences are relevant depends on one's theory about what is happening in the experiment. Since one of those theories is the one being tested, there is no independent criterion to judge the adequacy of the replication or the meaning of its result. Falsificationism doesn't work at the frontier of science.

    So yes: a very small effect in the same direction may be interpreted as somehow consistent with the original result. The best thing to do is to muddle on and do more experiments. 'Adversarial collaboration' between original researcher and critics is a good model, think.

    1. I don't really agree with your semantic distinction between the process and result. The word "replication" is used in both ways, and the context typically disambiguates. If I say that I'm going to try to replicate some result, I'm referring to the process of conducting a replication. If I say "that result replicates" I'm referring to the outcome of repeating the study. I don't follow why that necessarily leads to confusion.

      I agree that the interpretation of the results of a replication study must vary as a function of how that replication was conducted. If it was a conceptual replication, the interpretation should be different than for a direct replication. That said, I completely disagree with the claim that falsification doesn't work at the frontier of science. That is precisely where direct replication is needed. If direct replication attempts consistently fail to reproduce the same effect, then we know that the original effect, as described, is either wrong or the description was incomplete. Of course it's always possible that undescribed factors contributed to the original result, but the direct replications still constrain (falsify, in a sense) the generality of that result. I do think it is possible to judge the adequacy of a replication even if you don't know all of the factors that might matter in theory. A direct replication is an attempt to reproduce a result, and that should be possible for any scientific result, regardless of the richness of the theories describing it. After all, theories eventually will be proven wrong or incomplete, but the evidence used to generate them should be robust.

      Adversarial collaborations are great when they work (they're rare), especially when two theories make conflicting predictions. But even in the absence of direct theoretical predictions, direct replication is necessary to verify the strength of the results themselves.

    2. Don't get me wrong: I think replication is important, and direct replication sadly too rare. I just don't think that the result of one replication can falsify the original result. If it is different from the original result, it tells you that something interesting is going on and further replications should be done. I agree that a series of non-replications will get the original result and theory in trouble, to the point where it will be discarded. That point will be reached when there are no more theoretically interesting reasons to uphold the original result in the face of mounting counter-evidence, and it will have to be put down to 'banal' causes such as questionable research practices. But just as one result isn't conclusive until it is replicated (that's why direct replications are so important, after all), one replication is not conclusive either until it is replicated -- until a point is reached where the research community decides that the replications don't produce interesting differences anymore and the original theory is ether accepted, or amended, or dropped.

      If you could really judge the adequacy of an experiment regardless of the theories describing it, then why do direct replications at all? Only to rule out fraud and questionable research practices? I agree that is (sadly) an important consideration, but I'm sure it's not the only one.

    3. I completely agree -- the results of one replication cannot falsify an original result. That's one reason I've pushed for multiple replications using a shared protocol for APS. THe goal is to get a better handle on the size of the effect. And, we shouldn't take the results of any one study as definitive (unless it has enormous power and an airtight design). I also don't think we necessarily need to assume shoddy research practices if, upon multiple replications, the original result turns out to be wrong. Statistically, we should expect some false positive results. And, with publication bias as it is, we should expect that most published findings overestimate the true effect size. The original study might have followed best research practices and still produced a false positive. In my view, we should treat a single positive result much the same way that we treat a single negative one — it's suggestive but not definitive evidence for the true size of the effect in the population.

      For what it's worth, I don't think of replication as a questionable-practices-detection mechanism (although it might do that on occasion). Rather, I view it as a way to obtain a better estimate of the true size of an effect in reality. If an effect has no theoretical or empirical importance, then there's not much reason to bother replicating it. But, for results that are theoretically important (even if the theories are only weakly elaborated), direct replications help verify the size of a result.

      In my view, there's too much emphasis on replication as fraud detection and debunking. Replication helps to shore up an original finding by showing that it is robust. My hope and expectation is that most findings in our field will withstand such direct replication (particularly when the original study had adequate power to detect an effect with a reasonable effect size). Of course underpowered studies are more likely to produce false alarms with large effect sizes, so those are the ones that are most in need of direct replication to verify the actual effect size. That doesn't mean they were fraudulent, just that provided a far less precise estimate of the effect size. Now, if they were produced via p-hacking and other questionable practices, they will be even more likely to be false alarms. I hope such practices are less prevalent than they seem to be, but if not, then direct replication is even more crucial. There's no point in fleshing out sophisticated theories if the results they are based on prove to be insubstantial.

    4. Yes, well put, I agree with everything you wrote. I would only add that if the differences themselves reproduce (team A keeps finding A, team B keeps finding B, for example), then it is definitely time for adversarial collaboration. That might be the situation that is most interesting in terms of theory development.

    5. Thanks Maarten. Yes, that is a fascinating situation. And, I could easily imagine cases in which some set of labs consistently produce an effect and others do not. The key, then, will be in trying to understand what differs. One advantage of using a shared, vetted protocol (as we're doing at Perspectives on Psych Science) is that we can identify as many of the necessary manipulation checks and method details in advance as possible. That helps rule out the simple mistakes, leaving the more interesting ones.

  2. I agree with everything in this post and wish to add one thought, concerning the statement "We have to treat near-zero effect sizes as failures to replicate..." Any study can have an effect that goes one direction, goes the other direction, or is so close to zero that it might as well be zero. But how close to zero is that? It depends on many things, including theoretical context (e.g., what other effects are competing as explanations for the phenomenon and what is their size?)and practical implications (e.g., how many lives will be saved by the widespread use of a treatment that has this effect size?).
    The situation is a bit different when we are talking about replication. Here, the question is whether the new study obtained an effect size large enough to support the existence of the original effect. But how large is that? Here's an interesting possibility:
    If the original finding was reported in the context of NHST, then an N was reported along with a critical p-value. What is the smallest effect size that would have attained that critical p-value, given that N? The answer to this question tells us what the original investigator, implicitly, is saying is the minimum effect size that would make the finding worth taking seriously. Because: if it had been smaller than that, the investigator wouldn't have reported it!
    If this logic is correct, then it follows that a subsequent study that does not attain at least this effect size cannot count as being "big enough" to confirm the finding, and if the confidence interval around the new effect doesn't include the old effect size, then the original study can be said to have been disconfirmed!
    Did I get anything wrong here?

    1. That approach has some merit—basically, you're asking what the criterion effect size would have been if p had been exactly .05. In that case, power would be 50%. So, you're assuming that the study was conducted with 50% power to detect an effect of that size (under NHST). The problem is that we know that published effect size estimates are likely inflated (due to publication bias, etc). So, I would prefer assuming that the effect size actually is smaller than the published effect and then use a large enough sample to find that smaller effect size with at least 80% power in the replication attempt. I think the best approach is to make sure that the replication attempt has enough power to detect a similar sort of effect.

      The one aspect of your logic that I disagree with is your statement that the effect with p of exactly .50 is the "minimum effect size that would make the finding worth taking seriously." I'm not a fan of using p-values in this way for determining the legitimacy of an effect (I know that's probably not what you meant, of course). The measured p-value will vary quite a bit across repeated version of the same study.

      You're absolutely right that the importance of an effect size varies wildly with the nature of the inference. The effect size of aspirine on heart attacks is about r=.02, and the study required 10,000 subjects. That is a life-saving effect size. Most psychology studies are not that large, and given the amount of variability in psychology measures, most effect sizes that small would not be of practical importance. The real question, as you suggest, is what sorts of effects are we capable of detecting reliably given the sample sizes we use. (Stay tuned -- another post coming soon on that point.)

  3. Briefly: You're indeed right that I don't favor using p-values to establish the legitimacy of an effect. But that's the common practice. My only small point was that investigator who plays that game is barring himself/herself from reporting any effect smaller than one sufficient to attain the critical p-level given the N, and therefore implicitly accepting that an effect smaller than that is one that should never see the light of day. Which, as much as anything, shows the flaw in the whole logic of NHST.
    David Funder

    1. Ah. Good point. Thanks David. I do like the idea of using the power of the original study as a way to assess an unstated belief about the size of the effect. When an effect is surprising, the lack of power and the necessity of finding a large effet is particularly problematic. For a surprising effect, you shouldn't assume it will be large, right?

  4. Suppose you ran two experiments (the second being an attempted replication of the first) on different subjects and you want to know what their aggregated data tell us. As a Bayesian, you shouldn't care which came first. Data should be undated (with some exceptions). The first experiment is as much a replication of the second as vice versa. Thus the idea of replication is misleading. If you have two experiments that are similar enough to compare effects, you should think of experiments as a random effect, i.e., these experiments are a sample from an infinite population of possible experiments, with subjects nested within experiments. The tools for doing this are readily available in the R packages to do linear mixed-effects models, or programs such as HLM.

    1. Thanks Michael. Absolutely right: From a statistical perspective, you can treat the order of the original study and replication as arbitrary. The case in which you just run an experiment twice with different subjects is an idealized version of that, and order doesn't matter (in principle, assuming the population hasn't changed over time). Combining the two meta-analytically without differentiating which came first (with various tools that allow you to treat them as a random effect as well) makes sense in that case.

      That said, the idealized case in which the two studies are conducted independently isn't always the case in reality (even though they can be treated that way for meta-analytic purposes if they are the same study design with the same population). In most cases, an initial study of an has a disproportionate influence on how people think about a finding. Murder is shouted on page 1, but corrections appear on page C26. Moreover, the first study does give us information we can use to adjust our prior beliefs about the size of the effect when conducting a follow-up study. When deciding what to do for the followup (e.g., power, etc), knowledge of the first result can guide decisions about the second. If you update your priors based on the first result, then the two studies aren't really two tests of the same effect using the same prior beliefs. The order is irrelevant if you don't update your priors based on your initial evidence. But, it's relevant if you do.

      Of course, treating them as equivalent and disregarding the order in which they were conducted is a good way to estimate the overall effect and to try to generalize to the larger population of possible experiments of the same sort. If they are direct replications using the same procedure, that's the right way to go (and you can use a common set of prior beliefs to look at the overall effect, etc. in the way Michael suggests).