Tuesday, February 12, 2013

What counts as a successful replication?

A recent discussion among cognitive and social psychologists prompted me to think about the nature of replications. What does it mean to call a replication "successful" or a failure? Below I write about a thought experiment that you can try for yourself. Please post your thoughts in the comments!

What counts as a replication and what counts as a failure to replicate? The way those terms have been used in the field, a success is a study that finds a significant effect in the same direction and a failure is one that doesn't. But that is a lousy criterion on which to evaluate the similarity of two findings.

As a thought experiment, take the case of a somewhat underpowered study that finds a significant effect (p=.049) with an effect size of d=.50. Which of the following should count as a failure to replicate?

  1. A study that has the same sample sizes as the original that does not find a significant difference (p > .05) but that produces an effect size of d=.48 in the same direction.
  2. A study with 50% larger sample sizes that significantly rejects the null (p<.05), with an effect size estimate of d=.40.
  3. A study with massive power (10x the original sample size) that does not significantly reject the null and produces an effect size of d=0.
  4. A study with massive power (10x the original sample size) that significantly rejects the null (p<05), but produces an effect size estimate of d=.10 in the same direction.

(For the sake of this thought experiment, let's assume the studies were conducted well and used the same protocol as the original study. Heck, it's a thought experiment -- let's have them be studies from the same lab run with the same RAs so there's no risk of lab differences).

    By the standard in the literature, study 1 would be treated as a failure to replicate because it was not statistically significant at p<.05. But that makes no sense at all. The effect size estimate is nearly identical. Just because one result is significant and the other isn't does not mean that the two are different from each other. There is no meaningful difference in terms of the effect size estimate for a study that produces p=.049 and one that produces p=.051 (with the same design and sample size). This example reveals the problem with just tallying up significant and non-significant results.

    Take case number 2. I think most researchers would agree that such a study would count as a successful replication. It is significantly significant (p<.05) and the effect size is in the same range as the original. It has larger power, so the effect size estimate will be more precise (and accurate). In other words, it is qualitatively similar to the original result. We should expect direct replications to find smaller effects on average due to publication biases, even if the original effect is real and substantial. That's one reason why replications must have greater power than an original study—we should assume that the original study overestimated the size of the effect. There's no individual fault or blame for a slightly reduced effect size estimate in this sort of case—it's a consequence of our field's tendency to publish only statistically significant findings.

    Take case number 3. I think all of us would agree that such a study is a failure to replicate by any standard we can come up with. It provides a highly precise estimate of the effect size (zero) that is not in the same direction as the original and is not statistically significant. I'm hoping nobody would quibble with this one.

    But what do we make of case #4. With massive power, the effect size estimate is highly precise. The effect size is in the same direction as the original and significantly different from zero. But it is substantially smaller. Here we run into the problem of what it means to be qualitatively the same. Would you treat this result as a replication because it is in the same direction and significantly different from 0? Or, should it be treated as a failure to replicate because it does not find the same magnitude of effect size. What counts as too different to be treated as qualitatively the same? I don't believe there is a definitively right answer to this question. There is an effect, but it is substantially smaller than the original estimate and requires much greater sample sizes to detect it reliably. How should the field treat such a finding?


    1. I think case 4 is a partially successful replication.

      I do not think that there should be a clack/white binary definition. there is a lot of grey.

      PS. The most crucial parameter is to have replications done exactly the same in terms of protocol. (can have more power, but the experimental procedure must be exactly same).

      The guy that "failed to replicate" Bargh paper on old people priming REFUSED to re-run the experiment with the more exact protocol (personal communication). So while he did two strong deviations from the protocol (30-30 words were primes + calling attention to the walking), it got marked as "failure" and he refuses to redo it correctly!

      1. Okay. Let's push the thought experiment a little further. Imagine the direct replication with a massive sample size produced an effect size of .05 rather than .10. Would that still count as a partial replication? How about .01 or .0000001? Is the only criterion for replication that it's in the same direction?

        I agree that direct replication is key. And, whether or not the study you mentioned was direct enough is a topic for another discussion (interesting as well). My purpose in this post is not to focus on particular controversial findings but to address the broader question of what counts as a replication. For what it's worth, it's inappropriate to cite an anonymous personal correspondence in a discussion of a scientific finding. Without specifying whom the correspondence was from, your statement that the researchers "refused" to re-run an experiment carries no weight and could even be viewed as libelous (damaging to reputations without proof of the veracity of your statement). What evidence do you have that the original authors *refused* to run the experiment as a direct replication? Can you name the person who told you that? If not, please do not post anonymous accusations of this sort, even if you know the source to be someone trustworthy.

      2. 1) I am taking away my statement of replication refusal.

        2) Besides, while not published to my knowledge, here it is suggested that they later re-did the replication "taking into account Bargh's concerns. It still didn't work." If they fixed all problems and got no results with sufficient power, that is more interesting.

      3. Thanks for understanding, Jazi. I agree that the key is large-scale replication using the same protocol. For that particular study, there are other reported replication attempts at psychfiledrawer.org. I'm not sure what the cumulative evidence shows, though -- probably not enough direct replication attempts yet.

      4. I emailed you on where I came from

      5. Just to set the record straight here: It is incorrect to state that we refused to carry out a further replication. What happened is that after we found out that there was indeed one valid concern that our published replication attempt was not as exact as we had meant it to be (specifically, the number of prime words we used), we set out to carry out a further replication that corrected this issue. The full results of this further attempt, as well as all relevant methodological details are available online:


    2. Interesting post. Think I'd agree with Jazi, in that I'd perhaps think of it as a partially successful replication - sign, if not magnitude. As you point out, the question is how to reconcile the differing effect sizes. Effect sizes themselves are point estimates, so perhaps a good place to start would be looking at the confidence intervals around those effect sizes - they'd be much broader for the original, underpowered effect, and may include the effect size estimate for the new, follow-up study. If so, one might be keener to say it replicates the original, because it finds an effect in the same direction and within the same range of magnitudes as were suggested by the original study.

      PS: re the Bargh paper, those two strong deviations don't appear that strong if you compare the orig Bargh paper and the new one; Bargh's lists 28 words as elderly primes, and participants were told the exit was down the hall (says so in the paper, although it was repeatedly claimed otherwise).

      1. Good point about confidence intervals, Matt. I'm a big fan of that approach, especially when there are multiple direct replications using the same protocol. There is a problem, though, that follows from your note that the intervals around the original study's effect size will be much wider if it had a much smaller sample. That means a decision about whether or not the replication attempt "succeeded" will depend on how *underpowered* the original was. The weaker the original study, the wider the confidence interval, and the more difficult it would be to find an effect size that didn't fall within that confidence interval. Let's push this to the extreme. Imagine a replication with millions of subjects that finds an effect size of .02 (just barely above 0). Now, imagine that the original study had an effect size of .5 but with a small sample such that the confidence interval around the effect size ranged from .01 to .99 (not really plausible, but we're in thought-land here). Would that .02 effect count as a replication because it fell within the confidence interval around the original effect? Note that had the original effect been measured with a large sample, the .02 would fall outside the confidence interval. With this approach, the weakness of the original study protects it against failures to replicate. That seems problematic. That's one reason I think it's important to consider both the magnitude and the sign, and ideally to give weight to the relative power as well. I think this reflects a broader problem with dichotomous decisions.

      2. I am hardly a priming expert. But i always knew from skimping here and there that priming is done by giving a list of words that only *part of them* are elderly related. This I remember clearly from years ago.

        Now, these guys I understand gave 30 out of 30 words for priming.

        beyond invalidating the results, it shows a high level of ignorance in the priming literature. Which is weakening the validity of any experiments.

      3. Very true, hadn't thought of it that way; I totally agree that it highlights a problem with dichotomous thinking, which is one of my bugbears with NHST in general.

        Ultimately whether you'd count case #4 as a successful replication would come down to what your question is. If your question is "does x affect y?" then yes, it replicates it. If your question is "does x affect y by 100+-30 comedy-made-up-units, as in the original paper?", then no it doesn't, but that leads straight back to "does x affect y?", or simply "how much *does* x affect y?". For both of those questions, whether it replicates the effect size is pretty much irrelevant.

        But, to the extent that case #4 directly replicates the original experiment, I think it'd be hard to claim that the two show different effects in a meaningful way. By saying the first experiment showed a different effect, you're implying that there exists a large effect of the same manipulation that, for whatever reason, didn't appear in an experiment that'd be better able to detect a large effect. There'd have to be some unknown moderator. And then you could still say they're both subtypes of the same effect, namely that a and b are both examples of effects of category c, or that effect c expresses itself differently under different conditions. A 2x4 plank and a 2x8 plank are both still planks.

        I think perhaps what we're searching for here is a definition of what constitutes an effect rather than what constitutes a successful replication. Clearly it's desirable that a full definition of an effect would constitute a description of its sign and an estimate of its size. But if I estimated gravity (without knowing that gravity exists) by how heavy I was on Mars compared to Earth I might think there are two completely different effects in operation, when both estimates are the product of the same constant and its moderating factors. So perhaps to really define an effect, we need to know how big it is and what moderates the effect, and to really clearly separate two effects, show they're moderated by different things and/or have a different constant.

      4. I agree that it depends on the question, but I disagree with your inference that effect size is irrelevant. It can be hugely important whether x affects y strongly or minimally. Many theories in psychology are inadequately precise to make strong predictions about effect size, but that doesn't make effect size irrelevant. To build on your analogy: Yes, both are planks, but one can't reach as far as the other, which makes it less useful in construction.

        As for the difference between the two effect sizes, if you assume that both estimates accurately reflect reality (and do so equally well), then there must be some moderator. Note, though, that the thought experiment was designed to rule out that possibility -- the study was done in the same way by the same lab. I'm explicitly asking about the case in which the study was a direct replication using the same methods, but produces a smaller effect with a larger sample size.

        There are, of course, other reasons why the two effect sizes are unequal. The most obvious one is that the smaller initial study overestimated the true size of the effect. That's entirely plausible given the presence of publication bias (positive results are more likely to be published) and the fact that studies with a small sample can only find statistically significant results by happening to get a huge effect. Studies provide estimates of effect sizes, but they are variable and imperfect due to sampling. The larger the sample, the more precise the estimate (and presumably, the more accurate the estimate) of the true underlying effect.

        The safest assumption from the description of case #4 is that the larger study provides a more accurate measure of the underlying effect, one that is substantially smaller than the initial estimate. The question is how to interpret the claims from the initial study in light of this new information.

      5. I wasn't claiming that effect size in general was irrelevant - far from it! It's more that, to my eyes, both of those questions could be answered perfectly well without really caring whether it replicates the answer given by the previous study. Ok, so if your question is whether it replicates the size of the effect in the other study, then perhaps we could put them both in an ANOVA with experiment as a factor. but that's often not really the question being asked by the second study, which is more like: if the effect is there, how big is it?

        My point about the hidden moderator was exactly that this has been explicitly ruled out in this case - I would argue that, in our thought experiment at least, it wouldn't make a lot of sense to say that they had found completely different effects when there is no possibility of a moderator. It's more likely that they've simply provided different estimates of the same effect. I'd even say that in our thought experiment, the two studies are essentially single study, and interpreting them separately is not such a good idea.

        How should the field interpret the first study in our thought experiment? It found a genuine effect of x on y, but it provided a poor estimate of the size of that effect.

      6. I like the interpretation in your last paragraph. That fits how I'd think of it.

    3. 1) replications here test whether the effect is real or a fluke.
      therefore, testing the effect size is not so relevant. This is a matter of refining studies (how exactly the effect works etc.)

      2) We want a single probability of the reality of the original effect.

      So it should be a composite probability of the original effect being real given the new results. Which boviously should incorporate the replication effect size and the various
      implied probabilities. It seems theoretically simple, just a formula to be worked out (with a few decisions, of course).

      Then, one would get a single P of true/false for "is there an effect?"

      1. I disagree with your premise in #1. The replication is not testing whether the effect is real or a fluke in case 4. Rather, it shows that the effect is likely much smaller than originally measured. The question, then, is how to interpret that finding. Yes, the initial result overestimated the effect size. But, there is still an effect in the same direction -- just smaller.

        I don't understand your point 2. What we want is an accurate understanding of reality, but the thought experiment was designed to test how we should interpret a replication that found a much smaller effect. The larger study does provide a better estimate of the true effect (at least in the context of this thought experiment).

      2. If we define the question more narrowly, we can get more reliable statistics. By definition (the more outcome you want from data, the weaker your results. bonferroni etc., and wiggle room). So defining an "Is there an effect?" question is bound to give more reliable results.

        Now, if we want just the most accurate number given all experiments, we just need to do Bayesian updating. We have the hidden variable of how many experiments were done, and that we have a choice of how to combine results (the very decision to combine results from two studies is itself another degree of freedom, to be paid for in bonferroni currency.

        [[An additional problem is assigning reliability to each experiment. if the recent high powered study gives significantly weaker effects than the early one, one may claim that it reduces the probability of the first one being done well, and one got to adjust for this as well)]]

        The thing is. that every measure is statistically valid if decided IN ADVANCE. because you can have many criteria and measures, there is no way to have a reliable stat if decisions are flexible and possibly made after the fact.

    4. Hi Dan,

      Why not take the following approach, based on Matt's suggestion? It is kind of a fast-and-frugal way, too probably not entirely correct. In case 4, you actually have 11 times the same study (1 original + a 10-fold replication). So, you have a set of 11 different effect sizes, which you can use to estimate how "odd" the first effect size was. Of course, a lot will depend on how you split up your 10x replication study, but you can actually do a worst-case and a best-case version of this split up or you even figure out (after some intense computing) the empirical "oddness" of your first effect size given all possible split-ups of your replication study.

      1. Tim -- that would be a variant of what's known as a permutation test. You basically sample from the large study and compute the mean. If you do that repeatedly many times, you can plot the resulting distribution of means to show how likely each mean is given the data you have. The downside is that it treats the sample data as if it were the full population. It's not a bad way to go. But, we already know that the confidence interval around the effect size for the massive study will be small, so as long as the samples you take are large enough, they will be centered on the sample mean and likely will differ from the original study mean. What we would then know is that the large study mean differs from the original mean. But, I could stipulate that in the thought experiment -- let's agree that those effect sizes are robustly different, but the second study is still significantly different from zero. What we would know from your analysis is how different the original mean was from the data in the second study.

        Another approach to the same thing (and possibly a better one) would be to conduct a Bayesian analysis to estimate the parameters that are most consistent with the data. You can then calculate something like a confidence interval that specifies the 95% range for the most plausible values of the effect size based on the large study.

    5. Wow, very interesting discussions regarding how to interpret direct replications.

      I'd say in the case of the massive (x10) **direct** replication with d=.05 or d=.01 (same direction), robustly different (via permutation/bootstrapping test, chi-square, or Bayesian approach) from the imprecise d=.5 in the original, that this should be considered a FAILURE TO REPLICATE the original finding because the massive direct replication found an effect that -- even though is in the same direction --is robustly different from the original. My position is based on the fact that as scientists we should care not only about the direction but size of an effect. I think John Tukey (who incidentally contributed significantly to the jackknife procedure, a child of bootstrapping) said it best by saying
      “amount, as well as direction, is vital” (Tukey, 1969, p. 86). In his own words:

      "The physical sciences have learned much by storing up amounts, not just directions. If, for example, elasticity had been confined to “When you pull on it, it gets longer!” Hooke’s law, the elastic limit, plasticity, and many other important topics could not have appeared (emphasis added) (p. 86)."

      And the reason psychologists haven't heeded to such words of wisdom is that virtually all metrics in psychology are arbitrary and hence typically non-interpretable, which is why they are not paid attention to. Some theoretical work has been done proposing how to go about developing meaningful units of measurement for instruments in psychology (LeBel, 2011; http://ir.lib.uwo.ca/etd/174/), but there are formidable challenges in this endeavor.

      1. Steve - glad you're finding the discussions interesting. I am as well. I figured that case 4 would generate disagreement. I lean toward your interpretation, that it is sufficiently different that it suggests the original effect size estimate was not accurate. I'm not a huge fan of characterizing any result as a success or failure to replicate. For me, the broader goal is to understand the size of the effect in reality. I don't really agree with your statement that "virtually all metrics in psychology are arbitrary and hence typically non-interpretable." I find most metrics in psychology to be readily interpretable. They require inference, but that's true of physics measurements as well. Accuracy metrics are straightforward, and response time metrics are reliable and interpretable (even if multiple interpretations are possible). Just my 2 cents.