Daniel Simons: What counts as a successful replication?

Tuesday, February 12, 2013

What counts as a successful replication?

A recent discussion among cognitive and social psychologists prompted me to think about the nature of replications. What does it mean to call a replication "successful" or a failure? Below I write about a thought experiment that you can try for yourself. Please post your thoughts in the comments!

What counts as a replication and what counts as a failure to replicate? The way those terms have been used in the field, a success is a study that finds a significant effect in the same direction and a failure is one that doesn't. But that is a lousy criterion on which to evaluate the similarity of two findings.

As a thought experiment, take the case of a somewhat underpowered study that finds a significant effect (p=.049) with an effect size of d=.50. Which of the following should count as a failure to replicate?

A study that has the same sample sizes as the original that does not find a significant difference (p > .05) but that produces an effect size of d=.48 in the same direction.
A study with 50% larger sample sizes that significantly rejects the null (p<.05), with an effect size estimate of d=.40.
A study with massive power (10x the original sample size) that does not significantly reject the null and produces an effect size of d=0.
A study with massive power (10x the original sample size) that significantly rejects the null (p<05), but produces an effect size estimate of d=.10 in the same direction.

(For the sake of this thought experiment, let's assume the studies were conducted well and used the same protocol as the original study. Heck, it's a thought experiment -- let's have them be studies from the same lab run with the same RAs so there's no risk of lab differences).

By the standard in the literature, study 1 would be treated as a failure to replicate because it was not statistically significant at p<.05. But that makes no sense at all. The effect size estimate is nearly identical. Just because one result is significant and the other isn't does not mean that the two are different from each other. There is no meaningful difference in terms of the effect size estimate for a study that produces p=.049 and one that produces p=.051 (with the same design and sample size). This example reveals the problem with just tallying up significant and non-significant results.

Take case number 2. I think most researchers would agree that such a study would count as a successful replication. It is significantly significant (p<.05) and the effect size is in the same range as the original. It has larger power, so the effect size estimate will be more precise (and accurate). In other words, it is qualitatively similar to the original result. We should expect direct replications to find smaller effects on average due to publication biases, even if the original effect is real and substantial. That's one reason why replications must have greater power than an original study—we should assume that the original study overestimated the size of the effect. There's no individual fault or blame for a slightly reduced effect size estimate in this sort of case—it's a consequence of our field's tendency to publish only statistically significant findings.

Take case number 3. I think all of us would agree that such a study is a failure to replicate by any standard we can come up with. It provides a highly precise estimate of the effect size (zero) that is not in the same direction as the original and is not statistically significant. I'm hoping nobody would quibble with this one.

But what do we make of case #4. With massive power, the effect size estimate is highly precise. The effect size is in the same direction as the original and significantly different from zero. But it is substantially smaller. Here we run into the problem of what it means to be qualitatively the same. Would you treat this result as a replication because it is in the same direction and significantly different from 0? Or, should it be treated as a failure to replicate because it does not find the same magnitude of effect size. What counts as too different to be treated as qualitatively the same? I don't believe there is a definitively right answer to this question. There is an effect, but it is substantially smaller than the original estimate and requires much greater sample sizes to detect it reliably. How should the field treat such a finding?

20 comments:

Jazi Zilber2/12/2013 12:35:00 PM
I think case 4 is a partially successful replication.

I do not think that there should be a clack/white binary definition. there is a lot of grey.

PS. The most crucial parameter is to have replications done exactly the same in terms of protocol. (can have more power, but the experimental procedure must be exactly same).

The guy that "failed to replicate" Bargh paper on old people priming REFUSED to re-run the experiment with the more exact protocol (personal communication). So while he did two strong deviations from the protocol (30-30 words were primes + calling attention to the walking), it got marked as "failure" and he refuses to redo it correctly!
ReplyDelete
Replies
Matt Craddock2/12/2013 02:02:00 PM
Interesting post. Think I'd agree with Jazi, in that I'd perhaps think of it as a partially successful replication - sign, if not magnitude. As you point out, the question is how to reconcile the differing effect sizes. Effect sizes themselves are point estimates, so perhaps a good place to start would be looking at the confidence intervals around those effect sizes - they'd be much broader for the original, underpowered effect, and may include the effect size estimate for the new, follow-up study. If so, one might be keener to say it replicates the original, because it finds an effect in the same direction and within the same range of magnitudes as were suggested by the original study.

PS: re the Bargh paper, those two strong deviations don't appear that strong if you compare the orig Bargh paper and the new one; Bargh's lists 28 words as elderly primes, and participants were told the exit was down the hall (says so in the paper, although it was repeatedly claimed otherwise).
ReplyDelete
Replies
Jazi Zilber2/12/2013 03:53:00 PM
1) replications here test whether the effect is real or a fluke.
therefore, testing the effect size is not so relevant. This is a matter of refining studies (how exactly the effect works etc.)

2) We want a single probability of the reality of the original effect.

So it should be a composite probability of the original effect being real given the new results. Which boviously should incorporate the replication effect size and the various
implied probabilities. It seems theoretically simple, just a formula to be worked out (with a few decisions, of course).

Then, one would get a single P of true/false for "is there an effect?"
ReplyDelete
Replies
Unknown2/12/2013 05:02:00 PM
Hi Dan,

Why not take the following approach, based on Matt's suggestion? It is kind of a fast-and-frugal way, too probably not entirely correct. In case 4, you actually have 11 times the same study (1 original + a 10-fold replication). So, you have a set of 11 different effect sizes, which you can use to estimate how "odd" the first effect size was. Of course, a lot will depend on how you split up your 10x replication study, but you can actually do a worst-case and a best-case version of this split up or you even figure out (after some intense computing) the empirical "oddness" of your first effect size given all possible split-ups of your replication study.
ReplyDelete
Replies
Etienne LeBel3/03/2013 10:00:00 PM
Wow, very interesting discussions regarding how to interpret direct replications.

I'd say in the case of the massive (x10) **direct** replication with d=.05 or d=.01 (same direction), robustly different (via permutation/bootstrapping test, chi-square, or Bayesian approach) from the imprecise d=.5 in the original, that this should be considered a FAILURE TO REPLICATE the original finding because the massive direct replication found an effect that -- even though is in the same direction --is robustly different from the original. My position is based on the fact that as scientists we should care not only about the direction but size of an effect. I think John Tukey (who incidentally contributed significantly to the jackknife procedure, a child of bootstrapping) said it best by saying
“amount, as well as direction, is vital” (Tukey, 1969, p. 86). In his own words:

"The physical sciences have learned much by storing up amounts, not just directions. If, for example, elasticity had been confined to “When you pull on it, it gets longer!” Hooke’s law, the elastic limit, plasticity, and many other important topics could not have appeared (emphasis added) (p. 86)."

And the reason psychologists haven't heeded to such words of wisdom is that virtually all metrics in psychology are arbitrary and hence typically non-interpretable, which is why they are not paid attention to. Some theoretical work has been done proposing how to go about developing meaningful units of measurement for instruments in psychology (LeBel, 2011; http://ir.lib.uwo.ca/etd/174/), but there are formidable challenges in this endeavor.
ReplyDelete
Replies

Add comment

New comments are not allowed.