Monday, June 3, 2013

Direct replication of imperfect studies - what's the right approach?

In a recent blog post+Rolf Zwaan riffed on the benefits and limits of direct replications of the sort +Alex Holcombe+Bobbie Spellman, and I are supporting at Perspectives on Psychological Science. He correctly claims that direct replications are focused on assessing the reliability of individual findings rather than testing the validity of the theory they were intended to support. By definition, a direct replication provides no better support for the theory than the original did. And, if the original study has flaws (and what study doesn't), then that support might be limited in important ways. The goal of direct replication is not to evaluate theory. Rather, it is to determine whether a particular finding is robust. In my view, such measures of reliability have been sorely lacking in the field, whereas we have no shortage of conceptual replications that are better suited to exploring the generality of findings. 

We've encountered cases in which a lab submits a proposal to replicate an original study, and either we as the editors or those submitting the proposal spot a critical flaw in the original design. The problem is that, in hindsight, scientists typically can pick holes in most studies, finding problems that limit their generalizability or validity. For example, Zwaan correctly notes that the original control condition in the verbal overshadowing task does not equate the tasks required of participants across the experimental and control condition. Generating a description of a face is not the same as generating a list of states/capitals, and a better control might ask participant to describe something other than the perpetrator. Maybe it's the act of writing a description—any description—that reduces memory accuracy. If so, then the replicating the original finding would not validate the model of memory in which verbal encoding of the target object interferes with visual memory. True enough. As Zwaan argues, direct replication assesses the reliability of a finding, not the validity of the claims based on that finding. 

One approach we've considered taking is to plug known design holes as best we can in developing the new protocol, provided we can maintain the original intent while conducting a direct replication. The problem is that there are many ways to improve a design, and any change of that scope risks shifting into the mode of conceptual replication rather than direct replication. For the verbal overshadowing studies, I have seen at least half a dozen different suggestions for ways that labs wanted to improve the design. But, by allowing labs to adopt their own improvements, we risk abandoning the primary goal of measuring the reliability of a finding. 

On occasion, we may have to tweak the original design. For example, in the protocol to replicate Schooler's verbal overshadowing effect, we had to change the control condition slightly in order to permit replication in multiple countries. Rather than listing US States and their capitals, participants in all of the replication studies will instead list countries and capitals. That way, we can make the study comparable across all countries. We vetted that change with Schooler, and it does not change the design in a meaningful way, so we can still consider the protocol to represent a direct replication (although it is not an exact replication).

Another approach we might well take in the future would be to design a study that optimally tests a theory, one that builds on the earlier study but tweaks the design to improve it (such studies could arise from an antagonistic collaboration). We could then undertake the same multiple-lab process to estimate the population effect size. Some of the protocols might well be a hybrid of this sort, provided that the original authors or their surrogates believe the new design to be an improved version of the original that will optimize the size of the effect.

Zwaan's interesting blog post suggests an approach to solving the validity problem, but it's one that I fear would not help solve the reliability problem:
This protocol might be like a script (of the Schank & Abelson kind) with slots for things like stimuli and subjects. We would then need to specify the criteria for how each slot should be filled. We’d want the slots to be filled slightly differently across studies; this would prevent the effect from being attributable to quirks of the original stimuli and thus enhance the validity of our findings. To stick with verbal overshadowing, across studies we’d want to use different videos. We’d also want to use different line-ups. By specifying the constraints that stimuli and subjects need to meet we would end up with a better understanding of what the theory does and does not claim. 
This approach, allowing each replication to vary provided that they follow a more general script, might not provide a definitive test of the reliability of a finding. Depending on how much each study deviated from the other studies, the studies could veer into the territory of conceptual replication rather than direct replication. Conceptual replications are great, and they do help to determine the generality of a finding, but they don't test the reliability of a finding. If one such study found a particularly large or small effect, we couldn't tell whether that deviation from the average resulted from the differences in the protocol or just resulted from measurement noise. As +Hal Pashler and others have noted, that's the shortcoming of conceptual replications—they do not assess the reliability of an effect because a failure to find the effect can be attributed either to the reliability of the original effect or to the changes from the original protocol. 

There is a role for such variation in the field, of course. But in my view, we already plenty of such conceptual replication in the literature (although we generally never see the conceptual replications that don't work). What we don't have enough of are direct replications that accurately measure the reliability of a finding. In my view, we need to measure both the reliability and validity of our findings, and we can't just focus on one of those.