Tuesday, June 4, 2013

Direct replication and conceptual replication - more thoughts

In response to my response to his post+Rolf Zwaan wrote another interesting post on the limits and value of direct replication. 

In his new post, Rolf makes a number of interesting points, and I'd like to highlight and discuss a couple of them:
There is no clear dividing line between direct and conceptual replications but what I am advocating are not conceptual replications.
I agree that the line dividing direct and conceptual replication is not clear, but I do think there is a boundary there. For me, a direct replication stick as close to the original method as possible. No study is an exact replication, of course—all replications will involve different subjects at a different time, and they typically will involve other minor changes (e.g., more modern computers, etc.). Still, a direct replication can be functionally the same if it uses the same materials, tasks, etc. and is designed to generalize across the same variations as the original. There's some gray area in what constitutes the same target generalization. Our approach with Registered Replication Reports is to ask the original authors to specify a range of tolerances on the parameters of the study. The more they restrict those tolerances, the more they are retrospectively limiting the generality and scope of their original claims. So, we encourage the original authors to be as liberal in their definitions as possible in order to maximize the generality of their effect. If the original author insists that the study could only be conducted in 1989 with subjects from a particular university, then a direct replication is impossible. However, we should then treat the original effect as unreliable and unreproducable, and it should not factor into larger theorizing (and the original paper should have mentioned that limited generality. Of course, if it had, it likely would not have been published). 

In my own work, I've recently started added a sentence to my discussion specifying the generalization target for the study. For example, in a just-published study of overconfidence in judgments by bridge players, I added the following short section to the end of the method section:
Limits in scope and generalizability
The results from this study should generalize to experienced bridge players at other duplicate bridge clubs, as well as to other domains in which players regularly compete against the same group of players in games of skill, provided that the outcome of any individual match or session is not determined entirely by skill. 
I might be wrong about how my findings wil generalize, but at least I've stated what I believe the scope to be. I would love to see more papers (and press releases) state there generalizability target. That might avoid a lot of needless follow-up work and it might also help journalists to accurately convey the scope of new research.

Rolf also notes that "Most of the studies that have been focused on in discussions on reproducibility are what you might call one-shot between-subjects studies." Danny Kahneman eloquently made a similar point at session on scientific best practices at the APS meeting a couple weeks ago. His point was that reliability is a much bigger problem studies in which each participant contributes just one data point to one condition and all comparisons between conditions are conducted across subjects. Within-subject designs in which each participants contributes many data points to each condition are more powerful and tend to produce more reliable outcomes. With repeated measurement of each condition for each participant, it is possible to obtain a more precise estimate of performance in each condition. Psychophysics research takes this principle to the extreme, testing each subject for hundreds or sometimes thousands of trials in each condition. Such studies can test as few as 2 or 3 participants and still obtain highly reliable results. 

Rolf correctly notes that many of the findings that have been questioned are between-subjet designs with one measurement for each participant. As someone who regularly does such studies myself (e.g., inattentional blindness studies often take that form), I'm aware of the dangers that small samples might cause. And, I'm also aware that my selection of stimuli or tasks might limit the generality of my findings.

Rolf's larger point, though, is that it should be considered a direct replication to vary things in ways that are consistent with the theory that governs the study itself. That is because it is impossible to generalize across items if there is only one item type in a design. Fair enough. That said, I think we still disagree on the contrast between direct and conceptual replication. Direct replication tests whether we can obtain the same results with the same task. Yes, changing some aspect of the design (using a different lineup, for example) would allow us to test the generality of a finding across items. You could construe that as a direct replication, but imagine if it didn't work after changing that element. Does that mean the original result did not replicate? I would argue no. It does not test whether the original result replicates. Rather it tests whether the original result generalizes to other materials. That's crucial to know, but it's a different question.

In sum, Rolf and I appear to be in complete agreement that we need to know both whether an effect is reliable and whether it applies more broadly. But, before we begin testing whether the effect generalizes to other materials, it seems worthwhile to first make sure the original effect with the original items is robust. Perhaps those approaches could be combined in some cases, adding conditions to test the generalizability question at the same time that we test the reliability question (we're considering that approach with another protocol now, actually).