Wednesday, December 26, 2012

Journals of null results and the goal of replication

Here is my response to the following question that +Gary Marcus forwarded me from one of his readers:
Is there a place in experimental science for a journal dedicated to publishing "failed" experiments? Or would publication in a failed-studies journal be so ignominious for the scientists involved as not to be worthwhile Does a "failed-studies" journal have any chance of success (no pun intended)?

Over the years, there have been a number of attempts to form "null results" journals. Currently, the field has The Journal In Support of the Null Hypothesis (there may well be others). As a rule, such journals are not terribly successful. They tend to become an outlet for studies that can't get published anywhere else. And, given that there are many reasons for failed replications, people generally don't devote much attention to them.

Journals like PLoS One have been doing a better job than many others in publishing direct replication attempts. They emphasize experimental accuracy over theoretical contributions, which fits the goal of a journal that publishes replication attempts whether or not they work. There also are websites now that compile replication attempts (psychfiledrawer.org). The main goal of that site is to make researchers aware of existing replication attempts.

For me, there's a bigger problem with null results journals and websites: They treat replications as an attempt to make a binary decision about the existence of an effect. The replication either succeeds or fails, and there's no intermediary state of the world. Yet, in my view, the goal of replication should be to provide a more accurate estimate of the true effect, not to decide whether a replication is a failure or success.

Few replication attempts should lead to a binary succeed/fail judgment. Some will show the original finding to be a true false positive with no actual effect, but most will just find at the original study overestimated the size of the effect (I say "most" because publication bias ensures that many reported effects overestimate the true effects). The goal of replication should be to achieve greater and greater confidence in the estimate of the actual effet. Only with repeated replication can we zero in on the actual estimate. The greater the size of the new study (e.g., more subjects tested), the better the estimate.

The initiatives I'm pushing behind the scenes (more on those soon) is to encourage multiple replications using identical protocols in order to achieve a better estimate of the true effect. One failure to replicate is no more informative than one positive effect -- both results could be false. With repeated replication, though, we get a sense of what the real effect actually is.

4 comments:

  1. In truth, PsychFileDrawer's creators do not subscribe to the view that the outcome of the experiment is best thought of simply in binary terms ("significant" vs "nonsignificant") and our FAQs encouraged people to use common sense (e.g., don't label a strong trend in the same direction as the original result as a "failure to replic" merely because it doesn't reach significance.) But this informality does not provide any bright lines, unfortunately. We agree that the meta-analytic approach of synthesizing effect sizes and confidence intervals is generally a good way to go, but felt it would be inadvisable to require this when many investigators are not easily able to provide this information.

    Of course, if literatures contain pseudo-effects spawned by type-1 errors or p-hacking or fraud, then the synthetic mean may perpetuate confusion, while the more simple-minded conclusion "oops-- nothing there!" may sometimes be more on target.

    The big issue, though, is still incentivizing replications (on this problem, PsychFileDrawer has done more to dramatize the problem than it has done to solve it). Dan's approach has great promise not only to generate a balanced and smart review process for replication attempts, but also to incentivize doing replication attempts in the first place. And that in turn should help disincentivize publishing stuff that won't replicate. A virtuous cycle. Go Dan and Alex!

    ReplyDelete
    Replies
    1. Thanks for the clarification. Yes, a tally of succeed/fail will help in cases when type-1 errors, p-hacking, and fraud produce a truly false positive. In those cases, there truly will be no "there" there. I guess I'm hopeful that true false positives are relatively rare, although there might well be some prominent cases.

      My hope is that, with multiple replications sharing a common protocol, we will achieve an accurate meta-analytic effect size estimate. Moreover, we can plot each effect size and its confidence interval graphically to illustrate how the effects cluster. If it turns out that the original effect really is a false positive (i.e, effect size of 0), that will be clear from the plot -- it will stand apart from the other attempts to replicate, which all should cluster around an effect size of 0. True, the original false positive would elevate the mean effect size estimate, but the figure could make clear why. I think sites like PsychFileDrawer are the idea place to publicize and highlight such cumulative results, to help make people aware of effects that others are struggling to replicate. People could also take their cue from PsychFileDrawer when developing their protocol and deciding which studies need a more precise effect size estimate.

      I completely agree on the incentive cycle. I'll be posting more about that soon.

      Delete
  2. Good piece. Two points (1) I agree that forcing the outcomes of replication attempts into a either a "success" or a "failure" bin is not very insightful and even a little stigmatizing. In our recent paper, http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0051382, we (almost completely) avoided using these labels, trying to take a more nuanced approach, which does not always make for effective communication. (2) I like the idea of "multiple replications using identical protocols." With our experiments, this would be extremely easy. We can share our data-collection and data-analysis programs. The experiment will take a couple of days tops to run and will only set you back a couple hundred dollars. Experimenter effects will be practically nonexistent.

    ReplyDelete
    Replies
    1. Thanks Rolf. It is challenging to avoid the succeed/fail terminology when discussing replications, but I think it's a useful exercise.

      I'm glad you like the idea of multiple replications using identical protocols. I'm hoping, once the details of the initiative I keep hinting about go public, your reaction will be the norm. I've received several other emails from people who hold the same perspective you do, so I'm hoping there will be a groundswell of interest. I would love to see labs make their code and methods public to facilitate replications. Once the initiatives launch (ideally in the next month or so), I'll post the details here and elsewhere. The more labs that lead by example, the better!

      Delete