Friday, May 24, 2013

Should media-worthy psychology research be held to a higher standard of evidence?

I've been attending the APS meeting in Washington DC, and this afternoon had 3 back-to-back sessions on research best practices (5 straight hours without a break!). The sessions included extensive discussion of the issues of p-hacking, small samples, and other biases. It focused not just on problems, but also on possible solutions, and it ended with a set of brief commentaries from a panel of editors and program directors. In all, it was a fascinating session.

For me, perhaps the most interesting and provocative suggestion came from Danny Kahneman who argued that studies that are likely to be of broad public interest should be held to a higher evidentiary standard because they are more likely to influence the public perception of our entire field. He agreed with many of the other panelists in arguing that most studies should have much bigger samples (i.e., higher power to measure effects precisely), but argued that media worthy studies should be held to an even higher standard: Journal editors should insiste that such studies be highly powered and highly likely to be replicable. He also argued that journal editors and societies must enforce these standards, perhaps establishing guidelines that are implemented over a period of several years much like government requirements give car manufacturers some years to implement gas mileage standards in their fleet.

Should media worthy studies be held to a higher evidentiary standard, or should all studies be required to meet the same standards? Would imposing a higher standard for media-worthy studies stifle risky, novel, or innovative research, or would it help to improve such research (or both)? Let's hear your thoughts in the comments.

Update & ClarificationA few people seem to be misunderstanding my question. I'm not asking whether the media should be held to a higher reporting standard. That's a different topic. Rather, I'm asking whether journal editors should hold scientific manuscripts to a higher standard of evidence if the results reported in the manuscript are likely to be media worthy once they are published.

Tuesday, May 21, 2013

Should Grants Count As Research - Part 2

Warning: This post might not be of interest to anyone outside of academia. 

Last week I posted the following thought experiment:
Suppose you have two faculty members whose research productivity, publication rates, citation rates, etc. are identical in every respect. The only difference between them is that Faculty member A has two federally funded grants whereas Faculty member B has no grants. How would you rate their research influence/impact.
The post led to a lot of insightful comments, so you should read those if you haven't. The thought experiment was intended to eliminate any correlation between grants and research productivity in order to explore what should "count" as impact/influence. Some commenters denied the premise of the thought experiment when making their judgments (in a thought experiment, the premise doesn't have to be plausible...), so I'd like to address the premise directly before I give my take on the thought experiment.

Within a given research subfield, faculty with grants are likely more productive than those without. It is not clear how strong that correlation is, though. For some areas of psychology, grants are essential to have any productivity (e.g., fMRI, behavioral neuroscience). I can imagine other areas in which grants are largely superfluous once a lab is established. In either of those research areas, the correlation between grants and productivity likely is low. The most interesting cases are those subfields in which grants are helpful, but not absolutely necessary (e.g., traditional cognitive psychology). It would be interesting to explore the within-subfield correlation beween research impact and grant funding in such disciplines. The correlation likely would be positive given that getting grants typically requires prior productivity. I'd hazard a guess that it might not be a strong correlation, though. There must be data—does anyone know the actual numbers?

Now let's return to the issue that motivated the thought experiment. My department, and many others like it, evaluate the contributions of their faculty based on their contributions to research, service, and teaching. Often, the raters are left to their own devices to determine what counts in each category and how to measure it. Some raters use objective metrics, tallying up publications, evaluating citation rates as an index of influence, comparing the relative challenge of publishing in different subfields, etc. Others just grok the overall record and make a subjective assessment. In my discussions with another rater who, like me, prefers an objective approach, we disagreed on how to think about grants. Hence the thought experiment.

As promised, here is my take: 
Provided that a rater has access to a professor's actual research output (publications, citation rates, etc.), grants should be treated as irrelevant when evaluating research impact/influence. So, in the thought experiment, A and B should receive the same research rating. 
In the absence of direct access to the actual research output (and assuming grants are correlated with research impact), grants can serve as a useful, but imperfect proxy for impact. But, when we have access to the things that they proxy for, they add no useful information.

I think an analogy to citation rates is apt. If you lack access to the citation count for an individual paper, you could use the average citation rate of articles in that journal as a proxy for it's citation count. It's imperfect, but better than nothing. However, if you have the actual citations for that paper, there's nothing gained by knowing the average citation rate for that journal. Grants are similarly useful as a proxy when you lack other information. For example, for a more junior professor, a grant promises future productivity, making it a good proxy until that professor has enough of a track record to evaluate their research output directly. For senior faculty, we have the track record, so there is no need for a proxy. Grants are just a means to the end of scholarly productivity (i.e., journal articles), so they should not be treated as a research product themselves.

Note that even citation counts are a proxy for impact/influence. We all can identify papers that are cited frequently because of their topic, but are not necessarily influential (or even read by those citing them. Although citations may tell us little about the quality of an article, they do tell us something objective about its influence and impact.

If grants don't count toward research, should we count them at all?

Although I argue that grants should not count toward ratings of research productivity or impact, they should factor into faculty evaluations in a different way:
Grants should be treated as a component of service to the university rather than as an indicator of research productivity or impact.
Universities value grants because they provide a lot of money. Non-academics might not realize only a subset of the money in a federal grant goes toward research. In fact, more than one third of the grant funds at major universities (in the USA, anyway) go directly to the university to support operating expenses. Consequently, by bringing in grant funding, researchers are doing a service to their university; their grants help the school keep running, so they are highly valued. Even if the researcher produces nothing of research value from the grant, the university still benefits from their service. 

A follow-up thought question:

Let's return to the two faculty people in the thought experiment. Posit that they deserve the same research rating and that the grant-funded researcher should receive credit for their service to the university. Now consider that the funded researcher was less efficient than their unfunded colleague; they used federal funding to achieve the same level of productivity as someone who had no funding. In other words, they used a federal resource for something that apparently did not require it. 

Funding is a limited-sum game. Funding given to one researcher means less funding available for other researchers. If the funding was not needed to conduct the research, should we treat that grant-funded researcher as having done a disservice to their field as a whole. That is, should we credit them with service to their university while simultaneously penalizing their ratings of service to the field or the public because they took needed money away from other researchers? (Note: I'm just being provocative here, but I do think it's worth considering whether more funding should go to those labs that use it most efficiently and productively, those where the increment in funding will lead to the greatest increase in quality research output.)

This further discussion highlights the challenge of evaluating contributions objectively. It's often ambiguous what should count and where it should count. While we're at it,  perhaps we should figure out how to credit academics for all of the other stuff they do, including blogging! Should non-journal or general-audience writing count toward research productivity? Should it count as service to the community? Should it count as non-university teaching? Should it not count at all? I'll be curious to hear your thoughts.

Thursday, May 16, 2013

Evaluating research impact: Should grants count?

Warning: This post likely will not be of interest to anyone outside of academia. 

My department is in the midst of our annual faculty evaluations. Each year, our representative "advisory" committee evaluates all of the faculty in our department based on their research impact/influence, teaching, and service over the past 3 years. The collective ratings are factored into the small merit raises we sometimes receive (depending on state budgets). When I get the privilege of serving on the committee, I use as objective an approach as possible for the research component of the evaluation. I look at publication rates, citation rates, and any other metrics that seem reasonable.

Yesterday, I had a fascinating discussion with a colleague on advisory who similarly looks for objective indices of research impact and influence. The discussion led to an interesting difference of opinion, and I'd be curious to hear what others think. So, here's a thought experiment:

Suppose you have two faculty members whose research productivity, publication rates, citation rates, etc. are identical in every respect. The only difference between them is that Faculty member A has two federally funded grants whereas Faculty member B has no grants. How would you rate their research influence/impact:

Option 1: Faculty member A should receive a higher research score because they have grant funding

Option 2: Faculty member B should receive a higher research score because they do not have grant funding

Option 3:  Faculty members A and B should receive the same research score

Please leave your opinion in the comments. I'll post my own answer answer and some further discussion soon.

Notes and clarifications about Registered Replication Reports

We've had a great response to our announced protocol to replicate Jonathan Schooler's verbal overshadowing effect, the first approved protocol for the new Registered Replication Reports at Perspectives on Psychological Science. Not surprisingly, we've also gotten some important questions as well. 

I will occasionally post information about Replication Reports both here and on my Google+ page. If you are interested in participating in such projects or just in learning more about them, check back often. Here are a few clarifications:

Sample size
The stated sample size in the protocol is the minimum required. We strongly encourage the use of larger samples if at all possible. Although the reports will not focus on whether individual studies reject the null hypothesis (we're not tallying succeed/fail decisions based on p<.05), greater sample sizes will give a more precise estimate of the true underlying effect. The larger the sample, the smaller the confidence interval around your effect size estimate, and the better the meta-analytic effect size estimate across studies will be. So, please use as large a sample as is practical, and specify your proposed sample size in the Secondary Replication Proposal Form when you submit it.

Necessary deviations from the protocol:
Please specify any necessary deviations from the protocol in your submission. The editors will review those deviations to make sure they do not substantively change the protocol. For example, several people have noted that the control task—naming states and capitals—might not work for subjects outside the USA. We have discussed this issue with Jonathan Schooler, and we have agreed that labs located outside the United States could use an countries/capitals alternative if necessary. Note that if a deviation in protocol would mean that the study is not a direct replication, we will not be able to approve it. These must be direct replications, not extensions of the result or conceptual replications that differ in important respects from the original. Please do not justify deviations by noting that the study will show something new and different than the original—that's not the goal.

Could it be done better:
The goal of Registered Replication Reports is different from that of a traditional journal article in that we are focusing on direct replications of an effect. No study is perfect, and any study can be improved. We hope to choose studies for Replication Reports that do not have fundamentally flawed designs even if they have quirks that might not optimally test a theoretical question. We might consider improvements to the measurement of the dependent variable, but not if they change the effect being measured. For example, we would consider computerized presentations for a design that originally was conducted using slides or paper, but only if the presentation did not change the nature of the dependent measure. More precise measurement of the same dependent measure (e.g., computerized timing rather than hand-timing) generally will be fine. Similarly, we would permit computerized presentation using E-Prime even if the original study was conducted using MatLab. The guiding principle is whether the change fundamentally alters what is being measured. All studies in a Replication Report should be measuring the same outcome.

Tuesday, May 14, 2013

Announcing the first Registered Replication Report protocol!

I am pleased to announce that Perspectives on Psychological Science has today released the first approved protocol for a Registered Replication Report. The protocol is for a replication of the core finding from the following article: 
Schooler, J. W., & Engstler-Schooler, T. Y. (1990). Verbal overshadowing of visual memories: Some things are better left unsaid. Cognitive Psychology, 22, 36-71.
We are now inviting secondary replicator proposals for those interested in contributing to this Registered Replication Report. See below for more information and instructions, and look for further information from APS this morning as well.

The rationale for replicating this study:
Prior to the original finding of verbal overshadowing, most memory research suggested that any rehearsal of to-be-remembered materials would enhance recall of those materials. The original verbal overshadowing result was both theoretically important and surprising because it showed that verbally rehearsing an experienced event impaired memory for visual details from that event. The finding suggested that eyewitness recollection might be impaired by asking witnesses to describe what they saw, a result with both practical and theoretical importance. Over the ensuing years, the studyʼs original author (Jonathan Schooler) has tried to reproduce that finding, and the measured effect sizes were substantially smaller than those of the original study—the effect seems to be more temperamental than initially thought. Despite receiving more than 500 citations since it was first published in 1990, few other laboratories have attempted direct replications of the crucial first study. Moreover, Schooler has argued that his attempts to repeat the experiment have resulted in a reduced effect size. He attributes that reduction to an active mechanism, the so-called “decline effect." His writing about the reduced effect size of this result has received extensive coverage in journals and the popular media, including commentaries in Nature and a feature article in the New Yorker. The effect itself appears regularly in cognitive psychology textbooks as well. Given the uncertainty about the size of the effect, direct replication of the original study by multiple laboratories will help determine the robustness of the interfering effects of verbal rehearsal on recognition of visual materials.
About Registered Replication Reports:
Registered Replication Reports are a new article type at Perspectives on Psychological Science designed to better estimate the size of influential effects, especially those for which there is controversy about the effect size or for which there have been few or no direct replications. Alex Holcombe and I are serving as associate editors for these reports, with Bobbie Spellman as editor in chief. 
As part of the review process for these reports, we develop a protocol for a direct replication and then make that protocol public so that multiple laboratories can contribute their own, independent, direct replications. The protocol is pre-registered and vetted for accuracy before any data collection begins. The collected set of replications then will be published together as part of a single article in the pages of Perspectives, with all results published regardless of the outcome. All laboratories contributing a replication will be authors on the final manuscript, with their individual contributions also identified alongside their results. The final reports will be open-access, and all data from each replication attempt will be posted online at the Open Science Framework. The end-result of this replication effort will be a meta-analytic assessment, across all the direct replications, of the size of the effect. With all data and results available and open-access, others researchers will be able to aid in understanding the underlying effect by contributing re-analyses or commentaries.
More information and how to contribute a replication:
  • Active protocols for Registered Replication Reports at Perspectives. (This one will be added later today, with more to come soon.)
  • Open Science Framework page for the Schooler & Engstler-Schooler (1990) protocol. The page includes the approved protocol, all experimental materials, and instructions for joining the project. 

Monday, May 13, 2013

Registered Replication Reports - Stay Tuned!

Almost exactly one year ago, Alex Holcombe and I met in Florida (while at the Vision Sciences conference) to discuss our concerns with the state of the scientific literature and the problems with the incentive structure for publishing. Now, we're excited to report that our schemes are about to bear fruit. 

Tomorrow, we will be announcing the first of what we hope will be many Registered Replication Report protocols to be published at APS's journal, Perspectives on Psychological Science. I'll post details here tomorrow morning - stay tuned!

I thought it might be interesting, before that announcement, to give a little of the backstory for how this new initiative came about. 


For good reasons, many journals favor novelty and originality. With adequately powered studies, that preference might be okay, but most psychology studies are woefully underpowered, meaning that the effect size estimates they provide are relatively unstable and noisy. Moreover, the emphasis on null hypothesis significance testing (NHST) means that statistically significant results are treated as robust, even when a study is underpowered. We felt that the real emphasis should be on estimating the actual size of effects in the world, and for that purpose, statistical significance is the wrong metric. In fact, as this video from Geoff Cumming shows, the p-value from an individual study is not particularly diagnostic of the size of the underlying effect (unless the sample size is huge):

The problem with relying on statistical significance is amplified by a slew of questionable practices, including p-hacking, optional stopping, data peeking, etc. that have been addressed recently. Although these issues with NHST have been well documented for decades with little change in practices from researchers, Alex and I had both encountered increased interest in confronting these problems and changing how the field does business. The problem was one of incentives.

When dealing with inference from a study to the true size of an effect in reality, direct replications are needed. Only through multiple, direct replication using a shared protocol can we arrive at an accurate meta-analytic effect size estimate that overcomes the inferential shortcomings of any individual study. (The one exception might be studies that are so highly powered that they effectively do not require an inference to generalize to the full population.) Meta-analysis can help, but they are subject to uncontrolled differences in procedures, file drawer problems, etc. Yet, the incentives against publishing direct replications have been enormous. 

Publication is the currency of our field, but direct replications have been undervalued relative to novelty. Anyone who has tried to publish a direct replication that did not find the same effect as the original knows the challenges of overcoming biases in the publication process. There are many reasons a replication might produce a smaller effect, and those reasons often provide a rationale for rejection. 

These factors, taken in combination, have produced a proliferation of exciting, counter-intuitive, new findings with almost no published direct replications to verify the size of these effects in reality and few publications of studies that disconfirm original findings.  Our goal was to establish a new journal, one dedicated to publishing direct replications of important results, that could help to change those incentives. We felt the field needed a journal that focused not on publishing "null results" but on publishing the outcome of direct replications regardless of their outcome. 

We wanted our new journal to adhere to the following principles:

  • Use vetted, pre-registered plans for the study procedure and analysis so that each study would be an accurate, direct replication of an original result
  • Have all studies use the same protocol, with the results published regardless of the outcome
  • Identify studies that have high replication value, those that are theoretically or practically important but that have not been directly replicated
  • Publish the findings in an open-access format so that they could provide a definitive assessment of the size of an effect that would be accessible both to researchers and to the public (and media)
  • Make all data from each study publicly available as well
  • Emphasize the estimation of the population size of an effect rather than the statistical significance (succeed/failure) of individual replication attempts

The journal would provide an outlet for direct replication, providing an incentive for conducting such studies. If successful, it could induce broader changes in the field. For example, it might help to stem the use of questionable research practices that lead to significant but possibly spurious results. If you used questionable research practices to obtain a spiffy new finding, the prospect of having multiple labs attempt direct replications would give you pause. We hoped that the presence of such a journal might lead researchers to verify their own results before publishing, ideally using a pre-registered design and analysis and greater power. We viewed this journal not as an effort to debunk iffy findings (although that might happen), but as an effort to shore up our science for the future.

APS answers the call

Alex and I did a lot of legwork in developing our idea for the journal. We explored possible publishers and outlets, consulted colleagues, developed a mission statement, guiding principles, procedures for the submission and review process, lists of possible editorial board members, possible journal names, etc. By mid-July of last year, we were about ready to enter the final stages of development. At that point, we decided to seek guidance from experienced editors and leaders of academic societies to explore the factors that might lead to the success or failure of a new journal. I wrote to several publishing luminaries and colleagues to seek their advice, including Alan Kraut and Roddy Roediger at APS. All had excellent suggestions and feedback. 

Unexpectedly, a few weeks later Alan Kraut asked me to attend a small meeting in St. Louis to discuss the state of publishing in our field with some of the publishing board members and others at APS. As it turned out, APS had been actively working on a set of initiatives to improve reporting and publishing practices, with Eric Eich (editor in chief of Psychological Science) taking the lead. I couldn't attend the start of the meeting due to prior obligations, but I arrived at dinner after the rest of the attendees had been meeting all day. It was there that I learned that the group had discussed the ideas that Alex and I had developed, and they thought APS should adopt them. I was happily stunned. That APS wanted to address the issues in publishing head on has given me more cause for optimism about the state of our science than I have had for years. 

As with any large-scale new society initiative, we had to overcome some resistance, but the process was remarkably smooth given the scope of what we are trying. For some time, it was unclear whether the new registered replication reports would be published independently or incorporated into one of APS's journals. Then, after it became clear that hosting it in an established journal would give the best odds of success, it was unclear which journal would house it. In the end, with Bobbie Spellman's advocacy as editor in chief, we decided that it would be published in Perspectives on Psychological Science. The journal has become the go-to place for discussions of scientific practices in psychology, and it provides a natural home for this experimental initiative.

Once it become clear that Perspectives would host the Registered Replication Reports, Alex and I took on roles as associate editors and refined our materials to better fit their new home. We publicly launched the Registered Replication Reports initiative in March, and the month or so since then, we received a number of excellent proposals, many of which are wending their way through the process of devising and vetting an accurate protocol for a direct replication. Over the coming months, we will be announcing a number of accepted protocols. And, we hope that many researchers in the field will contribute their own direct replications to these projects. 

The first approved protocol, developed with the aid of the original author, will be announced tomorrow! Stay tuned.

Wednesday, May 1, 2013

A step too far: What replications do and do not imply

An article that appearing in Nature yesterday focused on a recent paper in PLoS ONE that reported a failure to replicate a classic social priming study by Ap Dijksterhuis. Unfortunately, the headline and thrust of that Nature piece incorrectly linked a failed replication attempt to the noted case of Stapel's fraud. It inappropriately impugned an entire discipline (social psychology) and country (the Netherlands) based on a single paper that was unable to replicate a few related priming effects. That's not just a step, but an entire staircase too far.

Replication failures do NOT indicate fraud. They are a part of the normal scientific process. Even if the original effect is completely true, some proportion of studies will fail to replicate it. And, even if it is false, one negative result does not prove that. Replications can fail for many reasons (and spurious positive results can occur as well). Replications like the PLoS paper add evidence. Perhaps the original effect is a little weaker than previously thought. Or, perhaps it is more sensitive to subtle differences in method. Or, perhaps the replication attempts were not done correctly. Or, perhaps the original study was not done correctly. Only with multiple, independent, direct replications can we better estimate the true underlying effect. A single result, positive or negative, does not provide definitive evidence. It is informative. Just not that informative.

I spoke the author of the Nature piece for 20 minutes when she was researching it. I didn't comment on the latest replication controversy. Instead, I talked with her about how psychology is taking the lead on addressing replicability and best practices. I talked about the great new initiatives at APS and other journals, and the changing incentives that will promote publication of direct replication attempts. Those changing incentives will improve the state of our science going forward, and they might generalize to other fields as well. Unfortunately, none of those positive comments made it into the piece.

If you want to read more, I'd recommend what  had to say about the controversy.

Update: Fixed some formatting.

Update 2: I fixed the typo in the title of the post (to -> too) almost immediately after publishing, but blogger won't let me fix the link. I guess it's permanent. Blech.