Daniel Simons

Earliest proposal for a new registered report format?

2022-06-23T08:39:00.000-05:00

Earliest proposal for a new registered report format?

Back in March of 2012, about a year before we launched registered replication reports at Perspectives on Psychological Science and Chris Chambers and crew launched registered reports at Cortex, Alex Holcombe and I had been discussing ways to increase the incentives for replication studies. The discussion happened on the now-defunct Google+ platform. Those discussions led to plans for a new journal which soon after morphed into the registered replication reports at Perspectives.

Yesterday, Alex uncovered what I think is my first public post about the idea, dated 14 March 2012. I've copied it in full below, editing only to remove now broken Google+ tags/links.

The post describes the value of what are now called "registered reports," detailing what form such articles would take, how they would be reviewed, and how the new format would improve the publishing process and incentives (at least for replication studies). Although the post was specifically focused on replication studies, most of the elements described in the post are now a standard part of the registered report model today.

If you know of earlier proposals for adoption of registered reports as an article type, let me know. It would be nice to know the full history of this format.

Originally posted to Google+ on 14 March 2012

Outlets to publicize replication attempts

Yesterday, Alex Holcombe started a thread discussing how best to encourage people to post replication attempts to psychfiledrawer.org. PsychFileDrawer.org is great site where you can post the details of your successful and failed replication attempts of other studies. Alex's question: If you had a little bit of money to encourage people to post, how would you use it. Lots of interesting comments, there.

After some discussions of this and related issues with faculty and students at Illinois, I've been wondering whether a new type of journal might be successful. Below is an idea I posted (in slightly modified form) to Alex Holcombe's thread. What do you think?
------
I know that most "null results" journals generally haven't been successful. I wonder, though, whether an open access journal that published both replication successes and failures might be. (Note: Journals like PLoS go part of the way toward what I'm thinking.)

Here's my idea: Researchers submit the intro (extremely short -- no need to review the literature) and method section for review, along with an analysis plan that specifies details like the sample size, assumed effect size, methods for eliminating outliers, a-priori power, etc. They would not submit results. Only the intro, method, and analysis plan would undergo peer review. Once the replication plan passed peer review, the results would be published regardless of outcome. But, in order to be published, the method and results would have to follow the pre-approved plan exactly.

Here are the benefits of this approach:

1) It increase the incentives for people to do replications—it could result in an actual journal article, so it might be worthwhile as an undergraduate thesis project or a grad student project.

2) It would encounter less resistance from the authors of the original publication during the review process (a major problem when publishing failed replications) -- their goal would be to verify that the methods are acceptable to them given what they had done originally. If they thought the method and analysis plan to be acceptable in advance, they wouldn't have grounds to object if the result didn't support them (and they should be excited if it supports them).

3) It would encourage direct replication rather than conceptual replications that differ in both method and analysis from the original.

4) It would lead to more details from original authors about their methods during the review process, avoiding the inevitable complains that surface when trying to publish a failed replication in a traditional journal (the incentive for the original authors in that case is to highlight any method difference and claim it to be the reason for the failure).

5) The submission and review process could be relatively quick as well given that there wouldn't be lengthy reviews, and the original authors could always be reviewers. I would favor all reviews being signed as well so that there can be no objections later.

6) It's possible that such a journal would publish a lot of replication attempts for the same paper, but that's okay -- better cumulative effect size estimate that way.

7) The end result could be posted to sites like PsychFileDrawer as well, making meta-analysis of the size of an effect possible (and more accurate).

What do you think? Any academic society publishers out there think that this might be a viable model.

Happy to be wrong (Covid update)

2021-09-19T09:44:00.002-05:00

Earlier this semester, I thought we likely had out-of-control spread on campus from people with breakthrough cases that were going undetected.

It looks like I was wrong.

The University of Illinois decided not to do surveillance testing this fall semester, instead focusing on regular testing only of unvaccinated people and selective testing of vaccinated people if they were in a place with an outbreak. (I still think that's a mistake - see below). The campus also returned almost all of its students—including a record-size first-year class—and encouraged attendance at several mass gathering events including football games. I expected this combination to be disastrous. And, at first, it looked like I was right.

In the first weeks of the semester, we saw a substantial spike in cases. Despite daily testing numbers that were just a fraction of those from last year, we saw new cases at a rate comparable to the February surge and daily positivity rates as high as during parts of the fall 2020 surge.

Over the past week or so, though, the daily numbers of cases have dropped dramatically—we're currently down to an average of under 7 detected cases per day. What we don't know, because we're not testing everyone, is whether those 7/day constitute all of the cases or if we would detect many more than that if we were testing everyone. That is, would we still see only 7 cases/day if we were conducting 10,000 to 20,000 tests/day necessary for surveillance rather than the 2,000 to 5,000 per day we're currently conducting. Or would the number of cases scale proportionally with the number of tests.

Aside: We wouldn't even need to test everyone to know—we could randomly select a few thousand vaccinated people to test each week and that would give us an estimate. As far as I know, were not doing that, so we remain blind to what's actually happening. I don't know why we're not doing that sort of testing (other than cost).

Here are some possible reasons things might be going better than I had expected they would:

Maybe high vaccination rates coupled with masking requirements were enough to eliminate spread on campus.
Maybe spread from breakthrough cases to people who also are vaccinated is even less likely than thought, so undetected breakthrough cases aren't spreading Covid further (given high vaccination rates on campus).
Maybe those people who are most likely to engage in riskier behavior—the ones who were most responsible for the outbreaks in the fall/spring—have extra immunity because they both had covid already and have been vaccinated. So, even if they're taking risks, they aren't getting Covid and spreading it. And, everyone else is doing what they were all of last year, taking appropriate precautions to avoid getting Covid or giving it to others.
Maybe breakthrough cases acquired from another breakthrough case are more likely to be asymptomatic, in which case we could have spread on campus but it wouldn't be detected because those students wouldn't seek out a test. (worst case that could be detected via the random sampling strategy I noted above.).

Regardless of the reasons, it's great that case numbers are down, and I hope it means that spread on campus is under control.

That said, the pandemic is far from over. Illinois Region 6, the set of counties that include Champaign, is hovering around 8% positivity (not great). We still have many county residents hospitalized, and we've seen increased deaths over the past month. Southern Illinois has zero open ICU beds for a 20-county region. Other Midwestern states (e.g., Ohio) are seeing a huge surge, and neighboring states are not taking the same precautions that Illinois is (no indoor mask mandates).

For those reasons, I'm not thrilled with unrestricted attendance at sporting events (especially the indoor ones). If people who aren't vaccinated attend those events, including those traveling from outside the area, we could see spread from them to the campus or vice versa. I'd like to see UIUC require negative tests or proof of vaccination to attend campus events (just like the United Center will).

It's a relief and great news that campus appears to be doing relatively well right now. I'm glad that my concerns about the fall semester appear to have been overly alarmist. It'd be wonderful to be proven wrong about that. But, that doesn't mean we shouldn't take precautions, especially when the campus and broader community interact.

Covid responses at Duke and UIUC

2021-08-31T08:46:00.004-05:00

According to the Duke University dashboard, they had 349 positive cases among students last week. Duke has about 15000 students (about 2/3 are grad students) and a 98% vaccination rate. You might recall from last year that Duke implemented a robust testing process last year and had far fewer cases than the University of Illinois did. They appear to have continued their approach from last year of testing all of their students. Although their dashboard doesn't break down cases based on vaccination status, most of these positive cases are likely to be breakthrough cases. In other words, they are testing everyone so that they will have some idea of how bad things are on campus. That has led them to implement mitigation (according to Washington Post article): Indoor and outdoor masking, allowing faculty to shift to virtual teaching for two weeks, suspending indoor dining, limits on student activities.

Duke has a higher vaccination rate than the University of Illinois, and when they tested everyone last week, they found 349 positive cases. When the University of Illinois tested primarily the unvaccinated students, we found 165 cases. Yet, unlike Duke, we don't know how many people are actually infected because we're testing only a small subset of our campus population.

Duke is doing surveillance testing to monitoring the covid situation on its campus and they're taking steps to detect cases among the entire campus population and to stop spread. The University of Illinois is not, even though we could be. We have made no effort to detect cases among the vast majority of students who are vaccinated even though breakthrough cases can spread Covid. And, if we don't know when someone is infected, we can't isolate them to break the chain of transmission. If I had to guess based on our vaccination rate and positive tests among the unvaccinated students, I'd estimate that we have 500+ cases on campus right now and that we've detected only about a third of them. And, those cases we haven't detected will be spreading Covid rapidly.

The University of Illinois has regularly touted our "success" last year in preventing community spread (although the data on that are debatable), so they recognize how important that is. Yet, they went ahead and encouraged a full-capacity crowd at a football game with no meaningful mitigations in place. Many people who attend football games come from the surrounding communities that have low vaccination rates and high levels of spread. If preventing community outbreaks is important to the university, maybe that wasn't such a good idea...

Yesterday, Awais Vaid of the Champaign-Urbana Public Health District (who has worked closely with the university SHIELD team) told the News-Gazette that we'll know in a week whether having 41,000 fans at the game will result in more Covid cases for Champaign County, and he encouraged fans who attended to consider getting tested. Perhaps the possibility that the game would be a super spreader event is something that the university should have raised and addressed before the game. Instead, the university offered free tickets to faculty and staff in order to fill seats (thanks for the invite, but I noped as far away from that one as possible).

Unlike Duke, the University of Illinois appears to have adopted a "let's wait and see what happens" strategy, while simultaneously blinding ourselves so that we can't see clearly. We have the capability to test everyone, but the university has chosen not to. Hoping everything will go well isn't a strategy—Covid doesn't care what we hope will happen.

Covid numbers - August 30, 2021

2021-08-30T11:31:00.001-05:00

During the first week of classes at the University of Illinois, campus testing identified 165 cases of Covid on a total of 22,296 tests.

For comparison, in the first week of the spring semester (Jan 25 - 31), we had a total of 128 cases with 63,208 tests. We're finding more cases than at the comparable point last spring despite testing far fewer people. Whether that's due to Delta or to reduced restrictions and more potential super spreader events (e.g., football games with unrestricted attendance, massive new student orientations with lots of yelling, open bars, etc.) is hard to know.

And, as noted in my post earlier today, we're missing the vast majority of breakthrough cases because we are not doing surveillance testing of vaccinated students. If the vaccines have 60% efficacy against infection and the exposure risks are comparable for vaccinated and unvaccinated students (an assumption that might not be justified), we could have up to 3x as many breakthrough cases that have gone undetected. That is, had we tested everyone on campus, we could well have found a much higher number of cases (perhaps hundreds). Any breakthrough cases that that we didn't detect (because we're not looking) potentially could infect others. Even if people with breakthrough cases do not spread covid as easily as unvaccinated people with Covid, that's still a lot infected people who come into close contact with many other people.

We don't definitive evidence of vaccine efficacy against infection in this population, and we lack information from the university about the total numbers of people being tested, the number of each subgroup of people on campus, or the numbers of vaccinated and unvaccinated people in each group being tested. Without that information, we can only rely on guestimates about efficacy as well as infection rates. But, even if we ignore the possibility of large numbers of undetected breakthrough cases, the first week numbers aren't great - they look worse than at the start of the spring semester. If there are a sizable number of undetected breakthrough cases, as seems inevitable, we won't know how much spread we're seeing on campus and we won't be able to disrupt chains of transmission. I hope breakthrough cases aren't as infectious, but if they are even somewhat infectious, we're risking undetected exponential growth.

Covid at Illinois - 30 August 2021

2021-08-30T07:43:00.002-05:00

During the 2020-2021 academic year, I posted regular updates about Covid at the University of Illinois. I tracked the number of positive cases on campus, estimated the percentage of students infected, and computed various metrics of the infection that were more meaningful than the ones provided by the university (e.g., rolling averages of cases, cumulative cases by semester, cases by week, etc.). Those analyses were necessary because the campus dashboard featured meaningless statistics like 7-day average positivity and the all-time number of tests, neither or which provided useful information to gauge the extent of outbreaks. The campus dashboard also gave no breakdown of the infection percentages, the numbers of students being tested each week, the compliance rate for testing, etc.

In addition to those daily updates, I also discussed many of the logistical problems that contributed to the infection of an estimated 20% of the undergrads on campus last year (4935 out of about 24000) as well as some of the misleading claims coming from the administration. In addition to the daily summaries, I posted a detailed summary of the fall semester issues as well as a more in-depth analysis of how delayed testing results likely contributed to the fall 2020 surge.

This year, changes in testing policies (not testing everyone) mean that I won't be able to estimate the infection rates on campus, and it will be much harder to evaluate whether or not we're in the midst of a large outbreak. It'll be more guesswork than analysis. I'll try to do that guesswork and will post updates here on my blog. Occasionally I'll post analyses as well. But, there's not much benefit to the sort of daily tracking I did last semester because there won't be enough information available on the campus dashboard to do it. I fear that we're running blind this semester. Here's where things stand on campus and why we won't have the information we need.

Vaccine "requirement"

This summer, the campus announced that they would be requiring vaccines. That sounded fantastic, but it turned out that the implementation had no teeth and it was actually more of a "nudge" than a requirement. If you weren't vaccinated — and you didn't need to give any reason why not — you had to continue testing.

Last week, Gov. Pritzger announced that all school and university faculty and staff in Illinois must be vaccinated. Shortly after that announcement, the university announced that they will now mandate vaccination for everyone (except those with a medical or religious exemption). I had thought that "requirement" and "mandate" are synonyms, but apparently they mean different things to our administration. I also have no idea how hard it is to get a religious exemption. In my view, you should be required to present a holy text from your faith showing that vaccines are not permitted...). In any case, it's great that we finally have an actual vaccine requirement/mandate coming into effect.

Changes to testing

Last year, the campus tested everyone twice weekly (allegedly - there was a lot of non-compliance). Over the summer, when cases were low and vaccines were available, the university developed their plan for the fall: They dramatically scaled back on testing by only requiring unvaccinated people to test. The number of testing sites was reduced to 4. The used vaccination as a get-out-of-testing incentive. Unfortunately, they didn't radically change their plans for Delta. They did increase the required testing frequency for unvaccinated undergraduates to every other day because modeling showed that without doing that, people would spread Delta before they knew they were positive. Delta becomes infectious faster than the earlier variants.

However, by electing not to test vaccinated people on campus, we won't detect breakthrough cases (unless they are symptomatic and elect to get tested, by which point it's too late to stop spread). We also won't be able to break the chain of transmission by isolating infected students and quarantining or testing their contacts. We're going to be largely blind to the outbreaks on campus, and they could easily get out of control. The campus has the capacity to test everyone, but they chose not to use it. It's a potentially disastrous decision. Here's why.

The consequences of testing only the unvaccinated

There are far more vaccinated students than unvaccinated ones on campus. About 88% of students are vaccinated, and that soon will be closer to 100% now that we have an actual mandate. If we estimate vaccine efficacy against infection to be approximately 60%, that means there will be nearly 3x as many breakthrough cases than there are cases among the unvaccinated!

To see why, imagine that we test all of the unvaccinated students on campus on a single day and that 1% of them test positive. If we assume that the unvaccinated and vaccinated students had the same exposure to covid via close contacts, that means 0.4% of the vaccinated students would have tested positive had we tested them too (60% efficacy against infection). With about 35,000 undergrads on campus, about 30,800 of them are currently vaccinated (88%) and the remaining 4200 are unvaccinated. If 1% of the 4200 tested positive, that would yield 42 cases. If 0.4% of 30,800 vaccinated students tested positive, that'd be 123 cases. So, we should expect about 3x as many cases among the vaccinated population as among the unvaccinated population. There are a lot more vaccinated than unvaccinated students (which is great, of course).

In the extreme, if 100% of students were vaccinated, and we're only testing unvaccinated students, we could have a massive outbreak on campus via breakthrough cases and we'd know nothing about it (until people start showing symptoms and seek testing or treatment. But by the time they're showing symptoms, they've been infectious for a while).

Last year, the University was (somewhat) able to contain outbreaks by quickly identifying positive cases—often before they were infectious—and then isolating them. We saw the surge, implemented mitigations, and brought numbers down at least somewhat. This semester, up to 3/4 of the initial cases could go undetected. Unlike last year, we'll be blind to spread from those cases until it's too late to stop it.

Vaccines are highly effective at preventing severe illness, hospitalization, and death. And, students might be at lower risk than older people as well. But, the vaccines aren't airtight protection, and even with the campus's indoor mask requirement (again, that's great!), there are many opportunities to spread Covid both on and off campus (see the massive unmasked crowds at the football game, at bars, etc.). If the University allows unmitigated spread on campus, it likely will reach the broader community this semester.

As was the case last year, the University of Illinois is in much better shape than many other universities. First, Illinois is in better shape than many other states. We have a governor who is implementing good public health policies (mandating vaccines for all schools and requiring masking indoors) and a local health district that is proactive and communicates well. We have a campus mask mandate and are implementing a vaccine mandate. Due to the amazing test that researchers developed here and the testing capacity that was built up over the past year, we have the potential to know a lot more about how bad the outbreaks on campus are and to end them before they spread too far. We have the capacity to limit the spread and consequences of Delta, but we're not doing everything we need to be doing.

Even if the University prefers not to test everyone due to costly logistics, there are approaches we could take that would help at least determine how much spread we have on campus. For example, the campus could test randomly selected, somewhat large subsets of vaccinated students each week. If those tests revealed a high rate of positives, we might need to adopt more extensive testing and restrictions. The dashboard could also separate results for vaccinated and unvaccinated people. That would show whether we're seeing a spike in vaccinated people seeking tests (a sign that we have spread). Without some form of systematic surveillance testing, though, we won't know how bad things are until people start getting sick.

2020-12-10T11:46:00.005-06:00

Analyses of Covid trends
at the University of Illinois

Since the university reopened in Fall 2020, I have closely monitored the numbers of Covid cases on campus and the reasons for spread. My analyses are intended to be a reliable source of publicly available information. The University has not always been transparent about the situation on campus, so I have provided additional estimates of the information that was not available. I also provide daily updates about hospitalizations and spread in the community surrounding campus.

Here are links to the various summaries and write-ups I've been tweeting about and posting publicly on Facebook.

I post almost daily updates at http://www.dansimons.com/covid.html. Monitor that page for the most recent daily case numbers, 7-day averages, hospitalizations/deaths in the community, etc. The page includes daily updated graphs and explanations of them.

My summary of the entire fall semester is available at http://www.dansimons.com/Covid/fall2020summary.html

That summary includes a discussion of the modeling that was used to justify reopening as well as how it was both right and badly wrong.
It also discusses how the University placed blame for the initial surge in cases on a handful of non-compliant students without taking any responsibility for logistical failures and inaccurate initial assumptions about compliance.

My analyses of delays in test processing and how they contributed to the spread of Covid on campus is at http://www.dansimons.com/Covid/testTiming.html

My essay on how to interpret positivity ratios correctly, including a discussion of why positivity is the wrong metric to use when you are testing everyone in your population of interest repeatedly (as was the case at the University of Illinois this fall) is at http://www.dansimons.com/Covid/positivity.html

I have turned off comments for this post. If you would like to comment, please do so in response to my regular tweets (@profsimons) on this topic. This post is a placeholder so that the links are findable.

A new journal at APS: AMPPS

2017-05-15T17:17:00.004-05:00

I am thrilled to announce the launch of the newest journal published by the Association for Psychological Science (APS):

Advances in Methods and Practices in Psychological Science (AMPPS)

I will been named the founding editor, and I have assembled what I think will be a terrific editorial team. We are now taking submissions, with the first issue slated for publication in early 2018.

The journal's primary mission is to communicate advances in research methods and practices to the broad membership of APS and beyond. We hope to bridge subfields of psychology, bringing advances from within an area to the rest of the field.

AMPPS will publish articles on research practices, tutorials to help researchers develop new research skills, and empirical papers that illustrate innovative methodological approaches. It will be the new home for the Registered Replication Reports that previously were published at the APS journal Perspectives on Psychological Science (my co-editors for those papers, Alex Holcombe and Jennifer Tackett, have joined the new editorial team). AMPPS will also publish registered reports, adversarial collaborations, multi-lab consortium studies, simulations and re-analyses of existing data, meta-science papers, commentaries, and much more.

You can read more about the journal's mission, see the editorial team, and read the submission guidelines on the journal's website.

APS also issued a press release about the new journal last week.

Please help spread the word! If you tweet or post about it, please use the hashtag #APS_AMPPS

Visual effort and inattentional deafness

2015-12-10T11:01:00.006-06:00

Visual Effort and Inattentional Deafness

Earlier this week I was asked for my thoughts on a new Journal of Neuroscience paper:

Molly, K., Griffiths, T. D., Chait, M., & Lavie, N. (2015). Inattentional deafness: Visual load leads to time-specific suppression of auditory evoked responses. Journal of Neuroscience, 35, 16046-16054.doi: 10.1523/JNEUROSCI.2931-15.2015

In part due to a widely circulated press release, the paper has garnered a ton of media coverage, with headlines like:

Focusing On A Task May Leave You Temporarily Deaf: Study

Did You Know Watching Something Makes You Temporarily Deaf?

Study Explains How Screen Time Causes 'Inattentional Deafness'

The main contribution of the paper was a link between activation in auditory cortex and the behavioral finding of reduced detection of a sound (a brief tone) when performing a more difficult visual task.

This brain-behavior link, not the behavioral result, is the new contribution from this paper. Yet, almost all of the media coverage has focused on the behavioral result which isn't particularly novel. That's unsurprising given that most of the stories just followed the lede of the press release, which was titled:

"Why focusing on a visual task will make us deaf to our surroundings: Concentrating attention on a visual task can render you momentarily 'deaf' to sounds at normal levels, reports a new UCL study funded by the Wellcome Trust"

Here are a few points about this paper that have largely been lost or ignored in the media frenzy (and the press release):

1. The study did not show that people were "deaf to their surroundings." In the study (Experiment 2), people performed an easy or hard visual task while also trying to detect a tone that occurred on half of the trials. When performing the easy visual task, they reported the tone accurately on 92% of the trials. When performing the harder visual task, they reported it accurately on 88% of trials. The key behavioral effect was a 4% reduction in accuracy on the secondary, auditory task when the primary visual task was harder. In other words, people correctly reported the tone on the vast majority of trials even with the hard visual task. That's not deafness. It's excellent performance of a secondary task with just a slight reduction when the primary task is harder.

Aside: much of that small effect on accuracy could be due to a difference in response bias between the conditions (Beta of 3.2 compared to 1.3, a difference reported as p = 0.07 with an underpowered study of only 11 subjects).

2. The behavioral effect of visual load on auditory performance is not original to this paper. In fact, it has been reported by the same lab.

3. A number of other studies have demonstrated costs to detection in one sensory modality when focusing attention on another modality. This paper is not the first to show such a cross-modal effect. See, for example, here, here, here, here, here (none of which were cited in the paper). Many other studies have shown that increasing primary task difficulty decreases secondary task performance. Again, the behavioral result touted in the media is not new, something the press release acknowledges in passing.

4. The study doesn't actually involve inattentional deafness; the term is misused. Inattentional deafness or blindness refers to a failure to notice an otherwise obvious but unexpected stimulus when focusing attention on something else. The "unexpected" part is key to ensuring that the critical stimulus actually is unattended (the justification for claiming the failure is due to inattention); people can't allocate attention to something that they don't know will be there.

In this study, tone detection was a secondary task. People were asked to focus mostly on the visual task, but they also were asked to report whether or not a tone occurred. In other words, people were actively trying to detect the tone and they knew it would occur. That's not inattentional deafness. It's just a reduction in detection for an attended stimulus when a primary task is more demanding. And, as I noted above, it's not really a demonstration of deafness either given participants were really good at detecting the tone in both conditions (they were just slightly worse when performing a harder visual task).

Note that the same lab previously published an paper that actually did show an effect of visual load on inattentional deafness.

Conclusion: There's nothing fundamentally wrong with this paper, at least that I can see (I'm not an expert on neuroimaging, though). The link between the behavioral results and brain imaging results is potentially interesting. I would have preferred a larger sample size and ideally measuring the link between brain and behavior in the same participants performing tasks with the same demands, but those issues aren't show stoppers. I can see why it is of interest to specialists (like me). That said, I'm not sure that it makes a contribution of broad interest to the public, and the novelty and importance of the behavioral result has been overplayed.

HI-BAR: A gold standard brain training study?

2015-11-30T09:57:00.002-06:00

A gold-standard brain training study?

Not without some alchemy

A HI-BAR (Had I Been A Reviewer) of:

Corbett, A., Owen, A., Hampshire, A., Grahn, J., Stenton, R., Dajani, S., Burns, A., Howard, R., Williams, N., Williams, G., & Ballard, C. (2015). The effect of an online cognitive training package in healthy older adults: An online randomized controlled trial. JAMDA, 16(11), 990-997.

Edit 12-3-15: The planned sample was ~1 order of magnitude larger than the actual one, not 2. (HT Matthew Hutson in the comments)

A recent large-scale brain training study, published in the Journal of the American Medical Directors Association (JAMDA), has garnered a lot of attention. A press release was picked up by major media outlets, and a blog post by Tom Stafford on the popular Mind Hacks blog called it “a gold-standard study on brain training” and noted that “this kind of research is what ‘brain training’ needs.”*

Tom applied the label “gold standard” because of the study’s design: It was a large, randomized, controlled trial with an active control group and blinding to condition assignment. From the gold-standard monicker, though, people might infer that the research methods and results provide solid evidence for brain training benefits. They do not.

Tom's post identified several limitations of the study, such as differential attrition across conditions and the use of a self-report primary outcome measure. Below I discuss why these and other analysis and reporting problems undermine the claims of brain training benefits.

Problems that undermine interpretability of the study

Differential Attrition
The analysis was based on the 6-month testing point, but the study was missing data from about 70% of the participants due to attrition. To address this problem, the authors carried forward data from the final completed testing session for each participant and treated it as if it were from the 6-month point. Critically, the control group had substantially greater attrition than the intervention groups—more of their scores were carried forward from earlier points in the intervention.

For the control group, only 27% of the data for the primary outcome and 17% of the data for the secondary outcomes came from participants who actually completed their testing at 6 months. For the Reasoning group, those numbers were 42% and 40%. For the General Cognition group, they were 40% and 30%.

The extent of the differential attrition and rates of carrying forward results from earlier sessions were only discoverable by inspecting the Consort diagram. This analysis choice and its implications were not fully discussed, and the paper did not report analyses of participants with comparable durations of training. This analysis approach introduces a major confound that could entirely account for any differential benefits.

Unclear sample sizes and means

Tables 3 and 4 list different control group means next to each training condition. There was only one control group, so it is unclear why the critical baseline means differed for the two training interventions. Without knowing why these means differed (they shouldn't have), the differential improvements in the training groups are uninterpretable.

The Ns listed in the tables also are inconsistent with the information provided in the Consort diagram. In a few cases, the Tables list a larger N than the consort diagram, meaning that there were more subjects in the analysis than in the study.

I emailed the corresponding author (on Nov. 10 and Nov. 23) to ask about the each of these issues, but I received no response. I also emailed the second author. His assistant noted that the corresponding author's team was "was responsible for that part of the study" and said the second author "can be of no help with this." I’m hoping this post will prompt an explanation for the values in the tables.

For me, those reporting and analysis issues are show stoppers, but the paper has other issues.

Other issues

Limitations of the pre-registration

The study was pre-registered, meaning that the recruiting, testing methods, and analysis plans were specified in advance. Such pre-registrations are required for clinical trials, but they have been relatively uncommon in the brain training literature. Have a pre-registered plan is ideal because it eliminates much of the flexibility that otherwise can undermine the interpretability of findings. The use of pre-registration is laudable. But, the registration was underspecified and the reported study deviated from it in important ways.

For example, the protocol called for 75,000 - 100,000 participants, but the reported study recruited fewer than 7000. That’s still a great sample, but it’s ~~2 orders~~ an order of magnitude smaller than the planned sample. Are there more studies resulting from this larger sample that just aren’t mentioned in the pre-registration?
The study also called for a year of testing, but it had to be cut short at 6 months and more than 2/3 of the participants did not undergo even 6 months of training. The pre-registration did not include analysis scripts, and the data from the study do not appear to have been posted publicly.

The pre-registered hypotheses predicted greater improvements in the reasoning training group than the general cognition group and it predicted that the general cognition group would not outperform the control group. The paper reports no tests of this predicted difference.

Underreporting for the primary measure (IADL)
The primary outcome measure consisted of self-reports of performance in daily activities (known as the Instrumental Activities of Daily Living or IADL). As Tom's post noted, such self-reports are subject to demand characteristics — people expect to do better following training, so they report having done better. The study did not test for different expectations across the training and control groups, so the benefits could be due to such demands or to a differential placebo effect (e.g., the control group might have found the study less worthwhile).

The reported benefits for IADLs were small, and the data provide little evidence for any benefit of training. The study reported statistically significant benefits for both training groups relative to the control group, but statistical significance is not the same as evidence. With samples this large, we should expect a substantially lower p-value than .05 when an effect actually is present in the population. If the Ns and means reported in the table were consistent with the method description, it might be possible to compute a Bayes Factor for these analyses. My bet is that the difference between the training groups and the control group would provide weak evidence at best for a meaningful training benefit (relative to the null).

The paper provides no information about baseline scores on the primary outcome measure (IADL). Although the analyses control for baseline scores, training papers must provide the pre-test scores and post-test scores. Without doing so, it is impossible to evaluate whether apparent training benefits resulted in part from baseline differences.

The paper also states that “Data from interim time points also show significant benefit to IADL at 3 months, particularly in the GCT group, although this difference was not significant.” I take this to mean both training groups outperformed the control group at 3 months, but they did not differ significantly from each other. No statistical evidence is provided in support of this claim.

Limited evidence from the secondary measures

Only one of the secondary cognitive outcome measures (a reasoning measure) showed a training benefit. The paper refers to it as “the key secondary measure,” but that designation does not appear in the pre-registration (http://www.isrctn.com/ISRCTN72895114). Moreover, the pre-registration predicts better performance for reasoning training than general cognition training or the control group, but the paper found improvements for both interventions. A few other measures showed significant effects, but given the large sample sizes, the high p-values might well be more consistent with the absence of a training benefit than the presence of one.

Despite providing no statistical evidence of differential benefits for Reasoning training and General Cognition training, the paper claims that “Taken together, these findings indicate that the ReaCT package confers a more generalized cognitive benefit than the GCT at 6 months. That claim appears to come from finding no effect on a digit secondary task in the Reasoning group and a decline in the General cognition group. However, a difference in significance is not a significant difference.

Almost all of the measures showed declining performance from the pre-test to the post-test. That is, participants were not getting better. They just declined less than the control participants. It is unclear why we should see such a pattern of declining performance over a short time window with relatively young participants. Although cognitive performance does decline with age, presumably those declines should be minimal over 1-6 months, and they should be swamped by the benefits of taking the test twice. One explanation might be differential attrition -- those subjects who did worse initially were more likely to drop out early.

* Thanks to Tom Stafford for emailing a copy of the paper. The journal is obscure enough that the University of Illinois library did not have access to it.

Response from Ballesteros et al to my HI-BAR

2014-12-15T10:42:00.002-06:00

Update 12-15-14: I used ~~strikethru~~ to correct a couple of the F test notes below. The crossed out ones were fine.

Update 5-26-15: Frontiers has published a correction from Ballesteros et al that acknowledges the overlap among their papers. It doesn't admit the inappropriateness of that overlap. It mostly echoes their response below, but does not address my questions about that response.

In late November, I posted a HI-BAR review of a paper by Ballesteros et al (2014) that appeared in Frontiers. In addition to a number of other issues I discussed there and in an earlier review of another paper, I raised the concern that the paper republished the training data and some outcome measures from an earlier paper in PLoS One without adequately conveying that those data had been published previously. I also noted this concern in a few comments on the Frontiers website. On my original HI-BAR post, I asked the original authors for clarification about these issues.

I have now received their response as a pdf file. You can read it here.

Below I quote some sections from their reply and comment on them. I encourage you to read the full reply from the original authors as I am addressing only a subset of their comments below. For example, their reply explains the different author lists on the two papers in a satisfactory way, I think. Again, I would be happy to post any responses from the original authors to these comments. As you will see, I don't think this reply addresses the fairly major concerns about duplication of methods and results (among other issues).

Quotes from the authors' reply are indented and italicized, with my responses following each.

As you noted, both papers originated from the same randomized controlled intervention study (clinical trial number: NCT02007616). We never pretended that the articles were to be considered as two different intervention studies. Part of this confusion could have been generated because in the Frontiers article the clinical trial number did not appear on the first page, even though this number was included in the four preliminary versions of the manuscript under revision. We have contacted Frontiers asking them to include the clinical trial number in the final accepted version, if possible.

Although that would help, it's not sufficient. The problem is that the data analyses are republished. It would be good to note, explicitly, both in the method section and in the results section, that this paper is reporting outcome measures from the same intervention. And, it's essential to note when and where analyses are repeated.

If it is not possible at this stage, we asked them to publish a note with this information and to acknowledge in the results section the overlap as well as in Figure 3b, mentioning that the data in the Figure were published previously in PLoS.

This seems necessary regardless of whether or not the link to the clinical trial number is added. Even if it is made clear that the paper is based on the same intervention, it still is not appropriate to republish data and results without explicitly noting that they were published previously. Actually, it would be better not to republish the data and analyses. Period.

You also indicated in your first comments posted in Frontiers that the way we reported our results could give the impression that they come from two different and independent interventions. To avoid this misunderstanding, as you noticed, we introduced two references in our Frontiers ́ article. We inserted the first reference to the PLoS article in the Method section and a second reference in the Discussion section. Two references that you considered were not enough to avoid possible confusions in the reader.

As I discussed in my HI-BAR post, these citations were wholly inadequate. One noted only that the results of the oddball task were reported elsewhere, yet the same results were reported again in Frontiers and results section included no mention of the duplication. The other citation, appearing in the discussion, implied that the earlier paper provided additional evidence for a conclusion about the brain. Nowhere did the Frontiers paper cite the PLoS paper for the details of the intervention, the nature of the outcome measures, etc. It just published those findings as if they were new. The text itself should have noted, both in the method and results sections, whenever procedures or results were published elsewhere. Again, it would have been better not to republish them at all.

In relation to the results section in which we describe the cross-modal oddball attention task results in the Frontiers article, we acknowledge that, perhaps, it would have been better to avoid a detailed presentation of the attentional results that were already published and indicate just that attentional data resulting from the task developed in collaboration with the UBI group were already published. We could have asked the readers to find out for themselves what happened with attention after training. We were considering this possibility but in the end we decided to include the results of the oddball task to facilitate ease of reading.

Acknowledging the repetition explicitly in the text would have helped avoid the interpretation that these were new results. Repeating the statistics from the earlier paper isn't an "ease of reading" issue -- it's duplicate publication. You could easily summarize the findings of the earlier paper, with clear citation, and note that the details were described in that paper. I don't see any justification for republishing the actual statistics and findings.

Regarding the last paragraph of the oddball results section, we said “New analyses were conducted....” As we said above, we tried (perhaps in an unfortunate way) to give in this paper a brief summary of the results obtained in the attention task, so this paragraph refers to the epigraphs “Distraction” and “Alertness” of results section in PLoS publication, where differential variables were calculated and new analyses were performed to obtain measures of gains and losses in the ability to ignore infrequent sounds and to take advantage of the frequent ones. Once again, we apologize if this sentence has led to confusion, and we are in contact with the Journal concerning this.

Yes, that sentence added to the impression these analyses were new to the Frontiers paper. But, the statistical analyses should not have been duplicated in the first place. That's also true for the extensive repetition of all of the training data.

Another comment in your exhaustive review referred to the differences in the samples shown in the Consort diagram between the two publications. This has a simple explanation. The diagram of the PLoS article refers only to the cross-modal oddball task designed to assess alertness and distraction while the Frontier ́s Consort diagram refers to the whole intervention study. In the control group, one of the participants was removed due to the large number of errors in the oddball task, but he was included in the rest of the study in which he reached inclusion criteria. The same occurred in the trained group. As attentional measures were analyzed separately by the UBI group, by the time we sent the results only fifteen participants completed the post-evaluation (we could not contact a participant and the other was travelling that week). A few days later, these two participants completed the post-evaluation, but we decided not to include them in the PLoS sample as the data were already analyzed by the UBI group.

Thank you for these clarifications. I'm curious why you decided not to wait for a few days for those remaining participants if that was part of your intervention design. If their results came in a few days later, why not re-do the analysis in the original paper to include all participants who were in the intervention. Presumably, by publishing the PLoS paper when you did, you deemed the intervention to be complete (i.e., it wasn't flagged as a preliminary result). It seems odd to then add these participants to the next paper. This difference between papers raises two questions. First, would the results for the PLoS paper have been different with those two participants? Second, would the results of the Frontiers paper have differed if they had been excluded? And, if those data were in by the time the Frontiers paper was written, why were these participants not included in the oddball analysis? At least that would have added new information to those duplicated analyses.

This clarified information should have been included in the Frontiers paper to make it clearer that both papers reported the same intervention with the same participants.

We would like to explain the clinical trial registration process. As you pointed out in your comments to Mayas et al. (2014), we registered the clinical trial after the attention manuscript was submitted to PLoS. The journal (PlosOne) specifically required the registration of the intervention study as a clinical trial during the revision process in order to publish the manuscript, and told us about the possibility of post-registration. The Editor of PLoS sent us the link to register the study.

Post-registering a study completely undermines the purpose of registration. I find it disturbing that PLoS would permit that as an option for a clinical trial. Looking at the PLoS guidelines, they do make an exception for post-study registration provided that the reasons for "failing to register before participant recruitment" are spelled out clearly in the article (emphasis from original on PLoS website):

PLOS ONE supports prospective trial registration (i.e. before participant recruitment has begun) as recommended by the ICMJE's clinical trial registration policy.Where trials were not publicly registered before participant recruitment began, authors must:

Register all related clinical trials and confirm they have done so in the Methods section

Explain in the Methods the reason for failing to register before participant recruitment

It's also clear that their policy is for clinical trials to be pre-registered, not post-registered. And, the exception for post-registration requires a justification. Neither the PLoS paper nor the Frontiers paper provided any such justification. The Frontiers paper didn't mention registration at all, and the PLoS one didn't make clear that the registration occurred after submission of the finished study.

The idea of registration is to specify, in advance, the procedures that will be followed in order to avoid introducing investigator degrees of freedom. Registering after the fact does nothing to address that problem. It's just a re-description of an already completed study. It's troubling that a PLoS editor would instruct the authors to post-register a study.

We would like to clarify some questions related to the data analysis. First, multiple tests in all ANOVAs were Bonferroni corrected although it is not made explicit in the text.

As best I can tell, the most critical multiple testing issues, the ones that could have been partly remedied by pre-registration, were not corrected in this paper.

There are at least four distinct multiple testing issues:

There are a large number of outcome measures, and as best I can tell, none of the statistical tests were corrected for the number of tests conducted across tasks.
There are multiple possible tests for a number of the tasks (e.g., wellbeing has a number of sub-scales), and there don't appear to have been corrections for the number of distinct ways in which an outcome could be measured.
A multi-factor ANOVA itself constitutes multiple tests (e.g., a 2x2 ANOVA involves 3 tests: each main effect and the interaction). Few studies correct for that multiple testing problem, and this paper does not appear to have done so.
There is a multiple tests issue with pairwise comparisons conducted to follow-up a significant ANOVA. I assume those are the Bonferroni-corrected tests that the authors referred to above. However, it's impossible to tell if these tests were corrected because the paper did not report the test statistics — it just reported p < xx or p= xx. Only the F tests for the main effects and interactions were reported.

If the authors did correct for the first three types of multiple testing issues, perhaps they can clarify. However, based on the ANOVA results, it does not appear that they did.

A related issue, one mentioned in my HI-BAR on the PLoS paper but not on the Frontiers paper, is that some of the reported significance levels for the F tests are incorrect. Here are some examples from the Frontiers paper in which the reported significance level (p<xx or p=xx) is wrong. For each, I give the correct p value in red (uncorrected for multiple tests). None of these calculations led to less statistical significance:

[F(1, 28) = 3.24, MSE = 1812.22, p = 0.057, η2p = 0.12]. p=.0826
[F(2, 50) = 5.52, MSE = 0.001, p < 0.005, η2p = 0.18]. p=.0068
[F(1, 28) = 4.35, MSE = 0.09, p < 0.001, η2p = 0.89]. p=.0462
~~[F(1, 28) = 17.98, MSE = 176.74, p < 0.001, η2p = 0.39]. p=.0002~~
[F(1, 28) = 13.02, MSE = 61.49, p < 0.01, η2p = 0.32]. p=.0012
[F(1, 28) = 3.42, MSE = 6.47, p = 0.07, η2 = 0.11]. p=.0750

The following two results were reported as statistically significant at p<.05, but actually were not significant with that alpha level:

[F(1, 28) = 3.98, MSE = 0.06, p < 0.01, η2p = 0.12]. p=.0559
[F(1, 28) = 3.40, MSE = 0.15, p = 0.04, η2p = 0.10]. p=.0758

I don't know that any of these errors dramatically changes the interpretation of the results, but they should be corrected.

Second, the RTs filtering was not an arbitrary decision. The lower limit (200 ms) reflects the approximate minimum amount of time necessary to respond to a given stimulus (due to speed limitations of the neurological system). RTs below this limit may reflect responses initiated before the stimulus onset (guessing). The selection of the upper limit (1100 ms) is a more difficult decision. Different criteria have been proposed (statistical, theoretical...) in the literature (see Whelan, 2008). Importantly none of them seem to affect type I errors significantly. In this case, we made the following theoretical assumption: RTs longer than 1100 ms might depend on other cognitive processes than speed of processing.

There is nothing wrong with this choice of cutoffs, and it might well have been principled. Still, it is arbitrary. Any number of cutoffs would have been just as appropriate (e.g., 150ms and 1200ms, 175ms and 3000ms, ± 3SD, etc). My point wasn't to question the choice, but instead to note that it introduces flexibility to the analysis. This is a case in which pre-registering the analysis plan would help -- the choice of cutoffs is reasonable, but flexible. Registration eliminates that flexibility, making it clear to readers that the choice of cutoffs was principled rather than based on knowing the outcome with different cutoffs. Another approach would be to assess the robustness of the results to various choices of cutoff (reporting all results).

HI-BAR: More benefits of Lumosity training for older adults?

2014-11-20T13:56:00.004-06:00

HI-BAR (Had I Been A Reviewer)

For more information about HI-BAR reviews, see my post introducing the acronym

Ballesteros S, Prieto A, Mayas J, Toril P, Pita C, Ponce de León L, Reales JM and Waterworth J (2014) Brain training with non-action video games enhances aspects of cognition in older adults: a randomized controlled trial. Front. Aging Neurosci. 6:277. doi: 10.3389/fnagi.2014.00277

Update: I posted a brief summary of this HI-BAR as a comment on the paper at Frontiers, and Dr. Ballesteros responded to indicate that this was the same training study. Her response quoted the one-sentence citation described below and noted that they did not intend to give the impression that this was a new intervention. Her reply did not address the overlap in presentation of methods and results.

Update 2: Dr. Ballesteros sent me a response to my questions [pdf]. I have just posted a new blog post in which I quote from her response and add new comments.

This paper testing the benefits of Lumosity training for older adults just appeared in Frontiers. The paper is co-authored two of the same people (Ballesteros and Mayas) as a recent PLoS One paper on Lumosity training that I critiqued in a HI-BAR in April (note that the new paper includes six additional authors who were not on the PLoS paper and the PLoS paper includes two who weren't on the Frontiers paper).

I was hopeful that this new paper would address some of the shortcomings of the earlier one. Unfortunately, it didn't. In fact, it couldn't have, because the "new" Frontiers paper is based on the same intervention as the PLoS paper.

And, when I say, "the same intervention," I don't mean that they directly replicated the procedures from their earlier paper. The Frontiers paper reports data from the same intervention with the same participants. It would be hard to know that from reading just the Frontiers paper, though, because it does not mention that these are additional measures from the same intervention.

As my colleagues and I have noted (ironically, in Frontiers), this sort of partial reporting of outcome measures gives the misleading impression that there are more interventions demonstrating transfer of training than actually exist. If a subsequent meta-analysis treats these reports as independent, it will produce a distorted estimate of the size of any intervention benefits. Moreover, whenever a paper does not report all outcome measures, there's no way to appropriately correct for multiple comparisons. At a minimum, all publications derived from the same intervention study should state explicitly that it is the same intervention and identify all outcome measures that will, at some point, be reported.

The Frontiers cites the PLoS paper only twice. The first citation appears in the method section in reference to the oddball outcome task that was the focus of the PLoS paper:

Results from this task have been reported separately (see Mayas et al., 2014).

That is the only indication in the paper that there might be overlap of any kind between the study presented in Frontiers and the one presented in PLoS. It's a fairly minimalist way to indicate that this was actually a report of different outcome measures from the same study.

It's also not entirely true: The results section of the Frontiers paper describes the results for that oddball task in detail as well, and all of the statistics for that task reported in the Frontiers paper were also reported in the PLoS paper. The results section does not acknowledge the overlap. Figure 3b in the Frontiers paper is nearly identical to Figure 2b in the PLoS paper, reporting the same statistics from the oddball task with no mention that the data in the figure were published previously.

After repeating the main results of the oddball task from the earlier paper, the Frontiers paper states:

"New analyses were conducted on distraction (novel vs. standard sound) and alertness (silence vs. standard sound), showing that the ability to ignore relevant sounds (distraction) improved significantly in the experimental group after training (12 ms) but not in the control group. The analyses of alertness showed that the experimental group increased 26 ms in alertness (p < 0.05) but control group did not (p = 0.54)."

I thought "new analysis" might mean that these analyses were new to this paper, but they weren't. Here's the relevant section of the PLoS paper (which reported the actual statistical tests):

"In other words, the ability to ignore irrelevant sounds improved significantly in the experimental group after video game training (12 ms approximately) but not in the control group....Pre- and post-training comparisons showed a 26ms increase of alertness in the experimental group..."

Stating that "results from this task have been reported separately" implies that any analyses of that task reported in Frontiers are new, but they're not.

The Frontiers paper also reports the same training protocol and training performance data without reference to the earlier study. The Frontiers paper put the training data in a figure whereas the PLoS paper put them in a table. Why not just cite the earlier paper for the training data rather than republishing them without citation?

The same CONSORT diagram appears as Figure 1 in both papers. The Frontiers paper CONSORT diagram does report that the experimental condition had data from 17 participants rather than the 15 participants reported in the PLoS paper. From the otherwise identical figures, it appears that those two additional participants were the ones that the PLoS paper noted were lost to followup due to "personal travel" and "no response from participant." I guess both responded after the PLoS paper was completed, although the method sections of the two papers noted that testing took place over the same time period (January through July of 2013). The presence of data from these two additional participants did not change any of the statistics for the oddball task that were reported in both papers.

One other oddity about the subject samples in the consort diagram: The PLoS paper excluded one participant from the control condition analyses due to "Elevated number of errors," leaving 12 participants in that analysis. The Frontiers paper did not exclude that participant. If they made too many errors for the PLoS paper, why were they included in the Frontiers paper?

Other than the one brief note in the method section about the oddball task, the only other reference to the PLoS paper appeared in the discussion, and gives the impression that the PLoS paper provided independent evidence for the contribution of frontal brain regions to alertness and attention filtering:

We also found that the trainees reduced distractibility by improving alertness and attention filtering, functions that decline with age and depend largely on frontal regions (Mayas et al., 2014).

That citation and the one in the method section are the only mentions of the earlier paper in the entire article.

Given that I was already familiar with the PLoS paper, had I been a reviewer of this paper, here's what I would have said:

The paper suffers from all of the same criticisms I raised about the intervention in the PLoS paper (e.g., inadequate control group among many other issues), and in my view, those shortcomings should preclude publication of this paper. See http://blog.dansimons.com/2014/04/hi-bar-benefits-of-lumosity-training.html
The paper should have explained that this was the same intervention study, with the same participants, as the earlier PLoS One paper.
The paper should have noted explicitly which data and results were reported previously and which were not.

It's possible that the authors notified the editor of the overlap between these papers upon submission. If so, it is surprising that the paper was even sent out for review before these issues were addressed. If the editor was not informed of the overlap, I cannot fault the editor and reviewers for missing the fact that the intervention, training results, and some of the outcome results were reported previously. Unless they happened to know about the PLoS One paper, they could be excused for missing the one reference to that paper in the method section. Still, the many other major problems with this intervention study probably should have precluded publication, at least without major corrections and qualifications of the claims.

HI-BAR: Benefits of Lumosity training for older adults?

2014-04-02T13:39:00.001-05:00

HI-BAR (Had I Been A Reviewer)

A post-publication review of Mayas J., Parmentier, F. B. R., Andrés P., & Ballesteros, S. (2014) Plasticity of Attentional Functions in Older Adults after Non-Action Video Game Training: A Randomized Controlled Trial. PLoS ONE 9(3): e92269. doi:10.1371/journal.pone.0092269

For more information about HI-BAR reviews, see my post introducing the acronym.

This paper explored "whether the ability to filter out irrelevant information...can improve in older adults after practicing non-violent video games." In this case, the experimental group played 10 games that are part of the Lumosity program for a total of 20 hours. The control group did not receive any training. Based on post-training improvements on an "oddball" task (a fairly standard attention task, not a measure of quirkiness), the authors claim that training improved the ability to ignore distractions and increased alertness in older adults.

Testing whether commercial brain training packages have any efficacy for cognitive enhancement is a worthwhile goal, especially given the dearth of robust, reliable evidence that such training has any measurable impact on cognitive performance on anything other than the trained tasks. I expect that Lumosity will tout this paper as evidence for the viability of their brain training games as a tool to improve cognition. They probably shouldn't.

Below are the questions and concerns I would have raised had I been a reviewer of this manuscript. If you read my earlier HI-BAR review of Anguera et al (2013), you'll recognize many of the same concerns. Unfortunately, the problems with this paper are worse. A few of these questions could be answered with more information about the study (I hope the authors will provide that information). Unfortunately, many of the shortcomings are more fundamental and undermine the conclusion that training transferred to their outcome measure.

I've separated the comments into two categories: Method/Analysis/Inferential issues and Reporting issues.

Method/Analysis/Inferential Issues

Sample size
The initial sample consisted of 20 older adults in the training group and 20 in the control group. After attrition, the analyzed sample consisted of only 15 people in the training group and 12 in the control group. That's a really small sample, especially when testing older adults who can vary substantially in their performance.

Inadequate control group

The control group for this paper is inadequate to make any causal claims about the efficacy of the training. The experimental group engaged in 20 hours of practice with Lumosity games. The control group "attended meetings with the other members of the study several times along the course of the study;" they got to hang out, chat, and have coffee with the experimenters a few times. This sort of control condition is little better than a no-contact control group (not even the amount of contact was equated). Boot et al (2013) explained how inadequate control conditions like this "limited contact" one provide an inadequate baseline against which to evaluate the effectiveness of an intervention. Here's the critical figure from our paper showing the conclusions that logically follow from interventions with different types of control conditions:

When inferring the causal potency of any treatment, it must be compared to an appropriate baseline. That is why drug studies use a placebo control, ideally one that equates for side effects too, so that participants do not know whether they have received the drug or a placebo. For a control condition to be adequate, it should include all of the elements of the experimental group excepting the critical ingredient of the treatment (including equating for expectation effects). Otherwise, any differences in outcome could be due to other differences between the groups. That subtractive method, first described by Donders more than 150 years ago, is the basis of clinical trial logic. Unfortunately, it commonly is neglected in psychological interventions.

In video game training, even an active control group in which participants play a different game might not control for differential placebo effects on outcome measures. But, the lack of an active control group allows almost no definitive conclusions: It does not equate the experience between the training group and the control group in any substantive way. This limited-contact control group accounts for test-retest effects and the passage of time, and little else.

Any advantage observed for the training group could result from many factors that are not specific to the games involved in the training condition or even to games at all: Any benefits could have resulted from doing something intensive for 20 hours, from greater motivation to perform well on the outcome measures, from greater commitment to the tasks, from differential placebo effects, from greater social contact, etc. Differences between the training group and this limited-contact control group do not justify any causal conclusion about the nature of the training.

Interventions and training studies that lack any control condition other than a no-contact or limited-contact control group should not be published. Period. They are minimally informative at best, and misleading at worst given that they will be touted as evidence for the benefits of training. The literature on game training is more than 10 years old, and there is no justification for publishing studies that claim a causal benefit of training if they lack an adequate baseline condition.

Multiple tests without correction

The initial analysis consisted of two separate 2x2x3 ANOVAs on accuracy and response times on the oddball task, accompanied by follow-up tests. A 3-factor ANOVA consists of 7 separate F-tests (3 main effects, 3 two-way interactions, and 1 three-way interaction). Even if the null hypothesis were true and all of the data were drawn from a single population, we would expect a significant result on at least one of these tests more than 30% of the time on average (1 - .95^7). In other words, each of the ANOVAs has a 30% false positive rate. For a thorough discussion of this issue, see this excellent blog post from Dorothy Bishop.

The critical response time analysis, the only one to show a differential training effect, produced two significant tests out of the 7 conducted in the ANOVA: a main effect of condition as for accuracy (but not with the predicted pattern), and a significant 3-way interaction. The results section does not report any correction for multiple tests, though, and the critical 3-way interaction would not have been significant (it was p=.017 without correction).

Possible speed-accuracy tradeoff

The accuracy ANOVA showed a marginally significant 3-way interaction (p=.077 without correction for multiple tests), but the paper does not report the means or standard deviations for the accuracy results. Is it possible that the effects on accuracy and RT were in opposite directions? If so, the entire difference between training groups could just be a speed-accuracy tradeoff, with no actual performance difference between conditions.

Flexible and arbitrary analytical decisions.
For response times, the analysis included only correct trials and excluded times faster than 200ms and slower than 1100ms. Standard trials after novel trials were discarded as well. These choices seem reasonable, but arbitrary. Would the results hold with different cutoffs and criteria? Were any other cutoffs tried? If so, that introduces additional flexibility (investigator degrees of freedom) that could spuriously inflate the significance of tests. That's one reason why pre-registration of analysis plans is essential. It's all too easy to rationalize any particular approach after the fact if it happens to work.

Unclear outcome measures

The three way interaction for response time was followed up with more tests that separated the oddball task into an alertness measure and a distraction measure analyzed separately for the two groups. It's not clear how these measures were derived from the oddball conditions, but I assume they were based on different combinations of the silent, standard, and novel noise conditions. It would be nice to know what these contrasts were as they provide the only focused tests of differential transfer task improvements between the experimental group and the control group.

A difference in significance is not a significant difference

The primary conclusions about these follow-up outcome measures are based on significant improvements for the training group (reported as p=.05 and p=.04) and the absence of a significant improvement for the control group. Yet, significance in one condition and not in another does not mean that those two conditions differed significantly. No test of the difference in improvements across conditions was provided.

Inappropriate truncation/rounding of p-values

The critical "significant" effect for the training group actually wasn't significant! The authors reported "F(1,25) = 4.00, MSE = 474.28, p = .05, d = 0.43]." Whenever I see a p value reported as exactly .05, I get suspicious. So, I checked. F(1,25) = 4.00 gives p = .0565. Not significant. The authors apparently truncated the p-value.

(The reported p=.04 is actually p=.0451. That finding was rounded, but rounding down is not appropriate either.)

Of the two crucial tests in the paper, one wasn't actually significant and the other was just barely under p=.05. Not strong evidence for improvement, especially given the absence of correction for multiple tests (with which, neither would be significant).

Inappropriate conclusions from correlation analyses

The paper explored the correlation between the alertness and distraction improvements (the two outcome measures) and each of the 10 games that made up the Lumosity training condition. The motivation is to test whether the amount of improvement an individual showed during training correlated with the amount of transfer they showed on the outcome measures. Of course, with N=15, no correlation is stable and any significant correlation is likely to be substantially inflated. The paper included no correction for the 20 tests they conducted, and neither of the two significant correlations (p=.02, p<.01) would survive correction. Moreover, even if these correlations were robust, correlations between training improvement and outcome measure improvement logically provide no evidence for the causal effect of training on the transfer task (See Tidwell et al, 2013).

Inaccurate conclusions

The authors write:

"The results of the present study suggest that training older adults with non-action video games reduces distractibility by improving attention filtering (a function declining with age and largely depending on frontal regions) but also improves alertness."

First, the study provides no evidence at all that any differences resulted from improvements in an attention filtering mechanism. It provided not tests of that theoretical idea.

Furthermore, the study as reported does not show that training differentially reduced distractibility or increased alertness. The improved alertness effect was not statistically significant when the p-value isn't truncated to .05. The effect on the distraction measure was 12ms (p=-.0451 without correction for multiple tests). Neither effect would be statistically significant with correction for multiple tests. But, even if they were significant, without a test of the difference between the training effect and the control group effect, we don't know if there was any significant difference in improvements between the two groups; significance in one condition but not another does not imply a significant difference between conditions.

Reporting Issues

Pre-Registration or Post-Registration

The authors registered their study on ClinicalTrials.gov on December 4, 2013; it's listed and linked under the "trial registration" heading in the published article. That's great. Pre-registration is the best way to eliminate p-hacking and other investigator degrees of freedom.

But, this wasn't a pre-registration: The manuscript was submitted to PLoS three months before it was registered! What is the purpose of registering an already completed study? Should it even count as a registration?

Unreported outcome measures

The only outcome measure mentioned in the published article is the "oddball task," but the registration on ClinicalTrials.gov identifies the following additional measures (none with any detail): neuropsychological testing, Wisconsin task, speed of processing, and spatial working memory. Presumably, these measures were collected as part of the study? After all, the protocol was registered after the paper was submitted. Why were they left out of the paper?

Perhaps the registration was an acknowledgment of the other measures and they are planning to report each outcome measure in a separate journal submission. Dividing a large study across multiple papers can be an acceptable practice, provided that all measures are identified in each paper and readers are informed about all of the relevant outcomes (the papers must cross-reference each other).

Sometimes a large-scale study is preceded by a peer-reviewed "design" paper that lays out the entire protocol in detail in advance. This registration lacks the necessary detail to serve as a roadmap for a series of studies. Moreover, separating measures across papers without mentioning that there were other outcome measures or that some of the measures were (or will be) reported elsewhere is misleading. It gives the false impression that these outcomes came from different, independent experiments. A meta-analysis would treat them as independent evidence for training benefits when they aren't independent.

Unless the results from all measures are published, readers have no way to interpret the significance tests for any one measure. Readers need to know the total number of hypothesis tests to determine the false positive rate. Without that information, the significance tests are largely uninterpretable.

Here's a more troubling possibility: Perhaps the results for these other measures weren't significant, so the authors chose not to report them (or the reviewers/editor told them not to). If true, this underreporting—providing only the outcome measure that showed an effect—constitutes p-hacking, increasing the chances that any significant results in the article were false positives.

Without more information, readers have no way to know which is the case. And, without that information, it is not possible to evaluate the evidence. This problem of incomplete reporting of outcome measures (and neglecting to mention that separate papers came from the same study) has occurred in the game training literature before—see "The Importance of Independent Replication" section in this 2012 paper for some details.

Conflicting documentation of outcome measures

The supplemental materials at PLoS include a protocol that lists only the oddball task and neuropsychological testing. The supplemental materials also include an author-completed Consort Checklist that identifies where all the measures are described in the manuscript. The checklist includes the following items for "outcomes":

"6a. Completely defined pre-specified primary and secondary outcome measures, including how and when they were assessed."

"6b. Any changes to trial outcomes after the trial commenced, with reasons."

For 6a, the authors answered "Methods." Yet, the primary and secondary outcomes noted in the Clinicaltrials.gov registration are not fully reported in the methods of the actual article or in the protocol page. For 6b, they responded "N/A," Implying that the design was carried out as described in the paper.

These forms and responses are inconsistent with the protocol description at clinicaltrials.gov. Either the paper and supplementary materials neglected to mention the other outcome measures or the clincialtrials.gov registration lists outcome measures that weren't actually collected. Given that the ClinicalTrials registration was completed after the paper was submitted, that implies that other outcome measures were collected as part of the study but not reported. If so, the PLoS supplemental materials are inaccurate.

Final post-test or interim stage post-test?

The clinicaltrials.gov registration states that outcome measures will be tested before training, after 12-weeks, and again after 24 weeks. The paper reports only the pre-training and 12-week outcomes and does not mention the 24-week test. Was it conducted? Is this paper an interim report? If so, that should be mentioned in the article. Had the results not been significant at 12 weeks, would they have been submitted for publication? Probably not. And, if not, that could be construed as selective reporting, again biasing the reported p-values in this paper in favor of significance.

Limitations of the limitations section
The paper ends with a limitations section, but the only identified limitations are the small sample size, the lack of a test of real-world outcome measure, the use of only the games in Lumosity, and the lack of evidence for maintenance of the training benefits (presumably foreshadowing a future paper based on the 24-week outcome testing mentioned in the clinicaltrials.gov registration). No mention is made of the inadequacy of the control group for causal claims about the benefits of game training, the fragility of the results to correction for multiple testing, the flexibility of analysis, the possible presence of unreported outcome measures, or any of the other issues I noted above.

Summary

Brain training is now a major industry, and companies capitalize (literally) on results that seem to support their claims. Training and intervention studies are critical if we want to evaluate the effectiveness of psychological interventions. But, intervention studies must include an adequate active control group, one that is matched for expected improvements independently for each outcome measure (to control for differential placebo effects). Without such a control condition, causal claims that a treatment has benefits are inappropriate because it is impossible to distinguish effects of the training task from other differences between the training and control group that could also lead to differential improvement. Far too many published papers make causal claims with inadequate designs, incomplete reporting of outcome measures, and overly flexible analyses.

In this case, the inadequacy of the limited-contact control condition (without acknowledging these limitations) alone would be sufficient grounds for an editor to reject this paper. Reviewers and editors need to step up and begin requiring adequate designs whenever authors make causal claims about brain training. Even those with an adequate design should take care to qualify any causal claims appropriately to avoid misrepresentation in the media.

Even if the control condition in this study had been adequate (it wasn't), the critical interaction testing the difference in improvement across conditions was not reported. Moreover, one of the improvements in the training group was reported to be significant even though it wasn't, and neither of the improvements would have withstood correction for multiple tests. Finally, the apparent underreporting of outcome measures makes all of the significance tests suspect.

More broadly, this paper provides an excellent example of why the field needs true pre-registration of cognitive intervention studies. Such registrations should include more than just the labels for the outcome measures. They should include a complete description of the protocol, tasks, measures, coding, and planned analysis. They should specify any arbitrary cutoffs, identify which analyses are confirmatory, and note when additional analyses will be exploratory. Without pre-registration (or, in the absence if pre-registration, complete reporting of all outcome measures), readers have no way to evaluate the results of an intervention because any statistical tests are effectively uninterpretable.

Note: Updated to fix formatting and typos

How experts recall chess positions

2014-02-21T11:36:00.004-06:00

Originally posted to invisiblegorilla blog on 15 February 2012. I am consolidating some posts from my other blogs onto my personal website where I have been blogging for the past year.

In 2011, a computer (Watson) outplayed two human Jeopardy champions. In 1997, chess computer Deep Blue defeated chess champion Garry Kasparov. In both cases, the computer “solved” the game—found the right questions or good moves—differently than humans do. Defeating humans in these domains took years of research and programming by teams of engineers, but only with huge advantages in speed, efficiency, memory, and precision could computers compete with much more limited humans.

What allows human experts to match wits with custom-designed computers equipped with tremendous processing power? Chess players have a limited ability to evaluate all of the possible moves, the responses to those moves, the responses to the responses, etc. Even if they could evaluate all of the possible alternatives several moves deep, they still would need to remember which moves they had evaluated, which ones led to the best outcomes, and so on. Computers expend no effort remembering possibilities that they had already rejected or revisiting options that proved unfruitful.

This question, how do chess experts evaluate positions to find the best move, has been studied for decades, dating back to the groundbreaking work of Adriaan de Groot and later to work by William Chase and Herbert Simon. de Groot interviewed several chess players as they evaluated positions, and he argued that experts and weaker players tended to “look” about the same number of moves ahead and to evaluate similar numbers of moves with roughly similar speed. The relatively small differences between experts and novices suggested that their advantages came not from brute force calculation ability but from something else: knowledge. According to De Groot, the core of chess expertise is the ability to recognize huge number of chess positions (or parts of positions) and to derive moves from them. In short, their greater efficiency came not from evaluating more outcomes, but from considering only the better options. [Note: Some of the details of de Groot’s claims, which he made before the appropriate statistical tests were in widespread use, did not hold up to later scrutiny—experts do consider somewhat more options, look a bit deeper, and process positions faster than less expert players (Holding, 1992). But de Groot was right about the limited nature of expert search and the importance of knowledge and pattern recognition in expert performance.]

In de Groot’s most famous demonstration, he showed several players images of chess positions for a few seconds and asked the players to reconstruct the positions from memory. The experts made relatively few mistakes even though they had seen the position only briefly. Years later, Chase and Simon replicated de Groot’s finding with another expert (a master-level player) as well as an amateur and a novice. They also added a critical control: The players viewed both real chess positions and scrambled chess positions (that included pieces in implausible and even impossible locations). The expert excelled with the real positions, but performed no better than the amateur and novice for the scrambled positions (later studies showed that experts can perform slightly better than novices for random positions too if given enough time; Gobet & Simon, 1996). The expert advantage apparently comes from familiarity with real chess positions, something that allows more efficient encoding or retrieval of the positions.

Chase and Simon recorded their expert performing the chess reconstruction task, and found that he placed the pieces on the board in spatially contiguous chunks, with pauses of a couple seconds after he reproduced each chunk. This finding has become part of the canon of cognitive psychology: People can increase their working memory capacity by grouping together otherwise discrete items to form a larger unit in memory. In that way, we can encode more information into the same limited number of memory slots.

In 1998, Chris Chabris and I invited two-time US Champion and International Grandmaster Patrick Wolff (a friend of Chris’s) to the lab and asked him to do the chess position reconstruction task. Wolff viewed each position (on a printed index card) for five seconds and then immediately reconstructed it on a chess board. After he was satisfied with his work, we gave him the next card. At the end of the study, after he had recalled five real positions and five scrambled positions, we asked him to describe how he did the task.

The video below shows his performance and his explanations (Chris is the one handing him the cards and holding the stopwatch—I was behind the camera). Like other experts who have been tested, Wolff rarely made mistakes in reconstructing positions, and when he did, the errors were trivial—they did not alter the fundamental meaning or structure of the position. Watch for the interesting comments at the end when Wolff describes why he was focused on some aspects of a position but not others.

HT to Chris Chabris for comments on a draft of this post

Sources cited:

For an extended discussion of chess expertise and the nature of expert memory, see Christopher Chabris’s dissertation: Chabris, C. F. (1999). Cognitive and neuropsychological mechanisms of expertise: Studies with chess masters. Doctoral Dissertation, Harvard University. http://en.scientificcommons.org/43254650

Chase, W. G., & Simon, H. A. (1973). Perception in chess. Cognitive Psychology,4, 55-81.

de Groot, A.D. (1946). Het denken van de schaker. [The thought of the chess player.] Amsterdam: North-Holland. (Updated translation published as Thought and choice in chess, Mouton, The Hague, 1965; corrected second edition published in 1978.)

Holding, D.H. (1992). Theories of chess skill. Psychological Research, 54(1), 10–16.

Gobet, F., & Simon, H.A. (1996a). Recall of rapidly presented random chess positions is a function of skill. Psychonomic Bulletin & Review, 3(2), 159–163.

HI-BAR: 10 questions about Inattentional blindness, race, and interpersonal goals

2014-02-14T13:37:00.002-06:00

HI-BAR (Had I Been A Reviewer)

A post-publication review of Brown-Iannuzzi et al (2014). The invisible man: Interpersonal goals moderate inattentional blindness to African Americans. Journal of Experimental Psychology: General, 143, 33-37. [pubmed] [full paper]

For more information about HI-BAR reviews, see this earlier post.

In a paper published this week in the Journal of Experimental Psychology:General, Brown-Iannuzzi and colleagues reported a study in which participants (White women) first were asked to think about their interpersonal goals and then completed an inattentional blindness task in which the unexpected event was either the appearance of a White man or a Black man. For these participants, their idealized interpersonal goals presumably included same-race romantic partners or friends, so the prediction was that priming participants to think about these idealized interpersonal goals would make them less likely to notice an unexpected Black "interloper" than a White one in a basketball counting task similar to our earlier selective attention task.

This approach is interesting and potentially important for several reasons that have nothing to do with race or interpersonal goals. Most studies showing variability in noticing rates as a function of expectations manipulate expectations by varying the task itself (e.g., counting the white shapes rather than the black ones, attending to shape rather than color. See Most et al, 2001, 2005). In this study, Brown-Iannuzzi and colleagues manipulated expectations not by changing the task instructions, but by priming people using an entirely unrelated task. In effect, their priming task was designed to get people to envision a White person without calling attention to race and then used that more activated concept to induce a change in noticing rates as a function of race. If this approach proves robust, it could change how we think about the detection of unexpected objects because it implies that an attention set could be induced (subtly) in a powerful way without actually changing the primary task people are doing.

I don't really have any specific comments about the use of race in particular or the use of interpersonal goals to accomplish this form of priming, but given the broader theoretical importance of this claim, I do have a number of questions about the methods and results in this paper. Most of my questions arise from the relatively sparse reporting of methods and results details in the paper, so I hope that they can be addressed if the authors provide more information. I am concerned that the evidence for the core conclusions is somewhat shaky given the fairly small samples and the flexibility of the analysis choices. Given the potential importance of this claim, I would like to see the finding directly replicated with a pre-registered analysis plan and a larger sample to verify that the effects are robust.

10 Questions and Comments

1) The method section provides almost no information about the test videos. What did the test video look like? How long did it last? Were all the players in the video White? How many passes did each team make? Did the two teams differ in the number of passes they made? Did the two videos differ in any way other than the race of the unexpected person? What color clothes did the players wear? How were the videos presented online to MTurk participants? (i.e., were they presented in Flash or some other media format?) Was there any check that the videos played smoothly on the platform on which they were viewed? Were there any checks on whether participants followed instructions carefully? That can be an issue with MTurk samples, but the only check on attentiveness appears to be the accuracy of counting. Was there any check to make sure people actually did think about their interpersonal goals? No subjects appear to have been excluded due to the failure to do. How long did the study take to complete? All of these factors potentially affect the results, so it's hard to evaluate the study without this information.

2) Were the players in the video wearing black shirts and white shirts as in our original video? If so, which team's passes did people count? Was that counter-balanced, and if so, did it matter? If the players were wearing white/black shirts, the finding that the shirt color worn by the unexpected person didn't matter is really surprising (and somewhat odd given Most et al's findings that similarity to the attended items matters). The task itself induces an attention set by virtue of the demand to focus selectively on one group and not the other, and it would be surprising if the subtle race priming yielded a more potent attention set than the task requirements. We know that the attention demands of the task (what's attended, what's ignored) affect noticing based on the similarity of the unexpected item to the attended and ignored items. That's a pretty big effect. Why wouldn't it operate here too? Shouldn't we also expect some interaction between the priming effect and the color of the attended team.

3) The analyzed sample consisted of 209 MTurk subjects. I have no objection to using MTurk for this sort of study. But, the method section doesn't report enough details about the assignment to conditions to evaluate the nature of these effects. For example, how many participants were in each of the conditions? It appears that the sample was divided across 5 (personal closeness) x 2 (race) x 2 (shirt color) combinations of conditions, for a total of 20 conditions. If so, there were approximately 10 subjects/condition. Did half of the participants attend to each team in the video? If so, that would mean there were approximately 5 subjects/combination of conditions. Depending on how many factors there were in the design, the sample size in each condition becomes pretty small. And, that's important because the core conclusion rests largely on the effect in just one of the priming conditions.

4) The paper does not report the number of participants in each condition or the number of those who noticed (it just reports percentages in a figure, but the figure does not break down the results across each of the combinations of conditions). Perhaps the authors could provide a table with that information as part of supplemental materials? Although the paper reports no effect of factors like shirt color, it's quite possible that such factors interacted with the other ones, but there probably is not adequate power with these sample sizes to test such effects. Still, it would be good to know the exact N in each combination of conditions, along with the number of participants who noticed in each condition.

5) Was there a reason why missing the total count by 4 was used the cutoff for accurate performance? That might well be a reasonable cutoff (it led to 35 exclusions out of the original 244 participants), but the paper doesn't report the total number of passes in the video, so we don't know how big an error that actually is. The analysis excluded participants who were inaccurate (by 4 passes), and footnote 3 reports that the simple comparisons were weaker if those participants were included. Does that mean that the effect in the Friend condition was not significant with these subjects included? Did that effect depend on this particular cutoff for accuracy? What if the study used a cutoff of 3 missed passes rather than 4? Would that change the outcome? How about 5? or 2? What if the study also excluded people familiar with the original basketball-counting video? Would it be reliable then? The flexibility of these sorts of analysis decisions are one reason I strongly favor pre-registered analysis plans for confirmatory studies.

6) It seems problematic to include the 23% of participants who were familiar with inattentional blindness in the analyses, especially if they were familiar with one of the variants in which people count ball passes by one of two teams of players. Although one of my own papers was cited as justification for including rather than excluding these participants, I didn't understand the reasoning. It's true that people can fail to notice an unexpected event even if they are familiar with the phenomenon of inattentional blindness more generally, but that typically depends on them not making the connection between the current test and the previous one. That is, they shouldn't have reason to expect an unexpected event in this particular task/scenario (e.g., people familiar with the "gorilla" video might still fail to detect an unexpected target in Mack & Rock's original paradigm because the two tasks are so different that they have no reason to expect one). When I show another variant of the gorilla task to people who are already familiar with the original gorilla video, they do notice the gorilla (Simons, 2010). They expect an unexpected interloper and look for it. The same likely is true here. The method section reports that excluding these 23% of the participants did not matter for the interaction, but the analyses with those data excluded are not reported. And, given that excluding those subjects would reduce the sample size by 23% and that the critical simple comparison was p=.03 (see below), it seems likely that the exclusion would have some consequences for the statistical significance of the critical comparisons underlying the conclusions. Perhaps the authors could report these analyses more fully in supplemental materials.

7) It is not appropriate to draw strong conclusions from the difference in noticing rates in the control condition for the White and Black unexpected person. The paper suggests that the difference in the no-prime control group results from racial stereotyping: White subjects view a Black person as a greater threat, so he garners more attention and is noticed more by default. But, this comparison is of two different videos that presumably were not identical in every respect other than the race of the unexpected person. It's quite possible that the actor just stood out more in that video due to the choreography of the players. Or, that particular Black actor might have stood out more due to features having nothing to do with his race (e.g., maybe he was taller, moved with a different gait, etc.). It's risky to draw strong comparisons about baseline differences in noticing rates across videos because many things could differ between any two videos like this. It's not justified to assume that any baseline differences must have been due to race, especially with just one example of an unexpected person of each Race.

8) If the control condition is treated as a default for noticing against which the priming conditions are compared (really, the appropriate way to do it given that direct comparison of different stimuli is questionable), it would be nice to have a more robust estimate of that baseline condition with a substantially larger sample. Otherwise, the noisy measurement of the baseline could artificially inflate the estimates of interaction effects. In the analyses, though, the baseline is treated just like the other priming conditions. If anything, the effects might be larger if the baseline were subtracted from each of the other conditions, but I would be hesitant to do that without first increasing the precision of the baseline estimate.

9) The primary conclusions are driven by the condition in which subjects were primed by thinking about an idealized Friend (the interaction effects are much harder to interpret because the priming conditions really are categories rather than an equally-spaced continuum, and even if they were a continuum, the trends are not consistent with one). Ideally, these simple effect analyses would have been relative to a baseline condition to account for default differences in noticing rates. Still, with the direct comparison, the Friend priming condition was the only one to yield significantly greater detection of the White unexpected person (p=.03). The effect for the Romantic Partner condition was not statistically significant at .05, nor were the neighbor or co-worker conditions. I don't see any a-priori reason to expect an effect just for the Friend condition, and with corrections for multiple tests, that effect would not have been significant either. This amounts to a concern about analysis flexibility: The authors could have drawn the same conclusion had there been a difference in the Romantic Partner condition but not the Friend condition. It might even be possible to explain an effect of priming in the more remote interpersonal relationship conditions or in some combination of them. Correcting for multiple tests helps to address this issue, but I would prefer pre-registration of the specific predictions for any confirmatory hypothesis test. Then, any additional analyses could be treated as exploratory. With this level of flexibility, correction for these alternative outcomes seems necessary when interpreting the statistical significance of the findings.

10) Without correcting for multiple tests, the paper effectively treats the secondary analyses as confirmatory hypothesis tests, but any of a number of other outcomes could have also been taken as support for the same hypothesis. Given the relatively small effect in the one condition that was statistically significant (although, again, see my notes about using a baseline), I worry that the effect might not be robust. Would the effect in that condition no longer be significant if one or two people who noticed were shifted to the miss category? My guess is that the significance in the one critical condition hinges on as little as that. A confirmatory replication would be helpful to verify that this effect is reliable.

Conclusions

This study provides suggestive evidence for a new way to moderate noticing of unexpected events. If true, it would have substantial theoretical and practical importance. However, the missing method details make the finding hard to evaluate. And, the flexibility of the analysis choices coupled with a critical finding that only reaches statistical significance without correction for that flexibility make me worried about the robustness of the result. Fortunately, the first of these issues would be easy to address by adding additional information to the supplemental materials. I hope the authors will do that so that readers can more fully evaluate the study. They could also provide more information about the results and the consequences of various analytical decisions. Given the importance of the finding, I hope they will build on this finding by conducting a pre-registered, highly powered, confirmatory test of the effect.

Replication, Retraction, and Responsibility

2014-01-25T10:21:00.004-06:00

Congrats/thanks to Brent Donnellan, Joe Cesario, and Rich Lucas for their tenacity and perseverance. They conducted 9 studies with more than 3000 participants in order to publish a set of direct replications. Their paper challenged an original report (study 1 in Bargh & Shalev, 2012) claiming that loneliness is correlated with preferred shower temperatures. The new, just-accepted paper did not find a reliable correlation. Donnellan describes their findings and the original studies in what may be the most measured and understated blog post I've seen. You should read it.

The original study had fewer than 100 subjects (51 from a Yale undergraduate sample and a replication with 41 from a community sample), underpowered to detect a typical effect size in a social psychology experiment. But there are bigger problems with the original results.

According to the description in Donnellan's post, the data from the Yale sample were completely screwy: 46/51 Yale students reported taking fewer than 1 shower/bath per week! Either Yale students are filthy, or something's wrong with the data. More critical for the primary question, 42/51 Yale students apparently prefer cold (24 students) or lukewarm (18 students) showers. How many people do you know who prefer cold showers to reasonably hot ones? Again, something's out of whack. In a comment on Donnellan's blog post, Rich Lucas noted that the original distribution of preferred temperatures would mirror what Donnellan et al found if the original data were inadvertently reverse coded. Of course, that would mean the correlation reported in the paper was backwards, and the effect was the opposite of what was claimed.

From an earlier Donnellan post, we know that Bargh was aware of these issues back in 2012, but that he prevented Donnellan and his colleagues from discussing the problems until recently (you should read that post too). In a semi-private email discussion among priming researchers and skeptics, Bargh claimed that his prohibition on discussing his data was just a miscommunication, but he didn't get around to correcting that misconception until he was pressed to respond on that email thread. In the same thread, Bargh claimed to have first learned of these errors from Joe Cesario (who initially requested the original data). Although it's odd that he didn't notice the weird distribution in the frequency responses, I can understand how someone might miss something obvious when they were focusing attention elsewhere... Bargh said that he provided an explanation to the editor at Emotion during the review process: He claimed that Yale students misunderstood the bathing frequency item as asking specifically about baths (not showers). According to Joe Cesiaro's response in that same email thread, though, that doesn't accord with the survey wording about showers/baths that Bargh provided.

Still, whenever and however Bargh learned of the problems with the data, he and Shalev had an obligation to retract the original study and issue an erratum (unless they actually believe Yale students prefer rare, cold showers). Even if the subjects misinterpreted the frequency question, the results are bogus. The problems could well have resulted from an honest oversight, a slip up in coding, a misinterpretation of a poorly worded question, or an Excel copy/paste error. Regardless of the reason, authors have a responsibility to own up to mistakes and to correct the record. Posting to a semi-private email list is not sufficient—the public record in the journal must be amended. Authors have an obligation to correct mistakes once they know of them, and the failure to do so in the published record is troubling.

Note that I am not arguing the original study should be retracted just because Donnellan and colleagues didn't replicate it. A failure to replicate is not adequate grounds for questioning the integrity of the original finding. The original effect size estimates could be wrong, but that's just science working properly to correct itself (that's why direct replications are useful and important). Yet, obviously flawed data like those described by Donnellan should not have to await replication, and scholars reading the literature should be informed that they should not place any stock in that first study with Yale students. That finding should be withdrawn until it can be verified so that it doesn't mislead the field.

One thing I find troubling about this story is that Donnellan, Cesario, and Lucas needed to conduct 9 studies with more than 30x the original number of participants in order to get this paper accepted at Emotion. They should be applauded for replicating with enough power to be sure that their effect size estimates are precise, but each of their studies had more than 2.5x the sample size of the original! If their efforts were entirely voluntary and not a consequence of appeasing reviewers, kudos to them for making sure they got it right. I'm glad that this paper was accepted, and our field owes them gratitude for their efforts. I just hope they haven't set an overly high standard and precedent for what's needed to publish a direct replication.

I would encourage Bargh to issue a public explanation (accessible to the whole field, not just an email thread) for the data issues in their original study. The problems could well have been an accidental coding or interpretation problem, and mistakes are excusable even if they do undermine the claims. More importantly, he should retract the original study (not the whole paper, necessarily -- just the study with problematic data) and issue an erratum in the journal. Out of curiosity, I would like to see an explanation for why the study was not retracted immediately upon learning of the problems more than a year ago. Perhaps there is a good reason, but I'm having trouble generating one. I hope he will enlighten us.

New Posts on IonPsych - October 27, 2013

2013-10-27T10:37:00.003-05:00

Jim Monti describes the best way to stave off the cognitive costs of aging.

Luis Flores describes a recent study of depression and argues that people with depression may enjoy activities just as much as those without depression, but they are less willing to work to experience those activities.

Christina Tworek muses on the causes of the large disparity between what scientists know and what the public knows of science.

Emily Hankosky critiques claims in a recent book espousing personal responsibility for overcoming addiction. She makes a case that discounting neurobiological bases of addiction is irresponsible.

Lindsey Hammerslag discusses the importance of teaching about certainty and uncertainty in science.

New Posts on IonPsych - 13 October 2013

2013-10-13T10:18:00.003-05:00

This fall, I am teaching a graduate seminar on speaking/writing for general audiences. As part of the class, students blog at www.ionpsych.com. Each week, I'll provide a short summary of the latest posts.

The latest posts on www.ionpsych.com:

H. A. Logis explores how we mold our behaviors to the people around us.

Brian Metzger demonstrates how perception often isn't subject to free will.

Emily Kim describes alternatives to academia for social psychologists.

Joachim Operskalski evaluates how new developments in neuroscience might improve treatment of psychiatric disorders and why we shouldn't be so quick to dismiss Prozac.

Anna Popova explains how to make statistics interesting.

New Posts on www.ionpsych.com (Oct 6, 2013)

2013-10-06T09:27:00.004-05:00

New Posts on IonPsych

2013-09-30T14:48:00.001-05:00

This fall, I am teaching a graduate seminar on speaking/writing for general audiences. As part of the class, students blog at www.ionpsych.com. Each week, I'll provide a short summary of the latest posts.

The latest posts on www.ionpsych.com:

Carolyn Hughes discusses the neural basis of first impressions.

Luis Flores addresses the difference between worrying and generalized anxiety disorder

Lindsey Hammerslag considers whether scientists are sexist for failing to deal with monkey cramps.

H. A. Logis revisits claims that bullying kills.

Christina Tworek examines why some people have a problem with boys wearing nail polish.

New posts on www.ionpsych.com

2013-09-23T08:45:00.004-05:00

This fall, I am teaching a graduate seminar on speaking/writing for general audiences. As part of the class, students blog at www.ionpsych.com. Each week, I'll provide a short summary of the latest posts.

The latest posts on www.ionpsych.com:

Anna Popova explains why individual and group decisions are treated differently by researchers studying decision making (and why they shouldn't be).

Jim Monti examines how your diet might affect your risk of Alzheimer's disease.

Aldis Sipolins covers how you might improve your video game playing (and learning in general) by applying direct current to your head.

Emily Hankosky describes a breakthrough study that might change how people have children in the not-too-distant future.

Judy Chiu argues that stress isn't inherently bad for you - what matters is how you think about stress.

19 questions about video games, multitasking, and aging (a HI-BAR commentary on a new Nature paper)

2013-09-05T16:20:00.004-05:00

HI-BAR (Had I Been A Reviewer)

**A post-publication review of Anguera et al (2013). Video game training enhances cognitive control in older adults. Nature, 501, 97-101.**

For more information about HI-BAR reviews, see my post from earlier today.

In a paper published this week in Nature, Anguera et al reported a study in which older adults were trained on a driving video game for 12 hours. Approximately 1/3 of the participants engaged in multitasking training (both driving and detecting signs), another 1/3 did the driving or sign tasks separately without having to do both at once, and the final 1/3 was a no-contact control. The key findings in the paper:

After multitasking training, the seniors attained "levels beyond those achieved by untrained 20-year-old participants, with gains persisting for 6 months"
Multitasking training "resulted in performance benefits that extended to untrained cognitive control abilities"
Neural measures of midline-frontal theta power and frontal-parietal theta coherence correlated with these improvements

This is one of many recent papers touting the power of video games to improve cognition, published in a top journal, that receives glowing (almost breathless) media coverage: The NY Times reports "A Multitasking Video Game Makes Old Brains Act Younger." A story in Nature in nature claims "Gaming Improves Multitasking Skills." The Atlantic titles their story, "How to Rebuild an Attention Span." (Here's one exception that notes a few limitations).

In several media quotes, the senior author on the paper (Gazzaley) admirably cautions against over-hyping of these findings (e.g., "Video games shouldn’t now be seen as a guaranteed panacea" in the Nature story). Yet overhyping is exactly what we have in the media coverage (and a bit in the paper as well).

The research is not bad. It's a reasonable, publishable first study that requires a bit more qualification and more limited conclusions: Some of the strongest claims are not justified, the methods and findings have limitations, and none of those are shortcomings are acknowledged or addressed. If you are a regular reader of this blog, you're familiar with the problems that plague most such studies. Unfortunately, it appears that the reviewers, editors, and authors did not address them.

In the spirit of Rolf Zwaan's recent "50 questions" post (although this paper is far stronger than the one he critiqued), here are 19 comments/questions about the paper and supplementary materials (in a somewhat arbitrary order). I hope that the authors can answer many of these questions by providing more information. Some might be harder to address. I would be happy to post their response here if they would like.

19 Questions and Comments

1. The sample size is small given the scope of the claims, averaging about 15 per group. That's worrisome -- it's too small a sample to be confident that random assignment compensates for important unknown differences between the groups.

2. The no-contact control group is of limited value. All it tells us is whether the training group improved more than would be expected from just re-taking the same tests. It's not an adequate control group to draw any conclusions about the effectiveness of training. It does nothing to control for motivation, placebo effects, differences in social contact, differences in computer experience, etc. Critically, the relative improvements due to multitasking training reported in the paper are consistently weaker (and fewer are statistically significant) when the comparison is to the active "single task" control group. According to Supplementary Table 2, out of the 11 reported outcome measures, the multitasking group improved more than the no-contact group on 5 of those measures, and they improved more than the single-task control on only 3.

3. The dual-task element of multitasking is the mechanism that purportedly underlies transfer to the other cognitive tasks, and neither the active nor the no-contact control included that interference component. If neither control group had the active ingredient, why were the effects consistently weaker when the multitasking group was compared to the single task group than when compared to the control group? That suggests the possibility of a differential placebo effect: Regardless of whether or not the condition included the active ingredient, participants might improve because they expected to improve.

4. The active control group is relatively good (compared to those often used in cognitive interventions) - it uses many of the same elements as the multitasking group and is fairly closely matched. But, the study included no checks for differential expectations between the two training groups. If participants expected greater improvements on some outcome measures from multitasking training than from single-task training, then some or all of the benefits for various outcome measures might have been due to expectations rather than to any benefits of dual-task training. For details, see our paper in Perspectives that discusses this pervasive problem with active control groups. If you want the shorter version, see my blog post about it. Just because a control group is active does not mean that it accounts for differential expectations across conditions.

5. The paper reports that improvements in the training task were maintained for 6 months. That's welcome information, but not particularly surprising (see #13 below). The critical question is whether the transfer effects were long-lasting. Were they? The paper doesn't say. If they weren't, then all we know is that subjects retained the skills they had practiced, and we know nothing about the long-term consequences of that learning for other cognitive skills.

6. According to Figure 9 in the supplement, 23% of the participants who enrolled in the study were dropped from the study/analyses (60 enrolled, 46 completed the study). Did drop out or exclusion differentially affect one group? If participants were dropped based on their performance, how were the cutoffs determined? Did the number of subjects excluded for each reason vary across groups? Are the results robust to different cut-offs? What are the implications for the broad use of this type of training if nearly a quarter of elderly adults cannot do the tasks adequately?

7. Supplemental Table 2 reports 3 significant outcome measures out of 11 tasks/measures (when comparing to the active control group). Many of those tasks include multiple measures and could have been analyzed differently. Consider also that each measure could be compared to either control group and that it also would have been noteworthy if the single task group had outperformed the no-contact group. That means there were a really large number of possible statistical tests that, if significant, could have been interpreted as supporting transfer of training. I see no evidence of correction for multiple tests. Only a handful of these many tests were significant, and most were just barely so (interaction of session x group was p=.08, p=.03, and p=.02). For the crucial comparison of the multitasking group to each control group, the table only reports a "yes" or "no" for statistically significant at .05, and they must be close to that boundary. (There also are oddities in the table, like a reported significant effect with d=.67, but a non-significant one with d=.68 for the same group comparison.) With correction for multiple comparisons, I'm guessing that none of these effects would reach statistical significance. A confirmatory replication with a larger sample would be needed to show that the few significant results (with small sample sizes) were not just false positives.

8. The pattern of outcome measures is somewhat haphazard and inconsistent with the hypothesis that dual-task interference is the reason for differential improvements. For example, if the key ingredient in dual-task training is interference, why didn't multitasking training lead to differential improvement on the dual-task outcome measure? That lack of a critical finding is largely ignored. Similarly, why was there a benefit for the working memory task that didn't involve distraction/interference? Why wasn't there a difference in the visual short term memory task both with and without distraction? Why was there a benefit for the working memory task without distraction (basically a visual memory task) but not the visual memory task? The pattern of improvements seems inconsistent with the proposed mechanism for improvement.

9. The study found that practice on a multitasking video game improves performance on that game to level of a college student. Does that mean that the game improved multitasking abilities to the level of a 20 year old? No, although you'd never know that from the media coverage. The actual finding is that after 12 hours of practice on a game, seniors play as well as a 20 year old who is playing the game for the first time. The finding does not show that multitasking training improved multitasking more broadly. In fact, it did not even transfer to a different dual task. Did they improve to the level of 20 year olds on any of the transfer tasks? That seems unlikely, but if they did, that would be bigger news.

10. The paper reports only difference scores and does not report any means or standard deviations. This information is essential to help the reader decide whether improvements were contaminated by regression to the mean or influenced by outliers. Perhaps the authors could upload the full data set to openscienceframework.org or another online repository to make those numbers available?

11. Why are there such large performance decreases in the no-contact group (Figures 3a and 3b of the main paper)? This is a slowing of 100ms, a pretty massive decline for just one month of aging. Most of the other data are presented as z-scores, so it's impossible to tell whether the reported interactions are driven by a performance decrease in one or both of the control groups rather than an improvement in the multitasking group. That's another reason why it's essential to report the performance means and standard deviations for both the pre-test and the post-test.

12. It seems a bit generous to claim (p.99) that, in addition to the significant differences on some outcome measures, there were trends for better performance in other tasks like the UFOV. Supplementary Figure 15 shows no difference in UFOV improvements between the multitasking group and the no-contact control. Moreover, because these figures show Z-scores, it's impossible to tell whether the single-task group is improving less or even showing worse performance. Again, we need the means for pre- and post-testing to evaluate the relative improvements.

13. Two of the core findings of this paper, that multitasking training can improve the performance of elderly subjects to the levels shown by younger subjects and that those improvements last for months, are not novel. In fact, they were demonstrated nearly 15 years ago in a paper that wasn't cited. Kramer et al (1999) found that giving older adults dual-task training led to substantial improvements on the task, reaching the levels of young adults after a small amount of training. Moreover, the benefits of that training lasted for two months. Here are the relevant bits from the Kramer et al abstract:

Young and old adults were presented with rows of digits and were required to indicate whether the number of digits (element number task) or the value of the digits (digit value task) were greater than or less than five. Switch costs were assessed by subtracting the reaction times obtained on non-switch trials from trials following a task switch.... First, large age-related differences in switch costs were found early in practice. Second, and most surprising, after relatively modest amounts of practice old and young adults switch costs were equivalent. Older adults showed large practice effects on switch trials. Third, age-equivalent switch costs were maintained across a two month retention period.

14. While we're on the subject of novelty, the authors state in their abstract: "These findings ... provide the first evidence, to our knowledge, of how a custom-designed video game can be used to assess cognitive abilities across the lifespan, evaluate underlying neural mechanisms, and serve as a powerful tool for cognitive enhancement." Unfortunately, they seem not to have consulted the extensive literature on the effects of training with the custom-made game Space Fortress. That game was designed by cognitive psychologists and neuroscientists in the 1980s to study different forms of training and to measure cognitive performance. It includes components designed to train and test memory, attention, motor control, etc. It has been used with young and old participants, and it has been studied using ERP, fMRI, and other measures. The task has been used to study cognitive training and transfer of training both to laboratory tasks and to real world performance. It has also been used to study different forms of training, some of which involve explicit multitasking and others that involve separating different task components. There are dozens of papers (perhaps more than 100) using that game to study cognitive abilities, training, and aging. Those earlier studies suffered from many of the same problems that most training interventions do, but they do address the same issues studied in this paper. The new game looks much better than Space Fortress, and undoubtably is more fun to play, but it's not novel in the way the authors claim.

15. Were the experimenters who conducted the cognitive testing blind to the condition assignment? That wasn't stated, and if they were not, then experimenter demands could contribute to differential improvements during the post-test.

16. Could the differences between conditions be driven by differences in social contact and computer experience? The extended methods state, "if requested by the participant, a laboratory member would visit the participant in their home to help set up the computer and instruct training." How often was such assistance requested? Did the rates differ across groups? Later, the paper states, "All participants were contacted through email and/or phone calls on a weekly basis to encourage and discuss their training; similarly, in the event of any questions regarding the training procedures, participants were able to contact the research staff through phone and email." Presumably, the authors did not really mean "all participants." What reason would the no-contact group have to contact the experimenters, and why would the experimenters check in on their training progress? As noted earlier, differences like this are one reason why no-contact controls are entirely inadequate for exploring the necessary ingredients of a training improvement.

17. Most of the assessment tasks were computer based. Was there any control for prior computer experience or the amount of additional assistance each group needed? If not, the difference between these small samples might partly be driven by baseline differences in computer skills that were not equally distributed across conditions. The training tasks might also have trained the computer skills of the older participants or increased their comfort with computers. If so, improved computing skills might account for any differences in improvement between the training conditions and the no-contact control.

18. The paper states, "Given that there were no clear differences in sustained attention or working memory demands between MTT and STT, transfer of benefits to these untrained tasks must have resulted from challenges to overlapping cognitive control processes." Why are there no differences? Presumably maintaining both tasks in mind simultaneously in the multitasking condition places some demand on working memory. And, the need to devote attention to both tasks might place a greater demand on attention as well. Perhaps the differences aren't clear, but it seems like an unverified assumption that they tap these processes equally.

19. The paper reports a significant relationship between brain measures and TOVA improvement (p = .04). The “statistical analyses” section reports that one participant was excluded for not showing the expected pattern of results after training (increased midline frontal theta power). Is this a common practice? What is the p value of the correlation when this excluded participant is included? Why aren’t correlations reported for the relationship between the transfer tasks and training performance or brain changes for the single-task control group? If the same relationships exist for that group, then that undermines the claim that multitask training is doing something special. The authors report that these relationships aren't significant, but the ones for the multitasking group are not highly significant either, and the presence of a significant relationship in one case and not in the other does not mean that the effects are reliably different for the two conditions.

Conclusions

Is this a worthwhile study? Sure. Is it fundamentally flawed? Not really. Does it merit extensive media coverage due to it's importance and novelty? Probably not. Should seniors rush out to buy brain training games to overcome their real-world cognitive declines? Not if their decision is based on this study. Should we trust claims that such games might have therapeutic benefits? Not yet.

Even if we accept all of the findings of this paper as correct and replicable, nothing in this study shows that the game training will improve an older person's ability to function outside of the laboratory. Claims of meaningful benefits, either explicit or implied, should be withheld until demonstrations show improvements on real or simulated real-world tasks as well.

This is a good first study of the topic, and it provides a new and potentially useful way to measure and train multitasking, but it doesn't merit quite the exuberance displayed in media coverage of it. If I were to rewrite the abstract to reflect what the study actually showed, it might sound something like this:

In three studies, we validated a new measure of multitasking, an engaging video game, by replicating prior evidence that multitasking declines linearly with age. Consistent with earlier evidence, we find that 12 hours of practice with multitasking leads to substantial performance gains for older participants, bringing their performance to levels comparable to those of 20-year-old subjects performing the task for the first time. And, they remained better at the task even after 6 months. The multitasking improvements were accompanied by changes to theta activity in EEG measures. Furthermore, an exploratory analysis showed that multitasking training led to greater improvements than did an active control condition for a subset of the tasks in a cognitive battery. However, the pattern of improvements on these transfer tasks was not entirely consistent with what we might expect from multitasking training, and the active control condition did not necessarily induce the same expectations for improvement. Future confirmatory research with a larger sample size and checks for differential expectations is needed to confirm that training enhances performance on other tasks before strong claims about the benefits of training are merited. The video game multitasking training we developed may prove to be a more enjoyable way to measure and train multitasking in the elderly.

HI-BAR (Had I Been A Reviewer)

2013-09-05T09:05:00.005-05:00

HI-BAR
Had I Been a Reviewer

If you're a researcher, you undoubtedly have had the experience of reading a new paper in your specialty area and thinking to yourself, "Had I been a reviewer, I would have raised serious concerns about these findings and claims." Or, less charitably, you might ask, "How the hell did that paper survive peer review?"

Each paper is reviewed by only 2 or 3 people, and small samples can lead to flawed conclusions. Given that I can't insert myself into the review process in advance of publication, I will, on occasion, use my blog to post the sorts of comments I would have made Had I Been A Reviewer of the original manuscript. My comments won't always take the same form that they would have if I had reviewed the paper in advance of publication when constructive comments might improve a manuscript. Rather, they will comment on the strengths and shortcomings of the finished product. On occasion, when I have reviewed a manuscript and the paper was published without addressing major concerns, I might post the reviews here as well (I always sign my reviews, so they won't come as a surprise to the authors in such cases).

Not all of these posts will be take-downs or critiques, although some will be. Post-publication review can help to correct mistakes in the literature, and it also identifies controversies that might have been glossed over in a manuscript media coverage of it and it can inspire future research. I hope that more researchers will take up the call and post their own HI-BAR post-publication reviews.

Good resources for science writing/speaking?

2013-08-12T13:46:00.004-05:00

For a psychology graduate class I'm teaching this fall (on speaking/writing for general audiences), I'm trying to create a list of good resources on writing, speaking, and blogging about science. I'm hoping that you can help.

I'm particularly interested in finding good discussions of the value and risks of blogging, suggestions for best practices in writing/speaking, etc. Do you have a favorite go-to source for such advice? Do you know of helpful resources for beginning science writers and speakers? If so, please leave them in the comments (or send them to me directly). I'll compile the full list and will post it here.

Stop the presses

2013-08-07T11:31:00.005-05:00

Yesterday I encountered something I've never seen before: a formal press release from an academic society (SPSP) about a conference presentation of unpublished research:

http://www.eurekalert.org/pub_releases/2013-08/sfpa-vgb080213.php

A friend of mine forwarded it to me because it makes claims about the cognitive benefits of video game training, an area fraught with methodological problems that my colleagues and I have written about extensively (e.g, here's a recent blog post about a recent critique of such interventions). My guess is that the design shortcomings we discussed in that paper undermine the claims that these authors are making. But, I have no way to know. The actual research isn't available.

Why does this work merit a press release now, before the research has been published?

The purpose of a press release is to draw public (and media) attention to a new finding, but in this case, the press release effectively is the finding because nobody can access the actual research. Journalists or science writers covering this study will have no more information than is available in the release itself, so they cannot verify that the research actually shows what the release claims that it does. In other words, the press release encourages churnalism rather than science reporting.

In my view, academic societies should not be encouraging media coverage of research until the actual research is available for popular consumption. Doing so risks misleading the public. For this particular release, if the studies suffer from the problems we discussed in our recent article, then the conclusions might be unjustified and there would be a reasonable chance that the research would not survive the peer review process (I can only hope that reviewers would nix publication if the claims aren't justified). If that happened, then the press release would have hyped vapor-findings, claims that lack any underlying support. How does that benefit the popular appraisal of our field?

Journalists and bloggers are free to discuss research they learn about at conferences, of course. And they typically do a good job in noting when findings are tentative (or giving enough details that others can evaluate the claims). But a formal press release from an academic society about unpublished research that is not available seems to me to be a different beast.

Are there cases in which an academic society should issue a press release based on a conference presentation? Do you think this sort of press release is acceptable? I'd be curious to hear the perspectives of other scientists and science writers. Let me know what you think?

Pop Quiz - What can we learn from an intervention study?

2013-07-09T10:36:00.004-05:00

Pop Quiz

1. Why is a double-blind, placebo-controlled study with random assignment to conditions the gold standard for testing the effectiveness of a treatment?

2. If participants are not blind to their condition and know the nature of their treatment, what problems does that lack of control introduce?

3. Would you use a drug if the only study showing that it was effective used a design in which those people who were given the drug knew that they were taking the treatment and those who were not given the drug knew they were not receiving the treatment? If not, why not?

Stop reading now, and think about your answers.

Most people who have taken a research methods class (or introductory psychology) will be able to answer all three. The gold standard controls for participant and experimenter expectations and helps to control for unwanted variation between the people in each group. If participants know their treatment, then their beliefs and expectations might affect the outcome. I would hope that you wouldn't trust a drug tested without a double-blind design. Without such a design, any improvement by the treatment group need not have resulted from the drug.

In a paper out today in Perspectives on Psychological Science, my colleagues (Walter Boot, Cary Stothart, and Cassie Stutts) and I note that psychology interventions typically cannot blind participants to the nature of the intervention—you know what's in your "pill." If you spend 30 hours playing an action video game, you know which game you're playing. If you are receiving treatment for depression, you know what is involved in your treatment. Such studies almost never confront the issues introduced by the lack of blinding to conditions, and most make claims about the effectiveness of their interventions when the design does not permit that sort of inference. Here is the problem:

If participants know the treatment they are receiving, they may form expectations about how that treatment will affect their performance on the outcome measures. And, participants in the control condition might form different expectations. If so, any difference between the two groups might result from the consequences of those expectations (e.g., arousal, motivation, demand characteristics, etc.) rather than from the treatment itself.

A truly double blind design addresses that problem—if people don't know whether they are receiving the treatment or the placebo, their expectations won't differ. Without a double blind design, researchers have an obligation to use other means to control for differential expectations. If they don't, then a bigger improvement in the treatment group tells you nothing conclusive about the effectiveness of the treatment. Any improvement could be due to the treatment, to different expectations, or to some combination of the two. No causal claims about the effectiveness of the treatment are justified.

If we wouldn't trust the effectiveness of a new drug when the only study testing it lacked a control for placebo effects, why should we believe a psychology intervention if it lacked any controls for differential expectations? Yet, almost all published psychology interventions attribute causal potency to interventions that lack such controls. Authors seem to ignore this known problem, reviewers don't block publication of such papers, and editors don't reject them.

Most psychology interventions have deeper problems than just a lack of controls for differential expectations. Many do not include a control group that is matched to the treatment group on everything other than the hypothesized critical ingredient of the treatment. Without such matching, any difference between the tasks could contribute to the difference performance. Some psychology interventions use completely different control tasks (e.g., crosswords puzzles as a control for working memory training, educational DVDs a control for auditory memory training, etc). Even worse, some do not even use an active control group, instead comparing performance to a "no-contact" control group that just takes a pre-test and a post-test. Worst of all, some studies use a wait-list control group that doesn't even complete the outcome measures before and after the intervention.

In my view, a psychology intervention that uses a waitlist or no-contact control should not be published. Period. Reviewers and editors should reject it without further consideration -- it tells us almost nothing about whether the treatment had any effect, and is just a pilot study (and a weak one at that).

Studies with active control groups that are not matched to the treatment intervention should be viewed as suspect—we have no idea what differences between the treatment and control condition were necessary. Even closely matched control groups do not permit causal claims if the study did nothing to check for differential expectations.

To make it easy to understand these shortcomings, here is a flow chart from our paper that illustrates when causal conclusions are merited and what we can learn from studies with weaker control conditions (short answer -- not much):

Almost no psychology interventions even fall into that lower-right box, but almost all of them make causal claims anyway. That needs to stop.

If you want to read more, check out our OpenScienceFramework Page for this paper/project. It includes an answers to a set of Frequent Questions.

Daniel Simons

Earliest proposal for a new registered report format?

Happy to be wrong (Covid update)

Covid responses at Duke and UIUC

Covid numbers - August 30, 2021

Covid at Illinois - 30 August 2021

Vaccine "requirement"

Changes to testing

The consequences of testing only the unvaccinated

Analyses of Covid trends at the University of Illinois

A new journal at APS: AMPPS

Visual effort and inattentional deafness

HI-BAR: A gold standard brain training study?

A gold-standard brain training study? Not without some alchemy

Response from Ballesteros et al to my HI-BAR

HI-BAR: More benefits of Lumosity training for older adults?

HI-BAR (Had I Been A Reviewer)

HI-BAR: Benefits of Lumosity training for older adults?

How experts recall chess positions

HI-BAR: 10 questions about Inattentional blindness, race, and interpersonal goals

HI-BAR (Had I Been A Reviewer) A post-publication review of Brown-Iannuzzi et al (2014). The invisible man: Interpersonal goals moderate inattentional blindness to African Americans. Journal of Experimental Psychology: General, 143, 33-37. [pubmed] [full paper]

10 Questions and Comments

Conclusions

Replication, Retraction, and Responsibility

New Posts on IonPsych - October 27, 2013

New Posts on IonPsych - 13 October 2013

New Posts on www.ionpsych.com (Oct 6, 2013)

New Posts on IonPsych

New posts on www.ionpsych.com

19 questions about video games, multitasking, and aging (a HI-BAR commentary on a new Nature paper)

HI-BAR (Had I Been A Reviewer) A post-publication review of Anguera et al (2013). Video game training enhances cognitive control in older adults. Nature, 501, 97-101.

19 Questions and Comments

Conclusions

HI-BAR (Had I Been A Reviewer)

Good resources for science writing/speaking?

Stop the presses

Pop Quiz - What can we learn from an intervention study?

Pop Quiz

Analyses of Covid trends
at the University of Illinois

A gold-standard brain training study?

Not without some alchemy

HI-BAR (Had I Been A Reviewer)

A post-publication review of Brown-Iannuzzi et al (2014). The invisible man: Interpersonal goals moderate inattentional blindness to African Americans. Journal of Experimental Psychology: General, 143, 33-37. [pubmed] [full paper]

HI-BAR (Had I Been A Reviewer)

**A post-publication review of Anguera et al (2013). Video game training enhances cognitive control in older adults. Nature, 501, 97-101.**