Thursday, December 10, 2015

Visual effort and inattentional deafness

Visual Effort and Inattentional Deafness

Earlier this week I was asked for my thoughts on a new Journal of Neuroscience paper: 
Molly, K., Griffiths, T. D., Chait, M., & Lavie, N. (2015). Inattentional deafness: Visual load leads to time-specific suppression of auditory evoked responses. Journal of Neuroscience, 35, 16046-16054.doi: 10.1523/JNEUROSCI.2931-15.2015
In part due to a widely circulated press release, the paper has garnered a ton of media coverage, with headlines like:
Focusing On A Task May Leave You Temporarily Deaf: Study

Did You Know Watching Something Makes You Temporarily Deaf?

Study Explains How Screen Time Causes 'Inattentional Deafness'

The main contribution of the paper was a link between activation in auditory cortex and the behavioral finding of reduced detection of a sound (a brief tone) when performing a more difficult visual task. 

This brain-behavior link, not the behavioral result, is the new contribution from this paper. Yet, almost all of the media coverage has focused on the behavioral result which isn't particularly novel. That's unsurprising given that most of the stories just followed the lede of the press release, which was titled:
"Why focusing on a visual task will make us deaf to our surroundings: Concentrating attention on a visual task can render you momentarily 'deaf' to sounds at normal levels, reports a new UCL study funded by the Wellcome Trust"

Here are a few points about this paper that have largely been lost or ignored in the media frenzy (and the press release):

1. The study did not show that people were "deaf to their surroundings." In the study (Experiment 2), people performed an easy or hard visual task while also trying to detect a tone that occurred on half of the trials. When performing the easy visual task, they reported the tone accurately on 92% of the trials. When performing the harder visual task, they reported it accurately on 88% of trials. The key behavioral effect was a 4% reduction in accuracy on the secondary, auditory task when the primary visual task was harder.  In other words, people correctly reported the tone on the vast majority of trials even with the hard visual task. That's not deafness. It's excellent performance of a secondary task with just a slight reduction when the primary task is harder. 

Aside: much of that small effect on accuracy could be due to a difference in response bias between the conditions (Beta of 3.2 compared to 1.3, a difference reported as p = 0.07 with an underpowered study of only 11 subjects).

2. The behavioral effect of visual load on auditory performance is not original to this paper. In fact, it has been reported by the same lab.

3. A number of other studies have demonstrated costs to detection in one sensory modality when focusing attention on another modality. This paper is not the first to show such a cross-modal effect. See, for example, hereherehereherehere (none of which were cited in the paper). Many other studies have shown that increasing primary task difficulty decreases secondary task performance. Again, the behavioral result touted in the media is not new, something the press release acknowledges in passing.

4. The study doesn't actually involve inattentional deafness; the term is misused. Inattentional deafness or blindness refers to a failure to notice an otherwise obvious but unexpected stimulus when focusing attention on something else. The "unexpected" part is key to ensuring that the critical stimulus actually is unattended (the justification for claiming the failure is due to inattention); people can't allocate attention to something that they don't know will be there. 

In this study, tone detection was a secondary task. People were asked to focus mostly on the visual task, but they also were asked to report whether or not a tone occurred. In other words, people were actively trying to detect the tone and they knew it would occur. That's not inattentional deafness. It's just a reduction in detection for an attended stimulus when a primary task is more demanding. And, as I noted above, it's not really a demonstration of deafness either given participants were really good at detecting the tone in both conditions (they were just slightly worse when performing a harder visual task). 

Note that the same lab previously published an paper that actually did show an effect of visual load on inattentional deafness.


ConclusionThere's nothing fundamentally wrong with this paper, at least that I can see (I'm not an expert on neuroimaging, though). The link between the behavioral results and brain imaging results is potentially interesting. I would have preferred a larger sample size and ideally measuring the link between brain and behavior in the same participants performing tasks with the same demands, but those issues aren't show stoppers. I can see why it is of interest to specialists (like me). That said, I'm not sure that it makes a contribution of broad interest to the public, and the novelty and importance of the behavioral result has been overplayed.

Monday, November 30, 2015

HI-BAR: A gold standard brain training study?



A gold-standard brain training study?
Not without some alchemy



A HI-BAR (Had I Been A Reviewer) of: 
Corbett, A., Owen, A., Hampshire, A., Grahn, J., Stenton, R., Dajani, S., Burns, A., Howard, R., Williams, N., Williams, G., & Ballard, C. (2015). The effect of an online cognitive training package in healthy older adults: An online randomized controlled trial. JAMDA, 16(11), 990-997.


Edit 12-3-15: The planned sample was ~1 order of magnitude larger than the actual one, not 2. (HT Matthew Hutson in the comments)


A recent large-scale brain training study, published in the Journal of the American Medical Directors Association (JAMDA), has garnered a lot of attention. A press release was picked up by major media outlets, and a blog post by Tom Stafford on the popular Mind Hacks blog called it “a gold-standard study on brain training” and noted that “this kind of research is what ‘brain training’ needs.”*

Tom applied the label “gold standard” because of the study’s design: It was a large, randomized, controlled trial with an active control group and blinding to condition assignment. From the gold-standard monicker, though, people might infer that the research methods and results provide solid evidence for brain training benefits. They do not.

Tom's post identified several limitations of the study, such as differential attrition across conditions and the use of a self-report primary outcome measure. Below I discuss why these and other analysis and reporting problems undermine the claims of brain training benefits. 


Problems that undermine interpretability of the study

Differential Attrition 
The analysis was based on the 6-month testing point, but the study was missing data from about 70% of the participants due to attrition. To address this problem, the authors carried forward data from the final completed testing session for each participant and treated it as if it were from the 6-month point. Critically, the control group had substantially greater attrition than the intervention groups—more of their scores were carried forward from earlier points in the intervention.

For the control group, only 27% of the data for the primary outcome and 17% of the data for the secondary outcomes came from participants who actually completed their testing at 6 months. For the Reasoning group, those numbers were 42% and 40%. For the General Cognition group, they were 40% and 30%.

The extent of the differential attrition and rates of carrying forward results from earlier sessions were only discoverable by inspecting the Consort diagram. This analysis choice and its implications were not fully discussed, and the paper did not report analyses of participants with comparable durations of training. This analysis approach introduces a major confound that could entirely account for any differential benefits.

Unclear sample sizes and means
Tables 3 and 4 list different control group means next to each training condition. There was only one control group, so it is unclear why the critical baseline means differed for the two training interventions. Without knowing why these means differed (they shouldn't have), the differential improvements in the training groups are uninterpretable.

The Ns listed in the tables also are inconsistent with the information provided in the Consort diagram. In a few cases, the Tables list a larger N than the consort diagram, meaning that there were more subjects in the analysis than in the study.

I emailed the corresponding author (on Nov. 10 and Nov. 23) to ask about the each of these issues, but I received no response. I also emailed the second author. His assistant noted that the corresponding author's team was "was responsible for that part of the study" and said the second author "can be of no help with this." I’m hoping this post will prompt an explanation for the values in th
e tables.


For me, those reporting and analysis issues are show stoppers, but the paper has other issues.

Other issues

Limitations of the pre-registration
The study was pre-registered, meaning that the recruiting, testing methods, and analysis plans were specified in advance. Such pre-registrations are required for clinical trials, but they have been relatively uncommon in the brain training literature. Have a pre-registered plan is ideal because it eliminates much of the flexibility that otherwise can undermine the interpretability of findings. The use of pre-registration is laudable. But, the registration was underspecified and the reported study deviated from it in important ways. 

For example, the protocol called for 75,000 - 100,000 participants, but the reported study recruited fewer than 7000. That’s still a great sample, but it’s 2 orders an order of magnitude smaller than the planned sample. Are there more studies resulting from this larger sample that just aren’t mentioned in the pre-registration?

The study also called for a year of testing, but it had to be cut short at 6 months and more than 2/3 of the participants did not undergo even 6 months of training. The pre-registration did not include analysis scripts, and the data from the study do not appear to have been posted publicly.

The pre-registered hypotheses predicted greater improvements in the reasoning training group than the general cognition group and it predicted that the general cognition group would not outperform the control group. The paper reports no tests of this predicted difference.

Underreporting for the primary measure (IADL)
The primary outcome measure consisted of self-reports of performance in daily activities (known as the Instrumental Activities of Daily Living or IADL). As Tom's post noted, such self-reports are subject to demand characteristics — people expect to do better following training, so they report having done better. The study did not test for different expectations across the training and control groups, so the benefits could be due to such demands or to a differential placebo effect (e.g., the control group might have found the study less worthwhile).

The reported benefits for IADLs were small, and the data provide little evidence for any benefit of training. The study reported statistically significant benefits for both training groups relative to the control group, but statistical significance is not the same as evidence. With samples this large, we should expect a substantially lower p-value than .05 when an effect actually is present in the population. If the Ns and means reported in the table were consistent with the method description, it might be possible to compute a Bayes Factor for these analyses. My bet is that the difference between the training groups and the control group would provide weak evidence at best for a meaningful training benefit (relative to the null).


The paper provides no information about baseline scores on the primary outcome measure (IADL). Although the analyses control for baseline scores, training papers must provide the pre-test scores and post-test scores. Without doing so, it is impossible to evaluate whether apparent training benefits resulted in part from baseline differences.

The paper also states that “Data from interim time points also show significant benefit to IADL at 3 months, particularly in the GCT group, although this difference was not significant.” I take this to mean both training groups outperformed the control group at 3 months, but they did not differ significantly from each other. No statistical evidence is provided in support of this claim.


Limited evidence from the secondary measures
Only one of the secondary cognitive outcome measures (a reasoning measure) showed a training benefit. The paper refers to it as “the key secondary measure,” but that designation does not appear in the pre-registration (http://www.isrctn.com/ISRCTN72895114). Moreover, the pre-registration predicts better performance for reasoning training than general cognition training or the control group, but the paper found improvements for both interventions. A few other measures showed significant effects, but given the large sample sizes, the high p-values might well be more consistent with the absence of a training benefit than the presence of one.

Despite providing no statistical evidence of differential benefits for Reasoning training and General Cognition training, the paper claims that “Taken together, these findings indicate that the ReaCT package confers a more generalized cognitive benefit than the GCT at 6 months. That claim appears to come from finding no effect on a digit secondary task in the Reasoning group and a decline in the General cognition group. However, a difference in significance is not a significant difference.

Almost all of the measures showed declining performance from the pre-test to the post-test. That is, participants were not getting better. They just declined less than the control participants. It is unclear why we should see such a pattern of declining performance over a short time window with relatively young participants. Although cognitive performance does decline with age, presumably those declines should be minimal over 1-6 months, and they should be swamped by the benefits of taking the test twice. One explanation might be differential attrition -- those subjects who did worse initially were more likely to drop out early. 


* Thanks to Tom Stafford for emailing a copy of the paper. The journal is obscure enough that the University of Illinois library did not have access to it.

Monday, December 15, 2014

Response from Ballesteros et al to my HI-BAR

Update 12-15-14: I used strikethru to correct a couple of the F test notes below. The crossed out ones were fine.

Update 5-26-15: Frontiers has published a correction from Ballesteros et al that acknowledges the overlap among their papers. It doesn't admit the inappropriateness of that overlap. It mostly echoes their response below, but does not address my questions about that response.


In late November, I posted a HI-BAR review of a paper by Ballesteros et al (2014) that appeared in Frontiers. In addition to a number of other issues I discussed there and in an earlier review of another paper, I raised the concern that the paper republished the training data and some outcome measures from an earlier paper in PLoS One without adequately conveying that those data had been published previously. I also noted this concern in a few comments on the Frontiers website. On my original HI-BAR post, I asked the original authors for clarification about these issues. 

I have now received their response as a pdf file. You can read it here

Below I quote some sections from their reply and comment on them. I encourage you to read the full reply from the original authors as I am addressing only a subset of their comments below. For example, their reply explains the different author lists on the two papers in a satisfactory way, I think. Again, I would be happy to post any responses from the original authors to these comments. As you will see, I don't think this reply addresses the fairly major concerns about duplication of methods and results (among other issues).



Quotes from the authors' reply are indented and italicized, with my responses following each.


As you noted, both papers originated from the same randomized controlled intervention study (clinical trial number: NCT02007616). We never pretended that the articles were to be considered as two different intervention studies. Part of this confusion could have been generated because in the Frontiers article the clinical trial number did not appear on the first page, even though this number was included in the four preliminary versions of the manuscript under revision. We have contacted Frontiers asking them to include the clinical trial number in the final accepted version, if possible. 
Although that would help, it's not sufficient. The problem is that the data analyses are republished. It would be good to note, explicitly, both in the method section and in the results section, that this paper is reporting outcome measures from the same intervention. And, it's essential to note when and where analyses are repeated.
If it is not possible at this stage, we asked them to publish a note with this information and to acknowledge in the results section the overlap as well as in Figure 3b, mentioning that the data in the Figure were published previously in PLoS.
This seems necessary regardless of whether or not the link to the clinical trial number is added. Even if it is made clear that the paper is based on the same intervention, it still is not appropriate to republish data and results without explicitly noting that they were published previously. Actually, it would be better not to republish the data and analyses. Period.
You also indicated in your first comments posted in Frontiers that the way we reported our results could give the impression that they come from two different and independent interventions. To avoid this misunderstanding, as you noticed, we introduced two references in our Frontiers ́ article. We inserted the first reference to the PLoS article in the Method section and a second reference in the Discussion section. Two references that you considered were not enough to avoid possible confusions in the reader.
As I discussed in my HI-BAR post, these citations were wholly inadequate. One noted only that the results of the oddball task were reported elsewhere, yet the same results were reported again in Frontiers and results section included no mention of the duplication. The other citation, appearing in the discussion, implied that the earlier paper provided additional evidence for a conclusion about the brain. Nowhere did the Frontiers paper cite the PLoS paper for the details of the intervention, the nature of the outcome measures, etc. It just published those findings as if they were new. The text itself should have noted, both in the method and results sections, whenever procedures or results were published elsewhere. Again, it would have been better not to republish them at all.
In relation to the results section in which we describe the cross-modal oddball attention task results in the Frontiers article, we acknowledge that, perhaps, it would have been better to avoid a detailed presentation of the attentional results that were already published and indicate just that attentional data resulting from the task developed in collaboration with the UBI group were already published. We could have asked the readers to find out for themselves what happened with attention after training. We were considering this possibility but in the end we decided to include the results of the oddball task to facilitate ease of reading.
Acknowledging the repetition explicitly in the text would have helped avoid the interpretation that these were new results. Repeating the statistics from the earlier paper isn't an "ease of reading" issue -- it's duplicate publication. You could easily summarize the findings of the earlier paper, with clear citation, and note that the details were described in that paper. I don't see any justification for republishing the actual statistics and findings. 
Regarding the last paragraph of the oddball results section, we said “New analyses were conducted....” As we said above, we tried (perhaps in an unfortunate way) to give in this paper a brief summary of the results obtained in the attention task, so this paragraph refers to the epigraphs “Distraction” and “Alertness” of results section in PLoS publication, where differential variables were calculated and new analyses were performed to obtain measures of gains and losses in the ability to ignore infrequent sounds and to take advantage of the frequent ones. Once again, we apologize if this sentence has led to confusion, and we are in contact with the Journal concerning this.
Yes, that sentence added to the impression these analyses were new to the Frontiers paper. But, the statistical analyses should not have been duplicated in the first place. That's also true for the extensive repetition of all of the training data.
Another comment in your exhaustive review referred to the differences in the samples shown in the Consort diagram between the two publications. This has a simple explanation. The diagram of the PLoS article refers only to the cross-modal oddball task designed to assess alertness and distraction while the Frontier ́s Consort diagram refers to the whole intervention study. In the control group, one of the participants was removed due to the large number of errors in the oddball task, but he was included in the rest of the study in which he reached inclusion criteria. The same occurred in the trained group. As attentional measures were analyzed separately by the UBI group, by the time we sent the results only fifteen participants completed the post-evaluation (we could not contact a participant and the other was travelling that week). A few days later, these two participants completed the post-evaluation, but we decided not to include them in the PLoS sample as the data were already analyzed by the UBI group.
Thank you for these clarifications. I'm curious why you decided not to wait for a few days for those remaining participants if that was part of your intervention design. If their results came in a few days later, why not re-do the analysis in the original paper to include all participants who were in the intervention. Presumably, by publishing the PLoS paper when you did, you deemed the intervention to be complete (i.e., it wasn't flagged as a preliminary result). It seems odd to then add these participants to the next paper. This difference between papers raises two questions. First, would the results for the PLoS paper have been different with those two participants? Second, would the results of the Frontiers paper have differed if they had been excluded? And, if those data were in by the time the Frontiers paper was written, why were these participants not included in the oddball analysis? At least that would have added new information to those duplicated analyses.

This clarified information should have been included in the Frontiers paper to make it clearer that both papers reported the same intervention with the same participants.
We would like to explain the clinical trial registration process. As you pointed out in your comments to Mayas et al. (2014), we registered the clinical trial after the attention manuscript was submitted to PLoS. The journal (PlosOne) specifically required the registration of the intervention study as a clinical trial during the revision process in order to publish the manuscript, and told us about the possibility of post-registration. The Editor of PLoS sent us the link to register the study.
Post-registering a study completely undermines the purpose of registration. I find it disturbing that PLoS would permit that as an option for a clinical trial. Looking at the PLoS guidelines, they do make an exception for post-study registration provided that the reasons for "failing to register before participant recruitment" are spelled out clearly in the article (emphasis from original on PLoS website): 
PLOS ONE supports prospective trial registration (i.e. before participant recruitment has begun) as recommended by the ICMJE's clinical trial registration policy.Where trials were not publicly registered before participant recruitment began, authors must:
  • Register all related clinical trials and confirm they have done so in the Methods section
  • Explain in the Methods the reason for failing to register before participant recruitment
It's also clear that their policy is for clinical trials to be pre-registered, not post-registered. And, the exception for post-registration requires a justification. Neither the PLoS paper nor the Frontiers paper provided any such justification. The Frontiers paper didn't mention registration at all, and the PLoS one didn't make clear that the registration occurred after submission of the finished study. 

The idea of registration is to specify, in advance, the procedures that will be followed in order to avoid introducing investigator degrees of freedom. Registering after the fact does nothing to address that problem. It's just a re-description of an already completed study. It's troubling that a PLoS editor would instruct the authors to post-register a study.
We would like to clarify some questions related to the data analysis. First, multiple tests in all ANOVAs were Bonferroni corrected although it is not made explicit in the text. 
As best I can tell, the most critical multiple testing issues, the ones that could have been partly remedied by pre-registration, were not corrected in this paper. 

There are at least four distinct multiple testing issues:

  1.  There are a large number of outcome measures, and as best I can tell, none of the statistical tests were corrected for the number of tests conducted across tasks. 
  2. There are multiple possible tests for a number of the tasks (e.g., wellbeing has a number of sub-scales), and there don't appear to have been corrections for the number of distinct ways in which an outcome could be measured. 
  3. A multi-factor ANOVA itself constitutes multiple tests (e.g., a 2x2 ANOVA involves 3 tests: each main effect and the interaction). Few studies correct for that multiple testing problem, and this paper does not appear to have done so. 
  4. There is a multiple tests issue with pairwise comparisons conducted to follow-up a significant ANOVA. I assume those are the Bonferroni-corrected tests that the authors referred to above. However, it's impossible to tell if these tests were corrected because the paper did not report the test statistics — it just reported p < xx or p= xx. Only the F tests for the main effects and interactions were reported.
If the authors did correct for the first three types of multiple testing issues, perhaps they can clarify. However, based on the ANOVA results, it does not appear that they did.

A related issue, one mentioned in my HI-BAR on the PLoS paper but not on the Frontiers paper, is that some of the reported significance levels for the F tests are incorrect. Here are some examples from the Frontiers paper in which the reported significance level (p<xx or p=xx) is wrong. For each, I give the correct p value in red (uncorrected for multiple tests). None of these calculations led to less statistical significance:

  • [F(1, 28) = 3.24, MSE = 1812.22, p = 0.057, η2p = 0.12]. p=.0826
  • [F(2, 50) = 5.52, MSE = 0.001, p < 0.005, η2p = 0.18]. p=.0068
  • [F(1, 28) = 4.35, MSE = 0.09, p < 0.001, η2p = 0.89]. p=.0462
  • [F(1, 28) = 17.98, MSE = 176.74, p < 0.001, η2p = 0.39]. p=.0002
  • [F(1, 28) = 13.02, MSE = 61.49, p < 0.01, η2p = 0.32]. p=.0012
  • [F(1, 28) = 3.42, MSE = 6.47, p = 0.07, η2 = 0.11]. p=.0750
The following two results were reported as statistically significant at p<.05, but actually were not significant with that alpha level:
  • [F(1, 28) = 3.98, MSE = 0.06, p < 0.01, η2p = 0.12]. p=.0559
  • [F(1, 28) = 3.40, MSE = 0.15, p = 0.04, η2p = 0.10]. p=.0758
I don't know that any of these errors dramatically changes the interpretation of the results, but they should be corrected.
Second, the RTs filtering was not an arbitrary decision. The lower limit (200 ms) reflects the approximate minimum amount of time necessary to respond to a given stimulus (due to speed limitations of the neurological system). RTs below this limit may reflect responses initiated before the stimulus onset (guessing). The selection of the upper limit (1100 ms) is a more difficult decision. Different criteria have been proposed (statistical, theoretical...) in the literature (see Whelan, 2008). Importantly none of them seem to affect type I errors significantly. In this case, we made the following theoretical assumption: RTs longer than 1100 ms might depend on other cognitive processes than speed of processing.
There is nothing wrong with this choice of cutoffs, and it might well have been principled. Still, it is arbitrary. Any number of cutoffs would have been just as appropriate (e.g., 150ms and 1200ms, 175ms and 3000ms, ± 3SD, etc). My point wasn't to question the choice, but instead to note that it introduces flexibility to the analysis. This is a case in which pre-registering the analysis plan would help -- the choice of cutoffs is reasonable, but flexible. Registration eliminates that flexibility, making it clear to readers that the choice of cutoffs was principled rather than based on knowing the outcome with different cutoffs. Another approach would be to assess the robustness of the results to various choices of cutoff (reporting all results).

Thursday, November 20, 2014

HI-BAR: More benefits of Lumosity training for older adults?

HI-BAR (Had I Been A Reviewer)

For more information about HI-BAR reviews, see my post introducing the acronym


Ballesteros S, Prieto A, Mayas J, Toril P, Pita C, Ponce de León L, Reales JM and Waterworth J (2014) Brain training with non-action video games enhances aspects of cognition in older adults: a randomized controlled trial. Front. Aging Neurosci. 6:277. doi: 10.3389/fnagi.2014.00277


Update: I posted a brief summary of this HI-BAR as a comment on the paper at Frontiers, and Dr. Ballesteros responded to indicate that this was the same training study. Her response quoted the one-sentence citation described below and noted that they did not intend to give the impression that this was a new intervention. Her reply did not address the overlap in presentation of methods and results.

Update 2: Dr. Ballesteros sent me a response to my questions [pdf]. I have just posted a new blog post in which I quote from her response and add new comments.  

This paper testing the benefits of Lumosity training for older adults just appeared in Frontiers. The paper is co-authored two of the same people (Ballesteros and Mayas) as a recent PLoS One paper on Lumosity training that I critiqued in a HI-BAR in April (note that the new paper includes six additional authors who were not on the PLoS paper and the PLoS paper includes two who weren't on the Frontiers paper). 

I was hopeful that this new paper would address some of the shortcomings of the earlier one. Unfortunately, it didn't. In fact, it couldn't have, because the "new" Frontiers paper is based on the same intervention as the PLoS paper.

And, when I say, "the same intervention," I don't mean that they directly replicated the procedures from their earlier paper. The Frontiers paper reports data from the same intervention with the same participants. It would be hard to know that from reading just the Frontiers paper, though, because it does not mention that these are additional measures from the same intervention. 

As my colleagues and I have noted (ironically, in Frontiers), this sort of partial reporting of outcome measures gives the misleading impression that there are more interventions demonstrating transfer of training than actually exist. If a subsequent meta-analysis treats these reports as independent, it will produce a distorted estimate of the size of any intervention benefits. Moreover, whenever a paper does not report all outcome measures, there's no way to appropriately correct for multiple comparisons. At a minimum, all publications derived from the same intervention study should state explicitly that it is the same intervention and identify all outcome measures that will, at some point, be reported. 

The Frontiers cites the PLoS paper only twice. The first citation appears in the method section in reference to the oddball outcome task that was the focus of the PLoS paper:
Results from this task have been reported separately (see Mayas et al., 2014).
That is the only indication in the paper that there might be overlap of any kind between the study presented in Frontiers and the one presented in PLoS. It's a fairly minimalist way to indicate that this was actually a report of different outcome measures from the same study. 

It's also not entirely true: The results section of the Frontiers paper describes the results for that oddball task in detail as well, and all of the statistics for that task reported in the Frontiers paper were also reported in the PLoS paper. The results section does not acknowledge the overlap. Figure 3b in the Frontiers paper is nearly identical to Figure 2b in the PLoS paper, reporting the same statistics from the oddball task with no mention that the data in the figure were published previously.

After repeating the main results of the oddball task from the earlier paper, the Frontiers paper states:
"New analyses were conducted on distraction (novel vs. standard sound) and alertness (silence vs. standard sound), showing that the ability to ignore relevant sounds (distraction) improved significantly in the experimental group after training (12 ms) but not in the control group. The analyses of alertness showed that the experimental group increased 26 ms in alertness (p < 0.05) but control group did not (p = 0.54)."
I thought "new analysis" might mean that these analyses were new to this paper, but they weren't. Here's the relevant section of the PLoS paper (which reported the actual statistical tests):
"In other words, the ability to ignore irrelevant sounds improved significantly in the experimental group after video game training (12 ms approximately) but not in the control group....Pre- and post-training comparisons showed a 26ms increase of alertness in the experimental group..."
Stating that "results from this task have been reported separately" implies that any analyses of that task reported in Frontiers are new, but they're not.

The Frontiers paper also reports the same training protocol and training performance data without reference to the earlier study. The Frontiers paper put the training data in a figure whereas the PLoS paper put them in a table. Why not just cite the earlier paper for the training data rather than republishing them without citation?

The same CONSORT diagram appears as Figure 1 in both papers. The Frontiers paper CONSORT diagram does report that the experimental condition had data from 17 participants rather than the 15 participants reported in the PLoS paper. From the otherwise identical figures, it appears that those two additional participants were the ones that the PLoS paper noted were lost to followup due to "personal travel" and "no response from participant." I guess both responded after the PLoS paper was completed, although the method sections of the two papers noted that testing took place over the same time period (January through July of 2013). The presence of data from these two additional participants did not change any of the statistics for the oddball task that were reported in both papers. 

One other oddity about the subject samples in the consort diagram: The PLoS paper excluded one participant from the control condition analyses due to "Elevated number of errors," leaving 12 participants in that analysis. The Frontiers paper did not exclude that participant. If they made too many errors for the PLoS paper, why were they included in the Frontiers paper?

Other than the one brief note in the method section about the oddball task, the only other reference to the PLoS paper appeared in the discussion, and gives the impression that the PLoS paper provided independent evidence for the contribution of frontal brain regions to alertness and attention filtering:
We also found that the trainees reduced distractibility by improving alertness and attention filtering, functions that decline with age and depend largely on frontal regions (Mayas et al., 2014).
That citation and the one in the method section are the only mentions of the earlier paper in the entire article. 


Given that I was already familiar with the PLoS paper, had I been a reviewer of this paper, here's what I would have said:


  1. The paper suffers from all of the same criticisms I raised about the intervention in the PLoS paper (e.g., inadequate control group among many other issues), and in my view, those shortcomings should preclude publication of this paper. See http://blog.dansimons.com/2014/04/hi-bar-benefits-of-lumosity-training.html
  2. The paper should have explained that this was the same intervention study, with the same participants, as the earlier PLoS One paper.
  3. The paper should have noted explicitly which data and results were reported previously and which were not.
It's possible that the authors notified the editor of the overlap between these papers upon submission. If so, it is surprising that the paper was even sent out for review before these issues were addressed. If the editor was not informed of the overlap, I cannot fault the editor and reviewers for missing the fact that the intervention, training results, and some of the outcome results were reported previously. Unless they happened to know about the PLoS One paper, they could be excused for missing the one reference to that paper in the method section. Still, the many other major problems with this intervention study probably should have precluded publication, at least without major corrections and qualifications of the claims.

Wednesday, April 2, 2014

HI-BAR: Benefits of Lumosity training for older adults?

HI-BAR (Had I Been A Reviewer)

A post-publication review of Mayas J., Parmentier, F. B. R., Andrés P., & Ballesteros, S. (2014) Plasticity of Attentional Functions in Older Adults after Non-Action Video Game Training: A Randomized Controlled Trial. PLoS ONE 9(3): e92269. doi:10.1371/journal.pone.0092269

For more information about HI-BAR reviews, see my post introducing the acronym.



This paper explored "whether the ability to filter out irrelevant information...can improve in older adults after practicing non-violent video games." In this case, the experimental group played 10 games that are part of the Lumosity program for a total of 20 hours. The control group did not receive any training. Based on post-training improvements on an "oddball" task (a fairly standard attention task, not a measure of quirkiness), the authors claim that training improved the ability to ignore distractions and increased alertness in older adults. 

Testing whether commercial brain training packages have any efficacy for cognitive enhancement is a worthwhile goal, especially given the dearth of robust, reliable evidence that such training has any measurable impact on cognitive performance on anything other than the trained tasks. I expect that Lumosity will tout this paper as evidence for the viability of their brain training games as a tool to improve cognition. They probably shouldn't.


Below are the questions and concerns I would have raised had I been a reviewer of this manuscript. If you read my earlier HI-BAR review of Anguera et al (2013), you'll recognize many of the same concerns. Unfortunately, the problems with this paper are worse. A few of these questions could be answered with more information about the study (I hope the authors will provide that information). Unfortunately, many of the shortcomings are more fundamental and undermine the conclusion that training transferred to their outcome measure.


I've separated the comments into two categories: Method/Analysis/Inferential issues and Reporting issues.


Method/Analysis/Inferential Issues

Sample size
The initial sample consisted of 20 older adults in the training group and 20 in the control group.  After attrition, the analyzed sample consisted of only 15 people in the training group and 12 in the control group. That's a really small sample, especially when testing older adults who can vary substantially in their performance.

Inadequate control group
The control group for this paper is inadequate to make any causal claims about the efficacy of the training. The experimental group engaged in 20 hours of practice with Lumosity games. The control group "attended meetings with the other members of the study several times along the course of the study;" they got to hang out, chat, and have coffee with the experimenters a few times. This sort of control condition is little better than a no-contact control group (not even the amount of contact was equated). Boot et al (2013) explained how inadequate control conditions like this "limited contact" one provide an inadequate baseline against which to evaluate the effectiveness of an intervention. Here's the critical figure from our paper showing the conclusions that logically follow from interventions with different types of control conditions:




When inferring the causal potency of any treatment, it must be compared to an appropriate baseline. That is why drug studies use a placebo control, ideally one that equates for side effects too, so that participants do not know whether they have received the drug or a placebo. For a control condition to be adequate, it should include all of the elements of the experimental group excepting the critical ingredient of the treatment (including equating for expectation effects). Otherwise, any differences in outcome could be due to other differences between the groups. That subtractive method, first described by Donders more than 150 years ago, is the basis of clinical trial logic. Unfortunately, it commonly is neglected in psychological interventions.

In video game training, even an active control group in which participants play a different game might not control for differential placebo effects on outcome measures. But, the lack of an active control group allows almost no definitive conclusions: It does not equate the experience between the training group and the control group in any substantive way. This limited-contact control group accounts for test-retest effects and the passage of time, and little else. 

Any advantage observed for the training group could result from many factors that are not specific to the games involved in the training condition or even to games at all: Any benefits could have resulted from doing something intensive for 20 hours, from greater motivation to perform well on the outcome measures, from greater commitment to the tasks, from differential placebo effects, from greater social contact, etc. Differences between the training group and this limited-contact control group do not justify any causal conclusion about the nature of the training.

Interventions and training studies that lack any control condition other than a no-contact or limited-contact control group should not be published. Period. They are minimally informative at best, and misleading at worst given that they will be touted as evidence for the benefits of training. The literature on game training is more than 10 years old, and there is no justification for publishing studies that claim a causal benefit of training if they lack an adequate baseline condition.

Multiple tests without correction
The initial analysis consisted of two separate 2x2x3 ANOVAs on accuracy and response times on the oddball task, accompanied by follow-up tests.  A 3-factor ANOVA consists of 7 separate F-tests (3 main effects, 3 two-way interactions, and 1 three-way interaction). Even if the null hypothesis were true and all of the data were drawn from a single population, we would expect a significant result on at least one of these tests more than 30% of the time on average (1 - .95^7). In other words, each of the ANOVAs has a 30% false positive rate. For a thorough discussion of this issue, see this excellent blog post from Dorothy Bishop.

The critical response time analysis, the only one to show a differential training effect, produced two significant tests out of the 7 conducted in the ANOVA: a main effect of condition as for accuracy (but not with the predicted pattern), and a significant 3-way interaction. The results section does not report any correction for multiple tests, though, and the critical 3-way interaction would not have been significant (it was p=.017 without correction).

Possible speed-accuracy tradeoff
The accuracy ANOVA showed a marginally significant 3-way interaction (p=.077 without correction for multiple tests), but the paper does not report the means or standard deviations for the accuracy results. Is it possible that the effects on accuracy and RT were in opposite directions? If so, the entire difference between training groups could just be a speed-accuracy tradeoff, with no actual performance difference between conditions.


Flexible and arbitrary analytical decisions
For response times, the analysis included only correct trials and excluded times faster than 200ms and slower than 1100ms. Standard trials after novel trials were discarded as well. These choices seem reasonable, but arbitrary. Would the results hold with different cutoffs and criteria? Were any other cutoffs tried? If so, that introduces additional flexibility (investigator degrees of freedom) that could spuriously inflate the significance of tests. That's one reason why pre-registration of analysis plans is essential. It's all too easy to rationalize any particular approach after the fact if it happens to work.


Unclear outcome measures
The three way interaction for response time was followed up with more tests that separated the oddball task into an alertness measure and a distraction measure analyzed separately for the two groups. It's not clear how these measures were derived from the oddball conditions, but I assume they were based on different combinations of the silent, standard, and novel noise conditions. It would be nice to know what these contrasts were as they provide the only focused tests of differential transfer task improvements between the experimental group and the control group. 


A difference in significance is not a significant difference
The primary conclusions about these follow-up outcome measures are based on significant improvements for the training group (reported as p=.05 and p=.04) and the absence of a significant improvement for the control group. Yet, significance in one condition and not in another does not mean that those two conditions differed significantly. No test of the difference in improvements across conditions was provided.

Inappropriate truncation/rounding of p-values
The critical "significant" effect for the training group actually wasn't significant! The authors reported "F(1,25) = 4.00, MSE = 474.28, p = .05, d = 0.43]." Whenever I see a p value reported as exactly .05, I get suspicious. So, I checked. F(1,25) = 4.00 gives p = .0565. Not significant. The authors apparently truncated the p-value. 

(The reported p=.04 is actually p=.0451. That finding was rounded, but rounding down is not appropriate either.)

Of the two crucial tests in the paper, one wasn't actually significant and the other was just barely under p=.05. Not strong evidence for improvement, especially given the absence of correction for multiple tests (with which, neither would be significant). 

Inappropriate conclusions from correlation analyses
The paper explored the correlation between the alertness and distraction improvements (the two outcome measures) and each of the 10 games that made up the Lumosity training condition. The motivation is to test whether the amount of improvement an individual showed during training correlated with the amount of transfer they showed on the outcome measures. Of course, with N=15, no correlation is stable and any significant correlation is likely to be substantially inflated. The paper included no correction for the 20 tests they conducted, and neither of the two significant correlations (p=.02, p<.01) would survive correction. Moreover, even if these correlations were robust, correlations between training improvement and outcome measure improvement logically provide no evidence for the causal effect of training on the transfer task (See Tidwell et al, 2013).

Inaccurate conclusions
The authors write: 
"The results of the present study suggest that training older adults with non-action video games reduces distractibility by improving attention filtering (a function declining with age and largely depending on frontal regions) but also improves alertness." 
First, the study provides no evidence at all that any differences resulted from improvements in an attention filtering mechanism. It provided not tests of that theoretical idea. 

Furthermore, the study as reported does not show that training differentially reduced distractibility or increased alertness.  The improved alertness effect was not statistically significant when the p-value isn't truncated to .05. The effect on the distraction measure was 12ms (p=-.0451 without correction for multiple tests). Neither effect would be statistically significant with correction for multiple tests. But, even if they were significant, without a test of the difference between the training effect and the control group effect, we don't know if there was any significant difference in improvements between the two groups; significance in one condition but not another does not imply a significant difference between conditions.



Reporting Issues

Pre-Registration or Post-Registration
The authors registered their study on ClinicalTrials.gov on December 4, 2013; it's listed and linked under the "trial registration" heading in the published article. That's great.  Pre-registration is the best way to eliminate p-hacking and other investigator degrees of freedom. 

But, this wasn't a pre-registration: The manuscript was submitted to PLoS three months before it was registered! What is the purpose of registering an already completed study? Should it even count as a registration?

Unreported outcome measures 
The only outcome measure mentioned in the published article is the "oddball task," but the registration on ClinicalTrials.gov identifies the following additional measures (none with any detail): neuropsychological testing, Wisconsin task, speed of processing, and spatial working memory. Presumably, these measures were collected as part of the study? After all, the protocol was registered after the paper was submitted. Why were they left out of the paper?

Perhaps the registration was an acknowledgment of the other measures and they are planning to report each outcome measure in a separate journal submission. Dividing a large study across multiple papers can be an acceptable practice, provided that all measures are identified in each paper and readers are informed about all of the relevant outcomes (the papers must cross-reference each other). 


Sometimes a large-scale study is preceded by a peer-reviewed "design" paper that lays out the entire protocol in detail in advance. This registration lacks the necessary detail to serve as a roadmap for a series of studies. Moreover, separating measures across papers without mentioning that there were other outcome measures or that some of the measures were (or will be) reported elsewhere is misleading. It gives the false impression that these outcomes came from different, independent experiments. A meta-analysis would treat them as independent evidence for training benefits when they aren't independent.


Unless the results from all measures are published, readers have no way to interpret the significance tests for any one measure. Readers need to know the total number of hypothesis tests to determine the false positive rate. Without that information, the significance tests are largely uninterpretable.


Here's a more troubling possibility: Perhaps the results for these other measures weren't significant, so the authors chose not to report them (or the reviewers/editor told them not to). If true, this underreporting—providing only the outcome measure that showed an effect—constitutes p-hacking, increasing the chances that any significant results in the article were false positives. 

Without more information, readers have no way to know which is the case. And, without that information, it is not possible to evaluate the evidence. This problem of incomplete reporting of outcome measures (and neglecting to mention that separate papers came from the same study) has occurred in the game training literature before—see "The Importance of Independent Replication" section in this 2012 paper for some details. 

Conflicting documentation of outcome measures
The supplemental materials at PLoS include a protocol that lists only the oddball task and neuropsychological testing. The supplemental materials also include an author-completed Consort Checklist that identifies where all the measures are described in the manuscript. The checklist includes the following items for "outcomes": 
"6a. Completely defined pre-specified primary and secondary outcome measures, including how and when they were assessed."  
"6b. Any changes to trial outcomes after the trial commenced, with reasons." 
For 6a, the authors answered "Methods." Yet, the primary and secondary outcomes noted in the Clinicaltrials.gov registration are not fully reported in the methods of the actual article or in the protocol page. For 6b, they responded "N/A," Implying that the design was carried out as described in the paper.

These forms and responses are inconsistent with the protocol description at clinicaltrials.gov. Either the paper and supplementary materials neglected to mention the other outcome measures or the clincialtrials.gov registration lists outcome measures that weren't actually collected. Given that the ClinicalTrials registration was completed after the paper was submitted, that implies that other outcome measures were collected as part of the study but not reported. If so, the PLoS supplemental materials are inaccurate.

Final post-test or interim stage post-test? 
The clinicaltrials.gov registration states that outcome measures will be tested before training, after 12-weeks, and again after 24 weeks. The paper reports only the pre-training and 12-week outcomes and does not mention the 24-week test. Was it conducted? Is this paper an interim report? If so, that should be mentioned in the article. Had the results not been significant at 12 weeks, would they have been submitted for publication? Probably not. And, if not, that could be construed as selective reporting, again biasing the reported p-values in this paper in favor of significance. 


Limitations of the limitations section
The paper ends with a limitations section, but the only identified limitations are the small sample size, the lack of a test of real-world outcome measure, the use of only the games in Lumosity, and the lack of evidence for maintenance of the training benefits (presumably foreshadowing a future paper based on the 24-week outcome testing mentioned in the clinicaltrials.gov registration). No mention is made of the inadequacy of the control group for causal claims about the benefits of game training, the fragility of the results to correction for multiple testing, the flexibility of analysis, the possible presence of unreported outcome measures, or any of the other issues I noted above.


Summary

Brain training is now a major industry, and companies capitalize (literally) on results that seem to support their claims. Training and intervention studies are critical if we want to evaluate the effectiveness of psychological interventions. But, intervention studies must include an adequate active control group, one that is matched for expected improvements independently for each outcome measure (to control for differential placebo effects). Without such a control condition, causal claims that a treatment has benefits are inappropriate because it is impossible to distinguish effects of the training task from other differences between the training and control group that could also lead to differential improvement. Far too many published papers make causal claims with inadequate designs, incomplete reporting of outcome measures, and overly flexible analyses. 

In this case, the inadequacy of the limited-contact control condition (without acknowledging these limitations) alone would be sufficient grounds for an editor to reject this paper. Reviewers and editors need to step up and begin requiring adequate designs whenever authors make causal claims about brain training.  Even those with an adequate design should take care to qualify any causal claims appropriately to avoid misrepresentation in the media.


Even if the control condition in this study had been adequate (it wasn't), the critical interaction testing the difference in improvement across conditions was not reported. Moreover, one of the improvements in the training group was reported to be significant even though it wasn't, and neither of the improvements would have withstood correction for multiple tests. Finally, the apparent underreporting of outcome measures makes all of the significance tests suspect.


More broadly, this paper provides an excellent example of why the field needs true pre-registration of cognitive intervention studies. Such registrations should include more than just the labels for the outcome measures. They should include a complete description of the protocol, tasks, measures, coding, and planned analysis. They should specify any arbitrary cutoffs, identify which analyses are confirmatory, and note when additional analyses will be exploratory. Without pre-registration (or, in the absence if pre-registration, complete reporting of all outcome measures), readers have no way to evaluate the results of an intervention because any statistical tests are effectively uninterpretable. 



Note: Updated to fix formatting and typos