Wednesday, April 2, 2014

HI-BAR: Benefits of Lumosity training for older adults?

HI-BAR (Had I Been A Reviewer)

A post-publication review of Mayas J., Parmentier, F. B. R., Andrés P., & Ballesteros, S. (2014) Plasticity of Attentional Functions in Older Adults after Non-Action Video Game Training: A Randomized Controlled Trial. PLoS ONE 9(3): e92269. doi:10.1371/journal.pone.0092269

For more information about HI-BAR reviews, see my post introducing the acronym.

This paper explored "whether the ability to filter out irrelevant information...can improve in older adults after practicing non-violent video games." In this case, the experimental group played 10 games that are part of the Lumosity program for a total of 20 hours. The control group did not receive any training. Based on post-training improvements on an "oddball" task (a fairly standard attention task, not a measure of quirkiness), the authors claim that training improved the ability to ignore distractions and increased alertness in older adults. 

Testing whether commercial brain training packages have any efficacy for cognitive enhancement is a worthwhile goal, especially given the dearth of robust, reliable evidence that such training has any measurable impact on cognitive performance on anything other than the trained tasks. I expect that Lumosity will tout this paper as evidence for the viability of their brain training games as a tool to improve cognition. They probably shouldn't.

Below are the questions and concerns I would have raised had I been a reviewer of this manuscript. If you read my earlier HI-BAR review of Anguera et al (2013), you'll recognize many of the same concerns. Unfortunately, the problems with this paper are worse. A few of these questions could be answered with more information about the study (I hope the authors will provide that information). Unfortunately, many of the shortcomings are more fundamental and undermine the conclusion that training transferred to their outcome measure.

I've separated the comments into two categories: Method/Analysis/Inferential issues and Reporting issues.

Method/Analysis/Inferential Issues

Sample size
The initial sample consisted of 20 older adults in the training group and 20 in the control group.  After attrition, the analyzed sample consisted of only 15 people in the training group and 12 in the control group. That's a really small sample, especially when testing older adults who can vary substantially in their performance.

Inadequate control group
The control group for this paper is inadequate to make any causal claims about the efficacy of the training. The experimental group engaged in 20 hours of practice with Lumosity games. The control group "attended meetings with the other members of the study several times along the course of the study;" they got to hang out, chat, and have coffee with the experimenters a few times. This sort of control condition is little better than a no-contact control group (not even the amount of contact was equated). Boot et al (2013) explained how inadequate control conditions like this "limited contact" one provide an inadequate baseline against which to evaluate the effectiveness of an intervention. Here's the critical figure from our paper showing the conclusions that logically follow from interventions with different types of control conditions:

When inferring the causal potency of any treatment, it must be compared to an appropriate baseline. That is why drug studies use a placebo control, ideally one that equates for side effects too, so that participants do not know whether they have received the drug or a placebo. For a control condition to be adequate, it should include all of the elements of the experimental group excepting the critical ingredient of the treatment (including equating for expectation effects). Otherwise, any differences in outcome could be due to other differences between the groups. That subtractive method, first described by Donders more than 150 years ago, is the basis of clinical trial logic. Unfortunately, it commonly is neglected in psychological interventions.

In video game training, even an active control group in which participants play a different game might not control for differential placebo effects on outcome measures. But, the lack of an active control group allows almost no definitive conclusions: It does not equate the experience between the training group and the control group in any substantive way. This limited-contact control group accounts for test-retest effects and the passage of time, and little else. 

Any advantage observed for the training group could result from many factors that are not specific to the games involved in the training condition or even to games at all: Any benefits could have resulted from doing something intensive for 20 hours, from greater motivation to perform well on the outcome measures, from greater commitment to the tasks, from differential placebo effects, from greater social contact, etc. Differences between the training group and this limited-contact control group do not justify any causal conclusion about the nature of the training.

Interventions and training studies that lack any control condition other than a no-contact or limited-contact control group should not be published. Period. They are minimally informative at best, and misleading at worst given that they will be touted as evidence for the benefits of training. The literature on game training is more than 10 years old, and there is no justification for publishing studies that claim a causal benefit of training if they lack an adequate baseline condition.

Multiple tests without correction
The initial analysis consisted of two separate 2x2x3 ANOVAs on accuracy and response times on the oddball task, accompanied by follow-up tests.  A 3-factor ANOVA consists of 7 separate F-tests (3 main effects, 3 two-way interactions, and 1 three-way interaction). Even if the null hypothesis were true and all of the data were drawn from a single population, we would expect a significant result on at least one of these tests more than 30% of the time on average (1 - .95^7). In other words, each of the ANOVAs has a 30% false positive rate. For a thorough discussion of this issue, see this excellent blog post from Dorothy Bishop.

The critical response time analysis, the only one to show a differential training effect, produced two significant tests out of the 7 conducted in the ANOVA: a main effect of condition as for accuracy (but not with the predicted pattern), and a significant 3-way interaction. The results section does not report any correction for multiple tests, though, and the critical 3-way interaction would not have been significant (it was p=.017 without correction).

Possible speed-accuracy tradeoff
The accuracy ANOVA showed a marginally significant 3-way interaction (p=.077 without correction for multiple tests), but the paper does not report the means or standard deviations for the accuracy results. Is it possible that the effects on accuracy and RT were in opposite directions? If so, the entire difference between training groups could just be a speed-accuracy tradeoff, with no actual performance difference between conditions.

Flexible and arbitrary analytical decisions
For response times, the analysis included only correct trials and excluded times faster than 200ms and slower than 1100ms. Standard trials after novel trials were discarded as well. These choices seem reasonable, but arbitrary. Would the results hold with different cutoffs and criteria? Were any other cutoffs tried? If so, that introduces additional flexibility (investigator degrees of freedom) that could spuriously inflate the significance of tests. That's one reason why pre-registration of analysis plans is essential. It's all too easy to rationalize any particular approach after the fact if it happens to work.

Unclear outcome measures
The three way interaction for response time was followed up with more tests that separated the oddball task into an alertness measure and a distraction measure analyzed separately for the two groups. It's not clear how these measures were derived from the oddball conditions, but I assume they were based on different combinations of the silent, standard, and novel noise conditions. It would be nice to know what these contrasts were as they provide the only focused tests of differential transfer task improvements between the experimental group and the control group. 

A difference in significance is not a significant difference
The primary conclusions about these follow-up outcome measures are based on significant improvements for the training group (reported as p=.05 and p=.04) and the absence of a significant improvement for the control group. Yet, significance in one condition and not in another does not mean that those two conditions differed significantly. No test of the difference in improvements across conditions was provided.

Inappropriate truncation/rounding of p-values
The critical "significant" effect for the training group actually wasn't significant! The authors reported "F(1,25) = 4.00, MSE = 474.28, p = .05, d = 0.43]." Whenever I see a p value reported as exactly .05, I get suspicious. So, I checked. F(1,25) = 4.00 gives p = .0565. Not significant. The authors apparently truncated the p-value. 

(The reported p=.04 is actually p=.0451. That finding was rounded, but rounding down is not appropriate either.)

Of the two crucial tests in the paper, one wasn't actually significant and the other was just barely under p=.05. Not strong evidence for improvement, especially given the absence of correction for multiple tests (with which, neither would be significant). 

Inappropriate conclusions from correlation analyses
The paper explored the correlation between the alertness and distraction improvements (the two outcome measures) and each of the 10 games that made up the Lumosity training condition. The motivation is to test whether the amount of improvement an individual showed during training correlated with the amount of transfer they showed on the outcome measures. Of course, with N=15, no correlation is stable and any significant correlation is likely to be substantially inflated. The paper included no correction for the 20 tests they conducted, and neither of the two significant correlations (p=.02, p<.01) would survive correction. Moreover, even if these correlations were robust, correlations between training improvement and outcome measure improvement logically provide no evidence for the causal effect of training on the transfer task (See Tidwell et al, 2013).

Inaccurate conclusions
The authors write: 
"The results of the present study suggest that training older adults with non-action video games reduces distractibility by improving attention filtering (a function declining with age and largely depending on frontal regions) but also improves alertness." 
First, the study provides no evidence at all that any differences resulted from improvements in an attention filtering mechanism. It provided not tests of that theoretical idea. 

Furthermore, the study as reported does not show that training differentially reduced distractibility or increased alertness.  The improved alertness effect was not statistically significant when the p-value isn't truncated to .05. The effect on the distraction measure was 12ms (p=-.0451 without correction for multiple tests). Neither effect would be statistically significant with correction for multiple tests. But, even if they were significant, without a test of the difference between the training effect and the control group effect, we don't know if there was any significant difference in improvements between the two groups; significance in one condition but not another does not imply a significant difference between conditions.

Reporting Issues

Pre-Registration or Post-Registration
The authors registered their study on on December 4, 2013; it's listed and linked under the "trial registration" heading in the published article. That's great.  Pre-registration is the best way to eliminate p-hacking and other investigator degrees of freedom. 

But, this wasn't a pre-registration: The manuscript was submitted to PLoS three months before it was registered! What is the purpose of registering an already completed study? Should it even count as a registration?

Unreported outcome measures 
The only outcome measure mentioned in the published article is the "oddball task," but the registration on identifies the following additional measures (none with any detail): neuropsychological testing, Wisconsin task, speed of processing, and spatial working memory. Presumably, these measures were collected as part of the study? After all, the protocol was registered after the paper was submitted. Why were they left out of the paper?

Perhaps the registration was an acknowledgment of the other measures and they are planning to report each outcome measure in a separate journal submission. Dividing a large study across multiple papers can be an acceptable practice, provided that all measures are identified in each paper and readers are informed about all of the relevant outcomes (the papers must cross-reference each other). 

Sometimes a large-scale study is preceded by a peer-reviewed "design" paper that lays out the entire protocol in detail in advance. This registration lacks the necessary detail to serve as a roadmap for a series of studies. Moreover, separating measures across papers without mentioning that there were other outcome measures or that some of the measures were (or will be) reported elsewhere is misleading. It gives the false impression that these outcomes came from different, independent experiments. A meta-analysis would treat them as independent evidence for training benefits when they aren't independent.

Unless the results from all measures are published, readers have no way to interpret the significance tests for any one measure. Readers need to know the total number of hypothesis tests to determine the false positive rate. Without that information, the significance tests are largely uninterpretable.

Here's a more troubling possibility: Perhaps the results for these other measures weren't significant, so the authors chose not to report them (or the reviewers/editor told them not to). If true, this underreporting—providing only the outcome measure that showed an effect—constitutes p-hacking, increasing the chances that any significant results in the article were false positives. 

Without more information, readers have no way to know which is the case. And, without that information, it is not possible to evaluate the evidence. This problem of incomplete reporting of outcome measures (and neglecting to mention that separate papers came from the same study) has occurred in the game training literature before—see "The Importance of Independent Replication" section in this 2012 paper for some details. 

Conflicting documentation of outcome measures
The supplemental materials at PLoS include a protocol that lists only the oddball task and neuropsychological testing. The supplemental materials also include an author-completed Consort Checklist that identifies where all the measures are described in the manuscript. The checklist includes the following items for "outcomes": 
"6a. Completely defined pre-specified primary and secondary outcome measures, including how and when they were assessed."  
"6b. Any changes to trial outcomes after the trial commenced, with reasons." 
For 6a, the authors answered "Methods." Yet, the primary and secondary outcomes noted in the registration are not fully reported in the methods of the actual article or in the protocol page. For 6b, they responded "N/A," Implying that the design was carried out as described in the paper.

These forms and responses are inconsistent with the protocol description at Either the paper and supplementary materials neglected to mention the other outcome measures or the registration lists outcome measures that weren't actually collected. Given that the ClinicalTrials registration was completed after the paper was submitted, that implies that other outcome measures were collected as part of the study but not reported. If so, the PLoS supplemental materials are inaccurate.

Final post-test or interim stage post-test? 
The registration states that outcome measures will be tested before training, after 12-weeks, and again after 24 weeks. The paper reports only the pre-training and 12-week outcomes and does not mention the 24-week test. Was it conducted? Is this paper an interim report? If so, that should be mentioned in the article. Had the results not been significant at 12 weeks, would they have been submitted for publication? Probably not. And, if not, that could be construed as selective reporting, again biasing the reported p-values in this paper in favor of significance. 

Limitations of the limitations section
The paper ends with a limitations section, but the only identified limitations are the small sample size, the lack of a test of real-world outcome measure, the use of only the games in Lumosity, and the lack of evidence for maintenance of the training benefits (presumably foreshadowing a future paper based on the 24-week outcome testing mentioned in the registration). No mention is made of the inadequacy of the control group for causal claims about the benefits of game training, the fragility of the results to correction for multiple testing, the flexibility of analysis, the possible presence of unreported outcome measures, or any of the other issues I noted above.


Brain training is now a major industry, and companies capitalize (literally) on results that seem to support their claims. Training and intervention studies are critical if we want to evaluate the effectiveness of psychological interventions. But, intervention studies must include an adequate active control group, one that is matched for expected improvements independently for each outcome measure (to control for differential placebo effects). Without such a control condition, causal claims that a treatment has benefits are inappropriate because it is impossible to distinguish effects of the training task from other differences between the training and control group that could also lead to differential improvement. Far too many published papers make causal claims with inadequate designs, incomplete reporting of outcome measures, and overly flexible analyses. 

In this case, the inadequacy of the limited-contact control condition (without acknowledging these limitations) alone would be sufficient grounds for an editor to reject this paper. Reviewers and editors need to step up and begin requiring adequate designs whenever authors make causal claims about brain training.  Even those with an adequate design should take care to qualify any causal claims appropriately to avoid misrepresentation in the media.

Even if the control condition in this study had been adequate (it wasn't), the critical interaction testing the difference in improvement across conditions was not reported. Moreover, one of the improvements in the training group was reported to be significant even though it wasn't, and neither of the improvements would have withstood correction for multiple tests. Finally, the apparent underreporting of outcome measures makes all of the significance tests suspect.

More broadly, this paper provides an excellent example of why the field needs true pre-registration of cognitive intervention studies. Such registrations should include more than just the labels for the outcome measures. They should include a complete description of the protocol, tasks, measures, coding, and planned analysis. They should specify any arbitrary cutoffs, identify which analyses are confirmatory, and note when additional analyses will be exploratory. Without pre-registration (or, in the absence if pre-registration, complete reporting of all outcome measures), readers have no way to evaluate the results of an intervention because any statistical tests are effectively uninterpretable. 

Note: Updated to fix formatting and typos