Thursday, November 20, 2014

HI-BAR: More benefits of Lumosity training for older adults?

HI-BAR (Had I Been A Reviewer)

For more information about HI-BAR reviews, see my post introducing the acronym


Ballesteros S, Prieto A, Mayas J, Toril P, Pita C, Ponce de León L, Reales JM and Waterworth J (2014) Brain training with non-action video games enhances aspects of cognition in older adults: a randomized controlled trial. Front. Aging Neurosci. 6:277. doi: 10.3389/fnagi.2014.00277


This paper testing the benefits of Lumosity training for older adults just appeared in Frontiers. The paper is co-authored two of the same people (Ballesteros and Mayas) as a recent PLoS One paper on Lumosity training that I critiqued in a HI-BAR in April (note that the new paper includes six additional authors who were not on the PLoS paper and the PLoS paper includes two who weren't on the Frontiers paper). 

I was hopeful that this new paper would address some of the shortcomings of the earlier one. Unfortunately, it didn't. In fact, it couldn't have, because the "new" Frontiers paper is based on the same intervention as the PLoS paper.

And, when I say, "the same intervention," I don't mean that they directly replicated the procedures from their earlier paper. The Frontiers paper reports data from the same intervention with the same participants. It would be hard to know that from reading just the Frontiers paper, though, because it does not mention that these are additional measures from the same intervention. 

As my colleagues and I have noted (ironically, in Frontiers), this sort of partial reporting of outcome measures gives the misleading impression that there are more interventions demonstrating transfer of training than actually exist. If a subsequent meta-analysis treats these reports as independent, it will produce a distorted estimate of the size of any intervention benefits. Moreover, whenever a paper does not report all outcome measures, there's no way to appropriately correct for multiple comparisons. At a minimum, all publications derived from the same intervention study should state explicitly that it is the same intervention and identify all outcome measures that will, at some point, be reported. 

The Frontiers cites the PLoS paper only twice. The first citation appears in the method section in reference to the oddball outcome task that was the focus of the PLoS paper:
Results from this task have been reported separately (see Mayas et al., 2014).
That is the only indication in the paper that there might be overlap of any kind between the study presented in Frontiers and the one presented in PLoS. It's a fairly minimalist way to indicate that this was actually a report of different outcome measures from the same study. 

It's also not entirely true: The results section of the Frontiers paper describes the results for that oddball task in detail as well, and all of the statistics for that task reported in the Frontiers paper were also reported in the PLoS paper. The results section does not acknowledge the overlap. Figure 3b in the Frontiers paper is nearly identical to Figure 2b in the PLoS paper, reporting the same statistics from the oddball task with no mention that the data in the figure were published previously.

After repeating the main results of the oddball task from the earlier paper, the Frontiers paper states:
"New analyses were conducted on distraction (novel vs. standard sound) and alertness (silence vs. standard sound), showing that the ability to ignore relevant sounds (distraction) improved significantly in the experimental group after training (12 ms) but not in the control group. The analyses of alertness showed that the experimental group increased 26 ms in alertness (p < 0.05) but control group did not (p = 0.54)."
I thought "new analysis" might mean that these analyses were new to this paper, but they weren't. Here's the relevant section of the PLoS paper (which reported the actual statistical tests):
"In other words, the ability to ignore irrelevant sounds improved significantly in the experimental group after video game training (12 ms approximately) but not in the control group....Pre- and post-training comparisons showed a 26ms increase of alertness in the experimental group..."
Stating that "results from this task have been reported separately" implies that any analyses of that task reported in Frontiers are new, but they're not.

The Frontiers paper also reports the same training protocol and training performance data without reference to the earlier study. The Frontiers paper put the training data in a figure whereas the PLoS paper put them in a table. Why not just cite the earlier paper for the training data rather than republishing them without citation?

The same CONSORT diagram appears as Figure 1 in both papers. The Frontiers paper CONSORT diagram does report that the experimental condition had data from 17 participants rather than the 15 participants reported in the PLoS paper. From the otherwise identical figures, it appears that those two additional participants were the ones that the PLoS paper noted were lost to followup due to "personal travel" and "no response from participant." I guess both responded after the PLoS paper was completed, although the method sections of the two papers noted that testing took place over the same time period (January through July of 2013). The presence of data from these two additional participants did not change any of the statistics for the oddball task that were reported in both papers. 

One other oddity about the subject samples in the consort diagram: The PLoS paper excluded one participant from the control condition analyses due to "Elevated number of errors," leaving 12 participants in that analysis. The Frontiers paper did not exclude that participant. If they made too many errors for the PLoS paper, why were they included in the Frontiers paper?

Other than the one brief note in the method section about the oddball task, the only other reference to the PLoS paper appeared in the discussion, and gives the impression that the PLoS paper provided independent evidence for the contribution of frontal brain regions to alertness and attention filtering:
We also found that the trainees reduced distractibility by improving alertness and attention filtering, functions that decline with age and depend largely on frontal regions (Mayas et al., 2014).
That citation and the one in the method section are the only mentions of the earlier paper in the entire article. 


Given that I was already familiar with the PLoS paper, had I been a reviewer of this paper, here's what I would have said:


  1. The paper suffers from all of the same criticisms I raised about the intervention in the PLoS paper (e.g., inadequate control group among many other issues), and in my view, those shortcomings should preclude publication of this paper. See http://blog.dansimons.com/2014/04/hi-bar-benefits-of-lumosity-training.html
  2. The paper should have explained that this was the same intervention study, with the same participants, as the earlier PLoS One paper.
  3. The paper should have noted explicitly which data and results were reported previously and which were not.
It's possible that the authors notified the editor of the overlap between these papers upon submission. If so, it is surprising that the paper was even sent out for review before these issues were addressed. If the editor was not informed of the overlap, I cannot fault the editor and reviewers for missing the fact that the intervention, training results, and some of the outcome results were reported previously. Unless they happened to know about the PLoS One paper, they could be excused for missing the one reference to that paper in the method section. Still, the many other major problems with this intervention study probably should have precluded publication, at least without major corrections and qualifications of the claims.

Wednesday, April 2, 2014

HI-BAR: Benefits of Lumosity training for older adults?

HI-BAR (Had I Been A Reviewer)

A post-publication review of Mayas J., Parmentier, F. B. R., Andrés P., & Ballesteros, S. (2014) Plasticity of Attentional Functions in Older Adults after Non-Action Video Game Training: A Randomized Controlled Trial. PLoS ONE 9(3): e92269. doi:10.1371/journal.pone.0092269

For more information about HI-BAR reviews, see my post introducing the acronym.



This paper explored "whether the ability to filter out irrelevant information...can improve in older adults after practicing non-violent video games." In this case, the experimental group played 10 games that are part of the Lumosity program for a total of 20 hours. The control group did not receive any training. Based on post-training improvements on an "oddball" task (a fairly standard attention task, not a measure of quirkiness), the authors claim that training improved the ability to ignore distractions and increased alertness in older adults. 

Testing whether commercial brain training packages have any efficacy for cognitive enhancement is a worthwhile goal, especially given the dearth of robust, reliable evidence that such training has any measurable impact on cognitive performance on anything other than the trained tasks. I expect that Lumosity will tout this paper as evidence for the viability of their brain training games as a tool to improve cognition. They probably shouldn't.


Below are the questions and concerns I would have raised had I been a reviewer of this manuscript. If you read my earlier HI-BAR review of Anguera et al (2013), you'll recognize many of the same concerns. Unfortunately, the problems with this paper are worse. A few of these questions could be answered with more information about the study (I hope the authors will provide that information). Unfortunately, many of the shortcomings are more fundamental and undermine the conclusion that training transferred to their outcome measure.


I've separated the comments into two categories: Method/Analysis/Inferential issues and Reporting issues.


Method/Analysis/Inferential Issues

Sample size
The initial sample consisted of 20 older adults in the training group and 20 in the control group.  After attrition, the analyzed sample consisted of only 15 people in the training group and 12 in the control group. That's a really small sample, especially when testing older adults who can vary substantially in their performance.

Inadequate control group
The control group for this paper is inadequate to make any causal claims about the efficacy of the training. The experimental group engaged in 20 hours of practice with Lumosity games. The control group "attended meetings with the other members of the study several times along the course of the study;" they got to hang out, chat, and have coffee with the experimenters a few times. This sort of control condition is little better than a no-contact control group (not even the amount of contact was equated). Boot et al (2013) explained how inadequate control conditions like this "limited contact" one provide an inadequate baseline against which to evaluate the effectiveness of an intervention. Here's the critical figure from our paper showing the conclusions that logically follow from interventions with different types of control conditions:




When inferring the causal potency of any treatment, it must be compared to an appropriate baseline. That is why drug studies use a placebo control, ideally one that equates for side effects too, so that participants do not know whether they have received the drug or a placebo. For a control condition to be adequate, it should include all of the elements of the experimental group excepting the critical ingredient of the treatment (including equating for expectation effects). Otherwise, any differences in outcome could be due to other differences between the groups. That subtractive method, first described by Donders more than 150 years ago, is the basis of clinical trial logic. Unfortunately, it commonly is neglected in psychological interventions.

In video game training, even an active control group in which participants play a different game might not control for differential placebo effects on outcome measures. But, the lack of an active control group allows almost no definitive conclusions: It does not equate the experience between the training group and the control group in any substantive way. This limited-contact control group accounts for test-retest effects and the passage of time, and little else. 

Any advantage observed for the training group could result from many factors that are not specific to the games involved in the training condition or even to games at all: Any benefits could have resulted from doing something intensive for 20 hours, from greater motivation to perform well on the outcome measures, from greater commitment to the tasks, from differential placebo effects, from greater social contact, etc. Differences between the training group and this limited-contact control group do not justify any causal conclusion about the nature of the training.

Interventions and training studies that lack any control condition other than a no-contact or limited-contact control group should not be published. Period. They are minimally informative at best, and misleading at worst given that they will be touted as evidence for the benefits of training. The literature on game training is more than 10 years old, and there is no justification for publishing studies that claim a causal benefit of training if they lack an adequate baseline condition.

Multiple tests without correction
The initial analysis consisted of two separate 2x2x3 ANOVAs on accuracy and response times on the oddball task, accompanied by follow-up tests.  A 3-factor ANOVA consists of 7 separate F-tests (3 main effects, 3 two-way interactions, and 1 three-way interaction). Even if the null hypothesis were true and all of the data were drawn from a single population, we would expect a significant result on at least one of these tests more than 30% of the time on average (1 - .95^7). In other words, each of the ANOVAs has a 30% false positive rate. For a thorough discussion of this issue, see this excellent blog post from Dorothy Bishop.

The critical response time analysis, the only one to show a differential training effect, produced two significant tests out of the 7 conducted in the ANOVA: a main effect of condition as for accuracy (but not with the predicted pattern), and a significant 3-way interaction. The results section does not report any correction for multiple tests, though, and the critical 3-way interaction would not have been significant (it was p=.017 without correction).

Possible speed-accuracy tradeoff
The accuracy ANOVA showed a marginally significant 3-way interaction (p=.077 without correction for multiple tests), but the paper does not report the means or standard deviations for the accuracy results. Is it possible that the effects on accuracy and RT were in opposite directions? If so, the entire difference between training groups could just be a speed-accuracy tradeoff, with no actual performance difference between conditions.


Flexible and arbitrary analytical decisions
For response times, the analysis included only correct trials and excluded times faster than 200ms and slower than 1100ms. Standard trials after novel trials were discarded as well. These choices seem reasonable, but arbitrary. Would the results hold with different cutoffs and criteria? Were any other cutoffs tried? If so, that introduces additional flexibility (investigator degrees of freedom) that could spuriously inflate the significance of tests. That's one reason why pre-registration of analysis plans is essential. It's all too easy to rationalize any particular approach after the fact if it happens to work.


Unclear outcome measures
The three way interaction for response time was followed up with more tests that separated the oddball task into an alertness measure and a distraction measure analyzed separately for the two groups. It's not clear how these measures were derived from the oddball conditions, but I assume they were based on different combinations of the silent, standard, and novel noise conditions. It would be nice to know what these contrasts were as they provide the only focused tests of differential transfer task improvements between the experimental group and the control group. 


A difference in significance is not a significant difference
The primary conclusions about these follow-up outcome measures are based on significant improvements for the training group (reported as p=.05 and p=.04) and the absence of a significant improvement for the control group. Yet, significance in one condition and not in another does not mean that those two conditions differed significantly. No test of the difference in improvements across conditions was provided.

Inappropriate truncation/rounding of p-values
The critical "significant" effect for the training group actually wasn't significant! The authors reported "F(1,25) = 4.00, MSE = 474.28, p = .05, d = 0.43]." Whenever I see a p value reported as exactly .05, I get suspicious. So, I checked. F(1,25) = 4.00 gives p = .0565. Not significant. The authors apparently truncated the p-value. 

(The reported p=.04 is actually p=.0451. That finding was rounded, but rounding down is not appropriate either.)

Of the two crucial tests in the paper, one wasn't actually significant and the other was just barely under p=.05. Not strong evidence for improvement, especially given the absence of correction for multiple tests (with which, neither would be significant). 

Inappropriate conclusions from correlation analyses
The paper explored the correlation between the alertness and distraction improvements (the two outcome measures) and each of the 10 games that made up the Lumosity training condition. The motivation is to test whether the amount of improvement an individual showed during training correlated with the amount of transfer they showed on the outcome measures. Of course, with N=15, no correlation is stable and any significant correlation is likely to be substantially inflated. The paper included no correction for the 20 tests they conducted, and neither of the two significant correlations (p=.02, p<.01) would survive correction. Moreover, even if these correlations were robust, correlations between training improvement and outcome measure improvement logically provide no evidence for the causal effect of training on the transfer task (See Tidwell et al, 2013).

Inaccurate conclusions
The authors write: 
"The results of the present study suggest that training older adults with non-action video games reduces distractibility by improving attention filtering (a function declining with age and largely depending on frontal regions) but also improves alertness." 
First, the study provides no evidence at all that any differences resulted from improvements in an attention filtering mechanism. It provided not tests of that theoretical idea. 

Furthermore, the study as reported does not show that training differentially reduced distractibility or increased alertness.  The improved alertness effect was not statistically significant when the p-value isn't truncated to .05. The effect on the distraction measure was 12ms (p=-.0451 without correction for multiple tests). Neither effect would be statistically significant with correction for multiple tests. But, even if they were significant, without a test of the difference between the training effect and the control group effect, we don't know if there was any significant difference in improvements between the two groups; significance in one condition but not another does not imply a significant difference between conditions.



Reporting Issues

Pre-Registration or Post-Registration
The authors registered their study on ClinicalTrials.gov on December 4, 2013; it's listed and linked under the "trial registration" heading in the published article. That's great.  Pre-registration is the best way to eliminate p-hacking and other investigator degrees of freedom. 

But, this wasn't a pre-registration: The manuscript was submitted to PLoS three months before it was registered! What is the purpose of registering an already completed study? Should it even count as a registration?

Unreported outcome measures 
The only outcome measure mentioned in the published article is the "oddball task," but the registration on ClinicalTrials.gov identifies the following additional measures (none with any detail): neuropsychological testing, Wisconsin task, speed of processing, and spatial working memory. Presumably, these measures were collected as part of the study? After all, the protocol was registered after the paper was submitted. Why were they left out of the paper?

Perhaps the registration was an acknowledgment of the other measures and they are planning to report each outcome measure in a separate journal submission. Dividing a large study across multiple papers can be an acceptable practice, provided that all measures are identified in each paper and readers are informed about all of the relevant outcomes (the papers must cross-reference each other). 


Sometimes a large-scale study is preceded by a peer-reviewed "design" paper that lays out the entire protocol in detail in advance. This registration lacks the necessary detail to serve as a roadmap for a series of studies. Moreover, separating measures across papers without mentioning that there were other outcome measures or that some of the measures were (or will be) reported elsewhere is misleading. It gives the false impression that these outcomes came from different, independent experiments. A meta-analysis would treat them as independent evidence for training benefits when they aren't independent.


Unless the results from all measures are published, readers have no way to interpret the significance tests for any one measure. Readers need to know the total number of hypothesis tests to determine the false positive rate. Without that information, the significance tests are largely uninterpretable.


Here's a more troubling possibility: Perhaps the results for these other measures weren't significant, so the authors chose not to report them (or the reviewers/editor told them not to). If true, this underreporting—providing only the outcome measure that showed an effect—constitutes p-hacking, increasing the chances that any significant results in the article were false positives. 

Without more information, readers have no way to know which is the case. And, without that information, it is not possible to evaluate the evidence. This problem of incomplete reporting of outcome measures (and neglecting to mention that separate papers came from the same study) has occurred in the game training literature before—see "The Importance of Independent Replication" section in this 2012 paper for some details. 

Conflicting documentation of outcome measures
The supplemental materials at PLoS include a protocol that lists only the oddball task and neuropsychological testing. The supplemental materials also include an author-completed Consort Checklist that identifies where all the measures are described in the manuscript. The checklist includes the following items for "outcomes": 
"6a. Completely defined pre-specified primary and secondary outcome measures, including how and when they were assessed."  
"6b. Any changes to trial outcomes after the trial commenced, with reasons." 
For 6a, the authors answered "Methods." Yet, the primary and secondary outcomes noted in the Clinicaltrials.gov registration are not fully reported in the methods of the actual article or in the protocol page. For 6b, they responded "N/A," Implying that the design was carried out as described in the paper.

These forms and responses are inconsistent with the protocol description at clinicaltrials.gov. Either the paper and supplementary materials neglected to mention the other outcome measures or the clincialtrials.gov registration lists outcome measures that weren't actually collected. Given that the ClinicalTrials registration was completed after the paper was submitted, that implies that other outcome measures were collected as part of the study but not reported. If so, the PLoS supplemental materials are inaccurate.

Final post-test or interim stage post-test? 
The clinicaltrials.gov registration states that outcome measures will be tested before training, after 12-weeks, and again after 24 weeks. The paper reports only the pre-training and 12-week outcomes and does not mention the 24-week test. Was it conducted? Is this paper an interim report? If so, that should be mentioned in the article. Had the results not been significant at 12 weeks, would they have been submitted for publication? Probably not. And, if not, that could be construed as selective reporting, again biasing the reported p-values in this paper in favor of significance. 


Limitations of the limitations section
The paper ends with a limitations section, but the only identified limitations are the small sample size, the lack of a test of real-world outcome measure, the use of only the games in Lumosity, and the lack of evidence for maintenance of the training benefits (presumably foreshadowing a future paper based on the 24-week outcome testing mentioned in the clinicaltrials.gov registration). No mention is made of the inadequacy of the control group for causal claims about the benefits of game training, the fragility of the results to correction for multiple testing, the flexibility of analysis, the possible presence of unreported outcome measures, or any of the other issues I noted above.


Summary

Brain training is now a major industry, and companies capitalize (literally) on results that seem to support their claims. Training and intervention studies are critical if we want to evaluate the effectiveness of psychological interventions. But, intervention studies must include an adequate active control group, one that is matched for expected improvements independently for each outcome measure (to control for differential placebo effects). Without such a control condition, causal claims that a treatment has benefits are inappropriate because it is impossible to distinguish effects of the training task from other differences between the training and control group that could also lead to differential improvement. Far too many published papers make causal claims with inadequate designs, incomplete reporting of outcome measures, and overly flexible analyses. 

In this case, the inadequacy of the limited-contact control condition (without acknowledging these limitations) alone would be sufficient grounds for an editor to reject this paper. Reviewers and editors need to step up and begin requiring adequate designs whenever authors make causal claims about brain training.  Even those with an adequate design should take care to qualify any causal claims appropriately to avoid misrepresentation in the media.


Even if the control condition in this study had been adequate (it wasn't), the critical interaction testing the difference in improvement across conditions was not reported. Moreover, one of the improvements in the training group was reported to be significant even though it wasn't, and neither of the improvements would have withstood correction for multiple tests. Finally, the apparent underreporting of outcome measures makes all of the significance tests suspect.


More broadly, this paper provides an excellent example of why the field needs true pre-registration of cognitive intervention studies. Such registrations should include more than just the labels for the outcome measures. They should include a complete description of the protocol, tasks, measures, coding, and planned analysis. They should specify any arbitrary cutoffs, identify which analyses are confirmatory, and note when additional analyses will be exploratory. Without pre-registration (or, in the absence if pre-registration, complete reporting of all outcome measures), readers have no way to evaluate the results of an intervention because any statistical tests are effectively uninterpretable. 



Note: Updated to fix formatting and typos

Friday, February 21, 2014

How experts recall chess positions

Originally posted to invisiblegorilla blog on 15 February 2012. I am consolidating some posts from my other blogs onto my personal website where I have been blogging for the past year.


In 2011, a computer (Watson) outplayed two human Jeopardy champions.  In 1997, chess computer Deep Blue defeated chess champion Garry Kasparov. In both cases, the computer “solved” the game—found the right questions or good moves—differently than humans do.  Defeating humans in these domains took years of research and programming by teams of engineers, but only with huge advantages in speed, efficiency, memory, and precision could computers compete with much more limited humans.
What allows human experts to match wits with custom-designed computers equipped with tremendous processing power?  Chess players have a limited ability to evaluate all of the possible moves, the responses to those moves, the responses to the responses, etc. Even if they could evaluate all of the possible alternatives several moves deep, they still would need to remember which moves they had evaluated, which ones led to the best outcomes, and so on.  Computers expend no effort remembering possibilities that they had already rejected or revisiting options that proved unfruitful.
This question, how do chess experts evaluate positions to find the best move, has been studied for decades, dating back to the groundbreaking work of Adriaan de Groot and later to work by William Chase and Herbert Simon.  de Groot interviewed several chess players as they evaluated positions, and he argued that experts and weaker players tended to “look” about the same number of moves ahead and to evaluate similar numbers of moves with roughly similar speed.  The relatively small differences between experts and novices suggested that their advantages came not from brute force calculation ability but from something else: knowledge.  According to De Groot, the core of chess expertise is the ability to recognize huge number of chess positions (or parts of positions) and to derive moves from them.  In short, their greater efficiency came not from evaluating more outcomes, but from considering only the better options. [Note: Some of the details of de Groot’s claims, which he made before the appropriate statistical tests were in widespread use, did not hold up to later scrutiny—experts do consider somewhat more options, look a bit deeper, and process positions faster than less expert players (Holding, 1992). But de Groot was right about the limited nature of expert search and the importance of knowledge and pattern recognition in expert performance.]
In de Groot’s most famous demonstration, he showed several players images of chess positions for a few seconds and asked the players to reconstruct the positions from memory.  The experts made relatively few mistakes even though they had seen the position only briefly.  Years later, Chase and Simon replicated de Groot’s finding with another expert (a master-level player) as well as an amateur and a novice.  They also added a critical control: The players viewed both real chess positions and scrambled chess positions (that included pieces in implausible and even impossible locations). The expert excelled with the real positions, but performed no better than the amateur and novice for the scrambled positions (later studies showed that experts can perform slightly better than novices for random positions too if given enough time; Gobet & Simon, 1996).  The expert advantage apparently comes from familiarity with real chess positions, something that allows more efficient encoding or retrieval of the positions.
Chase and Simon recorded their expert performing the chess reconstruction task, and found that he placed the pieces on the board in spatially contiguous chunks, with pauses of a couple seconds after he reproduced each chunk.  This finding has become part of the canon of cognitive psychology: People can increase their working memory capacity by grouping together otherwise discrete items to form a larger unit in memory.  In that way, we can encode more information into the same limited number of memory slots.
In 1998, Chris Chabris and I invited two-time US Champion and International Grandmaster Patrick Wolff (a friend of Chris’s) to the lab and asked him to do the chess position reconstruction task. Wolff viewed each position (on a printed index card) for five seconds and then immediately reconstructed it on a chess board.  After he was satisfied with his work, we gave him the next card. At the end of the study, after he had recalled five real positions and five scrambled positions, we asked him to describe how he did the task.
The video below shows his performance and his explanations (Chris is the one handing him the cards and holding the stopwatch—I was behind the camera). Like other experts who have been tested, Wolff rarely made mistakes in reconstructing positions, and when he did, the errors were trivial—they did not alter the fundamental meaning or structure of the position. Watch for the interesting comments at the end when Wolff describes why he was focused on some aspects of a position but not others.



HT to Chris Chabris for comments on a draft of this post
Sources cited:
For an extended discussion of chess expertise and the nature of expert memory, see Christopher Chabris’s dissertation:  Chabris, C. F. (1999).  Cognitive and neuropsychological mechanisms of expertise: Studies with chess masters.  Doctoral Dissertation, Harvard University. http://en.scientificcommons.org/43254650
Chase, W. G., & Simon, H. A. (1973).  Perception in chess.  Cognitive Psychology,4, 55-81.
de Groot, A.D. (1946). Het denken van de schaker. [The thought of the chess player.] Amsterdam: North-Holland. (Updated translation published as Thought and choice in chess, Mouton, The Hague, 1965; corrected second edition published in 1978.)
Holding, D.H. (1992). Theories of chess skill. Psychological Research, 54(1), 10–16.
Gobet, F., & Simon, H.A. (1996a). Recall of rapidly presented random chess positions is a function of skill. Psychonomic Bulletin & Review, 3(2), 159–163.

Friday, February 14, 2014

HI-BAR: 10 questions about Inattentional blindness, race, and interpersonal goals

HI-BAR (Had I Been A Reviewer)

A post-publication review of Brown-Iannuzzi et al (2014). The invisible man: Interpersonal goals moderate inattentional blindness to African Americans. Journal of Experimental Psychology: General, 143, 33-37. [pubmed] [full paper]

For more information about HI-BAR reviews, see this earlier post.


In a paper published this week in the Journal of Experimental Psychology:General, Brown-Iannuzzi and colleagues reported a study in which participants (White women) first were asked to think about their interpersonal goals and then completed an inattentional blindness task in which the unexpected event was either the appearance of a White man or a Black man. For these participants, their idealized interpersonal goals presumably included same-race romantic partners or friends, so the prediction was that priming participants to think about these idealized interpersonal goals would make them less likely to notice an unexpected Black "interloper" than a White one in a basketball counting task similar to our earlier selective attention task.

This approach is interesting and potentially important for several reasons that have nothing to do with race or interpersonal goals. Most studies showing variability in noticing rates as a function of expectations manipulate expectations by varying the task itself (e.g., counting the white shapes rather than the black ones, attending to shape rather than color. See Most et al, 2001, 2005). In this study, Brown-Iannuzzi and colleagues manipulated expectations not by changing the task instructions, but by priming people using an entirely unrelated task. In effect, their priming task was designed to get people to envision a White person without calling attention to race and then used that more activated concept to induce a change in noticing rates as a function of race. If this approach proves robust, it could change how we think about the detection of unexpected objects because it implies that an attention set could be induced (subtly) in a powerful way without actually changing the primary task people are doing. 

I don't really have any specific comments about the use of race in particular or the use of interpersonal goals to accomplish this form of priming, but given the broader theoretical importance of this claim, I do have a number of questions about the methods and results in this paper. Most of my questions arise from the relatively sparse reporting of methods and results details in the paper, so I hope that they can be addressed if the authors provide more information. I am concerned that the evidence for the core conclusions is somewhat shaky given the fairly small samples and the flexibility of the analysis choices. Given the potential importance of this claim, I would like to see the finding directly replicated with a pre-registered analysis plan and a larger sample to verify that the effects are robust. 

10 Questions and Comments 

1) The method section provides almost no information about the test videos. What did the test video look like? How long did it last? Were all the players in the video White? How many passes did each team make? Did the two teams differ in the number of passes they made? Did the two videos differ in any way other than the race of the unexpected person? What color clothes did the players wear? How were the videos presented online to MTurk participants? (i.e., were they presented in Flash or some other media format?) Was there any check that the videos played smoothly on the platform on which they were viewed? Were there any checks on whether participants followed instructions carefully? That can be an issue with MTurk samples, but the only check on attentiveness appears to be the accuracy of counting. Was there any check to make sure people actually did think about their interpersonal goals? No subjects appear to have been excluded due to the failure to do. How long did the study take to complete? All of these factors potentially affect the results, so it's hard to evaluate the study without this information.

2) Were the players in the video wearing black shirts and white shirts as in our original video? If so, which team's passes did people count? Was that counter-balanced, and if so, did it matter? If the players were wearing white/black shirts, the finding that the shirt color worn by the unexpected person didn't matter is really surprising (and somewhat odd given Most et al's findings that similarity to the attended items matters). The task itself induces an attention set by virtue of the demand to focus selectively on one group and not the other, and it would be surprising if the subtle race priming yielded a more potent attention set than the task requirements. We know that the attention demands of the task (what's attended, what's ignored) affect noticing based on the similarity of the unexpected item to the attended and ignored items. That's a pretty big effect. Why wouldn't it operate here too? Shouldn't we also expect some interaction between the priming effect and the color of the attended team.

3) The analyzed sample consisted of 209 MTurk subjects. I have no objection to using MTurk for this sort of study. But, the method section doesn't report enough details about the assignment to conditions to evaluate the nature of these effects. For example, how many participants were in each of the conditions? It appears that the sample was divided across 5 (personal closeness) x 2 (race) x 2 (shirt color) combinations of conditions, for a total of 20 conditions. If so, there were approximately 10 subjects/condition. Did half of the participants attend to each team in the video? If so, that would mean there were approximately 5 subjects/combination of conditions. Depending on how many factors there were in the design, the sample size in each condition becomes pretty small. And, that's important because the core conclusion rests largely on the effect in just one of the priming conditions.

4) The paper does not report the number of participants in each condition or the number of those who noticed (it just reports percentages in a figure, but the figure does not break down the results across each of the combinations of conditions). Perhaps the authors could provide a table with that information as part of supplemental materials? Although the paper reports no effect of factors like shirt color, it's quite possible that such factors interacted with the other ones, but there probably is not adequate power with these sample sizes to test such effects. Still, it would be good to know the exact N in each combination of conditions, along with the number of participants who noticed in each condition.

5) Was there a reason why missing the total count by 4 was used the cutoff for accurate performance? That might well be a reasonable cutoff (it led to 35 exclusions out of the original 244 participants), but the paper doesn't report the total number of passes in the video, so we don't know how big an error that actually is. The analysis excluded participants who were inaccurate (by 4 passes), and footnote 3 reports that the simple comparisons were weaker if those participants were included. Does that mean that the effect in the Friend condition was not significant with these subjects included? Did that effect depend on this particular cutoff for accuracy? What if the study used a cutoff of 3 missed passes rather than 4? Would that change the outcome? How about 5? or 2? What if the study also excluded people familiar with the original basketball-counting video? Would it be reliable then? The flexibility of these sorts of analysis decisions are one reason I strongly favor pre-registered analysis plans for confirmatory studies.

6) It seems problematic to include the 23% of participants who were familiar with inattentional blindness in the analyses, especially if they were familiar with one of the variants in which people count ball passes by one of two teams of players. Although one of my own papers was cited as justification for including rather than excluding these participants, I didn't understand the reasoning. It's true that people can fail to notice an unexpected event even if they are familiar with the phenomenon of inattentional blindness more generally, but that typically depends on them not making the connection between the current test and the previous one. That is, they shouldn't have reason to expect an unexpected event in this particular task/scenario (e.g., people familiar with the "gorilla" video might still fail to detect an unexpected target in Mack & Rock's original paradigm because the two tasks are so different that they have no reason to expect one). When I show another variant of the gorilla task to people who are already familiar with the original gorilla video, they do notice the gorilla (Simons, 2010). They expect an unexpected interloper and look for it. The same likely is true here. The method section reports that excluding these 23% of the participants did not matter for the interaction, but the analyses with those data excluded are not reported. And, given that excluding those subjects would reduce the sample size by 23% and that the critical simple comparison was p=.03 (see below), it seems likely that the exclusion would have some consequences for the statistical significance of the critical comparisons underlying the conclusions. Perhaps the authors could report these analyses more fully in supplemental materials.

7) It is not appropriate to draw strong conclusions from the difference in noticing rates in the control condition for the White and Black unexpected person. The paper suggests that the difference in the no-prime control group results from racial stereotyping: White subjects view a Black person as a greater threat, so he garners more attention and is noticed more by default. But, this comparison is of two different videos that presumably were not identical in every respect other than the race of the unexpected person. It's quite possible that the actor just stood out more in that video due to the choreography of the players. Or, that particular Black actor might have stood out more due to features having nothing to do with his race (e.g., maybe he was taller, moved with a different gait, etc.). It's risky to draw strong comparisons about baseline differences in noticing rates across videos because many things could differ between any two videos like this. It's not justified to assume that any baseline differences must have been due to race, especially with just one example of an unexpected person of each Race. 

8) If the control condition is treated as a default for noticing against which the priming conditions are compared (really, the appropriate way to do it given that direct comparison of different stimuli is questionable), it would be nice to have a more robust estimate of that baseline condition with a substantially larger sample. Otherwise, the noisy measurement of the baseline could artificially inflate the estimates of interaction effects. In the analyses, though, the baseline is treated just like the other priming conditions. If anything, the effects might be larger if the baseline were subtracted from each of the other conditions, but I would be hesitant to do that without first increasing the precision of the baseline estimate. 

9) The primary conclusions are driven by the condition in which subjects were primed by thinking about an idealized Friend (the interaction effects are much harder to interpret because the priming conditions really are categories rather than an equally-spaced continuum, and even if they were a continuum, the trends are not consistent with one). Ideally, these simple effect analyses would have been relative to a baseline condition to account for default differences in noticing rates. Still, with the direct comparison, the Friend priming condition was the only one to yield significantly greater detection of the White unexpected person (p=.03). The effect for the Romantic Partner condition was not statistically significant at .05, nor were the neighbor or co-worker conditions. I don't see any a-priori reason to expect an effect just for the Friend condition, and with corrections for multiple tests, that effect would not have been significant either. This amounts to a concern about analysis flexibility: The authors could have drawn the same conclusion had there been a difference in the Romantic Partner condition but not the Friend condition. It might even be possible to explain an effect of priming in the more remote interpersonal relationship conditions or in some combination of them. Correcting for multiple tests helps to address this issue, but I would prefer pre-registration of the specific predictions for any confirmatory hypothesis test. Then, any additional analyses could be treated as exploratory. With this level of flexibility, correction for these alternative outcomes seems necessary when interpreting the statistical significance of the findings.

10) Without correcting for multiple tests, the paper effectively treats the secondary analyses as confirmatory hypothesis tests, but any of a number of other outcomes could have also been taken as support for the same hypothesis. Given the relatively small effect in the one condition that was statistically significant (although, again, see my notes about using a baseline), I worry that the effect might not be robust.  Would the effect in that condition no longer be significant if one or two people who noticed were shifted to the miss category? My guess is that the significance in the one critical condition hinges on as little as that. A confirmatory replication would be helpful to verify that this effect is reliable. 

Conclusions

This study provides suggestive evidence for a new way to moderate noticing of unexpected events. If true, it would have substantial theoretical and practical importance. However, the missing method details make the finding hard to evaluate. And, the flexibility of the analysis choices coupled with a critical finding that only reaches statistical significance without correction for that flexibility make me worried about the robustness of the result. Fortunately, the first of these issues would be easy to address by adding additional information to the supplemental materials. I hope the authors will do that so that readers can more fully evaluate the study. They could also provide more information about the results and the consequences of various analytical decisions. Given the importance of the finding, I hope they will build on this finding by conducting a pre-registered, highly powered, confirmatory test of the effect.

Saturday, January 25, 2014

Replication, Retraction, and Responsibility

Congrats/thanks to Brent Donnellan, Joe Cesario, and Rich Lucas for their tenacity and perseverance. They conducted 9 studies with more than 3000 participants in order to publish a set of direct replications. Their paper challenged an original report (study 1 in Bargh & Shalev, 2012) claiming that loneliness is correlated with preferred shower temperatures. The new, just-accepted paper did not find a reliable correlation. Donnellan describes their findings and the original studies in what may be the most measured and understated blog post I've seen. You should read it.

The original study had fewer than 100 subjects (51 from a Yale undergraduate sample and a replication with 41 from a community sample), underpowered to detect a typical effect size in a social psychology experiment. But there are bigger problems with the original results.


According to the description in Donnellan's post, the data from the Yale sample were completely screwy: 46/51 Yale students reported taking fewer than 1 shower/bath per week! Either Yale students are filthy, or something's wrong with the data. More critical for the primary question, 42/51 Yale students apparently prefer cold (24 students) or lukewarm (18 students) showers. How many people do you know who prefer cold showers to reasonably hot ones? Again, something's out of whack. In a comment on Donnellan's blog post, Rich Lucas noted that the original distribution of preferred temperatures would mirror what Donnellan et al found if the original data were inadvertently reverse coded. Of course, that would mean the correlation reported in the paper was backwards, and the effect was the opposite of what was claimed.

From an earlier Donnellan post, we know that Bargh was aware of these issues back in 2012, but that he prevented Donnellan and his colleagues from discussing the problems until recently (you should read that post too)In a semi-private email discussion among priming researchers and skeptics, Bargh claimed that his prohibition on discussing his data was just a miscommunication, but he didn't get around to correcting that misconception until he was pressed to respond on that email thread. In the same thread, Bargh claimed to have first learned of these errors from Joe Cesario (who initially requested the original data). Although it's odd that he didn't notice the weird distribution in the frequency responses, I can understand how someone might miss something obvious when they were focusing attention elsewhere... Bargh said that he provided an explanation to the editor at Emotion during the review process: He claimed that Yale students misunderstood the bathing frequency item as asking specifically about baths (not showers). According to Joe Cesiaro's response in that same email thread, though, that doesn't accord with the survey wording about showers/baths that Bargh provided.

Still, whenever and however Bargh learned of the problems with the data, he and Shalev had an obligation to retract the original study and issue an erratum (unless they actually believe Yale students prefer rare, cold showers). Even if the subjects misinterpreted the frequency question, the results are bogus. The problems could well have resulted from an honest oversight, a slip up in coding, a misinterpretation of a poorly worded question, or an Excel copy/paste error. Regardless of the reason, authors have a responsibility to own up to mistakes and to correct the record. Posting to a semi-private email list is not sufficient—the public record in the journal must be amended. Authors have an obligation to correct mistakes once they know of them, and the failure to do so in the published record is troubling.

Note that I am not arguing the original study should be retracted just because Donnellan and colleagues didn't replicate it. A failure to replicate is not adequate grounds for questioning the integrity of the original finding. The original effect size estimates could be wrong, but that's just science working properly to correct itself (that's why direct replications are useful and important). Yet, obviously flawed data like those described by Donnellan should not have to await replication, and scholars reading the literature should be informed that they should not place any stock in that first study with Yale students. That finding should be withdrawn until it can be verified so that it doesn't mislead the field.

One thing I find troubling about this story is that Donnellan, Cesario, and Lucas needed to conduct 9 studies with more than 30x the original number of participants in order to get this paper accepted at Emotion. They should be applauded for replicating with enough power to be sure that their effect size estimates are precise, but each of their studies had more than 2.5x the sample size of the original! If their efforts were entirely voluntary and not a consequence of appeasing reviewers, kudos to them for making sure they got it right. I'm glad that this paper was accepted, and our field owes them gratitude for their efforts. I just hope they haven't set an overly high standard and precedent for what's needed to publish a direct replication.

I would encourage Bargh to issue a public explanation (accessible to the whole field, not just an email thread) for the data issues in their original study. The problems could well have been an accidental coding or interpretation problem, and mistakes are excusable even if they do undermine the claims. More importantly, he should retract the original study (not the whole paper, necessarily -- just the study with problematic data) and issue an erratum in the journal. Out of curiosity, I would like to see an explanation for why the study was not retracted immediately upon learning of the problems more than a year ago.  Perhaps there is a good reason, but I'm having trouble generating one. I hope he will enlighten us. 

Sunday, October 27, 2013

New Posts on IonPsych - October 27, 2013

This fall, I am teaching a graduate seminar on speaking/writing for general audiences. As part of the class, students blog at www.ionpsych.comEach week, I'll provide a short summary of the latest posts.


The latest posts on www.ionpsych.com

Anna Popova discusses the value of negative experiences and imagination in decision making.

Jim Monti describes the best way to stave off the cognitive costs of aging.

Luis Flores describes a recent study of depression and argues that people with depression may enjoy activities just as much as those without depression, but they are less willing to work to experience those activities.

Christina Tworek muses on the causes of the large disparity between what scientists know and what the public knows of science.

Emily Hankosky critiques claims in a recent book espousing personal responsibility for overcoming addiction. She makes a case that discounting neurobiological bases of addiction is irresponsible.

Lindsey Hammerslag discusses the importance of teaching about certainty and uncertainty in science.

Sunday, October 13, 2013

New Posts on IonPsych - 13 October 2013

This fall, I am teaching a graduate seminar on speaking/writing for general audiences. As part of the class, students blog at www.ionpsych.comEach week, I'll provide a short summary of the latest posts.


The latest posts on www.ionpsych.com

H. A. Logis explores how we mold our behaviors to the people around us.

Brian Metzger demonstrates how perception often isn't subject to free will.

Emily Kim describes alternatives to academia for social psychologists.

Joachim Operskalski evaluates how new developments in neuroscience might improve treatment of psychiatric disorders and why we shouldn't be so quick to dismiss Prozac. 

Anna Popova explains how to make statistics interesting.