Friday, February 21, 2014

How experts recall chess positions

Originally posted to invisiblegorilla blog on 15 February 2012. I am consolidating some posts from my other blogs onto my personal website where I have been blogging for the past year.

In 2011, a computer (Watson) outplayed two human Jeopardy champions.  In 1997, chess computer Deep Blue defeated chess champion Garry Kasparov. In both cases, the computer “solved” the game—found the right questions or good moves—differently than humans do.  Defeating humans in these domains took years of research and programming by teams of engineers, but only with huge advantages in speed, efficiency, memory, and precision could computers compete with much more limited humans.
What allows human experts to match wits with custom-designed computers equipped with tremendous processing power?  Chess players have a limited ability to evaluate all of the possible moves, the responses to those moves, the responses to the responses, etc. Even if they could evaluate all of the possible alternatives several moves deep, they still would need to remember which moves they had evaluated, which ones led to the best outcomes, and so on.  Computers expend no effort remembering possibilities that they had already rejected or revisiting options that proved unfruitful.
This question, how do chess experts evaluate positions to find the best move, has been studied for decades, dating back to the groundbreaking work of Adriaan de Groot and later to work by William Chase and Herbert Simon.  de Groot interviewed several chess players as they evaluated positions, and he argued that experts and weaker players tended to “look” about the same number of moves ahead and to evaluate similar numbers of moves with roughly similar speed.  The relatively small differences between experts and novices suggested that their advantages came not from brute force calculation ability but from something else: knowledge.  According to De Groot, the core of chess expertise is the ability to recognize huge number of chess positions (or parts of positions) and to derive moves from them.  In short, their greater efficiency came not from evaluating more outcomes, but from considering only the better options. [Note: Some of the details of de Groot’s claims, which he made before the appropriate statistical tests were in widespread use, did not hold up to later scrutiny—experts do consider somewhat more options, look a bit deeper, and process positions faster than less expert players (Holding, 1992). But de Groot was right about the limited nature of expert search and the importance of knowledge and pattern recognition in expert performance.]
In de Groot’s most famous demonstration, he showed several players images of chess positions for a few seconds and asked the players to reconstruct the positions from memory.  The experts made relatively few mistakes even though they had seen the position only briefly.  Years later, Chase and Simon replicated de Groot’s finding with another expert (a master-level player) as well as an amateur and a novice.  They also added a critical control: The players viewed both real chess positions and scrambled chess positions (that included pieces in implausible and even impossible locations). The expert excelled with the real positions, but performed no better than the amateur and novice for the scrambled positions (later studies showed that experts can perform slightly better than novices for random positions too if given enough time; Gobet & Simon, 1996).  The expert advantage apparently comes from familiarity with real chess positions, something that allows more efficient encoding or retrieval of the positions.
Chase and Simon recorded their expert performing the chess reconstruction task, and found that he placed the pieces on the board in spatially contiguous chunks, with pauses of a couple seconds after he reproduced each chunk.  This finding has become part of the canon of cognitive psychology: People can increase their working memory capacity by grouping together otherwise discrete items to form a larger unit in memory.  In that way, we can encode more information into the same limited number of memory slots.
In 1998, Chris Chabris and I invited two-time US Champion and International Grandmaster Patrick Wolff (a friend of Chris’s) to the lab and asked him to do the chess position reconstruction task. Wolff viewed each position (on a printed index card) for five seconds and then immediately reconstructed it on a chess board.  After he was satisfied with his work, we gave him the next card. At the end of the study, after he had recalled five real positions and five scrambled positions, we asked him to describe how he did the task.
The video below shows his performance and his explanations (Chris is the one handing him the cards and holding the stopwatch—I was behind the camera). Like other experts who have been tested, Wolff rarely made mistakes in reconstructing positions, and when he did, the errors were trivial—they did not alter the fundamental meaning or structure of the position. Watch for the interesting comments at the end when Wolff describes why he was focused on some aspects of a position but not others.

HT to Chris Chabris for comments on a draft of this post
Sources cited:
For an extended discussion of chess expertise and the nature of expert memory, see Christopher Chabris’s dissertation:  Chabris, C. F. (1999).  Cognitive and neuropsychological mechanisms of expertise: Studies with chess masters.  Doctoral Dissertation, Harvard University.
Chase, W. G., & Simon, H. A. (1973).  Perception in chess.  Cognitive Psychology,4, 55-81.
de Groot, A.D. (1946). Het denken van de schaker. [The thought of the chess player.] Amsterdam: North-Holland. (Updated translation published as Thought and choice in chess, Mouton, The Hague, 1965; corrected second edition published in 1978.)
Holding, D.H. (1992). Theories of chess skill. Psychological Research, 54(1), 10–16.
Gobet, F., & Simon, H.A. (1996a). Recall of rapidly presented random chess positions is a function of skill. Psychonomic Bulletin & Review, 3(2), 159–163.

Friday, February 14, 2014

HI-BAR: 10 questions about Inattentional blindness, race, and interpersonal goals

HI-BAR (Had I Been A Reviewer)

A post-publication review of Brown-Iannuzzi et al (2014). The invisible man: Interpersonal goals moderate inattentional blindness to African Americans. Journal of Experimental Psychology: General, 143, 33-37. [pubmed] [full paper]

For more information about HI-BAR reviews, see this earlier post.

In a paper published this week in the Journal of Experimental Psychology:General, Brown-Iannuzzi and colleagues reported a study in which participants (White women) first were asked to think about their interpersonal goals and then completed an inattentional blindness task in which the unexpected event was either the appearance of a White man or a Black man. For these participants, their idealized interpersonal goals presumably included same-race romantic partners or friends, so the prediction was that priming participants to think about these idealized interpersonal goals would make them less likely to notice an unexpected Black "interloper" than a White one in a basketball counting task similar to our earlier selective attention task.

This approach is interesting and potentially important for several reasons that have nothing to do with race or interpersonal goals. Most studies showing variability in noticing rates as a function of expectations manipulate expectations by varying the task itself (e.g., counting the white shapes rather than the black ones, attending to shape rather than color. See Most et al, 2001, 2005). In this study, Brown-Iannuzzi and colleagues manipulated expectations not by changing the task instructions, but by priming people using an entirely unrelated task. In effect, their priming task was designed to get people to envision a White person without calling attention to race and then used that more activated concept to induce a change in noticing rates as a function of race. If this approach proves robust, it could change how we think about the detection of unexpected objects because it implies that an attention set could be induced (subtly) in a powerful way without actually changing the primary task people are doing. 

I don't really have any specific comments about the use of race in particular or the use of interpersonal goals to accomplish this form of priming, but given the broader theoretical importance of this claim, I do have a number of questions about the methods and results in this paper. Most of my questions arise from the relatively sparse reporting of methods and results details in the paper, so I hope that they can be addressed if the authors provide more information. I am concerned that the evidence for the core conclusions is somewhat shaky given the fairly small samples and the flexibility of the analysis choices. Given the potential importance of this claim, I would like to see the finding directly replicated with a pre-registered analysis plan and a larger sample to verify that the effects are robust. 

10 Questions and Comments 

1) The method section provides almost no information about the test videos. What did the test video look like? How long did it last? Were all the players in the video White? How many passes did each team make? Did the two teams differ in the number of passes they made? Did the two videos differ in any way other than the race of the unexpected person? What color clothes did the players wear? How were the videos presented online to MTurk participants? (i.e., were they presented in Flash or some other media format?) Was there any check that the videos played smoothly on the platform on which they were viewed? Were there any checks on whether participants followed instructions carefully? That can be an issue with MTurk samples, but the only check on attentiveness appears to be the accuracy of counting. Was there any check to make sure people actually did think about their interpersonal goals? No subjects appear to have been excluded due to the failure to do. How long did the study take to complete? All of these factors potentially affect the results, so it's hard to evaluate the study without this information.

2) Were the players in the video wearing black shirts and white shirts as in our original video? If so, which team's passes did people count? Was that counter-balanced, and if so, did it matter? If the players were wearing white/black shirts, the finding that the shirt color worn by the unexpected person didn't matter is really surprising (and somewhat odd given Most et al's findings that similarity to the attended items matters). The task itself induces an attention set by virtue of the demand to focus selectively on one group and not the other, and it would be surprising if the subtle race priming yielded a more potent attention set than the task requirements. We know that the attention demands of the task (what's attended, what's ignored) affect noticing based on the similarity of the unexpected item to the attended and ignored items. That's a pretty big effect. Why wouldn't it operate here too? Shouldn't we also expect some interaction between the priming effect and the color of the attended team.

3) The analyzed sample consisted of 209 MTurk subjects. I have no objection to using MTurk for this sort of study. But, the method section doesn't report enough details about the assignment to conditions to evaluate the nature of these effects. For example, how many participants were in each of the conditions? It appears that the sample was divided across 5 (personal closeness) x 2 (race) x 2 (shirt color) combinations of conditions, for a total of 20 conditions. If so, there were approximately 10 subjects/condition. Did half of the participants attend to each team in the video? If so, that would mean there were approximately 5 subjects/combination of conditions. Depending on how many factors there were in the design, the sample size in each condition becomes pretty small. And, that's important because the core conclusion rests largely on the effect in just one of the priming conditions.

4) The paper does not report the number of participants in each condition or the number of those who noticed (it just reports percentages in a figure, but the figure does not break down the results across each of the combinations of conditions). Perhaps the authors could provide a table with that information as part of supplemental materials? Although the paper reports no effect of factors like shirt color, it's quite possible that such factors interacted with the other ones, but there probably is not adequate power with these sample sizes to test such effects. Still, it would be good to know the exact N in each combination of conditions, along with the number of participants who noticed in each condition.

5) Was there a reason why missing the total count by 4 was used the cutoff for accurate performance? That might well be a reasonable cutoff (it led to 35 exclusions out of the original 244 participants), but the paper doesn't report the total number of passes in the video, so we don't know how big an error that actually is. The analysis excluded participants who were inaccurate (by 4 passes), and footnote 3 reports that the simple comparisons were weaker if those participants were included. Does that mean that the effect in the Friend condition was not significant with these subjects included? Did that effect depend on this particular cutoff for accuracy? What if the study used a cutoff of 3 missed passes rather than 4? Would that change the outcome? How about 5? or 2? What if the study also excluded people familiar with the original basketball-counting video? Would it be reliable then? The flexibility of these sorts of analysis decisions are one reason I strongly favor pre-registered analysis plans for confirmatory studies.

6) It seems problematic to include the 23% of participants who were familiar with inattentional blindness in the analyses, especially if they were familiar with one of the variants in which people count ball passes by one of two teams of players. Although one of my own papers was cited as justification for including rather than excluding these participants, I didn't understand the reasoning. It's true that people can fail to notice an unexpected event even if they are familiar with the phenomenon of inattentional blindness more generally, but that typically depends on them not making the connection between the current test and the previous one. That is, they shouldn't have reason to expect an unexpected event in this particular task/scenario (e.g., people familiar with the "gorilla" video might still fail to detect an unexpected target in Mack & Rock's original paradigm because the two tasks are so different that they have no reason to expect one). When I show another variant of the gorilla task to people who are already familiar with the original gorilla video, they do notice the gorilla (Simons, 2010). They expect an unexpected interloper and look for it. The same likely is true here. The method section reports that excluding these 23% of the participants did not matter for the interaction, but the analyses with those data excluded are not reported. And, given that excluding those subjects would reduce the sample size by 23% and that the critical simple comparison was p=.03 (see below), it seems likely that the exclusion would have some consequences for the statistical significance of the critical comparisons underlying the conclusions. Perhaps the authors could report these analyses more fully in supplemental materials.

7) It is not appropriate to draw strong conclusions from the difference in noticing rates in the control condition for the White and Black unexpected person. The paper suggests that the difference in the no-prime control group results from racial stereotyping: White subjects view a Black person as a greater threat, so he garners more attention and is noticed more by default. But, this comparison is of two different videos that presumably were not identical in every respect other than the race of the unexpected person. It's quite possible that the actor just stood out more in that video due to the choreography of the players. Or, that particular Black actor might have stood out more due to features having nothing to do with his race (e.g., maybe he was taller, moved with a different gait, etc.). It's risky to draw strong comparisons about baseline differences in noticing rates across videos because many things could differ between any two videos like this. It's not justified to assume that any baseline differences must have been due to race, especially with just one example of an unexpected person of each Race. 

8) If the control condition is treated as a default for noticing against which the priming conditions are compared (really, the appropriate way to do it given that direct comparison of different stimuli is questionable), it would be nice to have a more robust estimate of that baseline condition with a substantially larger sample. Otherwise, the noisy measurement of the baseline could artificially inflate the estimates of interaction effects. In the analyses, though, the baseline is treated just like the other priming conditions. If anything, the effects might be larger if the baseline were subtracted from each of the other conditions, but I would be hesitant to do that without first increasing the precision of the baseline estimate. 

9) The primary conclusions are driven by the condition in which subjects were primed by thinking about an idealized Friend (the interaction effects are much harder to interpret because the priming conditions really are categories rather than an equally-spaced continuum, and even if they were a continuum, the trends are not consistent with one). Ideally, these simple effect analyses would have been relative to a baseline condition to account for default differences in noticing rates. Still, with the direct comparison, the Friend priming condition was the only one to yield significantly greater detection of the White unexpected person (p=.03). The effect for the Romantic Partner condition was not statistically significant at .05, nor were the neighbor or co-worker conditions. I don't see any a-priori reason to expect an effect just for the Friend condition, and with corrections for multiple tests, that effect would not have been significant either. This amounts to a concern about analysis flexibility: The authors could have drawn the same conclusion had there been a difference in the Romantic Partner condition but not the Friend condition. It might even be possible to explain an effect of priming in the more remote interpersonal relationship conditions or in some combination of them. Correcting for multiple tests helps to address this issue, but I would prefer pre-registration of the specific predictions for any confirmatory hypothesis test. Then, any additional analyses could be treated as exploratory. With this level of flexibility, correction for these alternative outcomes seems necessary when interpreting the statistical significance of the findings.

10) Without correcting for multiple tests, the paper effectively treats the secondary analyses as confirmatory hypothesis tests, but any of a number of other outcomes could have also been taken as support for the same hypothesis. Given the relatively small effect in the one condition that was statistically significant (although, again, see my notes about using a baseline), I worry that the effect might not be robust.  Would the effect in that condition no longer be significant if one or two people who noticed were shifted to the miss category? My guess is that the significance in the one critical condition hinges on as little as that. A confirmatory replication would be helpful to verify that this effect is reliable. 


This study provides suggestive evidence for a new way to moderate noticing of unexpected events. If true, it would have substantial theoretical and practical importance. However, the missing method details make the finding hard to evaluate. And, the flexibility of the analysis choices coupled with a critical finding that only reaches statistical significance without correction for that flexibility make me worried about the robustness of the result. Fortunately, the first of these issues would be easy to address by adding additional information to the supplemental materials. I hope the authors will do that so that readers can more fully evaluate the study. They could also provide more information about the results and the consequences of various analytical decisions. Given the importance of the finding, I hope they will build on this finding by conducting a pre-registered, highly powered, confirmatory test of the effect.