Thursday, June 6, 2013

When beliefs and actions are misaligned - the case of distracted driving

Originally posted to the invisiblegorilla blog on 22 December 2010. I am gradually reposting all of my earlier blog posts from other sites onto my personal website where I will be blogging for the foreseeable future. The post is unedited from the 2010 version.

During the summer of 2010, the California Office of Traffic Safety conducted a survey of 1671 drivers at gas stations throughout California. The survey asked drivers about their own driving behavior and perceptions of driving risks. Earlier this year I posted about the apparent contradiction between what we know and what we do—people continue to talk and text while driving despite awareness of the dangers. The California survey results (pdf) reinforce that conclusion.
59.5% of respondents listed talking on a phone (hand held or hands free) as the most serious distraction for drivers. In fact, 45.8% of respondents admitted to making a mistake while driving and talking on a phone, and 54.6 claimed to have been hit or almost hit by someone talking on a phone. People are increasingly aware of the dangers. As David Strayer has shown, talking on a phone while driving is roughly comparable to driving under the influence of alcohol (pdf). Yet, people continue to talk on the phone while driving.
Unlike some earlier surveys that only asked general questions about phone use, this one asked how often the respondents talked on a phone in the past 30 days. 14.0% report regularly talking on a hand-held phone (now illegal) and another 29.4% report regularly talking on a hands-free phone. Fewer than 50% report never talking on a hands free phone while driving (and only 52.8% report never talking on hand-held phones). People know that they are doing something dangerous, but they do it anyway (at least sometimes).
Fewer people report texting while driving than talking while driving: 9.4% do so regularly, 10.4% do so sometimes, and another 10.6% do so rarely. In other words, more than 30% of subjects still text while driving, at least on occasion, even though texting is much more distracting than talking and is substantially worse than driving under the influence.
68% of respondents thought that a hands-free conversation is safer than a hand-held one, a mistaken but unfortunately common belief. The misconception is understandable given that almost all laws regulating cell phones while driving focus on hand-held phones. The research consistently shows little if any benefit from using a hands-free phone—the distraction is in your head, not your hands.
Fortunately, there is hope that education (and perhaps regulation) can help. The extensive education campaigns about mandatory seatbelt use and the dangers of drunk driving have had an effect over the years: 95.8% report always using a seat belt, and only 1% report never wearing a seatbelt. Only 5.9% reported having driven when they thought they had already had too much alcohol to drive safely.
Sources cited:
Strayer, D., Drews, F., & Crouch, D. (2006). A Comparison of the Cell Phone Driver and the Drunk Driver Human Factors: The Journal of the Human Factors and Ergonomics Society, 48 (2), 381-391 DOI: 10.1518/001872006777724471

Wednesday, June 5, 2013

Continuing the "diablog" with Rolf Zwaan -- still more thoughts

+Rolf Zwaan just continued our "diablog" (love that term) on reliability and replication. (Rolf -- sorry for slightly misrepresenting your conclusion in my last post, and thanks for clarifying.) At this point, I think we're in complete agreement on pretty much everything in our discussion. I thought I'd comment on one suggestion in his post that was first raised by +Etienne LeBel in the comments on Rolf's first post and that Rolf discussed in his most recent post.

The idea of permitting additional between subjects conditions on registered replication reports is an interesting one. As Rolf notes, that won't work for within-subject designs as the new conditions would potentially affect the measurement of the original conditions. I have several concerns about permitting additional conditions for registered replication reports at Perspectives, but I don't think any of them necessarily precludes additional conditions. It's something the other editors and I will need to discuss more. Here are the primary issues as I see them:  

  • The inclusion of additional conditions should not diminish the sample size for the primary conditions. Otherwise, it would lead to a noisier effect size estimate for the crucial conditions, undermining the primary purpose of the replication reports. Given subject pool constraints and our desire to measure the crucial effects with a maximimal sample size, that could be a problem, particularly at smaller schools.
  • The additional condition must in no way affect the measurements in the primary condition. That is, subjects in the primary conditions could not be aware of the existence of an additional condition. Some measures would need to be taken to avoid any interactions among subjects. That's already something we account for in most designs, so I don't see this as a major impediment.
  • The additional conditions could not be reported alongside the primary analyses in the printed journal article. The issue here is that we want the final published article to report the same measures and tests for each individual replication attempt. Otherwise, the final report will become unwieldy, with each of the many participating labs reporting different analyses. That would hinder the ability of readers to assess the strength of the primary effect under study.
If we do decide to permit additional between-subjects conditions, analyses of those conditions could be reported on the OSF project pages created for each participating lab. There are no page limits for those project pages, and each lab could discuss their additional conditions more fully. I will make sure the other editors (+Alex Holcombe and +Bobbie Spellman) and I discuss this possibility.

The Value of Pre-Registration - comments on a letter in the Guardian

The Guardian just published a great letter, signed by many of the leaders of our field, calling for pre-registration as a way to improve our science. I imagine they would have had many more signatories if the authors had put out a more public call. I'll add my virtual signature here. If you agree with the letter, please make sure your colleagues see it and add your virtual signature as well.

I think pre-registration is the way forward. I hadn't pre-registered my studies before this past year, but I've started doing that for all of the studies for which I have direct input into the management of the study. I hope more journals will begin to conduct the review process before the data are in, vetting the method and analysis plan and then publishing the results regardless of the outcome. But even if they don't, pre-registration is the one way to demonstrate that the planned analyses weren't p-hacked. My bet is that, as the ratio of pre-registered to not-pre-registered studies in our journals grows, researchers will begin to look askance at studies that were not pre-registered. The incentive to pre-register will increase as a result, and that's a good thing.

Even if journals don't accept studies before data collection, pre-registration helps to certify that the research showed what it claimed to show. And, pre-registration does not preclude exploratory analyses. They can just be flagged as such in the final article, and readers will know to treat this explorations as preliminary and speculative, requiring further verification. I personally favor having two labeled headings in every results section, one for planned analyses and one for exploratory analyses. Even without pre-registration, that's a good approach. But pre-registration certifies the planned ones.

It's easy to pre-register your results and post your data publicly. You can do that with a free account at

update: fixed formatting errors.

Tuesday, June 4, 2013

Direct replication and conceptual replication - more thoughts

In response to my response to his post+Rolf Zwaan wrote another interesting post on the limits and value of direct replication. 

In his new post, Rolf makes a number of interesting points, and I'd like to highlight and discuss a couple of them:
There is no clear dividing line between direct and conceptual replications but what I am advocating are not conceptual replications.
I agree that the line dividing direct and conceptual replication is not clear, but I do think there is a boundary there. For me, a direct replication stick as close to the original method as possible. No study is an exact replication, of course—all replications will involve different subjects at a different time, and they typically will involve other minor changes (e.g., more modern computers, etc.). Still, a direct replication can be functionally the same if it uses the same materials, tasks, etc. and is designed to generalize across the same variations as the original. There's some gray area in what constitutes the same target generalization. Our approach with Registered Replication Reports is to ask the original authors to specify a range of tolerances on the parameters of the study. The more they restrict those tolerances, the more they are retrospectively limiting the generality and scope of their original claims. So, we encourage the original authors to be as liberal in their definitions as possible in order to maximize the generality of their effect. If the original author insists that the study could only be conducted in 1989 with subjects from a particular university, then a direct replication is impossible. However, we should then treat the original effect as unreliable and unreproducable, and it should not factor into larger theorizing (and the original paper should have mentioned that limited generality. Of course, if it had, it likely would not have been published). 

In my own work, I've recently started added a sentence to my discussion specifying the generalization target for the study. For example, in a just-published study of overconfidence in judgments by bridge players, I added the following short section to the end of the method section:
Limits in scope and generalizability
The results from this study should generalize to experienced bridge players at other duplicate bridge clubs, as well as to other domains in which players regularly compete against the same group of players in games of skill, provided that the outcome of any individual match or session is not determined entirely by skill. 
I might be wrong about how my findings wil generalize, but at least I've stated what I believe the scope to be. I would love to see more papers (and press releases) state there generalizability target. That might avoid a lot of needless follow-up work and it might also help journalists to accurately convey the scope of new research.

Rolf also notes that "Most of the studies that have been focused on in discussions on reproducibility are what you might call one-shot between-subjects studies." Danny Kahneman eloquently made a similar point at session on scientific best practices at the APS meeting a couple weeks ago. His point was that reliability is a much bigger problem studies in which each participant contributes just one data point to one condition and all comparisons between conditions are conducted across subjects. Within-subject designs in which each participants contributes many data points to each condition are more powerful and tend to produce more reliable outcomes. With repeated measurement of each condition for each participant, it is possible to obtain a more precise estimate of performance in each condition. Psychophysics research takes this principle to the extreme, testing each subject for hundreds or sometimes thousands of trials in each condition. Such studies can test as few as 2 or 3 participants and still obtain highly reliable results. 

Rolf correctly notes that many of the findings that have been questioned are between-subjet designs with one measurement for each participant. As someone who regularly does such studies myself (e.g., inattentional blindness studies often take that form), I'm aware of the dangers that small samples might cause. And, I'm also aware that my selection of stimuli or tasks might limit the generality of my findings.

Rolf's larger point, though, is that it should be considered a direct replication to vary things in ways that are consistent with the theory that governs the study itself. That is because it is impossible to generalize across items if there is only one item type in a design. Fair enough. That said, I think we still disagree on the contrast between direct and conceptual replication. Direct replication tests whether we can obtain the same results with the same task. Yes, changing some aspect of the design (using a different lineup, for example) would allow us to test the generality of a finding across items. You could construe that as a direct replication, but imagine if it didn't work after changing that element. Does that mean the original result did not replicate? I would argue no. It does not test whether the original result replicates. Rather it tests whether the original result generalizes to other materials. That's crucial to know, but it's a different question.

In sum, Rolf and I appear to be in complete agreement that we need to know both whether an effect is reliable and whether it applies more broadly. But, before we begin testing whether the effect generalizes to other materials, it seems worthwhile to first make sure the original effect with the original items is robust. Perhaps those approaches could be combined in some cases, adding conditions to test the generalizability question at the same time that we test the reliability question (we're considering that approach with another protocol now, actually). 

Monday, June 3, 2013

Direct replication of imperfect studies - what's the right approach?

In a recent blog post+Rolf Zwaan riffed on the benefits and limits of direct replications of the sort +Alex Holcombe+Bobbie Spellman, and I are supporting at Perspectives on Psychological Science. He correctly claims that direct replications are focused on assessing the reliability of individual findings rather than testing the validity of the theory they were intended to support. By definition, a direct replication provides no better support for the theory than the original did. And, if the original study has flaws (and what study doesn't), then that support might be limited in important ways. The goal of direct replication is not to evaluate theory. Rather, it is to determine whether a particular finding is robust. In my view, such measures of reliability have been sorely lacking in the field, whereas we have no shortage of conceptual replications that are better suited to exploring the generality of findings. 

We've encountered cases in which a lab submits a proposal to replicate an original study, and either we as the editors or those submitting the proposal spot a critical flaw in the original design. The problem is that, in hindsight, scientists typically can pick holes in most studies, finding problems that limit their generalizability or validity. For example, Zwaan correctly notes that the original control condition in the verbal overshadowing task does not equate the tasks required of participants across the experimental and control condition. Generating a description of a face is not the same as generating a list of states/capitals, and a better control might ask participant to describe something other than the perpetrator. Maybe it's the act of writing a description—any description—that reduces memory accuracy. If so, then the replicating the original finding would not validate the model of memory in which verbal encoding of the target object interferes with visual memory. True enough. As Zwaan argues, direct replication assesses the reliability of a finding, not the validity of the claims based on that finding. 

One approach we've considered taking is to plug known design holes as best we can in developing the new protocol, provided we can maintain the original intent while conducting a direct replication. The problem is that there are many ways to improve a design, and any change of that scope risks shifting into the mode of conceptual replication rather than direct replication. For the verbal overshadowing studies, I have seen at least half a dozen different suggestions for ways that labs wanted to improve the design. But, by allowing labs to adopt their own improvements, we risk abandoning the primary goal of measuring the reliability of a finding. 

On occasion, we may have to tweak the original design. For example, in the protocol to replicate Schooler's verbal overshadowing effect, we had to change the control condition slightly in order to permit replication in multiple countries. Rather than listing US States and their capitals, participants in all of the replication studies will instead list countries and capitals. That way, we can make the study comparable across all countries. We vetted that change with Schooler, and it does not change the design in a meaningful way, so we can still consider the protocol to represent a direct replication (although it is not an exact replication).

Another approach we might well take in the future would be to design a study that optimally tests a theory, one that builds on the earlier study but tweaks the design to improve it (such studies could arise from an antagonistic collaboration). We could then undertake the same multiple-lab process to estimate the population effect size. Some of the protocols might well be a hybrid of this sort, provided that the original authors or their surrogates believe the new design to be an improved version of the original that will optimize the size of the effect.

Zwaan's interesting blog post suggests an approach to solving the validity problem, but it's one that I fear would not help solve the reliability problem:
This protocol might be like a script (of the Schank & Abelson kind) with slots for things like stimuli and subjects. We would then need to specify the criteria for how each slot should be filled. We’d want the slots to be filled slightly differently across studies; this would prevent the effect from being attributable to quirks of the original stimuli and thus enhance the validity of our findings. To stick with verbal overshadowing, across studies we’d want to use different videos. We’d also want to use different line-ups. By specifying the constraints that stimuli and subjects need to meet we would end up with a better understanding of what the theory does and does not claim. 
This approach, allowing each replication to vary provided that they follow a more general script, might not provide a definitive test of the reliability of a finding. Depending on how much each study deviated from the other studies, the studies could veer into the territory of conceptual replication rather than direct replication. Conceptual replications are great, and they do help to determine the generality of a finding, but they don't test the reliability of a finding. If one such study found a particularly large or small effect, we couldn't tell whether that deviation from the average resulted from the differences in the protocol or just resulted from measurement noise. As +Hal Pashler and others have noted, that's the shortcoming of conceptual replications—they do not assess the reliability of an effect because a failure to find the effect can be attributed either to the reliability of the original effect or to the changes from the original protocol. 

There is a role for such variation in the field, of course. But in my view, we already plenty of such conceptual replication in the literature (although we generally never see the conceptual replications that don't work). What we don't have enough of are direct replications that accurately measure the reliability of a finding. In my view, we need to measure both the reliability and validity of our findings, and we can't just focus on one of those.