Tuesday, December 18, 2012

The demographics of surveys: Phone vs. Mechanical Turk

Last year, Chris Chabris and I published the results of a national survey of beliefs about memory (in PLoS One). We found that many people agreed with statements about memory and the mind that experts roundly dimiss as wrong. We conducted the survey in 2009 using the polling firm, SurveyUSA, with a nominal sample of 1500 respondents representing the population of the United States. SurveyUSA uses random-digit dialing of landline phone numbers, and respondents press keys on their keypad to questions from a recorded voice. There method is fairly typical of so-called "robotic" polling.

Last summer, we repeated the same survey using a sample drawn from Amazon Mechanical Turk, with the restriction that respondents be from the United States. On Mechanical Turk, workers decide whether they would like to participate and are paid a small amount for a completed survey. Unlike random digit dialing, the sample on Mechanical Turk is entirely self-selected.  The results of that survey and a comparison to our earlier survey just appeared in PLoS One on December 18, 2012.

Just recently I wrote an extended blog post about the nature of survey demographics. To compare these two surveys, we weighted both to a nominal sample of 750 respondents according to the population demographics from the 2010 US Census. Reassuringly, the pattern of results was roughly comparable. In essence, we replicated our generalization to the national population, with many people endorsing mistaken beliefs about memory.

In writing the paper and re-weighting the samples, I discovered something interesting about who responds to these sorts of surveys. Although both could be weighted to a nationally representative sample, the raw demographics of the samples were vastly different. They were roughly comparable on most dimensions (e.g., income, education, region of the country), but their ages differed dramatically.

figure comparing age demographics from multiple survey methods

The yellow bars represent data from the 2010 US Census. Note that about half of the adult US population is under 44 and half is over 44. Now look at the blue bars from the SurveyUSA sample of about 1840 people. The first thing to notice is that the phone survey massively oversamples older people and massively undersamples younger ones. In order to generalize to the larger population, each response from a young subject is weighted to count many times that of an old subject. The pattern is almost exactly the opposite for our Mechanical Turk sample (of just under 1000 people). Mechanical Turk respondents were disproportionately young. The extent of age bias in each sample was roughly comparable and in opposite directions, and neither was anywhere close to the actual demographics of the US population.

For me, this figure was eye opening. I wasn't surprised that an online Mechanical Turk sample would be disproportionately younger, and I assumed that phone surveys would oversample the elderly, but I had no idea how extreme that bias would be. What that means is that any national survey conducted by phone is mostly contacting older people. Unless the sample is adequately large, the number of young respondents will be minuscule, meaning that the weighting for those respondents will be huge. If a small survey happened to get a few oddball younger respondents, it could dramatically alter the total estimate. 

As I discussed in this earlier post, pollsters almost never report their raw demographics, essentially hiding how much weighting their survey needed to make it nationally representative. But, that information is crucial, especially if the survey compares demographic groups. If you compared young and old subjects in our SurveyUSA study, you would be comparing a small sample to a huge one, and the generalization from the older subjects would be much safer than from the younger ones. Without knowing that, you might assume that each generalization was equally valid.

In our paper, we compared the two surveys by weighting each to match the population, making each into a representative sample with a nominal size of 750 people (e.g., weighting to match what would be expected for a sample of 750 people. See the paper for details: We basically dichotomized the age category given the sample sizes). Fortunately, despite these huge deviations from the actual population statistics, each "nationally representative" sample produced comparable results. In a sense, that is the ideal situation: Two samples with vastly different demographics produce comparable results when weighted. That means that the different sampling methods and weightings did not dramatically change the pattern of results. 

That finding also has practical implications for anyone interested in conducting surveys. One approach to obtaining a more representative sample would be to combine phone and Mechanical Turk samples, counting on Mechanical Turk for younger respondents and the phone survey for older ones.

The next time you read about a survey, ask yourself: Was the sample truly representative and if not, was the sample large enough to trust the conclusions about different demographic subgroups?