Tuesday, December 18, 2012

The hidden secrets of polling

Pollsters and survey researchers often report the results of "nationally representative" surveys, but in what sense are such surveys representative? The answer, it turns out, is more complicated than it might seem. And, the way surveys are reported obscures a potentially important secret.

The 2012 elections in the United States likely were the most heavily polled in history. Not surprisingly, polls varied in their accuracy, and different polls of the same race often produced discrepant results. Election polling is particularly tricky because there is no ground truth until after the election -- pollsters are trying to predict what people are going to do, an inherently noisy process since you have no way to know for certain if they will follow through on what they say they will do. The challenge in polling elections is to adjust for the likelihood that people will do what they say they will, and many polling discrepancies are due to differences in that "likely voter" model. (If you are a political junky like me, that was why some people thought it was necessary to "unskewer" of polls—basically, that meant adjusting the likely voter model.)

If your goal, instead, is just to generalize from your sample to the population as a whole, you just need to know the characteristics of the population, a much easier (but still challenging) problem. Suppose I want to know what percentage of the US Population owns a bicycle. If I had infinite resources and could work infinitely fast, I could ask every US citizen and tally their answers to get the percentage (much as a classic Census would). Far more efficient and cost effective, though, would be to sample from the population and estimate the population characteristics from the sample. The bigger the sample, the more likely the sample will be representative of the population as a whole, assuming the sample is random. But there's the rub. In practice, no sampling method provides a truly random sample of the population.

For a random sample, we can assume that any one individual is as likely as any other to be included in our survey. That means, with a large sample, the relative proportions of men and women, old people and young, rich and poor, will match those of the population. Roughly 1% of your sample will be in the top 1% of income earners in the USA and 99% won't be. But, if a sample is not truly random, some people will be sampled more than others. That leaves you two options:
  1. Assume that your sample is representative enough and generalize to the full population anyway. That approach actually forms the basis of most generalization from experimental research. People conduct a study by testing a group of college students (known as a sample of convenience) and assume that their sample is representative enough of a larger population. Whether or not that generalization is appropriate depends on the nature of the question. If you sample a group of college students and find that all of them breath air, generalizing to all humans would be justified. If you sample a group of college students and find that none of them are married, you wouldn't be justified in generalizing to all Americans and concluding that Americans don't marry. Few published journal articles explicitly address the limits of their generalizability and the assumptions made when generalizing, but they should (more on that soon).
  2. Adjust your estimate to account for known discrepancies between your sample and the full population. That's how polling firms solve the problem. They weight the responses they did get to account for oversampling of some groups or undersampling of others. If 10% of your sample falls in the top 1% of incomes in the USA, you would need to weight those responses less and weight the responses of the respondents reporting lower incomes more heavily. If you know that Democrats outnumber Republicans in a region, but your sample includes more Republicans than Democrats, you need to weight the sample to account for the discrepancy. That's where much of the fighting emerges in political polling (were the percentages of each party appropriate? Did the pollsters accurately weight for how likely each group was to vote?). 

National surveys are not truly random samples of the population. Most operate by selecting area codes (to make sure they sample the right geographical region) and dialing random numbers in that area code. Some call cell phones, but most call landlines from lists of registered voters or other calling registries. If everyone had a phone, answered it with equal probability, and responded to the survey with equal frequency, then a random-digit-dialed survey would be a random sample of the population. None of those conditions hold in reality. Many people no longer have land lines, especially younger people. Relatively few people respond to surveys (often, the response rates are well under 10%), and those who do respond might differ systematically from those who don't.

Given that polls are not random samples, pollsters weight their samples to match their beliefs about the characteristics of the population as a whole. For political polls, that means weighting to match the demographics of registered voters or likely voters. For surveys of other sorts (e.g., owning a bicycle), that means weighting the sample to match the demographics of the broader population of interest. With a large enough sample, weighting allow you to make your sample conform to the demographics of your target population. If your goal is to generalize to the population, so you must adjust your sample to match it. If you do that, your sample will be representative of those population demographics. That does not mean the poll will be accurate, though. Perhaps the old people who did respond were unusual or differed from other old people in systematic ways. If so, then your sampled old people might be a poor stand in for old people in general, and the inference you draw from your poll might be inaccurate.

The first secret of poll reporting: The size of the poll is a convenient fiction. When the media reports that a poll surveyed 2000 people, that is misleading. They almost certainly surveyed more than that and then weighted their poll to match what would be representative of the population with a sample of 2000. The reason that they would have to survey more than 2000 is that some groups are so underrepresented in polling that it would take more than 2000 people to get enough respondents in those groups to estimate how that group as a whole would respond. If you cut the demographic categories too finely, you won't have any respondents from some groups (for any group constituting less than 1/4000 of the target population, you would not expect to sample any respondents). The number of respondents reported is a "nominal" sample size, not an actual sample size. Pollsters decide in advance what nominal sample size the want and then polling until they obtain a large enough sample in each of their critical demographic groups to be able to weight the responses. Polling firms typically do not report how many responses they need from each demographic group, and they rarely report the total number of people sampled to achieve their nominal sample. And that hints at the second problem.

The second secret of poll reporting: Pollsters almost never report the raw sample demographics that went into the polling estimate. Instead, they report the results as if their sample were representative. They might report cross-tabs, in which they break down the percentages of each group responding in a particular way (e.g., what percentage of women own bicycles), but they don't report how many women were in their sample or how heavily they had to weight individual responses to make those estimates. In some cases, they might generalize to an entire demographic category from the responses of only a handful of people. Critically, the actual demographics of the sample almost never match the demographics of the target population, and in some cases, they can be dramatically different. That means the pollsters must use fairly extreme weights to achieve an representative sample.

In an upcoming post, I will provide an example of how wildly sample demographics can vary even when they produce comparable results. In a sense, that is the most reassuring case—when widely different samples are weighted to a common standard and lead to the same conclusion, we can be more confident that the result wasn't just due to the sample demographics. Still, whenever you see conclusions about a demographic group in a poll, you should as yourself (and the pollster) how many people likely formed the basis for that generalization. It might well be a smaller number than you think.