3.4 Who to ask

The digital age is making probability sampling in practice harder and is creating new opportunities for non-probability sampling.

In the history of sampling, there have been two competing approaches: probability sampling methods and non-probability sampling methods. Although both approaches were used in the early days of sampling, probability sampling has come to dominate, and many social researchers are taught to view non-probability sampling with great skepticism. However, as I will describe below, changes created by the digital age mean that it is time for researchers to reconsider non-probability sampling. In particular, probability sampling has been getting hard to do in practice, and non-probability sampling has been getting faster, cheaper, and better. Faster and cheaper surveys are not just ends in themselves: they enable new opportunities such as more frequent surveys and larger sample sizes. For example, by using non-probability methods the Cooperative Congressional Election Study (CCES) is able to have roughly 10 times more participants than earlier studies using probability sampling. This much larger sample enables political researchers to study variation in attitudes and behavior across subgroups and social contexts. Further, all of this added scale came without decreases in the quality of estimates (Ansolabehere and Rivers 2013).

Currently, the dominant approach to sampling for social research is probability sampling. In probability sampling, all members of the target population have a known, nonzero probability of being sampled, and all people who are sampled respond to the survey. When these conditions are met, elegant mathematical results offer provable guarantees about a researcher’s ability to use the sample to make inferences about the target population.

In the real world, however, the conditions underlying these mathematical results are rarely met. For example, there are often coverage errors and nonresponse. Because of these problems, researchers often have to employ a variety of statistical adjustments in order to make inference from their sample to their target population. Thus, it is important to distinguish between probability sampling in theory, which has strong theoretical guarantees, and probability sampling in practice, which offers no such guarantees and depends on a variety of statistical adjustments.

Over time, the differences between probability sampling in theory and probability sampling in practice have been increasing. For example, nonresponse rates have been steadily increasing, even in high-quality, expensive surveys (figure 3.5) (National Research Council 2013; B. D. Meyer, Mok, and Sullivan 2015). Nonresponse rates are much higher in commercial telephone surveys—sometimes even as high as 90% (Kohut et al. 2012). These increases in nonresponse threaten the quality of estimates because the estimates increasingly depend on the statistical models that researchers use to adjust for nonresponse. Further, these decreases in quality have happened despite increasingly expensive efforts by survey researchers to maintain high response rates. Some people fear that these twin trends of decreasing quality and increasing cost threaten the foundation of survey research (National Research Council 2013).

Figure 3.5: Nonresponse has been increasingly steadily, even in high-quality expensive surveys (National Research Council 2013; B. D. Meyer, Mok, and Sullivan 2015). Nonresponse rates are much higher for commercial telephones surveys, sometimes even as high as 90% (Kohut et al. 2012). These long-term trends in nonresponse mean that data collection is more expensive and estimates are less reliable. Adapted from B. D. Meyer, Mok, and Sullivan (2015), figure 1.

At the same time that there has been growing difficulties for probability sampling methods, there has also been exciting developments in non-probability sampling methods. There are a variety of styles of non-probability sampling methods, but the one thing that they have in common is that they cannot easily fit in the mathematical framework of probability sampling (Baker et al. 2013). In other words, in non-probability sampling methods not everyone has a known and nonzero probability of inclusion. Non-probability sampling methods have a terrible reputation among social researchers and they are associated with some of the most dramatic failures of survey researchers, such as the Literary Digest fiasco (discussed earlier) and “Dewey Defeats Truman,” the incorrect prediction about the US presidential elections of 1948 (figure 3.6).

Figure 3.6: President Harry Truman holding up the headline of a newspaper that had incorrectly announced his defeat. This headline was based in part on estimates from non-probability samples (Mosteller 1949; Bean 1950; Freedman, Pisani, and Purves 2007). Although “Dewey Defeats Truman” happened in 1948, it is still among the reason that some researchers are skeptical about estimates from non-probability samples. Source: Harry S. Truman Library & Museum.

One form of non-probability sampling that is particular suited to the digital age is the use of online panels. Researchers using online panels depend on some panel provider—usually a company, government, or university—to construct a large, diverse group of people who agree to serve as respondents for surveys. These panel participants are often recruited using a variety of ad hoc methods such as online banner ads. Then, a researcher can pay the panel provider for access to a sample of respondents with desired characteristics (e.g., nationally representative of adults). These online panels are non-probability methods because not everyone has a known, nonzero probability of inclusion. Although non-probability online panels are already being used by social researchers (e.g., the CCES), there is still some debate about the quality of estimates that come from them (Callegaro et al. 2014).

Despite these debates, I think there are two reasons why the time is right for social researchers to reconsider non-probability sampling. First, in the digital age, there have been many developments in the collection and analysis of non-probability samples. These newer methods are different enough from the methods that caused problems in the past that I think it makes sense to think of them as “non-probability sampling 2.0.” The second reason why researchers should reconsider non-probability sampling is because probability sampling in practice are become increasingly difficult. When there are high rates of non-response—as there are in real surveys now—the actual probabilities of inclusion for respondents are not known, and thus, probability samples and non-probability samples are not as different as many researchers believe.

As I said earlier, non-probability samples are viewed with great skepticism by many social researchers, in part because of their role in some of the most embarrassing failures in the early days of survey research. A clear example of how far we have come with non-probability samples is the research by Wei Wang, David Rothschild, Sharad Goel, and Andrew Gelman (2015) that correctly recovered the outcome of the 2012 US election using a non-probability sample of American Xbox users—a decidedly nonrandom sample of Americans. The researchers recruited respondents from the XBox gaming system, and as you might expect, the Xbox sample skewed male and skewed young: 18- to 29-year-olds make up 19% of the electorate but 65% of the Xbox sample, and men make up 47% of the electorate but 93% of the Xbox sample (figure 3.7). Because of these strong demographic biases, the raw Xbox data was a poor indicator of election returns. It predicted a strong victory for Mitt Romney over Barack Obama. Again, this is another example of the dangers of raw, unadjusted non-probability samples and is reminiscent of the Literary Digest fiasco.

Figure 3.7: Demographics of respondents in W. Wang et al. (2015). Because respondents were recruited from XBox, they were more likely to be young and more likely to be male, relative to voters in the 2012 election. Adapted from W. Wang et al. (2015), figure 1.

However, Wang and colleagues were aware of these problems and attempted to adjust for their non-random sampling process when making estimates. In particular, they used post-stratification, a technique that is also widely used to adjust probability samples that have coverage errors and non-response.

The main idea of post-stratification is to use auxiliary information about the target population to help improve the estimate that comes from a sample. When using post-stratification to make estimates from their non-probability sample, Wang and colleague chopped the population into different groups, estimated the support for Obama in each group, and then took a weighted average of the group estimates to produce an overall estimate. For example, they could have split the population into two groups (men and women), estimated the support for Obama among men and women, and then estimated overall support for Obama by taking a weighted average in order to account for the fact that women make up 53% of the electorate and men 47%. Roughly, post-stratification helps correct for an imbalanced sample by bringing in auxiliary information about the sizes of the groups.

The key to post-stratification is to form the right groups. If you can chop up the population into homogeneous groups such that the response propensities are the same for everyone in each group, then post-stratification will produce unbiased estimates. In other words, post-stratifying by gender will produce unbiased estimates if all men have the response propensity and all women have the same response propensity. This assumption is called the homogeneous-response-propensities-within-groups assumption, and I describe it a bit more in the mathematical notes at the end of this chapter.

Of course, it seems unlikely that the response propensities will be the same for all men and all women. However, the homogeneous-response-propensities-within-groups assumption becomes more plausible as the number of groups increases. Roughly, it becomes easier to chop the population into homogeneous groups if you create more groups. For example, it might seem implausible that all women have the same response propensity, but it might seem more plausible that there is the same response propensity for all women who are aged 18-29, who graduated from college, and who are living in California. Thus, as the number of groups used in post-stratification gets larger, the assumptions needed to support the method become more reasonable. Given this fact, researchers often want to create a huge number of groups for post-stratification. However, as the number of groups increases, researchers run into a different problem: data sparsity. If there are only a small number of people in each group, then the estimates will be more uncertain, and in the extreme case where there is a group that has no respondents, then post-stratification completely breaks down.

There are two ways out of this inherent tension between the plausibility of the homogeneous-response-propensity-within-groups assumption and the demand for reasonable sample sizes in each group. First, researchers can collect a larger, more diverse sample, which helps ensure reasonable sample sizes in each group. Second, they can use a more sophisticated statistical model for making estimates within groups. And, in fact, sometimes researchers do both, as Wang and colleagues did with their study of the election using respondents from Xbox.

Because they were using a non-probability sampling method with computer-administered interviews (I’ll talk more about computer-administered interviews in section 3.5), Wang and colleagues had very inexpensive data collection, which enabled them to collect information from 345,858 unique participants, a huge number by the standards of election polling. This massive sample size enabled them to form a huge number of post-stratification groups. Whereas post-stratification typically involves chopping the population into hundreds of groups, Wang and colleagues divided the population into 176,256 groups defined by gender (2 categories), race (4 categories), age (4 categories), education (4 categories), state (51 categories), party ID (3 categories), ideology (3 categories), and 2008 vote (3 categories). In other words, their huge sample size, which was enabled by low-cost data collection, enabled them to make a more plausible assumption in their estimation process.

Even with 345,858 unique participants, however, there were still many, many groups for which Wang and colleagues had almost no respondents. Therefore, they used a technique called multilevel regression to estimate the support in each group. Essentially, to estimate the support for Obama within a specific group, the multilevel regression pooled information from many closely related groups. For example, imagine trying to estimate the support for Obama among female Hispanics between 18 and 29 years old, who are college graduates, who are registered Democrats, who self-identify as moderates, and who voted for Obama in 2008. This is a very, very specific group, and it is possible that there is nobody in the sample with these characteristics. Therefore, to make estimates about this group, multilevel regression uses a statistical model to pool together estimates from people in very similar groups.

Thus, Wang and colleagues used an approach that combined multilevel regression and post-stratification, so they called their strategy multilevel regression with post-stratification or, more affectionately, “Mr. P.” When Wang and colleagues used Mr. P. to make estimates from the XBox non-probability sample, they produced estimates very close to the overall support that Obama received in the 2012 election (figure 3.8). In fact their estimates were more accurate than an aggregate of traditional public opinion polls. Thus, in this case, statistical adjustments—specifically Mr. P.—seem to do a good job correcting the biases in non-probability data; biases that were clearly visible when you look at the estimates from the unadjusted Xbox data.

Figure 3.8: Estimates from W. Wang et al. (2015). Unadjusted XBox sample produced inaccurate estimates. But, the weighted XBox sample produced estimates that were more accurate than an average of probability-based telephone surveys. Adapted from W. Wang et al. (2015), figures 2 and 3.

There are two main lessons from the study of Wang and colleagues. First, unadjusted non-probability samples can lead to bad estimates; this is a lesson that many researchers have heard before. The second lesson, however, is that non-probability samples, when analyzed properly, can actually produce good estimates; non-probability samples need not automatically lead to something like the Literary Digest fiasco.

Going forward, if you are trying to decide between using a probability sampling approach and a non-probability sampling approach you face a difficult choice. Sometimes researchers want a quick and rigid rule (e.g., always use probability sampling methods), but it is increasingly difficult to offer such a rule. Researchers face a difficult choice between probability sampling methods in practice—which are increasingly expensive and far from the theoretical results that justify their use—and non-probability sampling methods—which are cheaper and faster, but less familiar and more varied. One thing that is clear, however, is that if you are forced to work with non-probability samples or nonrepresentative big data sources (think back to Chapter 2), then there is a strong reason to believe that estimates made using post-stratification and related techniques will be better than unadjusted, raw estimates.