3.4.2 Non-probability samples: weighting

With non-probability samples, weights can undo distortions caused by the assumed sampling process.

In the same way that researchers weight responses from probability samples, they can also weight responses from non-probability samples. For example, as an alternative to the CPS, imagine that you placed banner ads on thousands of websites to recruit participants for a survey to estimate the unemployment rate. Naturally, you would be skeptical that the simple mean of your sample would be a good estimate of the unemployment rate. Your skepticism is probably because you think that some people are more likely to complete your survey than others. For example, people who don’t spend a lot of time on the web are less likely to complete your survey.

As we saw in the last section, however, if we know how the sample was selected—as we do with probability samples—then we can undo distortions caused by the sampling process. Unfortunately, when working with non-probability samples, we don’t know how the sample was selected. But, we can make assumptions about the sampling process and then apply weighting in the same way. If these assumptions are correct, then the weighting will undo the distortions caused by the sampling process.

For example, imagine that in response to your banner ads, you recruited 100,000 respondents. However, you don’t believe that these 100,000 respondents are a simple random sample of American adults. In fact, when you compare your respondents to the US population, you find that people from some states (e.g., New York) are over-represented and that people from some states (e.g., Alaska) are under-represented. Thus, the unemployment rate of your sample is likely to be a bad estimate of the unemployment rate in the target population.

One way to undo the distortion that happened in the sampling process is to assign weights to each person; lower weights to people from states that are over-represented in the sample (e.g., New York) and higher weights to people from states that are under-represented in the sample (e.g., Alaska). More specifically, the weight for each respondent is related to their prevalence in your sample relative to their prevalence in the US population. This weighting procedure is called post-stratification, and the idea of weighing should remind you of the example in Section 3.4.1 where respondents from Rhode Island were given less weight than respondents from California. Post-stratification requires that you know enough to put your respondents into groups and to know the proportion of the target population in each group.

Although the weighting of the probability sample and of the non-probability sample are the same mathematically (see technical appendix), they work well in different situations. If the researcher has a perfect probability sample (i.e., no coverage error and no non-response), then weighting will produce unbiased estimates for all traits in all cases. This strong theoretical guarantee is why advocates of probability samples find them so attractive. On the other hand, weighting non-probability samples will only produce unbiased estimates for all traits if the response propensities are the same for everyone in each group. In other words, thinking back to our example, using post-stratification will produce unbiased estimates if everyone in New York has the same probability of participating and everyone in Alaska has the same probability of participating and so on. This assumption is called the homogeneous-response-propensities-within-groups assumption, and it plays a key role in knowing if post-stratification will work well with non-probability samples.

Unfortunately, in our example, the homogeneous-response-propensities-within-groups assumption seems unlikely to be true. That is, it seems unlikely that everyone in Alaska has the same probability of being in your survey. But, there are three important points to keep in mind about post-stratification, all of which make it seem more promising.

First, homogeneous-response-propensities-within-groups assumption becomes more plausible as the number of groups increases. And, researchers are not limited to groups just based on a single geographic dimension. For example, we could create groups based on state, age, sex, and level of education. It seems more plausible that there is homogeneous response propensities within the group of 18-29, female, college graduates living in Alaska than within the group of all people living in Alaska. Thus, as the number of groups used for post-stratification increases, the assumptions needed to support it become more reasonable. Given this fact, it seems like a researchers would want to create a huge number of groups for post-stratification. But, as the number of groups increases, researchers run into a different problem: data sparsity. If there are only a small number of people in each group, then the estimates will be more uncertain, and in the extreme case where there is a group that has no respondents, then post-stratification completely breaks down. There are two ways out of this inherent tension between the plausibility of homogeneous- response-propensity-within-groups assumption and the demand for reasonable sample sizes in each group. One approach is to move to a more sophisticated statistical model for calculating weights and the other is to collect a larger, more diverse sample, which helps ensure reasonable sample sizes in each group. And, sometimes researchers do both, as I’ll describe in more detail below.

A second consideration when working with post-stratification from non-probability samples is that the homogeneous-response-propensity-within-groups assumption is already frequently made when analyzing probability samples. The reason that this assumption is needed for probability samples in practice is that probability samples have non-response, and the most common method for adjusting for non-response is post-stratification as described above. Of course, just because many researchers make a certain assumption doesn’t mean that you should do it too. But, it does mean that when comparing non-probability samples to probability samples in practice, we must keep in mind that both depend on assumptions and auxiliary information in order to produce estimates. In most realistic settings, there is simply no assumption-free approach to inference.

Finally, if you care about one estimate in particular—in our example unemployment rate—then you need a condition weaker than homogeneous-response-propensity-within-groups assumption. Specifically, you don’t need to assume that everyone has the same response propensity, you only need to assume that there is no correlation between response propensity and unemployment rate within each group. Of course, even this weaker condition will not hold in some situations. For example, imagine estimating the proportion of Americans that do volunteer work. If people who do volunteer work are more likely to agree to be in a survey, then researchers will systematically over-estimate the amount of volunteering, even if they do post-stratification adjustments, a result that has been demonstrated empirically by Abraham, Helms, and Presser (2009).

As I said earlier, non-probability samples are viewed with great skepticism by social scientists, in part because of their role in some of the most embarrassing failures in the early days of survey research. A clear example of how far we have come with non-probability samples is the research of Wei Wang, David Rothschild, Sharad Goel, and Andrew Gelman that correctly recovered the outcome of the 2012 US election using a non-probability sample of American Xbox users—a decidedly non-random sample of Americans (Wang et al. 2015). The researchers recruited respondents from the XBox gaming system, and as you might expect, the Xbox sample skewed male and skewed young: 18 - 29 year olds make up 19% of the electorate but 65% of the Xbox sample and men make up 47% of the electorate and 93% of the Xbox sample (Figure 3.4). Because of these strong demographic biases, the raw Xbox data was a poor indicator of election returns. It predicted a strong victory for Mitt Romney over Barack Obama. Again, this is another example of the dangers of raw, unadjusted non-probability samples and is reminiscent of the Literary Digest fiasco.

Figure 3.4: Demographics of respondents in Wang et al. (2015). Because respondents were recruited from XBox, they were more likely to be young and more likely to be male, relative to voters in the 2012 election.

Figure 3.4: Demographics of respondents in Wang et al. (2015). Because respondents were recruited from XBox, they were more likely to be young and more likely to be male, relative to voters in the 2012 election.

However, Wang and colleagues were aware of these problems and attempted to weight the respondents to correct for the sampling process. In particular, they used a more sophisticated form of the post-stratification I told you about. It is worth learning a bit more about their approach because it builds intuition about post-stratification, and the particular version Wang and colleagues used is one of the most exciting approaches to weighting non-probability samples.

In our simple example about estimating unemployment in Section 3.4.1, we divided the population into groups based on state of residence. In contrast, Wang and colleagues divided the population into into 176,256 groups defined by: gender (2 categories), race (4 categories), age (4 categories), education (4 categories), state (51 categories), party ID (3 categories), ideology (3 categories) and 2008 vote (3 categories). With more groups, the researchers hoped that it would be increasingly likely that within each group, response propensity was uncorrelated with support for Obama. Next, rather than constructing individual-level weights, as we did in our example, Wang and colleagues used a complex model to estimate the proportion of people in each group that would vote for Obama. Finally, they combined these group estimates of support with the known size of each group to produce an estimated overall level of support. In other words, they chopped up the population into different groups, estimated the support for Obama in each group, and then took a weighted average of the group estimates to produce an overall estimate.

Thus, the big challenge in their approach is to estimate the support for Obama in each of these 176,256 groups. Although their panel included 345,858 unique participants, a huge number by the standards of election polling, there were many, many groups for which Wang and colleagues had almost no respondents. Therefore, to estimate the support in each group they used a technique called multilevel regression with post-stratification, which researchers affectionately call Mr. P. Essentially, to estimate the support for Obama within a specific group, Mr. P. pools information from many closely related groups. For example, consider the challenge of estimating the support for Obama among female, Hispanics, between 18-29 years old, who are college graduates, who are registered Democrats, who self-identify as moderates, and who voted for Obama in 2008. This is a very, very specific group, and it is possible that there is nobody in the sample with these characteristics. Therefore, to make estimates about this group, Mr. P. pools together estimates from people in very similar groups.

Using this analysis strategy, Wang and colleagues were able to use the XBox non-probability sample to very closely estimate the overall support that Obama received in the 2012 election (Figure 3.5). In fact their estimates were more accurate than an aggregate of public opinion polls. Thus, in this case, weighting—specifically Mr. P.—seems to do a good job correcting the biases in non-probability data; biases that are visible when you look at the estimates from the unadjusted Xbox data.

Figure 3.5: Estimates from Wang et al. (2015). Unadjusted XBox sample produced inaccurate estimates. But, the weighted XBox sample produced estimates that were more accurate than an average of probability-based telephone surveys.

Figure 3.5: Estimates from Wang et al. (2015). Unadjusted XBox sample produced inaccurate estimates. But, the weighted XBox sample produced estimates that were more accurate than an average of probability-based telephone surveys.

There are two main lessons from the study of Wang and colleagues. First, unadjusted non-probability samples can lead to bad estimates; this is a lesson that many researchers have heard before. However, the second lesson is that non-probability samples, when weighted properly, can actually produce quite good estimates. In fact, their estimates were more accurate than the estimates from pollster.com, an aggregation of more traditional election polls.

Finally, there are important limitations to what we can learn from this one specific study. Just because post-stratification worked well in this particular case, there is no guarantee that it will work well in other cases. In fact, elections are perhaps one of the easiest settings because pollsters have been studying elections for almost 100 years, there is regular feedback (we can see who wins the elections), and party identification and demographic characteristics are relatively predictive of voting. At this point, we lack solid theory and empirical experience to know when weighting adjustments to non-probability samples will produce sufficiently accurate estimates. One thing that is clear, however, is if you are forced to work with non-probability samples, then there is strong reason to believe that adjusted estimates will be better than non-adjusted estimates.