3.4.1 Probability sampling: data collection and data analysis

Weights can undo distortions intentionally caused by the sampling process.

Probability samples are those where all people have a known, non-zero probability of inclusion, and the simplest probability sampling design is simple random sampling where each person has equal probability of inclusion. When respondents are selected via simple random sampling with perfect execution (e.g., no coverage error and no non-response), then estimation is straightforward because the sample will—on average—be a miniature version of the population.

Simple random sampling is rarely used in practice, however. Rather, researchers intentionally select people with unequal probabilities of inclusion in order to reduce cost and increase accuracy. When researchers intentionally select people with different probabilities of inclusion, then adjustments are needed to undo the distortions caused by the sampling process. In other words, how we generalize from a sample depends on how the sample was selected.

For example, the Current Population Survey (CPS) is used by the US government to estimate the unemployment rate. Each month about 100,000 people are interviewed, either face-to-face or over the telephone, and the results are used to produce the estimated unemployment rate. Because the government wishes to estimate the unemployment rate in each state, it cannot do a simple random sample of adults because that would yield too few respondents in states with small populations (e.g., Rhode Island) and too many from states with large populations (e.g., California). Instead, the CPS samples people in different states at different rates, a process called stratified sampling with unequal probability of selection. For example, if the CPS wanted 2,000 respondents per state, then adults in Rhode Island would have about 30 times higher probability of inclusion than adults in California (Rhode Island: 2,000 respondents per 800,000 adults vs California: 2,000 respondents per 30,000,000 adults). As we will see later, this kind of sampling with unequal probability happens with online sources of data too, but unlike the CPS, the sampling mechanism is usually not known or controlled by the researcher.

Given its sampling design, the CPS is not directly representative of the US; it includes too many people from Rhode Island and too few from California. Therefore, it would be unwise to estimate the unemployment rate in the country with the unemployment rate in the sample. Instead of the sample mean, it is better to take a weighted mean, where the weights account for the fact that people from Rhode Island were more likely to be included than people from California. For example, each person from California would be upweighted— they would count more in the estimate—and each person from Rhode Island would be downweighted—they would count less in the estimate. In essence, you are given more voice to people that you are less likely to learn about.

This toy example illustrates an important but commonly misunderstood point: a sample does not need to be a miniature version of the population in order to produce good estimates. If enough is known about how the data was collected, then that information can be used when making estimates from the sample. The approach I’ve just described—and that I describe mathematically in the technical appendix—falls squarely within the classical probability sampling framework. Now, I’ll show how that same idea can be applied to non-probability samples.