2.4.3 Approximating experiments

We can approximate experiments that we haven’t or can’t do. Two approaches that especially benefit from big data sources are natural experiments and matching.

Some important scientific and policy questions are causal. For example, what is the effect of a job training program on wages? A researcher attempting to answer this question might compare the earnings of people who signed up for training to those that didn’t. But how much of any difference in wages between these groups is because of the training and how much is because of preexisting differences between the people that sign up and those that don’t? This is a difficult question, and it is one that doesn’t automatically go away with more data. In other words, the concern about possible preexisting differences arises no matter how many workers are in your data.

In many situations, the strongest way to estimate the causal effect of some treatment, such as job training, is to run a randomized controlled experiment where a researcher randomly delivers the treatment to some people and not others. I’ll devote all of chapter 4 to experiments, so here I’m going to focus on two strategies that can be used with non-experimental data. The first strategy depends on looking for something happening in the world that randomly (or nearly randomly) assigns the treatment to some people and not others. The second strategy depends on statistically adjusting non-experimental data in an attempt to account for preexisting differences between those who did and did not receive the treatment.

A skeptic might claim that both of these strategies should be avoided because they require strong assumptions, assumptions that are difficult to assess and that, in practice, are often violated. While I am sympathetic to this claim, I think it goes a bit too far. It is certainly true that it is difficult to reliably make causal estimates from non-experimental data, but I don’t think that means that we should never try. In particular, non-experimental approaches can be helpful if logistical constraint prevent you from conducting an experiment or if ethical constraints mean that you do not want to run an experiment. Further, non-experimental approaches can be helpful if you want to take advantage of data that already exist in order to design a randomized controlled experiment.

Before proceeding, it is also worth noting that making causal estimates is one of the most complex topics in social research, and one that can lead to intense and emotional debate. In what follows, I will provide an optimistic description of each approach in order to build intuition about it, then I will describe some of the challenges that arise when using that approach. Further details about each approach are available in the materials at the end of this chapter. If you plan to use either of these approaches in your own research, I highly recommend reading one of the many excellent books on causal inference (Imbens and Rubin 2015; Pearl 2009; Morgan and Winship 2014).

One approach to making causal estimates from non-experimental data is to look for an event that has randomly assigned a treatment to some people and not to others. These situations are called natural experiments. One of the clearest examples of a natural experiment comes from the research of Joshua Angrist (1990) measuring the effect of military services on earnings. During the war in Vietnam, the United States increased the size of its armed forces through a draft. In order to decide which citizens would be called into service, the US government held a lottery. Every birth date was written on a piece of paper, and, as shown in figure 2.7, these pieces of paper were selected one at a time in order to determine the order in which young men would be called to serve (young women were not subject to the draft). Based on the results, men born on September 14 were called first, men born on April 24 were called second, and so on. Ultimately, in this lottery, men born on 195 different days were drafted, while men born on 171 days were not.

Figure 2.7: Congressman Alexander Pirnie (R-NY) drawing the first capsule for the Selective Service draft on December 1, 1969. Joshua Angrist (1990) combined the draft lottery with earnings data from the Social Security Administration to estimate the effect of military service on earnings. This is an example of research using a natural experiment. Source: U.S. Selective Service System (1969)/Wikimedia Commons.

Figure 2.7: Congressman Alexander Pirnie (R-NY) drawing the first capsule for the Selective Service draft on December 1, 1969. Joshua Angrist (1990) combined the draft lottery with earnings data from the Social Security Administration to estimate the effect of military service on earnings. This is an example of research using a natural experiment. Source: U.S. Selective Service System (1969)/Wikimedia Commons.

Although it might not be immediately apparent, a draft lottery has a critical similarity to a randomized controlled experiment: in both situations, participants are randomly assigned to receive a treatment. In order to study the effect of this randomized treatment, Angrist took advantage of an always-on big data system: the US Social Security Administration, which collects information on virtually every American’s earnings from employment. By combining the information about who was randomly selected in the draft lottery with the earnings data that was collected in governmental administrative records, Angrist concluded that the earnings of veterans were about 15% less than the earnings of comparable non-veterans.

As this example illustrates, sometimes social, political, or natural forces assign treatments in a way that can be leveraged by researchers, and sometimes the effects of these treatments are captured in always-on big data sources. This research strategy can be summarized as follows: \[\text{random (or as if random) variation} + \text{always-on data} = \text{natural experiment}\]

To illustrate this strategy in the digital age, let’s consider a study by Alexandre Mas and Enrico Moretti (2009) that tried to estimate the effect of working with productive colleagues on a worker’s productivity. Before seeing the results, it is worth pointing out that there are conflicting expectations that you might have. On the one hand, you might expect that working with productive colleagues would lead a worker to increase her productivity because of peer pressure. Or, on the other hand, you might expect that having hard-working peers might lead a worker to slack off because the work will be done by her peers anyway. The clearest way to study peer effects on productivity would be a randomized controlled experiment where workers are randomly assigned to shifts with workers of different productivity levels and then the resulting productivity is measured for everyone. Researchers, however, do not control the schedule of workers in any real business, and so Mas and Moretti had to rely on a natural experiment involving cashiers at a supermarket.

In this particular supermarket, because of the way that scheduling was done and the way that shifts overlapped, each cashier had different co-workers at different times of day. Further, in this particular supermarket, the assignment of cashiers was unrelated to the productivity of their peers or how busy the store was. In other words, even though the scheduling of cashiers was not determined by a lottery, it was as if workers were sometimes randomly assigned to work with high (or low) productivity peers. Fortunately, this supermarket also had a digital-age checkout system that tracked the items that each cashier was scanning at all times. From this checkout log data, Mas and Moretti were able to create a precise, individual, and always-on measure of productivity: the number of items scanned per second. Combining these two things—the naturally occurring variation in peer productivity and the always-on measure of productivity—Mas and Moretti estimated that if a cashier was assigned co-workers who were 10% more productive than average, her productivity would increase by 1.5%. Further, they used the size and richness of their data to explore two important issues: the heterogeneity of this effect (For which kinds of workers is the effect larger?) and the mechanisms behind the effect (Why does having high-productivity peers lead to higher productivity?). We will return to these two important issues—heterogeneity of treatment effects and mechanisms—in chapter 4 when we discuss experiments in more detail.

Generalizing from these two studies, table 2.3 summarizes other studies that have this same structure: using an always-on data source to measure the effect of some random variation. In practice, researchers use two different strategies for finding natural experiments, both of which can be fruitful. Some researchers start with an always-on data source and look for random events in the world; others start a random event in the world and look for data sources that capture its impact.

Table 2.3: Examples of Natural Experiments Using Big Data Sources
Substantive focus Source of natural experiment Always-on data source Reference
Peer effects on productivity Scheduling process Checkout data Mas and Moretti (2009)
Friendship formation Hurricanes Facebook Phan and Airoldi (2015)
Spread of emotions Rain Facebook Lorenzo Coviello et al. (2014)
Peer-to-peer economic transfers Earthquake Mobile money data Blumenstock, Fafchamps, and Eagle (2011)
Personal consumption behavior 2013 US government shutdown Personal finance data Baker and Yannelis (2015)
Economic impact of recommender systems Various Browsing data at Amazon Sharma, Hofman, and Watts (2015)
Effect of stress on unborn babies 2006 Israel–Hezbollah war Birth records Torche and Shwed (2015)
Reading behavior on Wikipedia Snowden revelations Wikipedia logs Penney (2016)
Peer effects on exercise Weather Fitness trackers Aral and Nicolaides (2017)

In the discussion so far about natural experiments, I’ve left out an important point: going from what nature has provided to what you want can sometimes be quite tricky. Let’s return to the Vietnam draft example. In this case, Angrist was interested in estimating the effect of military service on earnings. Unfortunately, military service was not randomly assigned; rather it was being drafted that was randomly assigned. However, not everyone who was drafted served (there were a variety of exemptions), and not everyone who served was drafted (people could volunteer to serve). Because being drafted was randomly assigned, a researcher can estimate the effect of being drafted for all men in the draft. But Angrist didn’t want to know the effect of being drafted; he wanted to know the effect of serving in the military. To make this estimate, however, additional assumptions and complications are required. First, researchers need to assume that the only way that being drafted impacted earnings is through military service, an assumption called the exclusion restriction. This assumption could be wrong if, for example, men who were drafted stayed in school longer in order to avoid serving or if employers were less likely to hire men who were drafted. In general, the exclusion restriction is a critical assumption, and it is usually hard to verify. Even if the exclusion restriction is correct, it is still impossible to estimate the effect of service on all men. Instead, it turns out that researchers can only estimate the effect on a specific subset of men called compliers (men who would serve when drafted, but would not serve when not drafted) (Angrist, Imbens, and Rubin 1996). Compliers, however, were not the original population of interest. Notice that these problems arise even in the relatively clean case of the draft lottery. A further set of complications arise when the treatment is not assigned by a physical lottery. For example, in Mas and Moretti’s study of cashiers, additional questions arise about the assumption that the assignment of peers is essentially random. If this assumption were strongly violated, it could bias their estimates. To conclude, natural experiments can be a powerful strategy for making causal estimates from non-experimental data, and big data sources increase our ability to capitalize on natural experiments when they occur. However, it will probably require great care—and sometimes strong assumptions—to go from what nature has provided to the estimate that you want.

The second strategy I’d like to tell you about for making causal estimates from non-experimental data depends on statistically adjusting non-experimental data in an attempt to account for preexisting differences between those who did and did not receive the treatment. There are many such adjustment approaches, but I’ll focus on one called matching. In matching, the researcher looks through non-experimental data to create pairs of people who are similar except that one has received the treatment and one has not. In the process of matching, researchers are actually also pruning; that is, discarding cases where there are no obvious match. Thus, this method would be more accurately called matching-and-pruning, but I’ll stick with the traditional term: matching.

One example of the power of matching strategies with massive non-experimental data sources comes from research on consumer behavior by Liran Einav and colleagues (2015). They were interested in auctions taking place on eBay, and in describing their work, I’ll focus on the effect of auction starting price on auction outcomes, such as the sale price or the probability of a sale.

The most naive way to estimate the effect of starting price on sale price would be to simply calculate the final price for auctions with different starting prices. This approach would be fine if you wanted to predict the sale price given the starting price. But if your question concerns the effect of the starting price, then this approach will not work because it is not based on fair comparisons; the auctions with lower starting prices might be quite different from those with higher starting prices (e.g., they might be for different types of goods or include different types of sellers).

If you are already aware of the problems that can arise when making causal estimates from non-experimental data, you might skip the naive approach and consider running a field experiment where you would sell a specific item—say, a golf club—with a fixed set of auction parameters—say, free shipping and auction open for two weeks—but with randomly assigned starting prices. By comparing the resulting market outcomes, this field experiment would offer a very clear measurement of the effect of starting price on sale price. But this measurement would only apply to one particular product and set of auction parameters. The results might be different, for example, for different types of products. Without a strong theory, it is difficult to extrapolate from this single experiment to the full range of possible experiments that could have been run. Further, field experiments are sufficiently expensive that it would be infeasible to run every variation that you might want to try.

In contrast to the naive and experimental approaches, Einav and colleagues took a third approach: matching. The main trick in their strategy is to discover things similar to field experiments that have already happened on eBay. For example, figure 2.8 shows some of the 31 listings for exactly the same golf club—a Taylormade Burner 09 Driver—being sold by exactly the same seller—“budgetgolfer.” However, these 31 listings have slightly different characteristics, such as different starting price, end dates, and shipping fees. In other words, it is as if “budgetgolfer” is running experiments for the researchers.

These listings of the Taylormade Burner 09 Driver being sold by “budgetgolfer” are one example of a matched set of listings, where the exact same item is being sold by the exact same seller, but each time with slightly different characteristics. Within the massive logs of eBay there are literally hundreds of thousands of matched sets involving millions of listings. Thus, rather than comparing the final price for all auctions with a given starting price, Einav and colleagues compared within matched sets. In order to combine results from the comparisons within these hundreds of thousands of matched sets, Einav and colleagues re-expressed the starting price and final price in terms of the reference value of each item (e.g., its average sale price). For example, if the Taylormade Burner 09 Driver had a reference value of $100 (based on its sales), then a starting price of $10 would be expressed as 0.1 and a final price of $120 as 1.2.

Figure 2.8: An example of a matched set. This is the exact same golf club (a Taylormade Burner 09 Driver) being sold by the exact same person (budgetgolfer), but some of these sales were performed under different conditions (e.g., different starting prices). Reproduced by permission from Einav et al. (2015), figure 1b.

Figure 2.8: An example of a matched set. This is the exact same golf club (a Taylormade Burner 09 Driver) being sold by the exact same person (“budgetgolfer”), but some of these sales were performed under different conditions (e.g., different starting prices). Reproduced by permission from Einav et al. (2015), figure 1b.

Recall that Einav and colleagues were interested in the effect of start price on auction outcomes. First, they used linear regression to estimate that higher starting prices decrease the probability of a sale, and that higher starting prices increase the final sale price (conditional on a sale occurring). By themselves, these estimates—which describe a linear relationship and are averaged over all products—are not all that interesting. Then, Einav and colleagues used the massive size of their data to create a variety of more subtle estimates. For example, by estimating the effect separately for a variety of different starting prices, they found that the relationship between starting price and sale price is nonlinear (figure 2.9). In particular, for starting prices between 0.05 and 0.85, the starting price has very little impact on sale price, a finding that was completely missed by their first analysis. Further, rather than averaging over all items, Einav and colleagues estimated the impact of starting price for 23 different categories of items (e.g., pet supplies, electronics, and sports memorabilia) (figure 2.10). These estimates show that for more distinctive items—such as memorabilia—starting price has a smaller effect on the probability of a sale and a larger effect on the final sale price. Further, for more commodified items—such as DVDs—the starting price has almost no impact on the final price. In other words, an average that combines results from 23 different categories of items hides important differences between these items.

Figure 2.9: Relationship between auction starting price and probability of a sale (a) and sale price (b). There is roughly a linear relationship between starting price and probability of sale, but a nonlinear relationship between starting price and sale price; for starting prices between 0.05 and 0.85, the starting price has very little impact on sale price. In both cases, the relationships are basically independent of item value. Adapted from Einav et al. (2015), figures 4a and 4b.

Figure 2.9: Relationship between auction starting price and probability of a sale (a) and sale price (b). There is roughly a linear relationship between starting price and probability of sale, but a nonlinear relationship between starting price and sale price; for starting prices between 0.05 and 0.85, the starting price has very little impact on sale price. In both cases, the relationships are basically independent of item value. Adapted from Einav et al. (2015), figures 4a and 4b.

Figure 2.10: Estimates from each category of items; the solid dot is the estimate for all categories pooled together (Einav et al. 2015). These estimates show that for more distinctive items—such as memorabilia—the starting price has a smaller effect on the probability of a sale (x-axis) and a larger effect on the final sale price (y-axis). Adapted from Einav et al. (2015), figure 8.

Figure 2.10: Estimates from each category of items; the solid dot is the estimate for all categories pooled together (Einav et al. 2015). These estimates show that for more distinctive items—such as memorabilia—the starting price has a smaller effect on the probability of a sale (\(x\)-axis) and a larger effect on the final sale price (\(y\)-axis). Adapted from Einav et al. (2015), figure 8.

Even if you are not particularly interested in auctions on eBay, you have to admire the way that figure 2.9 and figure 2.10 offer a richer understanding of eBay than simple estimates that describe a linear relationship and combine many different categories of items. Further, although it would be scientifically possible to generate these more subtle estimates with field experiments, the cost would make such experiments essentially impossible.

As with natural experiments, there are a number of ways that matching can lead to bad estimates. I think the biggest concern with matching estimates is that they can be biased by things that were not used in the matching. For example, in their main results, Einav and colleagues did exact matching on four characteristics: seller ID number, item category, item title, and subtitle. If the items were different in ways that were not used for matching, then this could create an unfair comparison. For example, if “budgetgolfer” lowered prices for the Taylormade Burner 09 Driver in the winter (when golf clubs are less popular), then it could appear that lower starting prices lead to lower final prices, when in fact this would be an artifact of seasonal variation in demand. One approach to addressing this concern is trying many different kinds of matching. For example, Einav and colleagues repeated their analysis while varying the time window used for matching (matched sets included items on sale within one year, within one month, and contemporaneously). Fortunately, they found similar results for all time windows. A further concern with matching arises from interpretation. Estimates from matching only apply to matched data; they do not apply to the cases that could not be matched. For example, by limiting their research to items that had multiple listings, Einav and colleagues are focusing on professional and semi-professional sellers. Thus, when interpreting these comparisons we must remember that they only apply to this subset of eBay.

Matching is a powerful strategy for finding fair comparisons in non-experimental data. To many social scientists, matching feels second-best to experiments, but that is a belief that can be revised, slightly. Matching in massive data might be better than a small number of field experiments when (1) heterogeneity in effects is important and (2) the important variables needed for matching have been measured. Table 2.4 provides some other examples of how matching can be used with big data sources.

Table 2.4: Examples of Studies that Use Matching with Big Data Sources
Substantive focus Big data source Reference
Effect of shootings on police violence Stop-and-frisk records Legewie (2016)
Effect of September 11, 2001 on families and neighbors Voting records and donation records Hersh (2013)
Social contagion Communication and product adoption data Aral, Muchnik, and Sundararajan (2009)

In conclusion, estimating causal effects from non-experimental data is difficult, but approaches such as natural experiments and statistical adjustments (e.g., matching) can be used. In some situations, these approaches can go badly wrong, but when deployed carefully, these approaches can be a useful complement to the experimental approach that I describe in chapter 4. Further, these two approaches seem especially likely to benefit from the growth of always-on, big data systems.