Matching create fair comparisons by pruning away cases.
Fair comparisons can come from either randomized controlled experiments or natural experiments. But, there are many situations where you can’t run the ideal experiment and nature has not provided a natural experiment. In these settings, the best way to create a fair comparison is matching. In matching, the researcher looks through non-experimental data to create pairs of people that are similar except that one has received the treatment and one has not. In the process of matching, researchers are actually also pruning; that is, discarding cases where there are no obvious comparison. Thus, this method would be more accurately called matching-and-pruning, but I’ll stick with the traditional term: matching.
A beautiful example of the power of matching strategies with massive non-experimental data sources come from the research on consumer behavior by Liran Einav and colleagues (2015). Einav and colleagues were interested in auctions taking place on eBay, and in describing their work, I’ll focus on one particular aspect: the effect of auction starting price on auction outcomes, such as the sale price or the probability of a sale.
The most naive way to answer the question about the effect of starting price on sale price would be to simply calculate the final price for auctions with different starting prices. This approach would be fine if you simply want to predict the sale price of a given item that had been put on eBay with a given starting price. But, if your question is what is the effect of starting price on market outcomes this approach will not work because it is not based on fair comparisons; the auctions with lower starting prices might be quite different from auctions with higher starting prices (e.g., they might be for different types of goods or include different types of sellers).
If you are already concerned about making fair comparisons, you might skip the naive approach and consider running a field experiment where you would sell a specific item—say, a golf club—with a fixed set of auction parameters—say, free shipping, auction open for two weeks, etc.—but with randomly set starting prices. By comparing the resulting market outcomes, this field experiment would offer a very clear measurement of the effect of starting price on sale price. But, this measurement would only apply to one particular product and set of auction parameters. The results might be different, for example, for different types of products. Without strong theory, it is difficult to extrapolate from this single experiment the full range of possible experiments that could have been run. Further, field experiments are sufficiently expensive that it would be infeasible to run enough of them up to cover the whole parameter space of products and auction types.
In contrast to the naive approach and the experimental approach, Einav and colleagues take a third approach: matching. The main trick of their strategy is to discover things similar to field experiments that have already happened on eBay. For example, Figure 2.6 shows some of the 31 listings for exactly the same golf club—a Taylormade Burner 09 Driver—being sold by exactly the same seller—“budgetgolfer”. However, these listings have slightly different characteristics. Eleven of them offer the driver for a fixed price of $124.99, while the other 20 are auctions with different end dates. Also, the listings have different shipping fees, either $7.99 or $9.99. In other words, it is as if “budgetgolfer” is running experiments for the researchers.
The listings of the Taylormade Burner 09 Driver being sold by “budgetgolfer” are one example of a matched set of listings, where the exact same item is being sold by the exact same seller but each time with slightly different characteristics. Within the massive logs of eBay there are literally hundreds of thousands of matched sets involving millions of listings. Thus, rather than comparing the final price for all auctions within a given starting price, Einav and colleagues make comparisons within matched sets. In order to combine results from the comparisons within these hundreds of thousands of matched sets, Einav and colleagues re-express the starting price and final price in terms of the reference value of each item (e.g., its average sale price). For example, if the Taylormade Burner 09 Driver has a reference value of $100 (based on its sales), then a starting price of $10 would be expressed as 0.1 and final price of $120 would be expressed as 1.2.
Recall that Einav and colleagues were interested in the effect of start price on auction outcomes. First, using linear regression they estimated that higher starting prices decrease the probability of a sale, and that higher starting prices increase the final sale price, conditional on a sale occurring. By themselves, these estimates—which are averaged over all products and assume a linear relationship between starting price and final outcomes—are not all that interesting. But, Einav and colleagues also use the massive size of their data to estimate a variety of more subtle findings. First, Einav and colleagues made these estimates separately for items of different prices and without using linear regression. They found that while the relationship between start price and probability of a sale is linear, the relationship between starting price and sale price is clearly non-linear (Figure 2.7). In particular, for starting prices between 0.05 and 0.85, the starting price has very little impact on sale price, a finding that was completed missed in the analysis that had assumed a linear relationship.
Second, rather than averaging over all items, Einav and colleagues also use the massive scale of their data to estimate the impact of starting price for 23 different categories of items (e.g, pet supplies, electronics, and sports memorabilia) (Figure 2.8). These estimates show that for more distinctive items—such as memorabilia—start price has a smaller effect on the probability of a sale and a larger effect on the final sale price. Further, for more commodified items—such as DVDs and video—the start price has almost no impact on the final price. In other words, an average that combines results from 23 different categories of items hides important information about the differences between these items.
Even if you are not particularly interested in auctions on eBay, you have to admire the way that Figure 2.7 and Figure 2.8 offer a richer understanding of eBay than simple linear regression estimates that assume linear relationships and combine many different categories of items. These more subtle estimates illustrate the power of matching in massive data; these estimates would have been impossible without an enormous number of field experiments, which would have been prohibitively expensive.
Of course, we should have less confidence in the results of any particular matching study than we would in the results of a comparable experiment. When assessing the results from any matching study, there are two important concerns. First, we have to remember that we can only ensure fair comparisons on things that were used for matching. In their main results, Einav and colleagues did exact matching on four characteristics: seller ID number, item category, item title, and subtitle. If the items were different in ways that were not used for matching, that could create an unfair comparison. For example, if “budgetgolfer” lowered prices for Taylormade Burner 09 Driver in the winter (when golf clubs are less popular), then it could appear that lower starting prices lead to lower final prices, when in fact this would be an artifact of seasonal variation in demand. In general, the best approach to this problem seems to be trying many different kinds of matching. For example, Einav and colleagues repeat their analysis where matched sets include items on sale within one year, within one month, and contemporaneously. Making the time window tighter decreases the number of matched sets, but reduces concerns about seasonal variation. Fortunately, they find that results are unchanged by these changes in matching criteria. In the matching literature, this type of concern is usually expressed in terms of observables and unobservables, but the key idea is really that researchers are only creating fair comparisons on the features used in matching.
The second major concern when interpreting matching results is that they only apply to matched data; they do not apply to the cases that could not be matched. For example, by limiting their research to items that had multiple listings Einav and colleagues are focusing on professional and semi-professional sellers. Thus, when interpreting these comparisons we must remember that they only apply to this subset of eBay.
Matching is a powerful strategy for finding fair comparisons in large datasets. To many social scientists, matching feels like second-best to experiments, but that is a belief that should be revised, slightly. Matching in massive data might be better than a small number of field experiments when: 1) heterogeneity in effects is important and 2) there are good observables for matching. Table 2.4 provides some other examples of how matching can be used with big data sources.
|Substantive focus||Big data source||Citation|
|Effect of shootings on police violence||Stop-and-frisk records||Legewie (2016)|
|Effect of September 11, 2001 on families and neighbors||voting records and donation records||Hersh (2013)|
|Social contagion||Communication and product adoption data||Aral, Muchnik, and Sundararajan (2009)|
In conclusion, naive approaches to estimating causal effects from non-experimental data are dangerous. However, strategies for making causal estimates lying along a continuum from strongest to weakest, and researchers can discover fair comparisons within non-experimental data. The growth of always-on, big data systems increases our ability to effectively use two existing methods: natural experiments and matching.