3.6.1 Enriched asking

In enriched asking, survey data builds context around a big data source that contains some important measurements but lack others.

One way to combine survey data and big data sources is a process that I’ll call enriched asking. In enriched asking, a big data source contains some important measurements but lacks other measurements so the researcher collects these missing measurements in a survey and then links the two data sources together. One example of enriched asking is the study by Burke and Kraut (2014) about whether interacting on Facebook increases friendship strength, which I described in section 3.2). In that case, Burke and Kraut combined survey data with Facebook log data.

The setting in which Burke and Kraut were working, however, meant that they didn’t have to deal with two big problems that researchers doing enriched asking typically face. First, actually linking together the individual-level data sets, a process called record linkage, can be difficult if there is no unique identifier in both data sources that can be use to ensure that the correct record in one dataset is matched with the correct record in the other dataset. The second main problem with enriched asking is that the quality of the big data source will frequently be difficult for researchers to assess because the process through which the data are created may be proprietary and could be susceptible to many of the problems described in chapter 2. In other words, enriched asking will frequently involve error-prone linking of surveys to black-box data sources of unknown quality. Despite these problems, however, enriched asking can be used to conduct important research, as was demonstrated by Stephen Ansolabehere and Eitan Hersh (2012) in their research on voting patterns in the United States.

Voter turnout has been the subject of extensive research in political science, and, in the past, researchers’ understanding of who votes and why has generally been based on the analysis of survey data. Voting in the United States, however, is an unusual behavior in that the government records whether each citizen has voted (of course, the government does not record who each citizen votes for). For many years, these governmental voting records were available on paper forms, scattered in various local government offices around the country. This made it very difficult, but not impossible, for political scientists to have a complete picture of the electorate and to compare what people say in surveys about voting with their actual voting behavior (Ansolabehere and Hersh 2012).

But these voting records have now been digitized, and a number of private companies have systematically collected and merged them to produce comprehensive master voting files that contain the voting behavior of all Americans. Ansolabehere and Hersh partnered with one of these companies—Catalist LCC—in order to use their master voting file to help develop a better picture of the electorate. Further, because their study relied on digital records collected and curated by a company that had invested substantial resources in data collection and harmonization, it offered a number of advantages over previous efforts that had been done without the aid of companies and by using analog records.

Like many of the big data sources in chapter 2, the Catalist master file did not include much of the demographic, attitudinal, and behavioral information that Ansolabehere and Hersh needed. In fact, they were particularly interested in comparing reported voting behavior in surveys with validated voting behavior (i.e., the information in the Catalist database). So Ansolabehere and Hersh collected the data that they wanted as a large social survey, the CCES, mentioned earlier in this chapter. Then they gave their data to Catalist, and Catalist gave them back a merged data file that included validated voting behavior (from Catalist), the self-reported voting behavior (from CCES) and the demographics and attitudes of respondents (from CCES) (figure 3.13). In other words, Ansolabehere and Hersh combined the voting records data with survey data in order do research that was not possible with either data source individually.

Figure 3.13: Schematic of the study by Ansolabehere and Hersh (2012). To create the master datafile, Catalist combines and harmonizes information from many different sources. This process of merging, no matter how careful, will propagate errors in the original data sources and will introduce new errors. A second source of errors is the record linkage between the survey data and the master datafile. If every person had a stable, unique identifier in both data sources, then linkage would be trivial. But, Catalist had to do the linkage using imperfect identifiers, in this case name, gender, birth year, and home address. Unfortunately, for many cases there could be incomplete or inaccurate information; a voter named Homer Simpson might appear as Homer Jay Simpson, Homie J Simpson, or even Homer Sampsin. Despite the potential for errors in the Catalist master datafile and errors in the record linkage, Ansolabehere and Hersh were able to build confidence in their estimates through several different types of checks.

With their combined data file, Ansolabehere and Hersh came to three important conclusions. First, over-reporting of voting is rampant: almost half of the nonvoters reported voting, and if someone reported voting, there is only an 80% chance that they actually voted. Second, over-reporting is not random: over-reporting is more common among high-income, well-educated, partisans who are engaged in public affairs. In other words, the people who are most likely to vote are also most likely to lie about voting. Third, and most critically, because of the systematic nature of over-reporting, the actual differences between voters and nonvoters are smaller than they appear just from surveys. For example, those with a bachelor’s degree are about 22 percentage points more likely to report voting, but are only 10 percentage points more likely to actually vote. It turns out, perhaps not surprisingly, that existing resource-based theories of voting are much better at predicting who will report voting (which is the data that researchers have used in the past) than they are at predicting who actually votes. Thus, the empirical finding of Ansolabehere and Hersh (2012) call for new theories to understand and predict voting.

But how much should we trust these results? Remember, these results depend on error-prone linking to black-box data with unknown amounts of error. More specifically, the results hinge on two key steps: (1) the ability of Catalist to combine many disparate data sources to produce an accurate master datafile and (2) the ability of Catalist to link the survey data to its master datafile. Each of these steps is difficult, and errors in either step could lead researchers to the wrong conclusions. However, both data processing and linking are critical to the continued existence of Catalist as a company, so it can invest resources in solving these problems, often at a scale that no academic researcher can match. In their paper, Ansolabehere and Hersh go through a number of steps to check the results of these two steps—even though some of them are proprietary—and these checks might be helpful for other researchers wishing to link survey data to black-box big data sources.

What are the general lessons researchers can draw from this study? First, there is tremendous value both from enriching big data sources with survey data and from enriching survey data with big data sources (you can see this study either way). By combining these two data sources, the researchers were able to do something that was impossible with either individually. The second general lesson is that though aggregated, commercial data sources, such as the data from Catalist, should not be considered “ground truth,” in some cases, they can be useful. Skeptics sometimes compare these aggregated, commercial data source with absolute Truth and point out that these data sources fall short. However, in this case, the skeptics are making the wrong comparison: all data that researchers use fall short of absolute Truth. Instead, it is better to compare aggregated, commercial data sources with other available data sources (e.g., self-reported voting behavior), which invariably have errors as well. Finally, the third general lesson of Ansolabehere and Hersh’s study is that in some situations, researchers can benefit from the huge investments that many private companies are making in collecting and harmonizing complex social data sets.