3.6.2 Enriched asking

You are reading the Open Review Edition of Bit by Bit. Click here to read the 1st Edition.

3.6.2 Enriched asking

Even though it can be messy, enriched asking can be powerful.

A different approach to dealing with the incompleteness of digital trace data is to enrich it directly with survey data, a process that I’ll call enriched asking. One example of enriched asking is the study of Burke and Kraut (2014), which I described earlier in the chapter (Section 3.2), about whether interacting on Facebook increases friendship strength. In that case, Burke and Kraut combined survey data with Facebook log data.

The setting that Burke and Kraut were working in, however, meant that they didn’t have to deal with two big problems that researchers doing enriched asking face. First, actually linking together the data sets—a process called record linkage, the matching of a record in one dataset with the appropriate record in the other dataset—can be difficult and error-prone (we’ll see an example of this problem below). The second main problem of enriched asking is that the quality of the digital traces will frequently be difficult for researchers to assess. For example, sometimes the process through which it is collected is proprietary and could be susceptible to many of the problems described in Chapter 2. In other words, enriched asking will frequently involve error-prone linking of surveys to black-box data sources of unknown quality. Despite the concerns that these two problems introduce, it is possible to conduct important research with this strategy as was demonstrated by Stephen Ansolabehere and Eitan Hersh (2012) in their research on voting patterns in the US . It is worthwhile to go over this study in some detail because many of the strategies that Ansolabehere and Hersh developed will be useful in other applications of enriched asking.

Voter turnout has been the subject of extensive research in political science, and in the past, researchers’ understanding of who votes and why has generally been based on the analysis of survey data. Voting in the US, however, is an unusual behavior in that the government records whether each citizen has voted (of course, the government does not record who each citizen votes for). For many years, these governmental voting records were available on paper forms, scattered in various local government offices around the country. This made it difficult, but not impossible, for political scientists to have a complete picture of the electorate and to compare what people say in surveys about voting to their actual voting behavior (Ansolabehere and Hersh 2012).

But, now these voting records have been digitized, and a number of private companies have systematically collected and merged these voting records to produce comprehensive master voting files that record the voting behavior of all Americans. Ansolabehere and Hersh partnered with one of these companies—Catalist LCC—in order to use their master voting file to help develop a better picture of the electorate. Further, because it relied on digital records collected and curated by a company, it offered a number of advantages over previous efforts by researchers that had been done without the aid of companies and using analog records.

Like many of the digital trace sources in Chapter 2, the Catalist master file did not include much of the demographic, attitudinal, and behavioral information that Ansolabehere and Hersh needed. In addition to this information, Ansolabehere and Hersh were particularly interested in comparing reported voting behavior to validated voting behavior (i.e., the information in the Catalist database). So, the researchers collected the data that they wanted as part of the Cooperative Congressional Election Study (CCES), a large social survey. Next, the researchers gave this data to Catalist, and Catalist gave the researchers back a merged data file that included validated voting behavior (from Catalist), the self-reported voting behavior (from CCES) and the demographics and attitudes of respondents (from CCES). In other words, Ansolabehere and Hersh enriched the voting data with survey data, and the resulting merged file enables them to do something that neither file enabled individually.

By enriching the Catalist master data file with survey data, Ansolabehere and Hersh came to three important conclusions. First, over-reporting of voting is rampant: almost half of the non-voters reported voting. Or, another way of looking at it is if someone reported voting, there is only an 80% chance that they actually voted. Second, over-reporting is not random; over-reporting is more common among high-income, well-educated, partisans who are engaged in public affairs. In other words, the people who are most likely to vote are also most likely to lie about voting. Third, and most critically, because of the systematic nature of over-reporting, the actual differences between voters and non-voters are smaller than they appear just from surveys. For example, those with a bachelors degree are about 22 percentage points more likely to report voting, but are only 10 percentage points more likely to actual vote. Further, existing resource-based theories of voting are much better at predicting who will report voting than who actually votes, an empirical finding that calls for new theories to understand and predict voting.

But, how much should we trust these results? Remember these results depend on error-prone linking to black-box data with unknown amounts of error. More specifically, the results hinge on two key steps: 1) the ability of Catalist to combine many disparate data sources to produce an accurate master datafile and 2) the ability of Catalist to link the survey data to its master datafile. Each of these steps is quite difficult and errors at either step could lead researchers to the wrong conclusions. However, both data processing and matching are critical to the continued existence of Catalist as a company so it can invest resources in solving these problems, often at a scale that no individual academic researcher or group of researchers can match. In the further reading at the end of the chapter, I describe these problems in more detail and how Ansolabehere and Hersh build confidence in their results. Although these details are specific to this study, issues similar to these will arise for other researchers wishing to link to black-box digital trace data sources.

What are the general lessons researchers can draw from this study? First, there is tremendous value from enriching digital traces with survey data. Second, even though these aggregated, commercial data sources should not be considered “ground truth”, in some cases they can be useful. In fact, it is best to compare these data sources not to absolute Truth (from which they will always fall short). Rather, it is better to compare them to other available data sources, which invariably have errors as well.