Further commentary

This section is designed to be used as a reference, rather than to be read as a narrative.

  • Introduction (Section 3.1)

Many of the themes in this chapter have also been echoed in recent Presidential Addresses at the American Association of Public Opinion Research (AAPOR), such as Dillman (2002), Newport (2011), Santos (2014), and Link (2015).

For more historical background about the development of survey research, see Smith (1976) and Converse (1987). For more on the idea of three eras of survey research, see Groves (2011) and Dillman, Smyth, and Christian (2008) (which breaks up the three eras slightly differently).

A peak inside the transition from the first to the second era in survey research is Groves and Kahn (1979), which does a detailed head-to-head comparison between a face-to-face and telephone survey. Brick and Tucker (2007) looks back at the historical development of random digit dialing sampling methods.

For more how survey research has changed in the past in response to changes in society, see Tourangeau (2004), Mitofsky (1989), and Couper (2011).

  • Asking vs. observing (Section 3.2)

Learning about internal states by asking questions can be problematic because sometimes the respondents themselves are not aware of their internal states. For example, Nisbett and Wilson (1977) have a wonderful paper with the evocative title: “Telling more than we can know: Verbal reports on mental processes.” In the paper the authors conclude: “subjects are sometimes (a) unaware of the existence of a stimulus that importantly influenced a response, (b) unaware of the existence of the response, and (c) unaware that the stimulus has affected the response.”

For arguments that researchers should prefer observed behavior to reported behavior or attitudes, see Baumeister, Vohs, and Funder (2007) (psychology) and Jerolmack and Khan (2014) and responses (Maynard 2014; Cerulo 2014; Vaisey 2014; Jerolmack and Khan 2014) (sociology). The difference between asking and observing also arises in economics, where researchers talk about stated and revealed preferences. For example, a researcher could ask respondents whether they prefer eating ice cream or going to gym (stated preferences) or the research could observe how often people eat ice cream and go to the gym (revealed preferences). There is deep skepticism of certain types of stated preferences data in economics (Hausman 2012).

A main theme from these debates is that reported behavior is not always accurate. But, automatically recorded behavior may not be accurate, may not be collected on a sample of interest, and may not be accessible to researchers. Thus, in some situations, I think that reported behavior can be useful. Further, a second main theme from these debates is that reports about emotions, knowledge, expectations, and opinions are not always accurate. But, if information about these internal states are needed by researchers—either to help explain some behavior or as the thing to be explained—then asking may be appropriate.

  • Total survey error (Section 3.3)

For book length treatments on total survey error, see Groves et al. (2009) or Weisberg (2005). For a history of the development of total survey error, see Groves and Lyberg (2010).

In terms of representation, a great introduction to the issues of non-response and non-response bias is the National Research Council report on Nonresponse in Social Science Surveys: A Research Agenda (2013). Another useful overview is provided by (Groves 2006). Also, entire special issues of the Journal of Official Statistics, Public Opinion Quarterly, and The Annals of the American Academy of Political and Social Science have been published on the topic of non-response. Finally, there are actually many different ways of calculating the response rate; these approaches are described in detail in a report by the American Association of Public Opinion Researchers (AAPOR) (Public Opinion Researchers} 2015).

The 1936 Literary Digest poll has been studied in detail (Bryson 1976; Squire 1988; Cahalan 1989; Lusinchi 2012). It has also been used as a parable to warn against haphazard data collection (Gayo-Avello 2011). In 1936, George Gallup used a more sophisticated form of sampling, and was able to produce more accurate estimates with a much smaller sample. Gallup’s success over the Literary Digest was a milestone the development of survey research (Converse 1987, Ch 3; Ohmer 2006, Ch 4; Igo 2008, Ch 3).

In terms of measurement, a great first resource for designing questionnaires is Bradburn, Sudman, and Wansink (2004). For a more advanced treatment focused specifically on attitude questions, see Schuman and Presser (1996). More on pre-testing questions is available in Presser and Blair (1994), Presser et al. (2004), and Chapter 8 of Groves et al. (2009).

The classic, book-length treatment of the trade-off between survey costs and survey errors is Groves (2004).

  • Who to ask (Section 3.4)

Classic book-length treatment of standard probability sampling and estimation are Lohr (2009) (more introductory) and Särndal, Swensson, and Wretman (2003) (more advanced). A classic book-length treatment of post-stratification and related methods is Särndal and Lundström (2005). In some digital age settings, researchers know quite a bit about non-respondents, which was not often true in the past. Different forms of non-response adjustment are possible when researchers have information about non-respondents (Kalton and Flores-Cervantes 2003; Smith 2011).

The Xbox study of Wang et al. (2015) uses a technique called multilevel regression and post-stratification (MRP, sometimes called “Mister P”) that allows researchers to estimate cell means even when there are many, many cells. Although there is some debate about the quality of the estimates from this technique, it seems like a promising area to explore. The technique was first used in Park, Gelman, and Bafumi (2004), and there has been subsequent use and debate (Gelman 2007; Lax and Phillips 2009; Pacheco 2011; Buttice and Highton 2013; Toshkov 2015). For more on the connection between individual weights and cell-based weights see Gelman (2007).

For other approaches to weighting web surveys, see Schonlau et al. (2009), Valliant and Dever (2011), and Bethlehem (2010).

Sample matching was proposed by Rivers (2007). Bethlehem (2015) argues that the performance of sample matching will actually be similar to other sampling approaches (e.g., stratified sampling) and other adjustment approaches (e.g., post-stratification). For more on online panels, see Callegaro et al. (2014).

Sometimes researchers have found that probability samples and non-probability samples yield estimates of similar quality (Ansolabehere and Schaffner 2014), but other comparisons have found that non-probability samples do worse (Malhotra and Krosnick 2007; Yeager et al. 2011). One possible reason for these differences is that non-probability samples have improved over time. For a more pessimistic view of non-probability sampling methods see the the AAPOR Task Force on Non-probability Sampling (Baker et al. 2013), and I also recommend reading the commentary that follows the summary report.

For a meta-analysis on the effect of weighting to reduce bias in non-probability samples, see Table 2.4 in Tourangeau, Conrad, and Couper (2013), which leads the authors to conclude “adjustments seem to be useful but fallible corrections . . .”

  • How to ask (Section 3.5)

Conrad and Schober (2008) provides an edited volume titled Envisioning the Survey Interview of the Future, and it addresses many of the themes in this section. Couper (2011) addresses similar themes, and Schober et al. (2015) offers a nice example of how data collection methods that are tailored to a new setting can result in higher quality data.

For another interesting example of using Facebook apps for social science surveys, see Bail (2015).

For more advice on making surveys an enjoyable and valuable experience for participants, see work on the Tailored Design Method (Dillman, Smyth, and Christian 2014).

Stone et al. (2007) offers a book length treatment of ecological momentary assessment and related methods.

  • Surveys linked to other data (Section 3.6)

Judson (2007) described the process of combining surveys and administrative data as “information integration,” discusses some advantages of this approach, and offers some examples.

Another way that researchers can use digital traces and administrative data is a sampling frame for people with specific characteristics. However, access these records to be used a sampling frame can also create questions related to privacy (Beskow, Sandler, and Weinberger 2006).

Regarding amplified asking, this approach is not as new as it might appear from how I’ve described it. This approach has deep connections to three large areas in statistics—model-based post-stratification (Little 1993), imputation (Rubin 2004), and small area estimation (Rao and Molina 2015). It is also related to the use of surrogate variables in medical research (Pepe 1992).

In addition to the ethical issues regarding accessing the digital trace data, amplified asking could also be used to infer sensitive traits that people might not choose to reveal in a survey (Kosinski, Stillwell, and Graepel 2013).

The cost and time estimates in Blumenstock, Cadamuro, and On (2015) refer more to variable cost—the cost of one additional survey—and do not include fixed costs such as the cost to clean and process the call data. In general, amplified asking will probably have high fixed costs and low variable costs similar to digital experiments (see Chapter 4). More details on the data used in Blumenstock, Cadamuro, and On (2015) paper are in Blumenstock and Eagle (2010) and Blumenstock and Eagle (2012). Approaches from multiple imputuation (Rubin 2004) might help capture uncertainty in estimates from amplified asking. If researchers doing amplified asking only care about aggregate counts, rather than individual-level traits, then the approaches in King and Lu (2008) and Hopkins and King (2010) may be useful. For more about the machine learning approaches in Blumenstock, Cadamuro, and On (2015), see James et al. (2013) (more introductory) or Hastie, Tibshirani, and Friedman (2009) (more advanced). Another popular machine learning textbook is Murphy (2012).

Regarding enriched asking, the results in Ansolabehere and Hersh (2012) hinge on two key steps: 1) the ability of Catalist to combine many disparate data sources to produce an accurate master datafile and 2) the ability of Catalist to link the survey data to its master datafile. Therefore, Ansolabehere and Hersh check each of these steps carefully.

To create the master datafile, Catalist combines and harmonizes information from many different sources including: multiple voting records snapshots from each state, data from the Post Office’s National Change of Address Registry, and data from other unspecified commercial providers. The gory details about how all this cleaning and merging happens are beyond the scope of this book, but this process, no matter how careful, will propagate errors in the original data sources and will introduce errors. Although Catalist was willing to discuss its data processing and provide some of its raw data, it was simply impossible for researchers to review the entire Catalist data pipeline. Rather, the researchers were in a situation where the Catalist data file had some unknown, and perhaps unknowable, amount of error. This is a serious concern because a critic might speculate that the large differences between the survey reports on the CCES and the behavior in the Catalist master data file were caused by errors in the master data file, not by misreporting by respondents.

Ansolabehere and Hersh took two different approaches to addressing the data quality concern. First, in addition to comparing self-reported voting to voting in the Catalist master file, the researchers also compared self-reported party, race, voter registration status (e.g., registered or not registered) and voting method (e.g., in person, absentee ballot, etc.) to those values found in the Catalist databases. For these four demographic variables, the researchers found much higher levels of agreement between survey report and data in the Catalist master file than for voting. Thus, the Catalist master data file appears to have high quality information for traits other than voting, suggesting that it is not of poor overall quality. Second, in part using data from Catalist, Ansolabehere and Hersh developed three different measures of quality of county voting records, and they found that the estimated rate of over-reporting of voting was essentially unrelated to any of these data quality measures, a finding that suggest that the high rates of over-reporting are not being driven by counties with unusually low data quality.

Given the creation of this master voting file, the second source of potential errors is linking the survey records to it. For example, if this linkage is done incorrectly it could lead to an over-estimate of the difference between reported and validated voting behavior (Neter, Maynes, and Ramanathan 1965). If every person had a stable, unique identifier that was in both data sources, then linkage would be trivial. In the US and most other countries, however, there is no universal identifier. Further, even if there were such an identifier people would probably be hesitant to provide it to survey researchers! Thus, Catalist had to do the linkage using imperfect identifiers, in this case four pieces of information about each respondent: name, gender, birth year, and home address. For example, Catalist had to decide if the Homie J Simpson in the CCES was the same person as the Homer Jay Simpson in their master data file. In practice, matching is a difficult and messy process, and, to make matters worse for the researchers, Catalist considered its matching technique to be proprietary.

In order to validate the matching algorithms, they relied on two challenges. First, Catalist participated in a matching competition that was run by an independent, third-party: the MITRE Corporation. MITRE provided all participants two noisy data files to be matched, and different teams competed to return to MITRE the best matching. Because MITRE itself knew the correct matching they were able to score the teams. Of the 40 companies that competed, Catalist came in second place. This kind of independent, third-party evaluation of proprietary technology is quite rare and incredibly valuable; it should give us confidence that Catalist’s matching procedures are essentially at the state-of-the-art. But is the state-of-the-art good enough? In addition to this matching competition, Ansolabehere and Hersh created their own matching challenge for Catalist. From an earlier project, Ansolabehere and Hersh had collected voter records from Florida. They provided some of these records with some of their fields redacted to Catalist and then compared Catalist’s reports of these fields to their actual values. Fortunately, Catalist’s reports were close to the withheld values, indicating that Catalist could match partial voter records onto their master data file. These two challenges, one by a third-party and one by Ansolabehere and Hersh, give us more confidence in the Catalist matching algorithms, even though we cannot review their exact implementation ourselves.

There have been many previous attempts to validate voting. For an overview of that literature, see Belli et al. (1999), Berent, Krosnick, and Lupia (2011), Ansolabehere and Hersh (2012), and Hanmer, Banks, and White (2014).

It is important to note that although in this case researchers were encouraged by the quality of data from Catalist, other evaluations of commercial vendors have been less enthusiastic. Researchers have found poor quality when data from a survey to a consumer-file from Marketing Systems Group (which itself merged together data from three providers: Acxiom, Experian, and InfoUSA) (Pasek et al. 2014). That is, the data file did not match survey responses that researchers expected to be correct, the datafile had missing data for a large number of questions, and the missing data pattern was correlated to reported survey value (in other words the missing data was systematic, not random).

For more on record linkage between surveys and administrative data, see Sakshaug and Kreuter (2012) and Schnell (2013). For more on record linkage in general, see Dunn (1946) and Fellegi and Sunter (1969) (historical) and Larsen and Winkler (2014) (modern). Similar approaches have also been developed in computer science under the names such as data deduplication, instance identification, name matching, duplicate detection, and duplicate record detection (Elmagarmid, Ipeirotis, and Verykios 2007). There are also privacy preserving approaches to record linkage which do not require the transmission of personally identifying information (Schnell 2013). Researchers at Facebook developed a procedure to probabilisticsly link their records to voting behavior (Jones et al. 2013); this linkage was done to evaluate an experiment that I’ll tell you about in Chapter 4 (Bond et al. 2012).

Another example of linking a large-scale social survey to government administrative records comes from the Health and Retirement Survey and the Social Security Administration. For more on that study, including information about the consent procedure, see Olson (1996) and Olson (1999).

The process of combining many sources of administrative records into a master datafile—the process that Catalist employees—is common in the statistical offices of some national governments. Two researchers from Statistics Sweden have written a detailed book on the topic (Wallgren and Wallgren 2007). For an example of this approach in a single county in the United States (Olmstead County, Minnesota; home of the Mayo Clinic), see Sauver et al. (2011). For more on errors that can appear in administrative records, see Groen (2012).