This section is designed to be used as a reference, rather than to be read as a narrative.
One kind of observing that is not included in this chapter is ethnography. For more on ethnography in digital spaces see Boellstorff et al. (2012), and for more on ethnography in mixed digital and physical spaces see Lane (2016).
When you are repurposing data, there are two mental tricks that can help you understand the possible problems that you might encounter. First, you can try to imagine the ideal dataset for your problem and the compare that to the dataset that you are using. How are they similar and how are they different? If you didn’t collect your data yourself, there are likely to be difference between what you want and what you have. But, you have to decide if these differences are minor or major.
Second, remember that someone created and collected your data for some reason. You should try to understand their reasoning. This kind of reverse-engineering can help you identify possible problems and biases in your repurposed data.
There is no single consensus definition of “big data”, but many definitions seem to focus on the 3 Vs: volume, variety, and velocity (e.g., Japec et al. (2015)). Rather than focusing on the characteristics of the data, my definition focuses more on why the data was created.
My inclusion of government administrative data inside the category of big data is a bit unusually. Others who have made this case, include Legewie (2015), Connelly et al. (2016), and Einav and Levin (2014). For more about the value of government administrative data for research, see Card et al. (2010), Taskforce (2012), and Grusky, Smeeding, and Snipp (2015).
For a view of administrative research from inside the government statistical system, particularly the US Census Bureau, see Jarmin and O’Hara (2016). For a book length treatment of the administrative records research at Statistics Sweden, see Wallgren and Wallgren (2007).
In the chapter, I briefly compared a traditional survey such as the General Social Survey (GSS) to a social media data source such as Twitter. For a thorough and careful comparison between traditional surveys and social media data, see Schober et al. (2016).
These 10 characteristics of big data have been described in a variety of different ways by a variety of different authors. Writing that influenced my thinking on these issues include: Lazer et al. (2009), Groves (2011), Howison, Wiggins, and Crowston (2011), boyd and Crawford (2012), Taylor (2013), Mayer-Schönberger and Cukier (2013), Golder and Macy (2014), Ruths and Pfeffer (2014), Tufekci (2014), Sampson and Small (2015), Lewis (2015), Lazer (2015), Horton and Tambe (2015), Japec et al. (2015), and Goldstone and Lupyan (2016).
Throughout this chapter, I’ve used the term digital traces, which I think is relatively neutral. Another popular term for digital traces is digital footprints (Golder and Macy 2014), but as Hal Abelson, Ken Ledeen, and Harry Lewis (2008) point out, a more appropriate term is probably digital fingerprints. When you create footprints, you are aware of what is happening and your footprints cannot generally be traced to you personally. The same is not true for your digital traces. In fact, you are leaving traces all the time about which you have very little knowledge. And, although these traces don’t have your name on them, they can often be linked back to you. In other words, they are more like fingerprints: invisible and personally identifying.
For more on why large datasets, render statistical tests problematic, see Lin, Lucas, and Shmueli (2013) and McFarland and McFarland (2015). These issues should lead researchers to focus on practical significance rather than statistical significance.
When considering always-on data, it is important to consider whether you are comparing the exact same people over time or whether you are comparing some changing group of people; see for example, Diaz et al. (2016).
A classic book on non-reactive measures is Webb et al. (1966). The examples in the book pre-date the digital age, but they are still illuminating. For examples of people changing their behavior because of the presence of mass surveillance, see Penney (2016) and Brayne (2014).
For more on record linkage, see Dunn (1946) and Fellegi and Sunter (1969) (historical) and Larsen and Winkler (2014) (modern). Similar approached have also been developed in computer science under the names such as data deduplication, instance identification, name matching, duplicate detection, and duplicate record detection (Elmagarmid, Ipeirotis, and Verykios 2007). There are also privacy preserving approaches to record linkage which do not require the transmission of personally identifying information (Schnell 2013). Facebook also has developed a proceed to link their records to voting behavior; this was done to evaluate an experiment that I’ll tell you about in Chapter 4 (Bond et al. 2012; Jones et al. 2013).
For more on construct validity, see Shadish, Cook, and Campbell (2001), Chapter 3.
For more on the AOL search log debacle, see Ohm (2010). I offer advice about partnering with companies and governments in Chapter 4 when I describe experiments. A number of authors have expressed concerns about research that relies on inaccessible data, see Huberman (2012) and boyd and Crawford (2012).
One good way for university researchers to acquire data access is to work at a company as an intern or visiting researcher. In addition to enabling data access, this process will also help the researcher learn more about how the data was created, which is important for analysis.
Non-representativeness is a major problem for researchers and governments who wish to make statements about an entire population. This is less of concern for companies that are typically focused on their users. For more on how Statistics Netherlands considers the issue of non-representativeness of business big data, see Buelens et al. (2014).
In Chapter 3, I’ll describe sampling and estimation in much greater detail. Even if data are non-representative, under certain conditions, they can be weighted to produce good estimates.
System drift is very hard to see from the outside. However, the MovieLens project (discussed more in Chapter 4) has been run for more than 15 years by an academic research group. Therefore, they have documented and shared information about the way that the system has evolved over time and how this might impact analysis (Harper and Konstan 2015).
A number of scholars have focused on drift in Twitter: Liu, Kliman-Silver, and Mislove (2014) and Tufekci (2014).
I first heard the term “algorithmically confounded” used by Jon Kleinberg in a talk. The main idea behind performativity is that some social science theories are “engines not cameras” (Mackenzie 2008). That is, they actually shape the world rather than just capture it.
Governmental statistical agencies call data cleaning, statistical data editing. De Waal, Puts, and Daas (2014) describe statistical data editing techniques developed for survey data and examine to which extent they are applicable to big data sources, and Puts, Daas, and Waal (2015) presents some of the same ideas for a more general audience.
For some examples of studies focused on spam in Twitter, Clark et al. (2016) and Chu et al. (2012). Finally, Subrahmanian et al. (2016) describes the results of the DARPA Twitter Bot Challenge.
Ohm (2015) reviews earlier research on the idea of sensitive information and offers a multi-factor test. The four factors he proposes are: the probability of harm; probability of harm; presence of a confidential relationship; and whether the risk reflect majoritarian concerns.
Farber’s study of taxis in New York was based on an earlier study by Camerer et al. (1997) that used three different convenience samples of paper trip sheets—paper forms used by drivers to record trip start time, end time, and fare. This earlier study found that drivers seemed to be target earners: they worked less on days where their wages were higher.
Kossinets and Watts (2009) was focused on the origins of homophily in social networks. See Wimmer and Lewis (2010) for a different approach to the same problem which uses data from Facebook.
In subsequent work, King and colleagues have further explored online censorship in China (King, Pan, and Roberts 2014; King, Pan, and Roberts 2016). For a related approach to measuring online censorship in China, see Bamman, O’Connor, and Smith (2012). For more on statistical methods like the one used in King, Pan, and Roberts (2013) to estimate the sentiment of the 11 million posts, see Hopkins and King (2010). For more on supervised learning, see James et al. (2013) (less technical) and Hastie, Tibshirani, and Friedman (2009) (more technical).
Forecasting is a big part of industrial data science (Mayer-Schönberger and Cukier 2013; Provost and Fawcett 2013). One type of forecasting that are commonly done by social researchers are demographic forecasting, for example Raftery et al. (2012).
Google Flu Trends was not the first project to use search data to nowcast influenza prevalence. In fact, researchers in the United States (Polgreen et al. 2008; Ginsberg et al. 2009) and Sweden (Hulth, Rydevik, and Linde 2009) have found that certain search terms (e.g., “flu”) predicted national public health surveillance data before it was released. Subsequently many, many other projects have tried to use digital trace data for disease surveillance detection, see Althouse et al. (2015) for a review.
In addition to using digital trace data to predict health outcomes, there has also been a huge amount of work using Twitter data to predict election outcomes; for reviews see Gayo-Avello (2011), Gayo-Avello (2013), Jungherr (2015) (Ch. 7), and Huberty (2015).
Using search data to predicting influenza prevalence and using Twitter data to predict elections are both examples of using some kind of digital trace to predict some kind of event in the world. There an enormous number of studies that have this general structure. Table 2.5 includes a few other examples.
|Box office revenue of movies in the US||Asur and Huberman (2010)|
|Search logs||Sales of movies, music, books, and video games in the US||Goel et al. (2010)|
|Dow Jones Industrial Average (US stock market)||Bollen, Mao, and Zeng (2011)|
The journal P.S. Political Science had a symposium on big data, causal inference, and formal theory, and Clark and Golder (2015) summarizes each contribution. The journal Proceedings of the National Academy of Sciences of the United States of America had a symposium on causal inference and big data, and Shiffrin (2016) summarizes each contribution.
In terms of natural experiments, Dunning (2012) provides an excellent book length treatment. For more on using the Vietnam draft lottery as a natural experiment, see Berinsky and Chatfield (2015). For machine learning approaches that attempt to automatically discover natural experiments inside of big data sources, see Jensen et al. (2008) and Sharma, Hofman, and Watts (2015).
In terms of matching, for an optimistic review, see Stuart (2010), and for a pessimistic review see Sekhon (2009). For more on matching as a kind of pruning, see Ho et al. (2007). For books that provide excellent treatments of matching, see Rosenbaum (2002), Rosenbaum (2009), Morgan and Winship (2014), and Imbens and Rubin (2015).