2.2 Big data

Big data are created and collected by companies and governments for purposes other than research. Using this data for research, therefore, requires repurposing.

The first way that many people encounter social research in the digital age is through what is often called big data. Despite the widespread use of this term, there is no consensus about what big data even is. However, one of the most common definitions of big data focuses on the “3 Vs”: Volume, Variety, and Velocity. Roughly, there is a lot of data, in a variety of formats, and it is being created constantly. Some fans of big data also add other “Vs” such as Veracity and Value, whereas some critics add Vs such as Vague and Vacuous. Rather than the 3 “Vs” (or the 5 “Vs” or the 7 “Vs”), for the purposes of social research, I think a better place to start is the 5 “Ws”: Who, What, Where, When, and Why. In fact, I think that many of the challenges and opportunities created by big data sources follow from just one “W”: Why.

In the analog age, most of the data that were used for social research was created for the purpose of doing research. In the digital age, however, a huge amount of data is being created by companies and governments for purposes other than research, such as providing services, generating profit, and administering laws. Creative people, however, have realized that you can repurpose this corporate and government data for research. Thinking back to the art analogy in chapter 1, just as Duchamp repurposed a found object to create art, scientists can now repurpose found data to create research.

While there are undoubtedly huge opportunities for repurposing, using data that were not created for the purposes of research also presents new challenges. Compare, for example, a social media service, such as Twitter, with a traditional public opinion survey, such as the General Social Survey. Twitter’s main goals are to provide a service to its users and to make a profit. The General Social Survey, on the other hand, is focused on creating general-purpose data for social research, particularly for public opinion research. This difference in goals means that the data created by Twitter and that created by the General Social Survey have different properties, even though both can be used for studying public opinion. Twitter operates at a scale and speed that the General Social Survey cannot match, but, unlike the General Social Survey, Twitter does not carefully sample users and does not work hard to maintain comparability over time. Because these two data sources are so different, it does not make sense to say that the General Social Survey is better than Twitter or vice versa. If you want hourly measures of global mood (e.g., Golder and Macy (2011)), Twitter is best. On the other hand, if you want to understand long-term changes in the polarization of attitudes in the United States (e.g., DiMaggio, Evans, and Bryson (1996)), then the General Social Survey is the best choice. More generally, rather than trying to argue that big data sources are better or worse than other types of data, this chapter will try to clarify for which kinds of research questions big data sources have attractive properties and for which kinds of questions they might not be ideal.

When thinking about big data sources, many researchers immediately focus on online data created and collected by companies, such as search engine logs and social media posts. However, this narrow focus leaves out two other important sources of big data. First, increasingly corporate big data sources come from digital devices in the physical world. For example, in this chapter, I’ll tell you about a study that repurposed supermarket check-out data to study how a worker’s productivity is impacted by the productivity of her peers (Mas and Moretti 2009). Then, in later chapters, I’ll tell you about researchers who used call records from mobile phones (Blumenstock, Cadamuro, and On 2015) and billing data created by electric utilities (Allcott 2015). As these examples illustrate, corporate big data sources are about more than just online behavior.

The second important source of big data missed by a narrow focus on online behavior is data created by governments. These government data, which researchers call government administrative records, include things such as tax records, school records, and vital statistics records (e.g., registries of births and deaths). Governments have been creating these kind of data for, in some cases, hundreds of years, and social scientists have been exploiting them for nearly as long as there have been social scientists. What has changed, however, is digitization, which has made it dramatically easier for governments to collect, transmit, store, and analyze data. For example, in this chapter, I’ll tell you about a study that repurposed data from New York City government’s digital taxi meters in order to address a fundamental debate in labor economics (Farber 2015). Then, in later chapters, I’ll tell you about how government-collected voting records were used in a survey (Ansolabehere and Hersh 2012) and an experiment (Bond et al. 2012).

I think the idea of repurposing is fundamental to learning from big data sources, and so, before talking more specifically about the properties of big data sources (section 2.3) and how these can be used in research (section 2.4), I’d like to offer two pieces of general advice about repurposing. First, it can be tempting to think about the contrast that I’ve set up as being between “found” data and “designed” data. That’s close, but it’s not quite right. Even though, from the perspective of researchers, big data sources are “found,” they don’t just fall from the sky. Instead, data sources that are “found” by researchers are designed by someone for some purpose. Because “found” data are designed by someone, I always recommend that you try to understand as much as possible about the people and processes that created your data. Second, when you are repurposing data, it is often extremely helpful to imagine the ideal dataset for your problem and then compare that ideal dataset with the one that you are using. If you didn’t collect your data yourself, there are likely to be important differences between what you want and what you have. Noticing these differences will help clarify what you can and cannot learn from the data you have, and it might suggest new data that you should collect.

In my experience, social scientists and data scientists tend to approach repurposing very differently. Social scientists, who are accustomed to working with data designed for research, are typically quick to point out the problems with repurposed data while ignoring its strengths. On the other hand, data scientists are typically quick to point out the benefits of repurposed data while ignoring its weaknesses. Naturally, the best approach is a hybrid. That is, researchers need to understand the characteristics of big data sources—both good and bad—and then figure out how to learn from them. And, that is the plan for the remainder of this chapter. In the next section, I will describe ten common characteristics of big data sources. Then, in the following section, I will describe three research approaches that can work well with such data.