2.2 Big data

You are reading the Open Review Edition of Bit by Bit. Click here to read the 1st Edition.

2.2 Big data

Big data are created and collected by governments for purposes other than research. Using this data for research, therefore, requires repurposing.

An idealized view of social research imagines a scientist having an idea and then collecting data to test that idea. This style of research leads to a tight fit between research question and data, but it is limited because an individual researcher often do not have the resources needed to collect the data they need, such as large, rich, and nationally-representative data. Therefore, a lot of social research in the past has used large-scale social surveys, such as the General Social Survey (GSS), the American National Election Study (ANES), and Panel Study of Income Dynamics (PSID). These large-scale survey are generally run by a team of researchers and they are designed to create data that can be used by many researchers. Because of the goals of these large-scale surveys, great care is put into designing the data collection and preparing the resulting data for use by researchers. These data are by researchers and for researchers.

Most social research using digital age sources, however, is fundamentally different. Instead of using data collected by researchers and for researchers, it uses data sources that were created and collected by businesses and governments for their own purposes such as making a profit, providing a service, or administering a law. These business and government data sources have come to be called big data. Doing research with big data is different than doing research with data that was originally created for research. Compare, for example, a social media website, such as Twitter, with a traditional public opinion survey such as the General Social Survey (GSS). Twitter’s main goals are to provide a service to its users and to make a profit. In the process of achieving these goals, Twitter creates data that might be useful for studying certain aspects of public opinion. But, unlike the General Social Survey (GSS), Twitter is not primarily focused on social research.

The term big data is frustratingly vague, and it groups together many different things. For the purposes of social research, I think it is helpful to distinguish between two kinds of big data sources: government administrative records and business administrative records. Government administrative records are data that are created by governments as part of their routine activities. These kinds of records have been used by researchers in the past—such as demographers studying birth, marriage, and death records—but governments are increasingly collecting and releasing detailed records in analyzable forms. For example, the New York City government installed digital meters inside of every taxi in the city. These meters record all kinds of data about each taxi ride including the driver, the start time and location, the stop time and location, and the fare. In a study that I’ll tell later in this chapter, Henry Farber (2015) repurposed these data to address a fundamental debate in labor economics about the relationship between hourly wages and the number of hours worked.

The second main type of big data for social research is business administrative records. These are data that business create and collect as part of their routine activities. These business administrative records are often called digital traces, and include things like search engine query logs, social media posts, and call records from mobile phones. Critically, these business administrative records are not just about online behavior. For example, stores that use check-out scanners are creating real-time measures of worker productivity. In a study that I’ll tell you about later in this chapter, Alexandre Mas and Enrico Moretti (2009) repurposed this supermarket check-out data to study how a workers’ productivity is impacted by the productivity of their peers.

As both of these examples illustrate, the idea of repurposing is fundamental to learning from big data. In my experience, social scientists and data scientists approach to this repurposing very differently. Social scientists, who are accustomed to working with data designed for research, are quick to point out the problems with repurposed data while ignoring its strengths. On the other hand, data scientists are quick to point out the benefits of repurposed data while ignoring its weaknesses. Naturally, the best approach would be a hybrid. That is, researchers need to understand the characteristics of these new sources of data—both good and bad—and then figure out how to learn from them. And, that is the plan for the remainder of this chapter. Next, I will describe ten common characteristics of business and government administrative data. After that, I will describe three research approaches that can be used with these data, approaches that are well suited to the characteristics of this data.