2.5 Conclusion

Big data sources are everywhere, but using them for social research can be tricky. In my experience, there is something like a “no free lunch” rule for data: if you don’t put in a lot of work collecting it, then you are probably going to have to put in a lot of work think about it and analyzing it.

The big data sources of today—and likely tomorrow—will tend to have 10 characteristics. Three of these are generally (but not always) helpful for research: big, always-on, and nonreactive. Seven are generally (but not always) problematic for research: incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, dirty, and sensitive. Many of these characteristics ultimately arise because big data sources were not created for the purpose of social research.

Based on the ideas in this chapter, I think that there are three main ways that big data sources will be most valuable for social research. First, they can enable researchers to decide between competing theoretical predictions. Examples of this kind of work include Farber (2015) (New York Taxi drivers) and King, Pan, and Roberts (2013) (censorship in China). Second, big data sources can enable improved measurement for policy through nowcasting. An example of this kind of work is Ginsberg et al. (2009) (Google Flu Trends). Finally, big data sources can help researchers make causal estimates without running experiments. Examples of this kind of work are Mas and Moretti (2009) (peer effects on productivity) and Einav et al. (2015) (effect of starting price on auctions at eBay). Each of these approaches, however, tends to require researchers to bring a lot to the data, such as the definition of a quantity that is important to estimate or two theories that make competing predictions. Thus, I think the best way to think about what big data sources can do is that they can help researchers who can ask interesting and important questions.

Before concluding, I think that it is worth considering that big data sources may have an important effect on the relationship between data and theory. So far, this chapter has taken the approach of theory-driven empirical research. But big data sources also enable researchers to do empirically driven theorizing. That is, through the careful accumulation of empirical facts, patterns, and puzzles, researchers can build new theories. This alternative, data-first approach to theory is not new, and it was most forcefully articulated by Barney Glaser and Anselm Strauss (1967) with their call for grounded theory. This data-first approach, however, does not imply “the end of theory,” as has been claimed in some of the journalism around research in the digital age (Anderson 2008). Rather, as the data environment changes, we should expect a rebalancing in the relationship between data and theory. In a world where data collection was expensive, it made sense to collect only the data that theories suggest will be the most useful. But, in a world where enormous amounts of data are already available for free, it makes sense to also try a data-first approach (Goldberg 2015).

As I have shown in this chapter, researchers can learn a lot by watching people. In the next three chapters, I’ll describe how we can learn more and different things if we tailor our data collection and interact with people more directly by asking them questions (chapter 3), running experiments (chapter 4), and even involving them in the research process directly (chapter 5).