Measurement is much less likely to change behavior in big data sources.

One challenge of social research is that people can change their behavior when they know that they are being observed by researchers. Social scientists generally call this behavior change in response to researcher measurement reactivity (Webb et al. 1966). One aspect of big data that many researcher find promising is that participants are generally not aware that their data are being captured or they have become so accustomed to this data collection that it no longer changes their behavior. Because they are non-reactive, therefore, many sources of big data can be used to study behavior that has not been amendable to accurate measurement previously. For example, Stephens-Davidowitz (2014) used the prevalence of racist terms in search engine queries to measure racial animus in different regions of the United States. The non-reactive and big (see previous section) nature of search data enabled measurements that would be difficult using other methods, such as surveys.

Non-reactivity, however, does not ensure that these data are somehow a direct reflect of people’s behavior or attitudes. For example, as one respondent told Newman et al. (2011), “It’s not that I don’t have problems, I’m just not putting them on Facebook.” In other words, even though some big data sources are non-reactive, they are not always free of social desirability bias, the tendency for people to want to present themselves in the best possible way. Further, as I’ll describe more below, these data sources are sometimes impacted by the goals of platform owners, a problem called algorithmic confounding (described more below).

Although non-reactivity is advantageous for research, tracking people’s behavior without their consent and awareness raises ethical concerns discussed below and in detail in Chapter 6. A public backlash against increased digital surveillance could lead big data systems to become more reactive over time, and strong concern about digital surveillance could even lead some people to attempt to opt-out of big data systems completely, increasing concerns about non-representativity (described more below).

These three good properties of big data for social research—big, always-on, and non-reactive—generally arise because these data sources were not created by researchers for research. Now, I’ll turn to the seven properties of big data sources that are bad for research. These features also tend to arise because this data was not created by researchers for research.