2.3.2 Always-on

Always-on big data enables the study of unexpected events and real-time measurement.

Many big data systems are always-on; they are constantly collecting data. This always-on characteristic provides researchers with longitudinal data (i.e., data over time). Being always-on has two important implications for research.

First, always-on data collection enables researchers to study unexpected events in ways that would not otherwise be possible. For example, researchers interested in studying the Occupy Gezi protests in Turkey in the summer of 2013 would typically focus on the behavior of protesters during the event. Ceren Budak and Duncan Watts (2015) were able to do more by using the always-on nature of Twitter to study protesters who used Twitter before, during, and after the event. And, they were able to create a comparison group of nonparticipants before, during, and after the event (figure 2.2). In total, their ex-post panel included the tweets of 30,000 people over two years. By augmenting the commonly used data from the protests with this other information, Budak and Watts were able to learn much more: they were able to estimate what kinds of people were more likely to participate in the Gezi protests and to estimate the changes in attitudes of participants and nonparticipants, both in the short term (comparing pre-Gezi to during Gezi) and in the long term (comparing pre-Gezi with post-Gezi).

Figure 2.2: Design used by Budak and Watts (2015) to study the Occupy Gezi protests in Turkey in the summer of 2013. By using the always-on nature of Twitter, the researchers created what they called an ex-post panel that included about 30,000 people over two years. In contrast to a typical study that focused on participants during the protests, the ex-post panel adds 1) data from participants before and after the event and 2) data from nonparticipants before, during, and after the event. This enriched data structure enabled Budak and Watts to estimate what kinds of people were more likely to participate in the Gezi protests and to estimate the changes in attitudes of participants and non-participants, both in the short term (comparing pre-Gezi with during Gezi) and in the long term (comparing pre-Gezi with post-Gezi).

Figure 2.2: Design used by Budak and Watts (2015) to study the Occupy Gezi protests in Turkey in the summer of 2013. By using the always-on nature of Twitter, the researchers created what they called an ex-post panel that included about 30,000 people over two years. In contrast to a typical study that focused on participants during the protests, the ex-post panel adds 1) data from participants before and after the event and 2) data from nonparticipants before, during, and after the event. This enriched data structure enabled Budak and Watts to estimate what kinds of people were more likely to participate in the Gezi protests and to estimate the changes in attitudes of participants and non-participants, both in the short term (comparing pre-Gezi with during Gezi) and in the long term (comparing pre-Gezi with post-Gezi).

A skeptic might point out that some of these estimates could have been made without always-on data collection sources (e.g., long-term estimates of attitude change), and that is correct, although such a data collection for 30,000 people would have been quite expensive. Even given an unlimited budget, however, I can’t think of any other method that essentially allows researchers to travel back in time and directly observe participants’ behavior in the past. The closest alternative would be to collect retrospective reports of behavior, but these reports would be of limited granularity and questionable accuracy. table 2.1 provides other examples of studies that use an always-on data source to study an unexpected event.

Table 2.1: Studies of unexpected events using always-on big data sources.
Unexpected event Always-on data source Citation
Occupy Gezi movement in Turkey Twitter Budak and Watts (2015)
Umbrella protests in Hong Kong Weibo Zhang (2016)
Shootings of police in New York City Stop-and-frisk reports Legewie (2016)
Person joining ISIS Twitter Magdy, Darwish, and Weber (2016)
September 11, 2001 attack livejournal.com Cohn, Mehl, and Pennebaker (2004)
September 11, 2001 attack pager messages Back, Küfner, and Egloff (2010), Pury (2011), Back, Küfner, and Egloff (2011)

In addition to studying unexpected events, always-on big data systems also enable researchers to produce real-time estimates, which can be important in settings where policy makers—in government or industry—want to respond based on situational awareness. For example, social media data can be used to guide emergency response to natural disasters (Castillo 2016) and a variety of different big data sources can be used produce real-time estimates of economic activity (Choi and Varian 2012).

In conclusion, always-on data systems enable researchers to study unexpected events and provide real-time information to policy makers. I do not, however, think that always-on data systems are well suited for tracking changes over very long periods of time. That is because many big data systems are constantly changing—a process that I’ll call drift later in the chapter (section 2.3.7).