2.3.1.2 Always-on

You are reading the Open Review Edition of Bit by Bit. Click here to read the 1st Edition.

2.3.1.2 Always-on

Always-on big data enables the study of unexpected events and real-time measurement.

Many big data systems are always-on; they are constantly collecting data. This always-on characteristic provides researchers with longitudinal data (i.e., data over time). Being always-on has two important implications for research.

First, always-on data collection enables researchers to study unexpected events in ways that were not possible previously. For example, researchers interested in studying the Occupy Gezi protests in Turkey in the summer of 2013 would typically focus on the behavior of protesters during the event. Ceren Budak and Duncan Watts (2015) were able to do more by using the always-on nature of Twitter to study Twitter-using protesters before, during, and after the event. And, they were able to create a comparison group of non-participants (or participants who did not tweet about the protest) before, during, and after the event (Figure 2.1). In total their ex-post panel included the tweets of 30,000 people over two years. By augmenting the commonly used data from the protests with this other information, Budak and Watts were able to learn much more: they were able to estimate what kinds of people were more likely to participate in the Gezi protests and to estimate the changes in attitudes of participants and non-participants, both in the short-term (comparing pre-Gezi to during Gezi) and in the long-term (comparing pre-Gezi to post-Gezi).

Figure 2.1: Design used by Budak and Watts (2015) to study the Occupy Gezi protests in Turkey in the summer of 2013. By using the always-on nature of Twitter, the researchers created what they called an ex-post panel that included about 30,000 people over two years. In contrast the typical study that focused on participants during the protests, the ex-post panel adds 1) data from participants before and after the event and 2) data from non-participants before, during, and after the event. This enriched data structure enabled Budak and Watts to estimate what kinds of people were more likely to participate in the Gezi protests and to estimate the changes in attitudes of participants and non-participants, both in the short-term (comparing pre-Gezi to during Gezi) and in the long-term (comparing pre-Gezi to post-Gezi).

It is true that some of these estimates could have been made without always-on data collection sources (e.g., long-term estimates of attitude change), although such data collection for 30,000 people would have been quite expensive. And, even given an unlimited budget, I can’t think of any other method that essentially allows researchers to travel back in time and directly observe participants behavior in the past. The closest alternative would be to collect retrospective reports of behavior, but these reports would be of limited granularity and questionable accuracy. Table 2.1 provides other examples of studies that use an always-on data source to study an unexpected event.

Table 2.1: Studies of unexpected events using always-on big data sources.
Unexpected event	Always-on data source	Citation
Occupy Gezi movement in Turkey	Twitter	Budak and Watts (2015)
Umbrella protests in Hong Kong	Weibo	Zhang (2016)
Shootings of police in New York City	Stop-and-frisk reports	Legewie (2016)
Person joining ISIS	Twitter	Magdy, Darwish, and Weber (2016)
September 11, 2001 attack	livejournal.com	Cohn, Mehl, and Pennebaker (2004)
September 11, 2001 attack	pager messages	Back, Küfner, and Egloff (2010), Pury (2011), Back, Küfner, and Egloff (2011)

Second, always-on data collection enables researchers to produce real-time measurements, which can be important in settings where policy makers want to not just learn from existing behavior but also respond to it. For example, social media data can be used to guide responses to natural disasters (Castillo 2016).

In conclusion, always-on data systems enable researchers to study unexpected events and provide real-time information to policy makers. I did not, however, propose that that always-on data systems enable researchers to track changes over long periods of time. That is because many big data systems are constantly changing—a process called drift (Section 2.3.2.4).