Population drift, usage drift, and system drift make it hard to use big data source to study long-term trends.

One of the great advantages of many big data sources are that they collect data over time. Social scientists call this kind of over-time data, longitudinal data. And, naturally, longitudinal data are very important for studying change. In order to reliably measure change, however, the measurement system itself must be stable. In the words of sociologist Otis Dudley Duncan, “if you want to measure change, don’t change the measure” (Fischer 2011).

Unfortunately, many big data systems—especially business system that create and capture digital traces—are changing all the time, a process that I’ll call drift. In particular, these systems change in three main ways: population drift (change in who is using them), behavioral drift (change in how people are using them), and system drift (change in the system itself). The three sources of drift mean that any pattern in digital trace data could be caused by an important change in the world, or it could be caused by some form of drift.

The first source of drift—population drift—is who is using the system, and this changes on long-time scales and short-time scales. For example, from 2008 to present the average age of people on social media has increased. In addition to these long-term trends, the people using a system at any moment varies. For example, during the US Presidential election of 2012 the proportion of tweets about politics that were written by women fluctuated from day to day (Diaz et al. 2016). Thus, what might appear to be a change in the mood of the Twitter-verse might actually just be changes in who is talking at any moment.

In addition to changes in who is using a system, there are also changes in how the system is used. For example, during the Occupy Gezi Park protests in Istanbul, Turkey in 2013 protesters changed their use of hashtags as the protest evolved. Here’s how Zeynep Tufekci (2014) described the drift, which she was able to detect because she was observing behavior on Twitter and on the ground:

“What had happened was that as soon as the protest became the dominant story, large numbers of people . . . stopped using the hashtags except to draw attention to a new phenomenon . . .. While the protests continued, and even intensified, the hashtags died down. Interviews revealed two reasons for this. First, once everyone knew the topic, the hashtag was at once superfluous and wasteful on the character-limited Twitter platform. Second, hashtags were seen only as useful for attracting attention to a particular topic, not for talking about it.”

Thus, researchers who were studying the protests by analyzing tweets with protest-related hashtags would have a distorted sense of what was happening because of this behavioral drift. For example, they might believe that the discussion of the protest decreased long before it actually decreased.

The third kind of drift is system drift. In this case, it is not the people changing or their behavior changing, but the system itself changing. For example, over time Facebook has increased the limit on the length of status updates. Thus, any longitudinal study of status updates will be vulnerable to artifacts caused by this change. System drift is closely related to problem called algorithmic confounding to which we now turn.