2.3.2.6 Dirty

Big data sources can be loaded with junk and spam.

Some researchers believe that big data sources, especially those from online sources, are pristine because they are collected automatically. In fact, people who have worked with big data sources know that they are frequently dirty. That is, they frequently include data that do not reflect real actions of interest to researchers. Many social scientists are already familiar with the process of cleaning large-scale social survey data, but cleaning big data sources is more difficult for two reasons: 1) they were not created by researchers for researchers and 2) researchers generally have less understanding of how they were created.

The dangers of dirty digital trace data are illustrated by Back and colleagues’ (2010) study of the emotional response to the attacks of September 11, 2001. Researchers typically study the response to tragic events using retrospective data collected over months or even years. But, Back and colleagues found an always-on source of digital traces—the timestamped, automatically recorded messages from 85,000 American pagers—and this enabled the researchers to study emotional response on a much finer timescale. Back and colleagues created a minute-by-minute emotional timeline of September 11th by coding the emotional content of the pager messages by the percentage of words related to (1) sadness (e.g., crying, grief), (2) anxiety (e.g., worried, fearful), and (3) anger (e.g., hate, critical). They found that sadness and anxiety fluctuated throughout the day without a strong pattern, but that there was a striking increase in anger throughout the day. This research seems to be a wonderful illustration of the power of always-on data sources: using standard methods it would be impossible to have such a high-resolution timeline of the immediate response to an unexpected event.

Just one year later, however, Cynthia Pury (2011) looked at the data more carefully. She discovered that a large number of the supposedly angry messages were generated by a single pager and they were all identical. Here’s what those supposedly angry messages said:

“Reboot NT machine [name] in cabinet [name] at [location]:CRITICAL:[date and time]”

These messages were labeled angry because they included the word “CRITICAL”, which may generally indicate anger but does not in this case. Removing the messages generated by this single automated pager completely eliminates the apparent increase in anger over the course of the day (Figure 2.2). In other words, the main result in Back, Küfner, and Egloff (2010) was an artifact of one pager. As this example illustrates, relatively simple analysis of relatively complex and messy data has the potential to go seriously wrong.

Figure 2.2: Estimated trends in anger over the course of September 11, 2001 based on 85,000 American pagers (Back, Küfner, and Egloff 2010; Pury 2011; Back, Küfner, and Egloff 2011). Originally, Back, Küfner, and Egloff (2010) reported a pattern of increasing anger throughout the day. However, most of these apparent angry messages were generated by a single pager that repeatedly sent out the following message: Reboot NT machine [name] in cabinet [name] at [location]:CRITICAL:[date and time]. With this message removed, the apparent increase in anger disappears (Pury 2011; Back, Küfner, and Egloff 2011). This figure is a reproduction of Fig 1B in Pury (2011).

Figure 2.2: Estimated trends in anger over the course of September 11, 2001 based on 85,000 American pagers (Back, Küfner, and Egloff 2010; Pury 2011; Back, Küfner, and Egloff 2011). Originally, Back, Küfner, and Egloff (2010) reported a pattern of increasing anger throughout the day. However, most of these apparent angry messages were generated by a single pager that repeatedly sent out the following message: “Reboot NT machine [name] in cabinet [name] at [location]:CRITICAL:[date and time]”. With this message removed, the apparent increase in anger disappears (Pury 2011; Back, Küfner, and Egloff 2011). This figure is a reproduction of Fig 1B in Pury (2011).

While dirty data that is created unintentionally—such as from one noisy pager—can be detected by a reasonably careful researcher, there are also some online systems that attract intentional spammers. These spammers actively generate fake data, and—often motivated by profit—work very hard to keep their spamming concealed. For example, political activity on Twitter seems to include at least some reasonably sophisticated spam, whereby some political causes are intentionally made to look more popular than they actual are (Ratkiewicz et al. 2011). Researchers working with data that may contain intentional spam face the challenge of convincing their audience that they have detected and removed relevant spam.

Finally, what is considered dirty data can depend in subtle ways on your research questions. For example, many edits to Wikipedia are created by automated bots (Geiger 2014). If you are interested in the ecology of Wikipedia, then these bots are important. But, if you are interested in how humans contribute to Wikipedia, these edits made by these bots should be excluded.

The best ways to avoid being fooled by dirty data are to understand how your data were created to perform simple exploratory analysis, such as making simple scatter plots.