2.3.9 Dirty

Big data sources can be loaded with junk and spam.

Some researchers believe that big data sources, especially online sources, are pristine because they are collected automatically. In fact, people who have worked with big data sources know that they are frequently dirty. That is, they frequently include data that do not reflect real actions of interest to researchers. Most social scientists are already familiar with the process of cleaning large-scale social survey data, but cleaning big data sources seems to be more difficult. I think the ultimate source of this difficulty is that many of these big data sources were never intended to be used for research, and so they are not collected, stored, and documented in a way that facilitates data cleaning.

The dangers of dirty digital trace data are illustrated by Back and colleagues’ (2010) study of the emotional response to the attacks of September 11, 2001, which I briefly mentioned earlier in the chapter. Researchers typically study the response to tragic events using retrospective data collected over months or even years. But, Back and colleagues found an always-on source of digital traces—the timestamped, automatically recorded messages from 85,000 American pagers—and this enabled them to study emotional response on a much finer timescale. They created a minute-by-minute emotional timeline of September 11 by coding the emotional content of the pager messages by the percentage of words related to (1) sadness (e.g., “crying” and “grief”), (2) anxiety (e.g., “worried” and “fearful”), and (3) anger (e.g., “hate” and “critical”). They found that sadness and anxiety fluctuated throughout the day without a strong pattern, but that there was a striking increase in anger throughout the day. This research seems to be a wonderful illustration of the power of always-on data sources: if traditional data sources had been used, it would have been impossible to obtain such a high-resolution timeline of the immediate response to an unexpected event.

Just one year later, however, Cynthia Pury (2011) looked at the data more carefully. She discovered that a large number of the supposedly angry messages were generated by a single pager and they were all identical. Here’s what those supposedly angry messages said:

“Reboot NT machine [name] in cabinet [name] at [location]:CRITICAL:[date and time]”

These messages were labeled angry because they included the word “CRITICAL,” which may generally indicate anger but in this case does not. Removing the messages generated by this single automated pager completely eliminates the apparent increase in anger over the course of the day (figure 2.4). In other words, the main result in Back, Küfner, and Egloff (2010) was an artifact of one pager. As this example illustrates, relatively simple analysis of relatively complex and messy data has the potential to go seriously wrong.

Figure 2.4: Estimated trends in anger over the course of September 11, 2001 based on 85,000 American pagers (Back, Küfner, and Egloff 2010, 2011; Pury 2011). Originally, Back, Küfner, and Egloff (2010) reported a pattern of increasing anger throughout the day. However, most of these apparently angry messages were generated by a single pager that repeatedly sent out the following message: “Reboot NT machine [name] in cabinet [name] at [location]:CRITICAL:[date and time]”. With this message removed, the apparent increase in anger disappears (Pury 2011; Back, Küfner, and Egloff 2011). Adapted from Pury (2011), figure 1b.

While dirty data that is created unintentionally—such as that from one noisy pager—can be detected by a reasonably careful researcher, there are also some online systems that attract intentional spammers. These spammers actively generate fake data, and—often motivated by profit—work very hard to keep their spamming concealed. For example, political activity on Twitter seems to include at least some reasonably sophisticated spam, whereby some political causes are intentionally made to look more popular than they actually are (Ratkiewicz et al. 2011). Unfortunately, removing this intentional spam can be quite difficult.

Of course what is considered dirty data can depend, in part, on the research question. For example, many edits to Wikipedia are created by automated bots (Geiger 2014). If you are interested in the ecology of Wikipedia, then these bot-created edits are important. But if you are interested in how humans contribute to Wikipedia, then the bot-created edits should be excluded.

There is no single statistical technique or approach that can ensure that you have sufficiently cleaned your dirty data. In the end, I think the best way to avoid being fooled by dirty data is to understand as much as possible about how your data were created.