5.4.3 Conclusion

Distributed data collection is possible, and in the future it will likely involve technology and passive participation.

As eBird demonstrates, distributed data collection can be used for scientific research. Further, PhotoCity shows that problems related to sampling and data quality are potentially solvable. How might distributed data collection work for social research? One example comes from the work of Susan Watkins and her colleagues on the Malawi Journals Project (Watkins and Swidler 2009; Kaler, Watkins, and Angotti 2015). In this project, 22 local residents—called “journalists”—kept “conversational journals” that recorded, in detail, the conversations they overheard about AIDS in the daily lives of ordinary people (at the time the project began, about 15% of adults in Malawi were infected with HIV (Bello, Chipeta, and Aberle-Grasse 2006)). Because of their insider status, these journalists were able to overhear conversations that might have been inaccessible to Watkins and her Western research collaborators (I’ll discuss the ethics of this later in the chapter when I offer advice about designing your own mass collaboration project). The data from the Malawi Journals Project have led to a number of important findings. For example, before the project started, many outsiders believed that there was silence about AIDS in sub-Saharan Africa, but the conversational journals demonstrated that this was clearly not the case: journalists overheard hundreds of discussions of the topic, in locations as diverse as funerals, bars, and churches. Further, the nature of these conversations helped researchers better understand some of the resistance to condom use; the way that condom use was framed in public health messages was inconsistent with the way that it was discussed in everyday life (Tavory and Swidler 2009).

Of course, like the data from eBird, the data from the Malawi Journals Project are not perfect, an issue discussed in detail by Watkins and colleagues. For example, the recorded conversations are not a random sample of all possible conversations. Rather, they are an incomplete census of conversations about AIDS. In terms of data quality, the researchers believed that their journalists were high-quality reporters, as evidenced by the consistency within journals and across journals. That is, because enough journalists were deployed in a small enough setting and focused on a specific topic, it was possible to use redundancy to assess and ensure data quality. For example, a sex worker named “Stella” showed up several times in the journals of four different journalists (Watkins and Swidler 2009). In order to further build your intuition, table 5.3 shows other examples of distributed data collection for social research.

Table 5.3: Examples of Distributed Data Collection Projects in Social Research
Data collected	Reference
Discussions about HIV/AIDS in Malawi	Watkins and Swidler (2009); Kaler, Watkins, and Angotti (2015)
Street begging in London	Purdam (2014)
Conflict events in Eastern Congo	Windt and Humphreys (2016)
Economic activity in Nigeria and Liberia	Blumenstock, Keleher, and Reisinger (2016)
Influenza surveillance	Noort et al. (2015)

All of the examples described in this section have involved active participation: journalists transcribed conversations that they heard; birders uploaded their birding checklists; or players uploaded their photos. But what if the participation was automatic and did not require any specific skill or time to submit? This is the promise offered by “participatory sensing” or “people-centric sensing.” For example, the Pothole Patrol, a project by scientists at MIT, mounted GPS-equipped accelerometers inside seven taxi cabs in the Boston area (Eriksson et al. 2008). Because driving over a pothole leaves a distinct accelerometer signal, these devices, when placed inside of moving taxis, can create pothole maps of Boston. Of course, taxis don’t randomly sample roads, but, given enough taxis, there may be sufficient coverage to provide information about large portions of they city. A second benefit of passive systems that rely on technology is that they de-skill the process of contributing data: while it requires skill to contribute to eBird (because you need to be able to reliably identify bird species), it requires no special skills to contribute to Pothole Patrol.

Going forward, I suspect that many distributed data collection projects will begin to make use of capabilities of the mobile phones that are already carried by billions of people around the world. These phones already have a large number of sensors important for measurement, such as microphones, cameras, GPS devices, and clocks. Further, they support third-party apps enabling researchers some control over the underlying data collection protocols. Finally, they have Internet-connectivity, making it possible for them to off-load the data they collect. There are numerous technical challenges, ranging from inaccurate sensors to limited battery life, but these problems will likely diminish over time as technology develops. Issues related to privacy and ethics, on the other hand, might get more complicated; I’ll return to questions of ethics when I offer advice about designing your own mass collaboration.

In distributed data collection projects, volunteers contribute data about the world. This approach has already been used successfully, and future uses will likely have to address sampling and data quality concerns. Fortunately, existing projects such as PhotoCity and Pothole Patrol suggest solutions to these problems. As more projects take advantage of technology that enables de-skilled and passive participation, distributed data collection projects should dramatically increase in scale, enabling researchers to collect data that were simply off limits in the past.