5.4.3 Conclusion

Distributed data collection is possible, and in the future will likely involve technology and passive participation.

As eBird demonstrates, distributed data collection can be used for scientific research. Further, PhotoCity shows that problems related to sampling and data quality are potentially solvable.

How might distributed data collection work for social research? A wonderful example comes from the work of Susan Watkins and her colleagues on the Malawi Journals Project (Watkins and Swidler 2009; Kaler, Watkins, and Angotti 2015). In this project, 22 local residents—called “journalists”—kept “conversational journals” that recorded, in detail, the conversations they overheard about AIDS in the daily lives of ordinary people (at the time the project began, about 15% of adults in Malawi were infected with HIV (Bello, Chipeta, and Aberle-Grasse 2006)). Because of their insider status, these journalists were able to overhear conversations that might have been inaccessible to Susan Watkins and her Western research collaborators (I’ll discuss the ethics of this later in the chapter when I offer advice about designing your own mass collaboration project). The data from the Malawi Journals Project has led to a number of important findings. For example, before the project started, many outsiders believed that there was silence about AIDS in sub-Saharan Africa, but the journals demonstrated that this was clearly not the case: journalists overheard hundreds of conversation on the topic, in locations as diverse as funerals, bars, and churches. Further, the nature of these conversations helped researchers better understand some of the resistance to condom use; the way that condom use was framed in public health messages was inconsistent with the way that it was discussed in everyday life (Tavory and Swidler 2009).

Of course, like the data from eBird, the data from the Malawi Journals Project is not perfect, an issue discussed in detail by Watkins and colleagues. For example, the recorded conversations are not a random sample of all possible conversations. Rather, they are an incomplete census of conversations about AIDS. In terms of data quality, the researchers believe that their journalists were high-quality reporters, as evidenced by the consistency within journals and across journals. Further, when enough journalists are deployed in a small enough setting and reports are focused on a specific topic, redundancy became possible, which increases confidence in data quality. For example, a sex worker named “Stella” showed up several times in the journals of four different journalists (Watkins and Swidler 2009). As it was in PhotoCity, the use of redundancy is an important principle for assessing and ensuring data quality in distributed data collection projects. In order to further build your intuition, Table 5.3 shows other examples of distributed data collection for social research.

Table 5.3: Examples of distributed data collection projects in social research.
Data collected Citation
Discussions about HIV/AIDS in Malawi Watkins and Swidler (2009); Kaler, Watkins, and Angotti (2015)
Street begging in London Purdam (2014)
Conflict events in Eastern Congo Windt and Humphreys (2016)
Economic activity in Nigeria and Liberia Blumenstock, Keleher, and Reisinger (2016)
Influenza surveillance Noort et al. (2015)

All of the examples described in this section have involved active participation: journalists transcribed conversations that they heard; birders uploaded their birding checklists; or players uploaded their photos. But what if the participation was automatic and did not require any specific skill or time to submit? This is the promise offered by “participatory sensing” or “people-centric sensing.” For example, the Pothole Patrol, a project by scientists at MIT, mounted GPS equipped accelerometers inside seven taxi cabs in the Boston area (Eriksson et al. 2008). Because driving over a pothole leaves a distinct accelerometer signal, these devices, when placed inside of moving taxis, can create pothole maps of Boston. Of course, taxis don’t randomly sample roads, but given enough taxis, there may be sufficient coverage to provide information about large portions of they city. A second benefit of passive systems that rely on technology is that they de-skill the process of contributing data: while it requires skill to contribute to eBird (because you need to be able to reliably identify bird species), it requires no special skills to contribute to Pothole Patrol.

Going forward, I suspect that many distributed data collection projects will begin to make use of capabilities of mobile phones that are already carried by billions of people around the world. These phones already have a large number of sensors important for measurement, such as microphones, cameras, GPS devices, and clocks. Further, these mobile phones support third-party apps enabling researchers some control over the underlying data collection protocols. Finally, these phones have Internet-connectivity, making it possible for them to off-load the data they collect. There are numerous technical challenges from inaccurate sensors to limited battery life, but these problems will likely diminish over time as technology develops. Issues related to privacy and ethics, on the other hand, might get more complicated as technology develops; I’ll return to questions of ethics when I offer advice about designing your own mass collaboration.

In distributed data collection projects, volunteers contribute data about the world. This approach has already been used successfully, and future uses will likely have to address sampling and data quality concerns. Fortunately, existing projects such as PhotoCity and Pothole Patrol suggest solutions to these problems. As more projects take advantage of technology that enables de-skilled and passive participation, distributed data collection projects should dramatically increase in scale, enabling researchers to collect data that was simply off limits in the past.