3.6.2 Amplified asking

Amplified asking using a predictive model to combine survey data from a few people with a big data source from many people.

A different way to combine survey and big data sources is a process that I’ll call amplified asking. In amplified asking, a researcher uses a predictive model to combine a small amount of survey data with a big data source in order to produce estimates at a scale or granularity that would not be possible with either data source individually. An important example of amplified asking comes from the work of Joshua Blumenstock, who wanted to collect data that could help guide development in poor countries. In the past, researchers collecting this kind of data generally had to take one of two approaches: sample surveys or censuses. Sample surveys, where researchers interview a small number of people, can be flexible, timely, and relatively cheap. However, these surveys, because they are based on a sample, are often limited in their resolution. With a sample survey, it is often hard to make estimates about specific geographic regions or for specific demographic groups. Censuses, on the other hand, attempt to interview everyone, and so they can be used to produce estimates for small geographic regions or demographic groups. But censuses are generally expensive, narrow in focus (they only include a small number of questions), and not timely (they happen on a fixed schedule, such as every 10 years) (Kish 1979). Rather than being stuck with sample surveys or censuses, imagine if researchers could combine the best characteristics of both. Imagine if researchers could ask every question to every person every day. Obviously, this ubiquitous, always-on survey is a kind of social science fantasy. But it does appear that we can begin to approximate this by combining survey questions from a small number of people with digital traces from many people.

Blumenstock’s research began when he partnered with the largest mobile phone provider in Rwanda, and the company provided anonymized transaction records from about 1.5 million customers between 2005 and 2009. These records contained information about each call and text message, such as the start time, duration, and approximate geographic location of the caller and receiver. Before I talk about the statistical issues, it is worth pointing out that this first step may be one of the hardest for many researchers. As I described in chapter 2, most big data sources are inaccessible to researchers. Telephone meta-data, in particular, is especially inaccessible because it is basically impossible to anonymize and it almost certainly contains information that participants would consider sensitive (Mayer, Mutchler, and Mitchell 2016; Landau 2016). In this particular case, the researchers were careful to protect the data and their work was overseen by a third party (i.e., their IRB). I’ll return to these ethical issues in more detail in chapter 6.

Blumenstock was interested in measuring wealth and well-being. But these traits are not directly in the call records. In other words, these call records are incomplete for this research—a common feature of big data sources that was discussed in detail in chapter 2. However, it seems likely that the call records probably have some information that could indirectly provide information about wealth and well-being. Given this possibility, Blumenstock asked whether it was possible to train a machine learning model to predict how someone will respond to a survey based on their call records. If this was possible, then Blumenstock could use this model to predict the survey responses of all 1.5 million customers.

In order to build and train such a model, Blumenstock and research assistants from Kigali Institute of Science and Technology called a random sample of about a thousand customers. The researchers explained the goals of the project to the participants, asked for their consent to link the survey responses to the call records, and then asked them a series of questions to measure their wealth and well-being, such as “Do you own a radio?” and “Do you own a bicycle?” (see figure 3.14 for a partial list). All participants in the survey were compensated financially.

Next, Blumenstock used a two-step procedure common in machine learning: feature engineering followed by supervised learning. First, in the feature engineering step, for everyone that was interviewed, Blumenstock converted the call records into a set of characteristics about each person; data scientists might call these characteristics “features” and social scientists would call them “variables.” For example, for each person, Blumenstock calculated the total number of days with activity, the number of distinct people a person has been in contact with, the amount of money spent on airtime, and so on. Critically, good feature engineering requires knowledge of the research setting. For example, if it is important to distinguish between domestic and international calls (we might expect people who call internationally to be wealthier), then this must be done at the feature engineering step. A researcher with little understanding of Rwanda might not include this feature, and then the predictive performance of the model would suffer.

Next, in the supervised learning step, Blumenstock built a model to predict the survey response for each person based on their features. In this case, Blumenstock used logistic regression, but he could have used a variety of other statistical or machine learning approaches.

So how well did it work? Was Blumenstock able to predict answers to survey questions like “Do you own a radio?” and “Do you own a bicycle?” using features derived from call records? In order to evaluate the performance of his predictive model, Blumenstock used cross-validation, a technique commonly used in data science but rarely in social science. The goal of cross-validation is to provide a fair assessment of a model’s predictive performance by training it and testing it on different subsets of data. In particular, Blumenstock split his data into 10 chunks of 100 people each. Then, he used nine of the chunks to train his model, and the predictive performance of the trained model was evaluated on the remaining chunk. He repeated this procedure 10 times—with each chunk of data getting one turn as the validation data—and averaged the results.

The accuracy of the predictions was high for some traits (figure 3.14); for example, Blumenstock could predict with 97.6% accuracy if someone owned a radio. This might sound impressive, but it is always important to compare a complex prediction method against a simple alternative. In this case, a simple alternative is to predict that everyone will give the most common answer. For example, 97.3% of respondents reported owning a radio so if Blumenstock had predicted that everyone would report owning a radio he would have had an accuracy of 97.3%, which is surprisingly similar to the performance of his more complex procedure (97.6% accuracy). In other words, all the fancy data and modeling increased the accuracy of the prediction from 97.3% to 97.6%. However, for other questions, such as “Do you own a bicycle?”, the predictions improved from 54.4% to 67.6%. More generally, figure 3.15 shows that for some traits Blumenstock did not improve much beyond just making the simple baseline prediction, but that for other traits there was some improvement. Looking just at these results, however, you might not think that this approach is particularly promising.

Figure 3.14: Predictive accuracy for a statistical model trained with call records. Adapted from Blumenstock (2014), table 2.

Figure 3.15: Comparison of predictive accuracy for a statistical model trained with call records to simple baseline prediction. Points are slightly jittered to avoid overlap. Adapted from Blumenstock (2014), table 2.

However, just one year later, Blumenstock and two colleagues—Gabriel Cadamuro and Robert On—published a paper in Science with substantially better results (Blumenstock, Cadamuro, and On 2015). There were two main technical reasons for this improvement: (1) they used more sophisticated methods (i.e., a new approach to feature engineering and a more sophisticated model to predict responses from features) and (2) rather than attempting to infer responses to individual survey questions (e.g., “Do you own a radio?”), they attempted to infer a composite wealth index. These technical improvements meant that they could do a reasonable job of using call records to predict wealth for the people in their sample.

Predicting the wealth of people in the sample, however, was not the ultimate goal of the research. Remember that the ultimate goal was to combine some of the best features of sample surveys and censuses to produce accurate, high-resolution estimates of poverty in developing countries. To assess their ability to achieve this goal, Blumenstock and colleagues used their model and their data to predict the wealth of all 1.5 million people in the call records. And they used the geospatial information embedded in the call records (recall that the data included the location of the nearest cell tower for each call) to estimate the approximate place of residence of each person (figure 3.17). Putting these two estimates together, Blumenstock and colleagues produced an estimate of the geographic distribution of subscriber wealth at extremely fine spatial granularity. For example, they could estimate the average wealth in each of Rwanda’s 2,148 cells (the smallest administrative unit in the country).

How well did these estimates match up to the actual level of poverty in these regions? Before I answer that question, I want to emphasize the fact that there are a lot of reasons to be skeptical. For example, the ability to make predictions at the individual level was pretty noisy (figure 3.17). And, perhaps more importantly, people with mobile phones might be systematically different from people without mobile phones. Thus, Blumenstock and colleagues might suffer from the types of coverage errors that biased the 1936 Literary Digest survey that I described earlier.

To get a sense of the quality of their estimates, Blumenstock and colleagues needed to compare them with something else. Fortunately, around the same time as their study, another group of researchers was running a more traditional social survey in Rwanda. This other survey—which was part of the widely respected Demographic and Health Survey program—had a large budget and used high-quality, traditional methods. Therefore, the estimates from the Demographic and Health Survey could reasonably be considered gold-standard estimates. When the two estimates were compared, they were quite similar (figure 3.17). In other words, by combining a small amount of survey data with the call records, Blumenstock and colleagues were able to produce estimates comparable to those from gold-standard approaches.

A skeptic might see these results as a disappointment. After all, one way of viewing them is to say that by using big data and machine learning, Blumenstock and colleagues were able to produce estimates that could be made more reliably by already existing methods. But I don’t think that is the right way to think about this study for two reasons. First, the estimates from Blumenstock and colleagues were about 10 times faster and 50 times cheaper (when cost is measured in terms of variable costs). As I argued earlier in this chapter, researchers ignore cost at their peril. In this case, for example, the dramatic decrease in cost means that rather than being run every few years—as is standard for Demographic and Health Surveys—this kind of survey could be run every month, which would provide numerous advantages for researchers and policy makers. The second reason not to take the skeptic’s view is that this study provides a basic recipe that can be tailored to many different research situations. This recipe has only two ingredients and two steps. The ingredients are (1) a big data source that is wide but thin (i.e., it has many people but not the information that you need about each person) and (2) a survey that is narrow but thick (i.e., it has only a few people, but it does have the information that you need about those people). These ingredients are then combined in two steps. First, for the people in both data sources, build a machine learning model that uses the big data source to predict survey answers. Next, use that model to impute the survey answers of everyone in the big data source. Thus, if there is some question that you want to ask lots of people, look for a big data source from those people that might be used to predict their answer, even if you don’t care about the big data source. That is, Blumenstock and colleagues didn’t inherently care about call records; they only cared about call records because they could be used to predict survey answers that they cared about. This characteristic—only indirect interest in the big data source—makes amplified asking different from embedded asking, which I described earlier.

Figure 3.16: Schematic of the study by Blumenstock, Cadamuro, and On (2015). Call records from the phone company was converted to a matrix with one row for each person and one column for each feature (i.e., variable). Next, the researchers built a supervised learning model to predict the survey responses from the person-by-feature matrix. Then, the supervised learning model was used to impute the survey responses for all 1.5 million customers. Also, the researchers estimated the approximate place of residence for all 1.5 million customers based on the locations of their calls. When these two estimates—the estimated wealth and the estimated place of residence—were combined, the results were similar to estimates from the Demographic and Health Survey, a gold-standard traditional survey (figure 3.17).

Figure 3.17: Results from Blumenstock, Cadamuro, and On (2015). At the individual level, the researchers were able to do a reasonable job at predicting someone’s wealth from their call records. The estimates of district-level wealth for Rwanda’s 30 districts—which were based on individual-level estimates of wealth and place of residence—were similar to results from the Demographic and Health Survey, a gold-standard traditional survey. Adapted from Blumenstock, Cadamuro, and On (2015), figures 1a and 3c.

In conclusion, Blumenstock’s amplified asking approach combined survey data with a big data source to produce estimates comparable to those from a gold-standard survey. This particular example also clarifies some of the trade-offs between amplified asking and traditional survey methods. The amplified asking estimates were more timely, substantially cheaper, and more granular. But, on the other hand, there is not yet a strong theoretical basis for this kind of amplified asking. This single example does not show when this approach will work and when it won’t, and researchers using this approach need to be especially concerned about possible biases caused by who is included—and who is not included—in their big data source. Further, the amplified asking approach does not yet have good ways to quantify uncertainty around its estimates. Fortunately, amplified asking has deep connections to three large areas in statistics—small-area estimation (Rao and Molina 2015), imputation (Rubin 2004), and model-based post-stratification (which itself is closely related to Mr. P., the method I described earlier in the chapter) (Little 1993). Because of these deep connections, I expect that many of the methodological foundations of amplified asking will soon be improved.

Finally, comparing Blumenstock’s first and second attempts also illustrates an important lesson about digital-age social research: the beginning is not the end. That is, many times, the first approach will not be the best, but if researchers continue working, things can get better. More generally, when evaluating new approaches to social research in the digital age, it is important to make two distinct evaluations: (1) How well does this work now? and (2) How well will this work in the future as the data landscape changes and as researchers devote more attention to the problem? Although researchers are trained to make the first kind of evaluation, the second is often more important.