3.6.1 Amplified asking

Linking your survey to digital traces can be like asking everyone your questions at all times.

Asking generally comes in two main categories: sample surveys and censuses. Sample surveys, where you access a small number of people, can be flexible, timely, and relatively cheap. However, sample surveys, because they are based on a sample, are often limited in their resolution; with a sample survey, it is often hard to make estimates about specific geographic regions or for specific demographic groups. Censuses, on the other, attempt to interview everyone in the population. They have great resolution, but they are generally expensive, narrow in focus (they only include a small number of questions), and not timely (they happen on a fixed schedule, such as every 10 years) (Kish 1979). Now imagine if researchers could combine the best characteristics of sample surveys and censuses; imagine if researchers could ask every question to everyone every day.

Obviously, this continual, ubiquitous, always-on survey is a kind of social science fantasy. But, it appears that we can begin to approximate this by combining survey questions from a small number of people with digital traces from many people. I call this type of combination amplified asking. If done well, it could help us provides estimate that are more local (for smaller geographic areas), more granular (for specific demographic groups), and more timely.

One example of amplified asking comes from the work of Joshua Blumenstock, who wanted to collect data that would help guide development in poor countries. More specifically, Blumenstock wanted to create a system to measure wealth and well-being that combined the completeness of a census with the flexibility and frequency of a survey (Blumenstock 2014; Blumenstock, Cadamuro, and On 2015). In fact, I’ve already described Blumenstock’s work briefly in Chapter 1.

To start, Blumenstock partnered with the largest mobile phone provider in Rwanda. The company provided him anonymized transaction records from about 1.5 million customers covering behavior from 2005 and 2009. The logs contain information about each call and text message such as the start time, duration, and approximate geographic location of the caller and receiver. Before we start talking about the statistical issues, it is worth pointing out that this first step may be one of the hardest. As described in Chapter 2, most digital trace data is inaccessible to researchers. And, many companies are justifiably hesitant to share their data because it is private; that is their customers probably didn’t expect that their records will be shared—in bulk—with researchers. In this case, the researchers took careful steps to anonymize the data and their work was overseen by a third-party (i.e., their IRB). But, despite these efforts, these data are probably still identifiable and they likely contain sensitive information (Mayer, Mutchler, and Mitchell 2016; Landau 2016). I’ll return to these ethical question in Chapter 6.

Recall that Blumenstock was interested in measuring wealth and well-being. But, these traits are not directly in the call records. In other words, these call records are incomplete for this research, a common feature of digital traces that was discussed in detail in Chapter 2. But, it seems likely that the call records probably have some information about wealth and well-being. So, one way of asking Blumenstock’s question could be: is it possible to predict how someone will respond to a survey based on their digital trace data? If so, then by asking a few people we can guess the answers of everyone else.

To assess this empirically, Blumenstock and research assistants from Kigali Institute of Science and Technology called a sample of about a thousand mobile phone customers. The researchers explained the goals of the project to the participants, asked for their consent to link the survey responses to the call records, and then asked them a series of questions to measure their wealth and well-being, such as “Do you own a radio?” and “Do you own a bicycle?” (see Figure 3.11 for a partial list). All participants in the survey were compensated financially.

Next, Blumenstock used a two-step procedure common in data science: feature engineering followed by supervised learning. First, in the feature engineering step, for everyone that was interviewed, Blumenstock converted the call records into a set of characteristics about each person; data scientists might call these characteristics “features” and social scientists would call them “variables.” For example, for each person, Blumenstock calculated total number of days with activity, the number of distinct people a person has been in contact with, the amount of money spent on airtime, and so on. Critically, good feature engineering requires knowledge of the research setting. For example, if it is important to distinguish between domestic and international calls (we might expect people who call internationally to be wealthier), then this must be done at the feature engineering step. A researcher with little understanding of Rwanda might not include this feature, and then the predictive performance of the model will suffer.

Next, in the supervised learning step, Blumenstock built a statistical model to predict the survey response for each person based on their features. In this case, Blumenstock used logistic regression with 10-fold cross-validation, but he could have used a variety of other statistical or machine learning approaches.

So how well did it work? Was Blumenstock able to predict answers to survey questions like “Do you own a radio?” and “Do you own a bicycle?” using features derived from call records? Sort of. The accuracy of the predictions were high for some traits (Figure 3.11). But, it is always important to compare a complex prediction method against a simple alternative. In this case, a simple alternative is to predict that everyone will give the most common answer. For example, 97.3% reported owning a radio so if Blumenstock had predicted that everyone would report owning a radio he would have had an accuracy of 97.3%, which is surprisingly similar to the performance of his more complex procedure (97.6% accuracy). In other words, all the fancy data and modeling increased the accuracy of the prediction from 97.3% to 97.6%. However, for other questions, such as “Do you own a bicycle?”, the predictions improved from 54.4% to 67.6%. More generally, Figure 3.12 shows for some traits Blumenstock did not improve much beyond just making the simple baseline prediction, but that for other traits there was some improvement.

Figure 3.11: Predictive accuracy for statistical model trained with call records. Results from Table 2 of Blumenstock (2014).

Figure 3.11: Predictive accuracy for statistical model trained with call records. Results from Table 2 of Blumenstock (2014).

Figure 3.12: Comparison of predictive accuracy for statistical model trained with call records to simple baseline prediction. Points are slightly jittered to avoid overlap; see Table 2 of Blumenstock (2014) for exact values.

Figure 3.12: Comparison of predictive accuracy for statistical model trained with call records to simple baseline prediction. Points are slightly jittered to avoid overlap; see Table 2 of Blumenstock (2014) for exact values.

At this point you might be thinking that these results are a bit disappointing, but just one year later, Blumenstock and two colleagues—Gabriel Cadamuro and Robert On—published a paper in Science with substantially better results (Blumenstock, Cadamuro, and On 2015). There were two main technical reasons for the improvement: 1) they used more sophisticated methods (i.e., a new approach to feature engineering and a more sophisticated machine learning model) and 2) rather than attempting to infer responses to individual survey questions (e.g., “Do you own a radio?”), they attempted to infer a composite wealth index.

Blumenstock and colleagues demonstrated the performance of their approach in two ways. First, they found that for the people in their sample, they could do a pretty good job of predicting their wealth from call records (Figure 3.14). Second, and ever more importantly, Blumenstock and colleagues showed that their procedure could produce high-quality estimates of the geographic distribution of wealth in Rwanda. More specifically, they used their machine learning model, which was trained on their sample of about 1,000 people, to predict the wealth of all 1.5 million people in the call records. Further, with the geospatial data embedded in the call data (recall that the call data includes the location of the nearest cell tower for each call), the researchers were able to estimate the approximate place of residence of each person. Putting these two estimates together, the research produced an estimate of the geographic distribution of subscriber wealth at extremely fine spatial granularity. For example, they could estimate the average wealth in each of Rwanda’s 2148 cells (the smallest administrative unit in the country). These predicted wealth values were so granular they were difficult to check. So, the researchers aggregated their results to produce estimates of the average wealth of Rwanda’s 30 districts. These district-level estimates were strongly related to the estimates from a gold standard traditional survey, the Rwandan Demographic and Health Survey (Figure 3.14). Although the estimates from the two sources were similar, the estimates from Blumenstock and colleagues were about 50 times cheaper and 10 times faster (when cost in measured in terms of variable costs). This dramatic decrease in cost means that rather than being run every few years—as is standard for Demographic and Health Surveys—the hybrid of small survey combined with big digital trace data could be run every month.

Figure 3.13: Schematic of Blumenstock, Cadamuro, and On (2015). Call data from the phone company was converted to a matrix with one row for each person and one column for each feature (i.e., variable). Next, the researchers built a supervised learning model to predict the survey responses from the person by feature matrix. Then, the supervised learning model was used to impute the survey responses for everyone. In essence, the researchers used the responses of about one thousand people to impute the wealth of about one million people. Also, the researchers estimated the approximate place of residence for all 1.5 million people based on the locations of their calls. When these two estimates were combined—the estimated wealth and the estimated place of residence—the results were similar to estimates from the Demographic and Health Survey, a gold-standard traditional survey (Figure 3.14).

Figure 3.13: Schematic of Blumenstock, Cadamuro, and On (2015). Call data from the phone company was converted to a matrix with one row for each person and one column for each feature (i.e., variable). Next, the researchers built a supervised learning model to predict the survey responses from the person by feature matrix. Then, the supervised learning model was used to impute the survey responses for everyone. In essence, the researchers used the responses of about one thousand people to impute the wealth of about one million people. Also, the researchers estimated the approximate place of residence for all 1.5 million people based on the locations of their calls. When these two estimates were combined—the estimated wealth and the estimated place of residence—the results were similar to estimates from the Demographic and Health Survey, a gold-standard traditional survey (Figure 3.14).

Figure 3.14: Results from Blumenstock, Cadamuro, and On (2015). At the individual-level, the researchers were able to do a reasonable job at predicting someone’s wealth from their call records. The estimates of district-level wealth—which were based on individual-level estimates of wealth and place of residence—the results were similar to results from the Demographic and Health Survey, a gold-standard traditional survey.

Figure 3.14: Results from Blumenstock, Cadamuro, and On (2015). At the individual-level, the researchers were able to do a reasonable job at predicting someone’s wealth from their call records. The estimates of district-level wealth—which were based on individual-level estimates of wealth and place of residence—the results were similar to results from the Demographic and Health Survey, a gold-standard traditional survey.

In conclusion, Blumenstock’s amplified asking approach combined survey data with digital trace data to produce estimates comparable with gold-standard survey estimates. This particular example also clarifies some of the trade-offs between amplified asking and traditional survey methods. First, the amplified asking estimates were more timely, substantially cheaper, and more granular. But, on the other hand, at this time, there is not a strong theoretical basis for this kind of amplified asking. That is, this one example does not show when it will work and when it won’t. Further, the amplified asking approach does not yet have good ways to quantify uncertainty around its estimates. However, amplified asking has deep connections to three large areas in statistics—model-based post-stratification (Little 1993), imputation (Rubin 2004), and small-area estimation (Rao and Molina 2015)—and so I expect that progress will be rapid.

Amplified asking follows a basic recipe that can be tailored to your particular situation. There are two ingredients and two steps. The two ingredients are 1) a digital trace dataset that is wide but thin (that is, it has many people but not the information that you need about each persons) and 2) a survey that is narrow but thick (that is, it has only a few people, but it has the information that you need about those people). Then, there are two steps. First, for the people in both data sources, build a machine learning model that uses digital trace data to predict survey answers. Next, use that machine learning model to impute the survey answers of everyone in the digital trace data. Thus, if there is some question that you want to ask to lots of people, look for digital trace data from those people that might be used to predict their answer.

Comparing Blumenstock’s first and second attempt at the problem also illustrates an important lesson about the transition from second era to third era approaches to survey research: the beginning is not the end. That is, many times, the first approach will not be the best, but if researchers continuing working, things can get better. More generally, when evaluating new approaches to social research in the digital age, it is important to make two distinct evaluations: 1) how well does this work now and 2) how well do you think this might work in the future as the data landscape changes and as researchers devote more attention to the problem. Although, researchers are trained to make the first kind of evaluation (how good is this particular piece of research), the second is often more important.