2.4.1 Counting things

Simple counting can be interesting if you combine a good question with good data.

Although it is couched in sophisticated-sounding language, lots of social research is really just counting things. In the age of big data, researchers can count more than ever, but that does not mean that they should just start counting haphazardly. Instead, researchers should ask: What things are worth counting? This may seem like an entirely subjective matter, but there are some general patterns.

Often students motivate their counting research by saying: I’m going to count something that no-one has ever counted before. For example, a student might say that many people have studied migrants and many people have studied twins, but nobody has studied migrant twins. In my experience, this strategy, which I call motivation by absence, does not usually lead to good research. Motivation by absence is kind of like saying that there’s a hole over there, and I’m going to work very hard to fill it up. But not every hole needs to be filled.

Instead of motivating by absence, I think a better strategy is to look for research questions that are important or interesting (or ideally both). Both of these terms a bit hard to define, but one way to think about important research is that it has some measurable impact or feeds into an important decision by policy makers. For example, measuring the rate of unemployment is important because it is an indicator of the economy that drives policy decisions. Generally, I think that researchers have a pretty good sense of what is important. So, in the rest of this section, I’m going to provide two examples where I think counting is interesting. In each case, the researchers were not counting haphazardly; rather, they were counting in very particular settings that revealed important insights into more general ideas about how social systems work. In other words, a lot of what makes these particular counting exercises interesting is not the data itself, it comes from these more general ideas.

One example of the simple power of counting comes from Henry Farber’s (2015) study of the behavior of New York City taxi drivers. Although this group might not sound inherently interesting, it is a strategic research site for testing two competing theories in labor economics. For the purposes of Farber’s research, there are two important features about the work environment of taxi drivers: (1) their hourly wage fluctuates from day to day, based in part on factors like the weather, and (2) the number of hours they work can fluctuate each day based on their decisions. These features lead to an interesting question about the relationship between hourly wages and hours worked. Neoclassical models in economics predict that taxi drivers will work more on days where they have higher hourly wages. Alternatively, models from behavioral economics predict exactly the opposite. If drivers set a particular income target—say $100 per day—and work until that target is met, then drivers will end up working fewer hours on days that they are earning more. For example, if you were a target earner, you might end up working four hours on a good day ($25 per hour) and five hours on a bad day ($20 per hour). So, do drivers work more hours on days with higher hourly wages (as predicted by the neoclassical models) or more hours on days with lower hourly wages (as predicted by behavioral economic models)?

To answer this question Farber obtained data on every taxi trip taken by New York City cabs from 2009 to 2013, data that are now publicly available. These data—which were collected by electronic meters that the city requires taxis to use—include information about each trip: start time, start location, end time, end location, fare, and tip (if the tip was paid with a credit card). Using this taxi meter data, Farber found that most drivers work more on days when wages are higher, consistent with the neoclassical theory.

In addition to this main finding, Farber was able to use the size of the data for a better understanding of heterogeneity and dynamics. He found that, over time, newer drivers gradually learn to work more hours on high-wage days (e.g., they learn to behave as the neoclassical model predicts). And new drivers who behave more like target earners are more likely to quit being taxi drivers. Both of these more subtle findings, which help explain the observed behavior of current drivers, were only possible because of the size of the dataset. They were impossible to detect in earlier studies that used paper trip sheets from a small number of taxi drivers over a short period of time (Camerer et al. 1997).

Farber’s study was close to a best-case scenario for a research using a big data source because the data that were collected by the city were pretty close to the data that Farber would have collected (one difference is that Farber would have wanted data on total wages—fares plus tips—but the city data only included tips paid by credit card). However, the data alone were not enough. The key to Farber’s research was bringing an interesting question to the data, a question that has larger implications beyond just this specific setting.

A second example of counting things comes from research by Gary King, Jennifer Pan, and Molly Roberts (2013) on online censorship by the Chinese government. In this case, however, the researchers had to collect their own big data and they had to deal with the fact that their data was incomplete.

King and colleagues were motivated by the fact that social media posts in China are censored by an enormous state apparatus that is thought to include tens of thousands of people. Researchers and citizens, however, have little sense of how these censors decide what content should be deleted. Scholars of China actually have conflicting expectations about which kinds of posts are most likely to get deleted. Some think that censors focus on posts that are critical of the state, while others think that they focus on posts that encourage collective behavior, such as protests. Figuring out which of these expectations is correct has implications for how researchers understand China and other authoritarian governments that engage in censorship. Therefore, King and colleagues wanted to compare posts that were published and subsequently deleted with posts that were published and never deleted.

Collecting these posts involved the amazing engineering feat of crawling more than 1,000 Chinese social media websites—each with different page layouts—finding relevant posts, and then revisiting these posts to see which were subsequently deleted. In addition to the normal engineering problems associated with large scale web-crawling, this project had the added challenge that it needed to be extremely fast because many censored posts are taken down in less than 24 hours. In other words, a slow crawler would miss lots of posts that were censored. Further, the crawlers had to do all this data collection while evading detection lest the social media websites block access or otherwise change their policies in response to the study.

By the time that this massive engineering task had been completed, King and colleagues had obtained about 11 million posts on 85 different prespecified topics, each with an assumed level of sensitivity. For example, a topic of high sensitivity is Ai Weiwei, the dissident artist; a topic of middle sensitivity is appreciation and devaluation of the Chinese currency, and a topic of low sensitivity is the World Cup. Of these 11 million posts, about 2 million had been censored. Somewhat surprisingly, King and colleagues found that posts on highly sensitive topics were censored only slightly more often than posts on middle- and low-sensitivity topics. In other words, Chinese censors are about as likely to censor a post that mentions Ai Weiwei as a post that mentions the World Cup. These findings do not support the idea that the government censors all posts on sensitive topics.

This simple calculation of censorship rate by topic could be misleading, however. For example, the government might censor posts that are supportive of Ai Weiwei, but leave posts that are critical of him. In order to distinguish between posts more carefully, the researchers needed to measure the sentiment of each post. Unfortunately, despite much work, fully automated methods of sentiment detection using pre-existing dictionaries are still not very good in many situations (think back to the problems creating an emotional timeline of September 11, 2001 described in section 2.3.9). Therefore, King and colleagues needed a way to label their 11 million social media posts as to whether they were (1) critical of the state, (2) supportive of the state, or (3) irrelevant or factual reports about the events. This sounds like a massive job, but they solved it using a powerful trick that is common in data science but relatively rare in social science: supervised learning; see figure 2.5.

First, in a step typically called preprocessing, the researchers converted the social media posts into a document-term matrix, where there was one row for each document and one column that recorded whether the post contained a specific word (e.g., protest or traffic). Next, a group of research assistants hand-labeled the sentiment of a sample of posts. Then, they used this hand-labeled data to create a machine learning model that could infer the sentiment of a post based on its characteristics. Finally, they used this model to estimate the sentiment of all 11 million posts.

Thus, rather than manually reading and labeling 11 million posts—which would be logistically impossible—King and colleagues manually labeled a small number of posts and then used supervised learning to estimate the sentiment of all the posts. After completing this analysis, they were able to conclude that, somewhat surprisingly, the probability of a post being deleted was unrelated to whether it was critical of the state or supportive of the state.

Figure 2.5: Simplified schematic of the procedure used by King, Pan, and Roberts (2013) to estimate the sentiment of 11 million Chinese social media posts. First, in a preprocessing step, the researchers converted the social media posts into a document-term matrix (see Grimmer and Stewart (2013) for more information). Second, they hand-coded the sentiments of a small sample of posts. Third, they trained a supervised learning model to classify the sentiment of posts. Fourth, they used the supervised learning model to estimate the sentiment of all the posts. See King, Pan, and Roberts (2013), appendix B for a more detailed description.

In the end, King and colleagues discovered that only three types of posts were regularly censored: pornography, criticism of censors, and those that had collective action potential (i.e., the possibility of leading to large-scale protests). By observing a huge number of posts that were deleted and posts that were not deleted, King and colleagues were able to learn how the censors work just by watching and counting. Further, foreshadowing a theme that will occur throughout this book, the supervised learning approach that they used—hand-labeling some outcomes and then building a machine learning model to label the rest—turns out to be very common in social research in the digital age. You will see pictures very similar to figure 2.5 in chapters 3 (Asking questions) and 5 (Creating mass collaboration); this is one of the few ideas that appears in multiple chapters.

These examples—the working behavior of taxi drivers in New York and the social media censorship behavior of the Chinese government—show that relatively simple counting of big data sources can, in some situations, lead to interesting and important research. In both cases, however, the researchers had to bring interesting questions to the big data source; the data by itself was not enough.