No matter how “big” your “big data” it probably doesn’t have the information you want.

Most big data sources are incomplete, in the sense that they don’t have the information that you will want for your research. This is a common feature of data that were created for purposes other than research. Many social scientists have already had the experience of dealing with the incompleteness, such as an existing survey that didn’t ask the question you wanted. Unfortunately, the problems of incompleteness tend to be more extreme in big data. In my experience, big data tends to be missing three types of information useful for social research: demographics, behavior on other platforms, and data to operationalize theoretical constructs.

All three of these forms of incompleteness are illustrated in a study by Gueorgi Kossinets and Duncan Watts (2006) about the evolution of the social network at a university. Kossinets and Watts started with the email logs from the university, which had precise information about who sent emails to whom at what time (the researchers did not have access to the content of the emails). These email records sound like an amazing dataset, but, they are—despite their size and granularity—fundamentally incomplete. For example, the email logs do not include data about the demographic characteristics of the students, such as gender and age. Further, the email logs do not include information about communication through other media, such as phone calls, text message, or face-to-face conversations. Finally, the email logs do not directly include information about relationships, the theoretical constructs in many existing theories. Later in the chapter, when I talk about research strategies, you’ll see how Kossinets and Watts solved these problems.

Of three kinds of incompleteness, the problem of incomplete data to operationalize theoretical constructs is the hardest to solve, and in my experience, it is often accidentally overlooked by data scientists. Roughly, theoretical constructs are abstract ideas that social scientists study, but, unfortunately, these constructs cannot always be unambiguously defined and measured. For example, let’s imagine trying to empirically test the apparently simple claim that people who are more intelligent earn more money. In order to test this claim you would need to measure “intelligence.” But, what is intelligence? For example, Gardner (2011) argued that there are actually eight different forms of intelligence. And, are there procedures that could accurately measure any of these forms of intelligence? Despite enormous amounts of work by psychologists, these questions still don’t have unambiguous answers. Thus, even a relatively simple claim—people who are more intelligent earn more money—can be hard to assess empirically because it can be hard to operationalize theoretical constructs in data. Other examples of theoretical constructs that are important but hard to operationalize include “norms,” “social capital,” and “democracy.” Social scientists call the match between theoretical constructs and data construct validity (Cronbach and Meehl 1955). And, as this list of constructs suggests, construct validity is a problem that social scientists have struggled with for a very long time, even when they were working with data that was collected for the purpose of research. When working with data collected for purposes other than research, the problems of construct validity are even more challenging (Lazer 2015).

When you are reading a research paper, one quick and useful way to assess concerns about construct validity is to take the main claim in the paper, which is usually expressed in terms of constructs, and re-express it in terms of the data used. For example, consider two hypothetical studies that claim to show that more intelligent people earn more money:

  • Study 1: people who score well on the Raven Progressive Matrices Test—a well studied test of analytic intelligence (Carpenter, Just, and Shell 1990)—have higher reported incomes on their tax returns
  • Study 2: people on Twitter who used longer words are more likely to mention luxury brands

In both cases, researchers could assert that they have shown that more intelligent people earn more money. But, in the first study the theoretical constructs are well operationalized by the data, and in the second they are not. Further, as this example illustrates, more data does not automatically solve problems with construct validity. You should doubt the results of Study 2 whether it involved a million tweets, a billion tweets, or a trillion tweets. For researchers not familiar with the idea of construct validity, Table 2.2 provides some examples of studies that have operationalized theoretical constructs using digital trace data.

Table 2.2: Examples of digital traces that are used as measures of more abstract theoretical concepts. Social scientists call this match construct validity and it is a major challenge with using big data sources for social research (Lazer 2015).
Digital trace Theoretical construct Citation
email logs from a university (meta-data only) Social relationships Kossinets and Watts (2006), Kossinets and Watts (2009), De Choudhury et al. (2010)
social media posts on Weibo Civic engagement Zhang (2016)
email logs from a firm (meta-data and complete text) Cultural fit in an organization Goldberg et al. (2015)

Although the problem of incomplete data for operationalizing theoretical constructs is pretty hard to solve, there are three common solutions to the problem of incomplete demographic information and incomplete information on behavior on other platforms. The first is to actually collect the data you need; I’ll tell you about an example of that in Chapter 3 when I tell you about surveys. Unfortunately, this kind of data collection is not always possible. The second main solution is to do what data scientists call user-attribute inference and what social scientists call imputation. In this approach, researchers use the information that they have on some people to infer attributes of other people. The third possible solution—the one used by Kossinets and Watts—was to combine multiple data sources. This process is sometimes called merging or record linkage. My favorite metaphor for this process was proposed in the very first paragraph of the very first paper ever written on record linkage (Dunn 1946):

“Each person in the world creates a Book of Life. This Book starts with birth and ends with death. Its pages are made up of records of the principle events in life. Record linkage is the name given to the process of assembling the pages of this book into a volume.”

This passage was written in 1946, and at that time, people were thinking that the Book of Life could include major life events like birth, marriage, divorce, and death. However, now that so much information about people is recorded, the Book of Life could be an incredibly detailed portrait, if those different pages (i.e., our digital traces), can be bound together. This Book of Life could be a great resource for researchers. But, the Book of Life could also be called a database of ruin (Ohm 2010), which could be used for all kinds of unethical purposes, as described more below when I talk about the sensitive nature of the information collected by big data sources below and in Chapter 6 (Ethics).