2.3.2.3 Non-representative

Two sources of non-representativeness are different populations and different usage patterns.

Big data tend to be systematically biased in two main ways. This need not cause a problem for all kind of analysis, but for some analysis it can be a critical flaw.

A first source of systematic bias is that the people captured are typically neither a complete universe of all people or a random sample from any specific population. For example, Americans on Twitter are not a random sample of Americans (Hargittai 2015). A second source of systematic bias is that many big data systems capture actions, and some people contribute many more actions than others. For example, some people on Twitter contribute hundreds of times more tweets than others. Therefore, the events on a specific platform can be ever more heavily reflective of certain subgroups than the platform itself.

Normally researchers want to know a lot about the data that they have. But, given the non-representative nature of big data, it is helpful to also flip your thinking. You also need to know a lot about the data that you don’t have. This is especially true when the data that you don’t have are systematically different from the data that you do have. For example, if you have the call records from a mobile phone company in a developing countries, you should think not just about the people in your dataset, but also about the people who might be too poor to own a mobile phone. Further, in Chapter 3, we’ll learn about how weighting can enable researchers to make better estimates from non-representative data.