2.3 Common characteristics of big data

Big data sources tend to have ten characteristics; some are good for social research and some are bad.

If researchers are going to learn from big data that they did not create or collect, then they must understand its general characteristics. Rather than taking a platform by platform approach (e.g., here’s what you need to know about Twitter, here’s what you need to know about Google search data, etc), I’m going to describe ten general characteristics of big data, characteristics that arise because the data was not created for the purpose of social research. By stepping back from the details of each particular system and looking at these general properties, researchers can quickly learn more about existing data sources and have a firm set of ideas to apply to future data sources.

I find it helpful to group the characteristics into two categories:

  • generally good for research: big, always-on, non-reactive
  • generally bad for research: incomplete, inaccessible, non-representative, drifting, algorithmically confounded, inaccessible, dirty, and sensitive

Broadly speaking, government administrative records are less non-representative, less algorithmically confounded, and less drifting. On the other hand, business administrative records tend to be larger and more always-on.