2.3 Ten common characteristics of big data

Big data sources tend to have a number of characteristics in common; some are generally good for social research and some are generally bad.

Even though each big data source is distinct, it is helpful to notice that there are certain characteristics that tend to occur over and over again. Therefore, rather than taking a platform-by-platform approach (e.g., here’s what you need to know about Twitter, here’s what you need to know about Google search data, etc.), I’m going to describe ten general characteristics of big data sources. Stepping back from the details of each particular system and looking at these general characteristics enables researchers to quickly learn about existing data sources and have a firm set of ideas to apply to the data sources that will be created in the future.

Even though the desired characteristics of a data source depend on the research goal, I find it helpful to crudely group the ten characteristics into two broad categories:

  • generally helpful for research: big, always-on, and nonreactive
  • generally problematic for research: incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, dirty, and sensitive

As I’m describing these characteristics you’ll notice that they often arise because big data sources were not created for the purpose of research.