Data held by businesses and governments are difficult for researchers to access.

In May 2014, the US National Security Agenda opened a data center in rural Utah that has an awkward name, the Intelligence Community Comprehensive National Cybersecurity Initiative Data Center. However, this data center, which has come to be known as the Utah Data Center, is reported to have astounding capabilities. One report alleges that the Utah Data Center is able to store and process all forms of communication including “the complete contents of private emails, cell phone calls, and Google searches, as well as all sorts of personal data trails—parking receipts, travel itineraries, bookstore purchases, and other digital `pocket litter’” (Bamford 2012). In addition to the raising concerns about the sensitive nature of much of the information captured in big data, which will be described more below, the Utah Data Center is an extreme example of a rich data source that is inaccessible to researchers. More generally, many sources of big data that would be useful to researchers are controlled and restricted by governments (e.g., tax data and educational data) and companies (e.g., queries to search engines and phone call meta-data). Therefore, these data will not be immediately available to researchers at universities, and most will not even be available to researchers in the governments or companies.

In my experience, many researchers based at universities misunderstand the source of this inaccessibility. These data are not inaccessible because people at companies and governments are stupid, lazy, or uncaring. Rather, there are serious legal, technical, business, and ethical barriers that prevent data access. For example, some terms-of-service agreements for websites only allow data to be used by employees or to improve the service. So certain forms of data sharing could expose companies to legitimate lawsuits from customers. There are also substantial business risks to companies involved in sharing data. Try to imagine how the public would respond if personal search data accidentally leaked out from Google as part of a university research project. Such a data breach, if extreme, might even be an existential risk for the company. So Google—and most large companies—are very risk-averse about sharing data with researchers.

In fact, almost everyone who is in a position to provide access to large amounts of data knows the story of Abdur Chowdhury. In 2006, when he was the head of AOL research, he intentionally released what he thought were anonymized search queries from 650,000 AOL users to the research community. As far as I can tell, Chowdhury and the researchers at AOL had good intentions and they thought that they had anonymized the data. But, they were wrong. It was quickly discovered that the data were not as anonymous as the researchers thought, and reporters from the New York Times were able to identify people in the dataset with ease (Barbaro and Zeller Jr 2006). Once these problems were discovered, Chowdhury removed the data from AOL’s website, but it was too late. The data had been reposted on other websites, and it will probably still be available when you are reading this book. Because of his attempt to share data with the research community, Chowdhury was fired, and AOL’s chief technology officer resigned (Hafner 2006). As this example shows, the benefits for specific individuals inside of companies to facilitate data access are pretty small and the worst-case scenario is terrible.

Research can, however, gain access to data that is inaccessible to the general public. Governments have procedures that researchers can follow to apply for access, and as the examples later in this chapter show, researchers can occasionally gain access to corporate data. For example, Einav et al. (2015) partnered with a researcher at eBay to study the digital traces from online auctions. I’ll talk more about the research that came from this collaboration later in the chapter (Section, but I mention it now because it had all four of the ingredients that I see in successful partnerships: researcher interest, researcher capability, company interest, and company capability. In other words, Einav and colleagues were interested in and capable of studying online auctions. And, eBay was also. However, I’ve seen many possible collaboration fail because either the researcher or company lacked one of these ingredients.

Even if you are able to develop a partnership with a business, however, there are some downsides for you. First, the questions that you can ask with the data with likely be limited; companies are unlikely to allow research that could make them look bad. Second, you will probably not be able to share your data with other researchers, which means that other researchers will not be able to verify and extend your results. Further, these partnerships can create at least the appearance of a conflict of interest, where people might think that your results were influenced by your partnerships. All of these downsides can be addressed, but it is important to be clear that working with data that is not accessible to everyone had both upsides and downsides.

In summary, lots of big data is inaccessible to researchers. There are serious legal, technical, business, and ethical barriers that prevent data access, and these barriers will not go away. National governments generally have established procedures for enabling data access, but the process can be more ad hoc at the state and local levels. Also, in some cases, researchers can partner with companies to obtain data access, but this can create a variety of problems for researchers.