6.6.2 Understanding and managing informational risk

Informational risk is the most common risk in social research; it has increased dramatically; and it is the hardest risk to understand.

The second ethical challenge for digital-age research is informational risk, the potential for harm from the disclosure of information (National Research Council 2014). Informational harms from the disclosure of personal information could be economic (e.g., losing a job), social (e.g., embarrassment), psychological (e.g., depression), or even criminal (e.g., arrest for illegal behavior). Unfortunately, the digital age increases informational risk dramatically—there is just so much more information about our behavior. And informational risk has proven very difficult to understand and manage compared with risks that were concerns in analog-age social research, such as physical risk.

One way that social researchers decrease informational risk is “anonymization” of data. “Anonymization” is the process of removing obvious personal identifiers such as name, address, and telephone number from the data. However, this approach is much less effective than many people realize, and it is, in fact, deeply and fundamentally limited. For that reason, whenever I describe “anonymization,” I’ll use quotation marks to remind you that this process creates the appearance of anonymity but not true anonymity.

A vivid example of the failure of “anonymization” comes from the late 1990s in Massachusetts (Sweeney 2002). The Group Insurance Commission (GIC) was a government agency responsible for purchasing health insurance for all state employees. Through this work, the GIC collected detailed health records about thousands of state employees. In an effort to spur research, the GIC decided to release these records to researchers. However, they did not share all of their data; rather, they “anonymized” these data by removing information such as names and addresses. However, they left other information that they thought could be useful for researchers such as demographic information (zip code, birth date, ethnicity, and sex) and medical information (visit data, diagnosis, procedure) (figure 6.4) (Ohm 2010). Unfortunately, this “anonymization” was not sufficient to protect the data.

Figure 6.4: “Anonymization” is the process of removing obviously identifying information. For example, when releasing the medical insurance records of state employees, the Massachusetts Group Insurance Commission (GIC) removed names and addresses from the files. I use the quotation marks around the word “anonymization” because the process provides the appearance of anonymity but not actual anonymity.

To illustrate the shortcomings of the GIC “anonymization”, Latanya Sweeney—then a graduate student at MIT—paid $20 to acquire the voting records from the city of Cambridge, the hometown of Massachusetts governor William Weld. These voting records included information such as name, address, zip code, birth date, and gender. The fact that the medical data file and the voter file shared fields—zip code, birth date, and sex—meant that Sweeney could link them. Sweeney knew that Weld’s birthday was July 31, 1945, and the voting records included only six people in Cambridge with that birthday. Further, of those six people, only three were male. And, of those three men, only one shared Weld’s zip code. Thus, the voting data showed that anyone in the medical data with Weld’s combination of birth date, gender, and zip code was William Weld. In essence, these three pieces of information provided a unique fingerprint to him in the data. Using this fact, Sweeney was able to locate Weld’s medical records, and, to inform him of her feat, she mailed him a copy of his records (Ohm 2010).

Figure 6.5: Re-idenification of “anonymized” data. Latanya Sweeney combined the “anonymized” health records with voting records in order to find the medical records of Governor William Weld Adapted from Sweeney (2002), figure 1.

Sweeney’s work illustrates the basic structure of re-identification attacks—to adopt a term from the computer security community. In these attacks, two data sets, neither of which by itself reveals sensitive information, are linked, and through this linkage, sensitive information is exposed.

In response to Sweeney’s work, and other related work, researchers now generally remove much more information—all so-called “personally identifying information” (PII) (Narayanan and Shmatikov 2010)—during the process of “anonymization.” Further, many researchers now realize that certain data—such as medical records, financial records, answers to survey questions about illegal behavior—are probably too sensitive to release even after “anonymization.” However, the examples that I’m about to give suggest that social researchers need to change their thinking. As a first step, it is wise to assume that all data are potentially identifiable and all data are potentially sensitive. In other words, rather than thinking that informational risk applies to a small subset of projects, we should assume that it applies—to some degree—to all projects.

Both aspects of this reorientation are illustrated by the Netflix Prize. As described in chapter 5, Netflix released 100 million movie ratings provided by almost 500,000 members, and had an open call where people from all over the world submitted algorithms that could improve Netflix’s ability to recommend movies. Before releasing the data, Netflix removed any obvious personally identifying information, such as names. They also went an extra step and introduced slight perturbations in some of the records (e.g., changing some ratings from 4 stars to 3 stars). They soon discovered, however, that despite their efforts, the data were still by no means anonymous.

Just two weeks after the data were released, Arvind Narayanan and Vitaly Shmatikov (2008) showed that it was possible to learn about specific people’s movie preferences. The trick to their re-identification attack was similar to Sweeney’s: merge together two information sources, one with potentially sensitive information and no obviously identifying information and one that contains people’s identities. Each of these data sources may be individually safe, but when they are combined, the merged dataset can create informational risk. In the case of the Netflix data, here’s how it could happen. Imagine that I choose to share my thoughts about action and comedy movies with my co-workers, but that I prefer not to share my opinion about religious and political movies. My co-workers could use the information that I’ve shared with them to find my records in the Netflix data; the information that I share could be a unique fingerprint just like William Weld’s birth date, zip code, and sex. Then, if they found my unique fingerprint in the data, they could learn my ratings about all movies, including movies that I choose not to share. In addition to this kind of targeted attack focused on a single person, Narayanan and Shmatikov also showed that it was possible to do a broad attack—one involving many people—by merging the Netflix data with personal and movie rating data that some people have chosen to post on the Internet Movie Database (IMDb). Quite simply, any information that is a unique fingerprint to a specific person—even their set of movie ratings—can be used to identify them.

Even though the Netflix data can be re-identified in either a targeted or broad attack, it still might appear to be low risk. After all, movie ratings don’t seem very sensitive. While that might be true in general, for some of the 500,000 people in the dataset, movie ratings might be quite sensitive. In fact, in response to the re-identification, a closeted lesbian woman joined a class-action suit against Netflix. Here’s how the problem was expressed in their lawsuit (Singel 2009):

“[M]ovie and rating data contains information of a … highly personal and sensitive nature. The member’s movie data exposes a Netflix member’s personal interest and/or struggles with various highly personal issues, including sexuality, mental illness, recovery from alcoholism, and victimization from incest, physical abuse, domestic violence, adultery, and rape.”

The re-identification of the Netflix Prize data illustrates both that all data are potentially identifiable and that all data are potentially sensitive. At this point, you might think that this only applies to data that purport to be about people. Surprisingly, that is not the case. In response to a Freedom of Information Law request, the New York City Government released records of every taxi ride in New York in 2013, including the pickup and drop off times, locations, and fare amounts (recall from chapter 2 that Farber (2015) used similar data to test important theories in labor economics). These data about taxi trips might seem benign because they do not seem to provide information about people, but Anthony Tockar realized that this taxi dataset actually contained lots of potentially sensitive information about people. To illustrate, he looked at all trips starting at the Hustler Club—a large strip club in New York—between midnight and 6 a.m. and then found their drop-off locations. This search revealed—in essence—a list of addresses of some people who frequented the Hustler Club (Tockar 2014). It is hard to imagine that the city government had this in mind when it released the data. In fact, this same technique could be used to find the home addresses of people who visit any place in the city—a medical clinic, a government building, or a religious institution.

These two cases of the Netflix Prize and the New York City taxi data show that relatively skilled people can fail to correctly estimate the informational risk in the data that they release—and these cases are by no means unique (Barbaro and Zeller 2006; Zimmer 2010; Narayanan, Huey, and Felten 2016). Further, in many such cases, the problematic data are still freely available online, indicating the difficulty of ever undoing a data release. Collectively, these examples—as well as research in computer science about privacy—lead to an important conclusion. Researchers should assume that all data are potentially identifiable and all data are potentially sensitive.

Unfortunately, there is no simple solution to the facts that all data are potentially identifiable and that all data are potentially sensitive. However, one way to reduce informational risk while you are working with data is to create and follow a data protection plan. This plan will decrease the chance that your data will leak and will decrease the harm if a leak does somehow occur. The specifics of data protection plans, such as which form of encryption to use, will change over time, but the UK Data Services helpfully organizes the elements of a data protection plan into five categories that they call the five safes: safe projects, safe people, safe settings, safe data, and safe outputs (table 6.2) (Desai, Ritchie, and Welpton 2016). None of the five safes individually provide perfect protection. But together they form a powerful set of factors that can decrease informational risk.

Table 6.2: The “Five Safes” are Principles for Designing and Executing a Data Protection Plan (Desai, Ritchie, and Welpton 2016)
Safe	Action
Safe projects	Limits projects with data to those that are ethical
Safe people	Access is restricted to people who can be trusted with data (e.g., people who have undergone ethical training)
Safe data	Data are de-identified and aggregated to the extent possible
Safe settings	Data are stored in computers with appropriate physical (e.g., locked room) and software (e.g., password protection, encrypted) protection
Safe output	Research output is reviewed to prevent accidental privacy breaches

In addition to protecting your data while you are using them, one step in the research process where informational risk is particularly salient is data sharing with other researchers. Data sharing among scientists is a core value of the scientific endeavor, and it greatly facilitates the advancement of knowledge. Here’s how the UK House of Commons described the importance of data sharing (Molloy 2011):

“Access to data is fundamental if researchers are to reproduce, verify and build on results that are reported in the literature. The presumption must be that, unless there is a strong reason otherwise, data should be fully disclosed and made publicly available.”

Yet, by sharing your data with another researcher, you may be increasing informational risk to your participants. Thus, it may seem that data sharing creates a fundamental tension between the obligation to share data with other scientists and the obligation to minimize informational risk to participants. Fortunately, this dilemma is not as severe as it appears. Rather, it is better to think about data sharing as falling along a continuum, with each point on that continuum providing a different mix of benefits to society and risk to participants (figure 6.6).

At one extreme, you can share your data with no one, which minimizes risk to participants but also minimizes gains to society. At the other extreme, you can release and forget, where data are “anonymized” and posted for everyone. Relative to not releasing data, release and forget offers both higher benefits to society and higher risk to participants. In between these two extreme cases are a range of hybrids, including what I’ll call a walled garden approach. Under this approach, data are shared with people who meet certain criteria and who agree to be bound by certain rules (e.g., oversight from an IRB and a data protection plan). The walled garden approach provides many of the benefits of release and forget with less risk. Of course, such an approach creates many questions—who should have access, under what conditions, and for how long, who should pay to maintain and police the walled garden, etc.—but these are not insurmountable. In fact, there are already working walled gardens in place that researchers can use right now, such as the data archive of the Inter-university Consortium for Political and Social Research at the University of Michigan.

Figure 6.6: Data release strategies can fall along a continuum. Where you should be on this continuum depends on the specific details of your data, and third-party review may help you decide the appropriate balance of risk and benefit in your case. The exact shape of this curve depends on the specifics of the data and research goals (Goroff 2015).

So, where should the data from your study be on the continuum of no sharing, walled garden, and release and forget? This depend on the details of your data: researchers must balance Respect for Persons, Beneficence, Justice, and Respect for Law and Public Interest. Viewed from this perspective, data sharing is not a distinctive ethical conundrum; it is just one of the many aspects of research in which researchers have to find an appropriate ethical balance.

Some critics are generally opposed to data sharing because, in my opinion, they are focused on its risks—which are undoubtedly real—and are ignoring its benefits. So, in order to encourage focus on both risks and benefits, I’d like to offer an analogy. Every year, cars are responsible for thousands of deaths, but we do not attempt to ban driving. In fact, a call to ban driving would be absurd because driving enables many wonderful things. Rather, society places restrictions on who can drive (e.g., the need to be a certain age and to have passed certain tests) and how they can drive (e.g., under the speed limit). Society also has people tasked with enforcing these rules (e.g., police), and we punish people who are caught violating them. This same kind of balanced thinking that society applies to regulating driving can also be applied to data sharing. That is, rather than making absolutist arguments for or against data sharing, I think we will make the most progress by focusing on how we can decrease the risks and increase the benefits from data sharing.

To conclude, informational risk has increased dramatically, and it is very hard to predict and quantify. Therefore, it is best to assume that all data are potentially identifiable and potentially sensitive. To decrease informational risk while doing research, researchers can create and follow a data protection plan. Further, informational risk does not prevent researchers from sharing data with other scientists.