6.6.2 Understanding and managing informational risk

You are reading the Open Review Edition of Bit by Bit. Click here to read the 1st Edition.

6.6.2 Understanding and managing informational risk

Information risk is the most common risk in social research; it has increased dramatically; and it is the hardest risk to understand.

The second ethical challenge for social age digital research is informational risk, the potential for harm from the disclosure of information (Council 2014). Informational harms from the disclosure of personal information could be economic (e.g., losing a job), social (e.g., embarrassment), psychological (e.g., depression), or even criminal (e.g., arrest for illegal behavior). Unfortunately, the digital age increases information risk dramatically—there is just so much more information about our behavior. And, informational risk has proven very difficult to understand and manage compared to risks that were concerns in analog age social research, such as physical risk. To see how the digital age increases informational risk, consider the transition from paper to electronic medical records. Both types of records create risk, but the electronic records create much greater risks because at a massive scale they can be transmitted to an unauthorized party or merged with other records. Social researchers in the digital age have already run into trouble with informational risk, in part because they didn’t fully understand how to quantify and manage it. So, I’m going to offer a helpful way to think about informational risk, and then I’m going to give you some advice for how to manage the informational risk in your research and in releasing data to other researchers.

One way that social researchers decrease informational risk is “anonymization” of data. “Anonymization” is the process of removing obvious personal identifiers such as name, address, and telephone number from the data. However, this approach is much less effective than many people realize, and it is, in fact, deeply and fundamentally limited. For that reason, whenever I describe “anonymization,” I’ll use quotation marks to remind you that this process creates the appearance of anonymity but not true anonymity.

A vivid example of the failure of “anonymization” comes from the late 1990s in Massachusetts (Sweeney 2002). The Group Insurance Commission (GIC) was a government agency responsible for purchasing health insurance for all state employees. Through this work, the GIC collected detailed health records about thousands of state employees. In an effort to spur research about ways to improve health, GIC decided to release these records to researchers. However, they did not share all of their data; rather, they “anonymized” it by removing information such as name and address. However, they left other information that they thought could be useful for researchers such as demographic information (zip code, birth date, ethnicity, and sex) and medical information (visit data, diagnosis, procedure) (Figure 6.4) (Ohm 2010). Unfortunately, this “anonymization” was not sufficient to protect the data.

Figure 6.4: “Anonymization” is the process of removing obviously identifying information. For example, when releasing the medical insurance records of state employees the Massachusetts Group Insurance Commission (GIC) removed name and address from the files. I use quotes around the word “anonymization” because the process provides the appearance of anonymity, but not actual anonymity.

To illustrate the shortcomings of the GIC “anonymization”, Latanya Sweeney—then a graduate student at MIT—paid $20 to acquire the voting records from the city of Cambridge, the hometown of Massachusetts governor William Weld. These voting records included information such as name, address, zip code, birth date, and gender. The fact that the medical data file and the voter file shared fields—zip code, birth date, and sex—meant that Sweeney could link them. Sweeney knew that Weld’s birthday was July 31, 1945, and the voting records included only six people in Cambridge with that birthday. Further, of those six people, only three were male. And, of those three men, only one shared Weld’s zip code. Thus, the voting data showed that anyone in the medical data with Weld’s combination of birth date, gender, and zip code was William Weld. In essence, these three pieces of information provided a unique fingerprint to him in the data. Using this fact, Sweeney was able to locate Weld’s medical records, and to inform him of her feat, she mailed him a copy of his records (Ohm 2010).

Figure 6.5: Re-idenification of “anonymized” data. Latanya Sweeney combined the “anonymized” health records with voting records in order to find the medical records of Governor William Weld (Sweeney 2002).

Sweeney’s work illustrates the basic structure of de-anonymization attacks—to adopt a term from the computer security community. In these attacks, two data sets, neither of which by itself reveals sensitive information, are linked, and through this linkage, sensitive information is exposed. In some ways this process is similar to the way that baking soda and vinegar, two substances that are by themselves safe, can be combined to produce a nasty outcome.

In response to Sweeney’s work, and other related work, researchers now generally remove much more information—all so called “Personally Identifying Information” (PII) (Narayanan and Shmatikov 2010)—during the process of “anonymization.” Further, many researchers now realize that certain data—such as medical records, financial records, answers to survey questions about illegal behavior—is probably too sensitive to release even after “anonymization.” However, more recent examples that I’ll describe below indicate that social researchers need to change their thinking. As a first step, it is wise to assume that all data is potentially identifiable and all data is potentially sensitive. In other words, rather than thinking that informational risk applies to a small subset of projects, we should assume that it applies—to some degree—to all projects.

Both aspects of this re-orientation are illustrated by the Netflix Prize. As described in Chapter 5, Netflix released 100 million movie ratings provided by almost 500,000 members, and had an open call where people from all over the world submitted algorithms that could improve Netflix’s ability to recommend movies. Before releasing the data, Netflix removed any obviously personally-identifying information, such as names. Netflix also went an extra step and introduced slight perturbations in some of the records (e.g., changing some ratings from 4 stars to 3 stars). Netflix soon discovered, however, that despite their efforts, the data were by no means anonymous.

Just two weeks after the data were released Narayanan and Shmatikov (2008) showed that it was possible to learn about specific people’s movie preferences. The trick to their re-identification attack was similar to Sweeney’s: merge together two information sources, one with potentially sensitive information and no obviously identifying information and one that contains the identity of people. Each of these data sources may be individually safe, but when they are combined the merged dataset can create informational risk. In the case of the Netflix data, here’s how it could happen. Imagine that I choose to share my thoughts about action and comedy movies with my co-workers, but that I prefer not to share my opinion about religious and political movies. My co-workers could use the information that I’ve shared with them to find my records in the Netflix data; the information that I share could be a unique fingerprint just like William Weld’s birth date, zip code, and sex. Then, if they find my unique fingerprint in the data, they could learn my ratings about all movies, including movies where I choose not to share. In addition to this kind of targeted attack focused on a single person, Narayanan and Shmatikov (2008) also showed that it was possible to do a broad attack—one involving many people—by merging the Netflix data with personal and movie rating data that some people have chosen to post on the Internet Movie Database (IMDb). Any information that is unique fingerprint to a specific person—even their set of movie ratings—can be used to identify them.

Even though the Netflix data can be re-identified in either a targeted or broad attack, it still might appear to be low risk. After all, movie ratings don’t seem very sensitive. While that might be true in general, for some of the 500,000 people in the dataset, movie ratings might be quite sensitive. In fact, in response to the de-anonymization a closeted lesbian woman joined a class-action suit against Netflix. Here’s how the problem was expressed in their lawsuit (Singel 2009):

“[M]ovie and rating data contains information of a more highly personal and sensitive nature [sic]. The member’s movie data exposes a Netflix member’s personal interest and/or struggles with various highly personal issues, including sexuality, mental illness, recovery from alcoholism, and victimization from incest, physical abuse, domestic violence, adultery, and rape.”

The de-anonymization of the Netflix Prize data illustrates both that all data is potentially identifiable and that all data is potentially sensitive. At this point, you might think that this only applies to data that that purports to be about people. Surprisingly, that is not the case. In response to a Freedom of Information Law request, the New York City Government released records of every taxi ride in New York in 2013, including the pickup and drop off times, locations, and fare amounts (recall from Chapter 2 that Farber (2015) used this data to test important theories in labor economics). Although this data about taxi trips might seem benign because it does not seem to be information about people, Anthony Tockar realized that this taxi dataset actually contained lots of potentially sensitive information about people. To illustrate, he looked at all trips starting at The Hustler Club—a large strip club in New York—between midnight and 6am and then found their drop-off locations. This search revealed—in essence—a list of addresses of some people who frequent The Hustler Club (Tockar 2014). It is hard to imagine that the city government had this in mind when it released the data. In fact, this same technique could be used to find the home addresses of people who visit any place in the city—a medical clinic, a government building, or a religious institution.

These two cases—the Netflix Prize and the New York City taxi data—show that relatively skilled people failed to correctly estimate the informational risk in the data that they released, and these cases are by no means unique (Barbaro and Zeller Jr 2006; Zimmer 2010; Narayanan, Huey, and Felten 2016). Further, in many of these cases, the problematic data is still freely available online, indicating the difficulty of ever undoing a data release. Collectively these examples—as well as research in computer science about privacy—leads to an important conclusion. Researchers should assume that all data is potentially identifiable and all data is potentially sensitive.

Unfortunately, there is no simple solution to the fact that all data is potentially identifiable and all data is potentially sensitive. However, one way to reduce information risk while you are working with data is to create and follow a data protection plan. This plan will decreases the chance that your data will leak and will decrease the harm if a leak somehow occurs. The specifics of data protection plans, such as which form of encryption to use, will change over time, but the UK Data Services helpfully organizes the elements of a data protection plan into 5 categories that they call the 5 safes: safe projects, safe people, safe settings, safe data, and safe outputs (Table 6.2) (Desai, Ritchie, and Welpton 2016). None of the five safes individually provide perfect protection. But, together they form a powerful set of factors that can decrease informational risk.

Table 6.2: The 5 safes are principles for designing and executing a data protection plan (Desai, Ritchie, and Welpton 2016).
Safe	Action
Safe projects	limits projects with data to those that are ethical
Safe people	access is restricted to people who can be trusted with data (e.g., people have undergone ethical training)
Safe data	data is de-identified and aggregated to the extent possible
Safe settings	data is stored in computers with appropriate physical (e.g., locked room) and software (e.g., password protection, encrypted) protections
Safe output	research output is reviewed to prevent accidentally privacy breaches

In addition to protecting your data while you are using it, one step in the research process where informational risk is particularly salient is data sharing with other researchers. Data sharing among scientists is a core value of the scientific endeavor, and it greatly facilities the advancement of knowledge. Here’s how the UK House of Commons described the importance of data sharing:

“Access to data is fundamental if researchers are to reproduce, verify and build on results that are reported in the literature. The presumption must be that, unless there is a strong reason otherwise, data should be fully disclosed and made publicly available. In line with this principle, where possible, data associated with all publicly funded research should be made widely and freely available.” (Molloy 2011)

Yet, by sharing your data with another researcher, you may be increasing informational risk to your participants. Thus, it may seem that researchers who wish to share their data—or are required to share their data—are facing a fundamental tension. On the one hand they have an ethical obligation to share their data with other scientists, especially if the original research is publicly funded. Yet, at the same time, researchers have an ethical obligation to minimize, as much as possible, the information risk to their participants.

Fortunately, this dilemma is not as severe as it appears. It is important to think of data sharing along a continuum from no data sharing to release and forget, where data is “anonymized” and posted for anyone to access (Figure 6.6). Both of these extreme positions have risks and benefits. That is, it is not automatically the most ethical thing to not share your data; such an approach eliminates many potential benefits to society. Returning to Taste, Ties, and Time, an example discussed earlier in the chapter, arguments against data release that focus only on possible harms and that ignore possible benefits are overly one-sided; I’ll describe the problems with this one-sided, overly protective approach in more detail in below when I offer advice about making decisions in the face of uncertainty (Section 6.6.4).

Figure 6.6: Data release strategies can fall along a continuum. Where you should be along this continuum depends on the specific details of your data. In this case, third party review may help you decide the appropriate balance of risk and benefit in your case.

Further, in between these two extreme cases is what I’ll called a walled garden approach where data is shared with people who meet certain criteria and who agree to be bound by certain rules (e.g., oversight from an IRB and a data protection plans). This walled garden approach provides many of the benefits of release and forget with less risk. Of course, a walled garden approach creates many questions—who should have access, under what conditions, for how long, who should pay to maintain and police the walled garden etc.—but these are not insurmountable. In fact, there are already working walled gardens in place that researchers can use right now, such as the data archive of the Inter-university Consortium for Political and Social Research at the University of Michigan.

So, where should the data from your study be on the continuum of no sharing, walled garden, and release and forget? It depend on the details of your data; researchers must balance Respect for Persons, Beneficence, Justice, and Respect for Law and Public Interest. When assessing appropriate balance for other decisions researchers seek the advice and approval of IRBs, and data release can be just another part of that process. In other words, although some people think of data release as a hopeless ethical morass, we already have systems in place to help researchers balance these kind of ethical dilemmas.

One final way to think about data sharing is by analogy. Every year cars are responsible for thousands of deaths, but we do not attempt to ban driving. In fact, such a call to ban driving would be absurd because driving enables many wonderful things. Rather, society places restrictions on who can drive (e.g., need to be a certain age, need to have passed certain tests) and how they can drive (e.g., under the speed limit). Society also has people tasked with enforcing these rules (e.g., police), and we punish people who are caught violating them. This same kind of balanced thinking that society applies to regulating driving can also be applied to data sharing. That is, rather than making absolutist arguments for or against data sharing, I think the biggest benefits will come from figuring out how we can share more data more safely.

To conclude, informational risk has increased dramatically, and it is very hard to predict and quantify. Therefore, it is best to assume that all data is potentially identifiable and potentially sensitive. To decrease informational risk while doing research, researchers can create and follow a data protection plan. Further, informational risk does not prevent researchers from sharing data with other scientists.