4.6.2 Replace, Refine, and Reduce

You are reading the Open Review Edition of Bit by Bit. Click here to read the 1st Edition.

4.6.2 Replace, Refine, and Reduce

Make your experiment more humane by replacing experiments with non-experimental studies, refining the treatments, and reducing the number of participants.

The second piece of advice that I’d like to offer about designing digital experiments concerns ethics. As the Restivo and van de Rijt experiment on barnstars in Wikipedia shows, decreased costs means that ethics will become an increasingly important part of research design. In addition to the ethical frameworks guiding human subjects research that I’ll describe in Chapter 6, researchers designing digital experiments can also draw on ethical ideas from a different source: the ethical principles developed to guide experiments involving animals. In particular, in their landmark book Principles of Humane Experimental Technique, Russell and Burch (1959) proposed three principles that should guide animal research: Replace, Refine, and Reduce. I’d like to propose that these three R’s can also be used—in a slightly modified form—to guide the design of human experiments. In particular,

Replace: Replace experiments with less invasive methods if possible
Refine: Refine the treatment to make it as harmless as possible
Reduce: Reduce the number of participants in your experiment as much as possible

In order to make these three R’s concrete and show how they can potentially lead to better and more humane experimental design, I’ll describe an online field experiment that generated ethical debate. Then I’ll describe how the three R’s suggest concrete and practical changes to the design of the experiment.

One of the most ethically debated digital field experiments is “Emotional Contagion,” which was conducted by Adam Kramer, Jamie Gillroy, and Jeffrey Hancock (2014). The experiment took place on Facebook and was motivated by a mix of scientific and practical questions. At the time, the dominant way that users interacted with Facebook was the News Feed, an algorithmically curated set of Facebook status updates from a user’s Facebook friends. Some critics of Facebook had suggested that because the News Feed has mostly positive posts—friends showing off their latest party—it could cause users to feel sad because their lives seem less exciting in comparison. On the other hand, maybe the effect is exactly the opposite; maybe seeing your friend having a good time would make you feel happy? In order to address these competing hypothesis—and to advance our understanding of how a person’s emotions are impacted by her friends’ emotions—Kramer and colleagues ran an experiment. The researchers placed about 700,000 users into four groups for one week: a “negativity reduced” group, for whom posts with negative words (e.g., sad) were randomly blocked from appearing the News Feed; a “positivity reduced” group for whom posts with positive words (e.g., happy) were randomly blocked; and two control groups. In the control group for the “negativity reduced” group, posts were randomly blocked at the same rate as the “negativity reduced” group but without regard to the emotional content. The control group for the “positivity reduced” group was constructed in a parallel fashion. The design of this experiment illustrates that the appropriate control group is not always one with no changes. Rather, sometimes the control group receives a treatment in order to create the precise comparison that a research question requires. In all cases, the posts that were blocked from the News Feed were still available to users through other parts of the Facebook website.

Kramer and colleagues found that for participants in the positivity reduced condition, the percentage of positive words in their status updates decreased and the percentage of negative words increased. On the other hand, for participants in the negativity reduced condition, the percentage of positive words increased and the percentage of negative words decreased (Figure 4.23). However, these effects were quite small: the difference in positive and negative words between treatments and controls was about 1 in 1,000 words.

Figure 4.23: Evidence of emotional contagion (Kramer, Guillory, and Hancock 2014). Percentage of positive words and negative words by experimental condition. Bars represent estimated standard errors.

I’ve put a discussion of the scientific aspects of this experiment in the further reading section at the end of the chapter, but unfortunately, this experiment is most known for generating ethical debate. Just days after this paper was published in Proceedings of the National Academy of Sciences, there was an enormous outcry from both researchers and the press. Outrage around the paper focused on two main points: 1) participants did not provide any consent beyond the standard Facebook terms-of-service for a treatment that some thought might cause harm to participants and 2) the study had not undergone third-party ethical review (Grimmelmann 2015). The ethical questions raised in this debate caused the journal to quickly publish a rare “editorial expression of concern” about the ethics and ethical review process for the research (Verma 2014). In subsequent years, the experiment has continued to be a source of intense debate and disagreement, and this disagreement may have had the unintended effect of driving into the shadows many other experiments that are being performed by companies (Meyer 2014).

Given that background about Emotional Contagion, I would now like to show that the 3 R’s can suggest concrete, practical improvements for real studies (whatever you might personally think about the ethics of this particular experiment). The first R is Replace: researchers should seek to replace experiments with less invasive and risky techniques, if possible. For example, rather than running an experiment, the researchers could have exploited a natural experiment. As described in Chapter 2, natural experiments are situations where something happens in the world that approximates the random assignment of treatments (e.g., a lottery to decide who will be drafted into the military). The advantage of a natural experiment is that the researcher does not have to deliver treatments; the environment does that for you. In other words, with a natural experiment, researchers would not have needed to experimentally manipulate people’s News Feeds.

In fact, almost concurrently with the Emotional Contagion experiment, Coviello et al. (2014) was exploiting what could be called an Emotional Contagion natural experiment. Their approach, which uses a technique called instrumental variables, is a bit complicated if you’ve never seen it before. So, in order to explain why it was needed, let’s build up to it. The first idea that some researchers might have to study emotional contagion would be to compare your posts on days where your News Feed was very positive to your posts on days where your News Feed was very negative. This approach would be fine if the goal was just to predict the emotional content of your posts, but this approach is problematic if the goal is to study the causal effect of your News Feed on your posts. To see the problem with this design, consider Thanksgiving. In the US, positive posts spike and negative posts plummet on Thanksgiving. Thus, on Thanksgiving, researchers could see that your News Feed was very positive and that you posted positive things as well. But, your positive posts could have been caused by Thanksgiving not by the content of your News Feed. Instead, in order to estimate the causal effect researchers need something that changes the content of your News Feed without directly changing your emotions. Fortunately, there is something like that happening all the time: the weather.

Coviello and colleagues found that a rainy day in someone’s city will, on average, decrease the proportion of posts that are positive by about 1 percentage point and increase the proportion of posts that are negative by about 1 percentage point. Then, Coviello and colleagues exploited this fact to study emotional contagion without the need to experimentally manipulate anyone’s News Feed. In essence what they did is measure how your posts were impacted by the weather in the cities where your friends live. To see why this makes sense, imagine that you live in New York City and you have a friend who lives in Seattle. Now imagine that one day it starts raining in Seattle. This rain in Seattle will not directly affect your mood, but it will cause your News Feed to be less positive and more negative because of your friend’s posts. Thus, the rain in Seattle randomly manipulates your News Feed. Turning this intuition into a reliable statistical procedure is complicated (and the exact approach used by Coviello and colleagues is a bit non-standard) so I’ve put a more detailed discussion in the further reading section. The most important thing to remember about Coviello and colleague’s approach is that it enabled them to study emotional contagion without the need to run an experiment that could potentially harm participants, and it may be the case that in many other settings you can replace experiments with other techniques.

Second in the 3 Rs is Refine: researchers should seek to refine their treatments in order to cause the smallest harm possible. For example, rather than blocking content that was either positive or negative, the researchers could have boosted content that was positive or negative. This boosting design would have changed the emotional content of participants News Feeds, but it would have addressed one of the concern that critics expressed: that the experiments could have caused participants to miss important information in their News Feed. With the design used by Kramer and colleagues, a message that is important is as likely to be blocked as one that is not. However, with a boosting design, the messages that would be displaced would be those that are less important.

Finally, the third R is Reduce: researchers should seek to reduce the number of participants in their experiment, if possible. In the past, this reduction happened naturally because the variable cost of analog experiments was high, which encouraged research to optimize their design and analysis. However, when there is zero variable cost data, researchers don’t face a cost constraint on the size of their experiment, and this has the potential to lead to unnecessarily large experiments.

For example, Kramer and colleagues could have used pre-treatment information about their participants—such as pre-treatment posting behavior—to make their analysis more efficient. More specifically, rather than comparing the proportion of positive words in the treatment and control conditions, Kramer and colleagues could have compared the change in the proportion of positive words between conditions; an approach often called difference-in-differences and which is closely related to the mixed design that I described earlier in the chapter (Figure 4.5). That is, for each participant, the researchers could have created a change score (post-treatment behavior - pre-treatment behavior) and then compared the change scores of participants in the treatment and control conditions. This difference-in-differences approach is more efficient statistically, which means that researchers can achieve the same statistical confidence using much smaller samples. In other words, by not treating participants like “widgets”, researchers can often get more precise estimates.

Without having the raw data it is difficult to know exactly how much more efficient a difference-in-differences approach would have been in this case. But, Deng et al. (2013) reported that in three online experiments on the Bing search engine they were able to reduce the variance of their estimates by about 50%, and similar results have been reported for some online experiments at Netflix (Xie and Aurisset 2016). This 50% variance reduction means that the Emotional Contagion researchers might have been able to cut their sample in half if they had used a slightly different analysis methods. In other words, with a tiny change in the analysis, 350,000 people might have been spared participation in the experiment.

At this point you might be wondering why researchers should care if 350,000 people were in Emotional Contagion unnecessarily. There are two particular features of Emotional Contagion that make concern with excessive size appropriate, and these features are shared by many digital field experiments: 1) there is uncertainty about whether the experiment will cause harm to at least some participants and 2) participation was not voluntary. In experiments with these two characteristics it seems advisable to keep the experiments as small as possible.

In conclusion, the three R’s—Replace, Refine, and Reduce—provide principles that can help researchers build ethics into their experimental designs. Of course, each of these possible changes to Emotional Contagion introduces trade-offs. For example, evidence from natural experiments is not always as clean as evidence from randomized experiments and boosting might have been more logistically difficult to implement than block. So, the purpose of suggesting these changes was not to second-guess the decisions of other researchers. Rather, it was to illustrate how the three R’s could be applied in a realistic situation.