4.6.2 Build ethics into your design: replace, refine, and reduce

Make your experiment more humane by replacing experiments with non-experimental studies, refining the treatments, and reducing the number of participants.

The second piece of advice that I’d like to offer about designing digital experiments concerns ethics. As the Restivo and van de Rijt experiment on barnstars in Wikipedia shows, decreased cost means that ethics will become an increasingly important part of research design. In addition to the ethical frameworks guiding human subjects research that I’ll describe in chapter 6, researchers designing digital experiments can also draw on ethical ideas from a different source: the ethical principles developed to guide experiments involving animals. In particular, in their landmark book Principles of Humane Experimental Technique, Russell and Burch (1959) proposed three principles that should guide animal research: replace, refine, and reduce. I’d like to propose that these three R’s can also be used—in a slightly modified form—to guide the design of human experiments. In particular,

  • Replace: Replace experiments with less invasive methods if possible.
  • Refine: Refine the treatment to make it as harmless as possible.
  • Reduce: Reduce the number of participants in your experiment as much as possible.

In order to make these three R’s concrete and show how they can potentially lead to better and more humane experimental design, I’ll describe an online field experiment that generated ethical debate. Then, I’ll describe how the three R’s suggest concrete and practical changes to the design of the experiment.

One of the most ethically debated digital field experiments was conducted by Adam Kramer, Jamie Guillroy, and Jeffrey Hancock (2014) and has come to be called “Emotional Contagion.” The experiment took place on Facebook and was motivated by a mix of scientific and practical questions. At the time, the dominant way that users interacted with Facebook was the News Feed, an algorithmically curated set of Facebook status updates from a user’s Facebook friends. Some critics of Facebook had suggested that because the News Feed has mostly positive posts—friends showing off their latest party—it could cause users to feel sad because their lives seemed less exciting in comparison. On the other hand, maybe the effect is exactly the opposite: maybe seeing your friend having a good time would make you feel happy. In order to address these competing hypotheses—and to advance our understanding of how a person’s emotions are impacted by her friends’ emotions—Kramer and colleagues ran an experiment. They placed about 700,000 users into four groups for one week: a “negativity-reduced” group, for whom posts with negative words (e.g., “sad”) were randomly blocked from appearing in the News Feed; a “positivity-reduced” group for whom posts with positive words (e.g., “happy”) were randomly blocked; and two control groups. In the control group for the “negativity-reduced” group, posts were randomly blocked at the same rate as the “negativity-reduced” group but without regard to the emotional content. The control group for the “positivity-reduced” group was constructed in a parallel fashion. The design of this experiment illustrates that the appropriate control group is not always one with no changes. Rather, sometimes, the control group receives a treatment in order to create the precise comparison that a research question requires. In all cases, the posts that were blocked from the News Feed were still available to users through other parts of the Facebook website.

Kramer and colleagues found that for participants in the positivity-reduced condition, the percentage of positive words in their status updates decreased and the percentage of negative words increased. On the other hand, for participants in the negativity-reduced condition, the percentage of positive words increased and that of negative words decreased (figure 4.24). However, these effects were quite small: the difference in positive and negative words between treatments and controls was about 1 in 1,000 words.

Figure 4.24: Evidence of emotional contagion (Kramer, Guillory, and Hancock 2014). Participants in the negativity-reduced condition used fewer negative words and more positive words, and participants in the positivity-reduced condition used more negative words and fewer positive words. Bars represent estimated standard errors. Adapted from Kramer, Guillory, and Hancock (2014), figure 1.

Figure 4.24: Evidence of emotional contagion (Kramer, Guillory, and Hancock 2014). Participants in the negativity-reduced condition used fewer negative words and more positive words, and participants in the positivity-reduced condition used more negative words and fewer positive words. Bars represent estimated standard errors. Adapted from Kramer, Guillory, and Hancock (2014), figure 1.

Before discussing the ethical issues raised by this experiment, I’d like to describe three scientific issues using some of the ideas from earlier in the chapter. First, it is not clear how the actual details of the experiment connect to the theoretical claims; in other words, there are questions about construct validity. It is not clear that the positive and negative word counts are actually a good indicator of the emotional state of participants because (1) it is not clear that the words that people post are a good indicator of their emotions and (2) it is not clear that the particular sentiment analysis technique that the researchers used is able to reliably infer emotions (Beasley and Mason 2015; Panger 2016). In other words, there might be a bad measure of a biased signal. Second, the design and analysis of the experiment tells us nothing about who was most impacted (i.e., there is no analysis of heterogeneity of treatment effects) and what the mechanism might be. In this case, the researchers had lots of information about the participants, but they were essentially treated as widgets in the analysis. Third, the effect size in this experiment was very small; the difference between the treatment and control conditions is about 1 in 1,000 words. In their paper, Kramer and colleagues make the case that an effect of this size is important because hundreds of millions of people access their News Feed each day. In other words, they argue that even if effects are small for each person, they are big in aggregate. Even if you were to accept this argument, it is still not clear if an effect of this size is important regarding the more general scientific question about the spread of emotion (Prentice and Miller 1992).

In addition to these scientific questions, just days after this paper was published in Proceedings of the National Academy of Sciences, there was an enormous outcry from both researchers and the press (I’ll describe the arguments in this debate in more detail in chapter 6). The issues raised in this debate caused the journal to publish a rare “editorial expression of concern” about the ethics and the ethical review process for the research (Verma 2014).

Given that background about Emotional Contagion, I would now like to show that the three R’s can suggest concrete, practical improvements for real studies (whatever you might personally think about the ethics of this particular experiment). The first R is replace: researchers should seek to replace experiments with less invasive and risky techniques, if possible. For example, rather than running a randomized controlled experiment, the researchers could have exploited a natural experiment. As described in chapter 2, natural experiments are situations where something happens in the world that approximates the random assignment of treatments (e.g., a lottery to decide who will be drafted into the military). The ethical advantage of a natural experiment is that the researcher does not have to deliver treatments: the environment does that for you. For example, almost concurrently with the Emotional Contagion experiment, Lorenzo Coviello et al. (2014) were exploiting what could be called an Emotional Contagion natural experiment. Coviello and colleagues discovered that people post more negative words and fewer positive words on days where it is raining. Therefore, by using random variation in the weather, they were able to study the effect of changes in the News Feed without the need to intervene at all. It was as if the weather was running their experiment for them. The details of their procedure are a bit complicated, but the most important point for our purposes here is that by using a natural experiment, Coviello and colleagues were able to learn about the spread of emotions without the need to run their own experiment.

The second of the three Rs is refine: researchers should seek to refine their treatments to make them as harmless as possible. For example, rather than blocking content that was either positive or negative, the researchers could have boosted content that was positive or negative. This boosting design would have changed the emotional content of participants’ News Feeds, but it would have addressed one of the concerns that critics expressed: that the experiments could have caused participants to miss important information in their News Feed. With the design used by Kramer and colleagues, a message that is important is as likely to be blocked as one that is not. However, with a boosting design, the messages that would be displaced would be those that are less important.

Finally, the third R is reduce: researchers should seek to reduce the number of participants in their experiment to the minimum needed to achieve their scientific objective. In analog experiments, this happened naturally because of the high variable costs of participants. But in digital experiments, particularly those with zero variable cost, researchers don’t face a cost constraint on the size of their experiment, and this has the potential to lead to unnecessarily large experiments.

For example, Kramer and colleagues could have used pre-treatment information about their participants—such as pre-treatment posting behavior—to make their analysis more efficient. More specifically, rather than comparing the proportion of positive words in the treatment and control conditions, Kramer and colleagues could have compared the change in the proportion of positive words between conditions; an approach that is sometimes called a mixed design (figure 4.5) and sometimes called a difference-in-differences estimator. That is, for each participant, the researchers could have created a change score (post-treatment behavior \(-\) pre-treatment behavior) and then compared the change scores of participants in the treatment and control conditions. This difference-in-differences approach is more efficient statistically, which means that researchers can achieve the same statistical confidence using much smaller samples.

Without having the raw data, it is difficult to know exactly how much more efficient a difference-in-differences estimator would have been in this case. But we can look at other related experiments for a rough idea. Deng et al. (2013) reported that by using a form of the difference-in-differences estimator, they were able to reduce the variance of their estimates by about 50% in three different online experiments; similar results have been reported by Xie and Aurisset (2016). This 50% variance reduction means that the Emotional Contagion researchers might have been able to cut their sample in half if they had used a slightly different analysis method. In other words, with a tiny change in the analysis, 350,000 people might have been spared participation in the experiment.

At this point, you might be wondering why researchers should care if 350,000 people were in Emotional Contagion unnecessarily. There are two particular features of Emotional Contagion that make concern with excessive size appropriate, and these features are shared by many digital field experiments: (1) there is uncertainty about whether the experiment will cause harm to at least some participants and (2) participation was not voluntary. It seems reasonable to try to keep experiments that have these features as small as possible.

To be clear, the desire to reduce the size of your experiment does not mean that you should not run large, zero variable cost experiments. It just means that your experiments should not be any larger than you need to achieve your scientific objective. One important way to make sure that an experiment is appropriately sized is to conduct a power analysis (Cohen 1988). In the the analog age, researchers generally did power analysis to make sure that their study was not too small (i.e., under-powered). Now, however, researchers should do power analysis to make sure that their study is not too big (i.e., over-powered).

In conclusion, the three R’s—replace, refine, and reduce—provide principles that can help researchers build ethics into their experimental designs. Of course, each of these possible changes to Emotional Contagion introduces trade-offs. For example, evidence from natural experiments is not always as clean as that from randomized experiments, and boosting content might have been logistically more difficult to implement than blocking content. So, the purpose of suggesting these changes was not to second-guess the decisions of other researchers. Rather, it was to illustrate how the three R’s could be applied in a realistic situation. In fact, the issue of trade-offs comes up all the time in research design, and in the digital-age, these these trade-offs will increasingly involve ethical considerations. Later, in chapter 6, I’ll offer some principles and ethical frameworks that can help researchers understand and discuss these trade-offs.