4.4 Moving beyond simple experiments

Let’s move beyond simple experiments. Three concepts are useful for rich experiments: validity, heterogeneity of treatment effects, and mechanism.

Researchers who are new to experiments often focus on a very specific, narrow question: does this treatment “work”? For example, does a phone call from a volunteer encourage someone to vote? Does changing a website button from blue to green increase click-through rate? Unfortunately, loose phrasing about what “works” obscures the fact that narrowly focused experiments don’t really tell you whether a treatment “works” in a general sense. Rather, narrowly focused experiments answer a much more specific question: what is the average effect of this specific treatment with this specific implementation for this population of participants at this time? I’ll call experiments that focus on this narrow question simple experiments.

Simple experiments can provide valuable information, but they fail to answer many questions that are both important and interesting such as: are there some people for whom the treatment had a larger or smaller effect?; is there another treatment that would be more effective?; and how does this experiment relate to broader social theories?

In order to show the value of moving beyond simple experiments, let’s consider one of my favorite analog field experiments, a study by P. Wesley Schultz and colleagues on the relationship between social norms and energy consumption (Schultz et al. 2007). Schultz and colleagues hung doorhangers on 300 households in San Marcos, California, and these doorhangers delivered different messages designed to encourage energy conservation. Then, Schultz and colleagues measured the effect of these messages on electricity consumption, both after one week and three weeks; see Figure 4.3 for a more detailed description of the experimental design.

Figure 4.3: Schematic of design from Schultz et al. (2007). The field experiment involved visiting about 300 households in San Marcos, California five times over an eight week period. On each visit the researchers manually took a reading from the house’s power meter. On two of the visits the researchers placed doorhangers on the house providing some information about their energy usage. The research question was how the content of these messages would impact energy use.

Figure 4.3: Schematic of design from Schultz et al. (2007). The field experiment involved visiting about 300 households in San Marcos, California five times over an eight week period. On each visit the researchers manually took a reading from the house’s power meter. On two of the visits the researchers placed doorhangers on the house providing some information about their energy usage. The research question was how the content of these messages would impact energy use.

The experiment had two conditions. In the first condition, households received general energy saving tips (e.g., use fans instead of air conditioners) and information about their household’s energy usage compared to the average of the energy usage in their neighborhood. Schultz and colleagues called this the descriptive normative condition because the information about the energy use in their neighborhood provided information about typical behavior (i.e., a descriptive norm). When Schultz and colleagues looked at the resulting energy usage in this group, the treatment appeared to have no effect, either in the short-term or the long-term; in other words, the treatment didn’t seem to “work” (Figure 4.4).

But, fortunately, Schultz et al. (2007) did not settle for this simplistic analysis. Before the experiment began they reasoned that heavy users of electricity—people above the mean—might reduce their consumption, and that light users of electricity—people below the mean—might actually increase their consumption. When they looked at the data, that’s exactly what they found (Figure 4.4). Thus, what looked like a treatment that was having no effect was actually a treatment that had two offsetting effects. The researchers called this counter-productive increase among the light users a boomerang effect.

Figure 4.4: Results from Schultz et al. (2007). The first panel shows that the descriptive norm treatment has an estimated zero average treatment effect. However, the second panel shows that this average treatment effect is actually composed of two offsetting effects. For heavy users, the treatment decreased usage but for light users, the treatment increased usage. Finally, the third panel shows that the second treatment, which used descriptive and injunctive norms, had roughly the same effect on heavy users but mitigated the boomerang effect on light users.

Figure 4.4: Results from Schultz et al. (2007). The first panel shows that the descriptive norm treatment has an estimated zero average treatment effect. However, the second panel shows that this average treatment effect is actually composed of two offsetting effects. For heavy users, the treatment decreased usage but for light users, the treatment increased usage. Finally, the third panel shows that the second treatment, which used descriptive and injunctive norms, had roughly the same effect on heavy users but mitigated the boomerang effect on light users.

Further, Schultz and colleagues anticipated this possibility, and in the second condition they deployed a slightly different treatment, one explicitly designed to eliminate the boomerang effect. The households in the second condition received the exact same treatment—general energy saving tips and information about their household’s energy usage compared to the their neighborhood—with one tiny addition: for people with below-average consumption, the researchers added a :) and for people with above-average consumption they added a :(. These emoticons were designed to trigger what the researchers called injunctive norms. Injunctive norms refer to perceptions of what is commonly approved (and disapproved) whereas descriptive norms refer to perceptions of what is commonly done (Reno, Cialdini, and Kallgren 1993).

By adding this one tiny emoticon, the researchers dramatically reduced the boomerang effect (Figure 4.4). Thus, by making this one simple change—a change that was motivated by an abstract social psychological theory (Cialdini, Kallgren, and Reno 1991)—the researchers were able to turn a program from one that didn’t seem to work into one that worked, and, simultaneously, they were able to contribute to the general understanding of how social norms affect human behavior.

At this point, however, you might notice that something is a bit different about this experiment. In particular, the experiment of Schultz and colleagues doesn’t really have a control group in the same way that randomized controlled experiments do. The comparison between this design and the design of Restivo and van de Rijt illustrates the differences between two major designs used by researchers. In between-subjects designs, such as Restivo and van de Rijt, there is a treatment group and a control group, and in within-subjects designs the behavior of participants is compared before and after the treatment (Greenwald 1976; Charness, Gneezy, and Kuhn 2012). In a within-subject experiment it is as if each participant acts as her own control group. The strength of between-subjects designs is that it provides protection against confounders (as I described earlier), and the strength of within-subjects experiments is increased precision in estimates. When each participant acts as their own control, between-participant variation is eliminated (see Technical Appendix). To foreshadow an that will come later when I offer advice about designing digital experiments, there is a final design, called a mixed design, that combines the improved precision of within-subjects designs and the protection against confounding of between-subjects designs.

Figure 4.5: Three experimental designs. Standard randomized controlled experiments use between-subjects designs. An example of a between-subjects design is Restivo and van de Rijt’s (2012) experiment on barnstars and contributions to Wikipedia: researchers randomly divided participants into treatment and control groups, gave participants in the treatment group a barnstar, and compared outcomes for the two groups. A second type of design is a within-subjects design. The two experiments in Schultz and colleague’s (2007) study on social norms and energy use illustrate a within-subjects design: researchers compared the electricity use of participants before and after receiving the treatment. Within-subjects designs offer improved statistical precision by eliminating between subject variance (see Technical Appendix), but they are open to possible confounders (e.g., changes in weather between the pre-treatment and treatment period) (Greenwald 1976; Charness, Gneezy, and Kuhn 2012). Within-subjects designs are also sometimes called repeated measures designs. Finally, mixed designs combine the improved precision of within-subjects designs and the protection against confounding of between-subjects designs. In a mixed design, a researcher compares the change in outcomes for people in the treatment and control groups. When researchers already have pre-treatment information, as is the case in many digital experiments, mixed designs are preferable to between-subjects designs because of gains in precision (see Technical Appendix).

Figure 4.5: Three experimental designs. Standard randomized controlled experiments use between-subjects designs. An example of a between-subjects design is Restivo and van de Rijt’s (2012) experiment on barnstars and contributions to Wikipedia: researchers randomly divided participants into treatment and control groups, gave participants in the treatment group a barnstar, and compared outcomes for the two groups. A second type of design is a within-subjects design. The two experiments in Schultz and colleague’s (2007) study on social norms and energy use illustrate a within-subjects design: researchers compared the electricity use of participants before and after receiving the treatment. Within-subjects designs offer improved statistical precision by eliminating between subject variance (see Technical Appendix), but they are open to possible confounders (e.g., changes in weather between the pre-treatment and treatment period) (Greenwald 1976; Charness, Gneezy, and Kuhn 2012). Within-subjects designs are also sometimes called repeated measures designs. Finally, mixed designs combine the improved precision of within-subjects designs and the protection against confounding of between-subjects designs. In a mixed design, a researcher compares the change in outcomes for people in the treatment and control groups. When researchers already have pre-treatment information, as is the case in many digital experiments, mixed designs are preferable to between-subjects designs because of gains in precision (see Technical Appendix).

Overall, the design and results of Schultz et al. (2007) show the value of moving beyond simple experiments. Fortunately, you don’t need to be a genius to create experiments like this. Social scientists have developed three concepts that will guide you toward richer and more creative experiments: 1) validity, 2) heterogeneity of treatment effects, and 3) mechanisms. That is, if you keep these three ideas in mind while you are designing your experiment, you will naturally create more interesting and useful experiments. In order to illustrate these three concepts in action, I’ll describe a number of follow-up partially digital field experiments that built on the elegant design and exciting results in Schultz et al. (2007). As you will see, through more careful design, implementation, analysis, and interpretation, you too can move beyond simple experiments.