4.4 Moving beyond simple experiments

Let’s move beyond simple experiments. Three concepts are useful for rich experiments: validity, heterogeneity of treatment effects, and mechanisms.

Researchers who are new to experiments often focus on a very specific, narrow question: Does this treatment “work”? For example, does a phone call from a volunteer encourage someone to vote? Does changing a website button from blue to green increase the click-through rate? Unfortunately, loose phrasing about what “works” obscures the fact that narrowly focused experiments don’t really tell you whether a treatment “works” in a general sense. Rather, narrowly focused experiments answer a much more specific question: What is the average effect of this specific treatment with this specific implementation for this population of participants at this time? I’ll call experiments that focus on this narrow question simple experiments.

Simple experiments can provide valuable information, but they fail to answer many questions that are both important and interesting, such as whether there are some people for whom the treatment had a larger or smaller effect; whether there is another treatment that would be more effective; and whether this experiment relates to broader social theories.

In order to show the value of moving beyond simple experiments, let’s consider an analog field experiment by P. Wesley Schultz and colleagues on the relationship between social norms and energy consumption (Schultz et al. 2007). Schultz and colleagues hung doorhangers on 300 households in San Marcos, California, and these doorhangers delivered different messages designed to encourage energy conservation. Then, Schultz and colleagues measured the effect of these messages on electricity consumption, both after one week and after three weeks; see figure 4.3 for a more detailed description of the experimental design.

Figure 4.3: Schematic of the experimental design from Schultz et al. (2007). The field experiment involved visiting about 300 households in San Marcos, California five times over an eight-week period. On each visit, the researchers manually took a reading from the house’s power meter. On two of the visits, they placed doorhangers on each house providing some information about the household’s energy usage. The research question was how the content of these messages would impact energy use.

Figure 4.3: Schematic of the experimental design from Schultz et al. (2007). The field experiment involved visiting about 300 households in San Marcos, California five times over an eight-week period. On each visit, the researchers manually took a reading from the house’s power meter. On two of the visits, they placed doorhangers on each house providing some information about the household’s energy usage. The research question was how the content of these messages would impact energy use.

The experiment had two conditions. In the first, households received general energy-saving tips (e.g., use fans instead of air conditioners) and information about their energy usage compared with the average energy usage in their neighborhood. Schultz and colleagues called this the descriptive normative condition because the information about the energy use in the neighborhood provided information about typical behavior (i.e., a descriptive norm). When Schultz and colleagues looked at the resulting energy usage in this group, the treatment appeared to have no effect, in either the short or long term; in other words, the treatment didn’t seem to “work” (figure 4.4).

Fortunately, Schultz and colleagues did not settle for this simplistic analysis. Before the experiment began, they reasoned that heavy users of electricity—people above the mean—might reduce their consumption, and that light users of electricity—people below the mean—might actually increase their consumption. When they looked at the data, that’s exactly what they found (figure 4.4). Thus, what looked like a treatment that was having no effect was actually a treatment that had two offsetting effects. This counterproductive increase among the light users is an example of a boomerang effect, where a treatment can have the opposite effect from what was intended.

Figure 4.4: Results from Schultz et al. (2007). Panel (a) shows that the descriptive norm treatment has an estimated zero average treatment effect. However, panel (b) shows that this average treatment effect is actually composed of two offsetting effects. For heavy users, the treatment decreased usage but for light users, the treatment increased usage. Finally, panel (c) shows that the second treatment, which used descriptive and injunctive norms, had roughly the same effect on heavy users but mitigated the boomerang effect on light users. Adapted from Schultz et al. (2007).

Figure 4.4: Results from Schultz et al. (2007). Panel (a) shows that the descriptive norm treatment has an estimated zero average treatment effect. However, panel (b) shows that this average treatment effect is actually composed of two offsetting effects. For heavy users, the treatment decreased usage but for light users, the treatment increased usage. Finally, panel (c) shows that the second treatment, which used descriptive and injunctive norms, had roughly the same effect on heavy users but mitigated the boomerang effect on light users. Adapted from Schultz et al. (2007).

Simultaneous to the first condition, Schultz and colleagues also ran a second condition. The households in the second condition received the exact same treatment—general energy-saving tips and information about their household’s energy usage compared with the average for their neighborhood—with one tiny addition: for people with below-average consumption, the researchers added a :) and for people with above-average consumption they added a :(. These emoticons were designed to trigger what the researchers called injunctive norms. Injunctive norms refer to perceptions of what is commonly approved (and disapproved), whereas descriptive norms refer to perceptions of what is commonly done (Reno, Cialdini, and Kallgren 1993).

By adding this one tiny emoticon, the researchers dramatically reduced the boomerang effect (figure 4.4). Thus, by making this one simple change—a change that was motivated by an abstract social psychological theory (Cialdini, Kallgren, and Reno 1991)—the researchers were able to turn a program that didn’t seem to work into one that worked, and, simultaneously, they were able to contribute to the general understanding of how social norms affect human behavior.

At this point, however, you might notice that something is a bit different about this experiment. In particular, the experiment of Schultz and colleagues doesn’t really have a control group in the same way that randomized controlled experiments do. A comparison between this design and that of Restivo and van de Rijt illustrates the differences between two major experimental designs. In between-subjects designs, such as that of Restivo and van de Rijt, there is a treatment group and a control group. In within-subjects designs, on the other hand, the behavior of participants is compared before and after the treatment (Greenwald 1976; Charness, Gneezy, and Kuhn 2012). In a within-subject experiment it is as if each participant acts as her own control group. The strength of between-subjects designs is that they provide protection against confounders (as I described earlier), while the strength of within-subjects experiments is increased precision of estimates. Finally, to foreshadow an idea that will come later when I offer advice about designing digital experiments, a _mixed design_combines the improved precision of within-subjects designs and the protection against confounding of between-subjects designs (figure 4.5).

Figure 4.5: Three experimental designs. Standard randomized controlled experiments use between-subjects designs. An example of a between-subjects design is Restivo and van de Rijt’s (2012) experiment on barnstars and contributions to Wikipedia: the researchers randomly divided participants into treatment and control groups, gave participants in the treatment group a barnstar, and compared outcomes for the two groups. The second type of design is a within-subjects design. The two experiments in Schultz and colleagues’ (2007) study on social norms and energy use illustrate a within-subjects design: the researchers compared the electricity use of participants before and after receiving the treatment. Within-subjects designs offer improved statistical precision, but they are open to possible confounders (e.g., changes in weather between the pre-treatment and treatment periods) (Greenwald 1976; Charness, Gneezy, and Kuhn 2012). Within-subjects designs are also sometimes called repeated measures designs. Finally, mixed designs combine the improved precision of within-subjects designs and the protection against confounding of between-subjects designs. In a mixed design, a researcher compares the change in outcomes for people in the treatment and control groups. When researchers already have pre-treatment information, as is the case in many digital experiments, mixed designs are generally preferable to between-subjects designs because they result in improved precision of estimates.

Figure 4.5: Three experimental designs. Standard randomized controlled experiments use between-subjects designs. An example of a between-subjects design is Restivo and van de Rijt’s (2012) experiment on barnstars and contributions to Wikipedia: the researchers randomly divided participants into treatment and control groups, gave participants in the treatment group a barnstar, and compared outcomes for the two groups. The second type of design is a within-subjects design. The two experiments in Schultz and colleagues’ (2007) study on social norms and energy use illustrate a within-subjects design: the researchers compared the electricity use of participants before and after receiving the treatment. Within-subjects designs offer improved statistical precision, but they are open to possible confounders (e.g., changes in weather between the pre-treatment and treatment periods) (Greenwald 1976; Charness, Gneezy, and Kuhn 2012). Within-subjects designs are also sometimes called repeated measures designs. Finally, mixed designs combine the improved precision of within-subjects designs and the protection against confounding of between-subjects designs. In a mixed design, a researcher compares the change in outcomes for people in the treatment and control groups. When researchers already have pre-treatment information, as is the case in many digital experiments, mixed designs are generally preferable to between-subjects designs because they result in improved precision of estimates.

Overall, the design and results of the study by Schultz and colleagues (2007) show the value of moving beyond simple experiments. Fortunately, you don’t need to be a creative genius to design experiments like this. Social scientists have developed three concepts that will guide you toward richer experiments: (1) validity, (2) heterogeneity of treatment effects, and (3) mechanisms. That is, if you keep these three ideas in mind while you are designing your experiment, you will naturally create a more interesting and useful experiment. In order to illustrate these three concepts in action, I’ll describe a number of follow-up partially digital field experiments that built on the elegant design and exciting results of Schultz and colleagues (2007). As you will see, through more careful design, implementation, analysis, and interpretation, you too can move beyond simple experiments.