4.4.1 Validity

Validity refers to how much the results of an experiment support a more general conclusion.

No experiment is perfect, and researchers have developed an extensive vocabulary to describe possible problems. Validity refers to the extent to which the results of a particular experiment support some more general conclusion. Social scientists have found it helpful to split validity into four main types: statistical conclusion validity, internal validity, construct validity, and external validity (Shadish, Cook, and Campbell 2001, chap. 2). Mastering these concepts will provide you with a mental checklist for critiquing and improving the design and analysis of an experiment, and it will help you communicate with other researchers.

Statistical conclusion validity centers around whether the statistical analysis of the experiment was done correctly. In the context of Schultz et al. (2007), such a question might center on whether they computed their $p$ -values correctly. The statistical principles need to design and analyze experiments are beyond the scope of this book, but they have have not fundamentally changed in the digital age. What has changed, however, is that the data environment in digital experiments has created new opportunities such as using machine learning methods to estimate heterogeneity of treatment effects (Imai and Ratkovic 2013).

Internal validity centers around whether the experimental procedures were performed correctly. Returning to the experiment of Schultz et al. (2007), questions about internal validity could center around randomization, delivery of treatment, and measurement of outcomes. For example, you might be concerned that the research assistants did not read the electric meters reliably. In fact, Schultz and colleagues were worried about this problem, and they had a sample of meters read twice; fortunately, the results were essentially identical. In general, Schultz and colleagues’ experiment appears to have high internal validity, but this is not always the case: complex field and online experiments often run into problems actually delivering the right treatment to the right people and measuring the outcomes for everyone. Fortunately, the digital age can help reduce concerns about internal validity because it is now easier to ensure that the treatment is delivered to those who are supposed to receive it and to measure outcomes for all participants.

Construct validity centers around the match between the data and the theoretical constructs. As discussed in chapter 2, constructs are abstract concepts that social scientists reason about. Unfortunately, these abstract concepts don’t always have clear definitions and measurements. Returning to Schultz et al. (2007), the claim that injunctive social norms can lower electricity use requires researchers to design a treatment that would manipulate “injunctive social norms” (e.g., an emoticon) and to measure “electricity use”. In analog experiments, many researchers designed their own treatments and measured their own outcomes. This approach ensures that, as much as possible, the experiments match the abstract constructs being studied. In digital experiments where researchers partner with companies or governments to deliver treatments and use always-on data systems to measure outcomes, the match between the experiment and the theoretical constructs may be less tight. Thus, I expect that construct validity will tend to be a bigger concern in digital experiments than in analog experiments.

Finally, external validity centers around whether the results of this experiment can be generalized to other situations. Returning to Schultz et al. (2007), one could ask whether this same idea—providing people with information about their energy usage in relationship to their peers and a signal of injunctive norms (e.g., an emoticon)—would reduce energy usage if it were done in a different way in a different setting. For most well-designed and well-run experiments, concerns about external validity are the hardest to address. In the past, these debates about external validity frequently involved nothing more than a group of people sitting in a room trying to imagine what would have happened if the procedures had been done in a different way, or in a different place, or with different participants. Fortunately, the digital age enables researchers to move beyond these data-free speculations and assess external validity empirically.

Because the results from Schultz et al. (2007) were so exciting, a company named Opower partnered with utilities in the United States to deploy the treatment more widely. Based on the design of Schultz et al. (2007), Opower created customized Home Energy Reports that had two main modules: one showing a household’s electricity usage relative to its neighbors with an emoticon and one providing tips for lowering energy usage (figure 4.6). Then, in partnership with researchers, Opower ran randomized controlled experiments to assess the impact of these Home Energy Reports. Even though the treatments in these experiments were typically delivered physically—usually through old-fashioned snail mail—the outcome was measured using digital devices in the physical world (e.g., power meters). Further, rather than manually collecting this information with research assistants visiting each house, the Opower experiments were all done in partnership with power companies enabling the researchers to access the power readings. Thus, these partially digital field experiments were run at a massive scale at low variable cost.

Figure 4.6: The Home Energy Reports had a Social Comparison Module and an Action Steps Module. Reproduced by permission from Allcott (2011), figures 1 and 2.

In a first set of experiments involving 600,000 households from 10 different sites, Allcott (2011) found that the Home Energy Report lowered electricity consumption. In other words, the results from the much larger, more geographically diverse study were qualitatively similar to the results from Schultz et al. (2007). Further, in subsequent research involving eight million additional households from 101 different sites, Allcott (2015) again found that the Home Energy Report consistently lowered electricity consumption. This much larger set of experiments also revealed an interesting new pattern that would not be visible in any single experiment: the size of the effect declined in the later experiments (figure 4.7). Allcott (2015) speculated that this decline happened because, over time, the treatment was being applied to different types of participants. More specifically, utilities with more environmentally focused customers were more likely adopt the program earlier, and their customers were more responsive to the treatment. As utilities with less environmentally-focused customers adopted the program, its effectiveness appeared to decline. Thus, just as randomization in experiments ensures that the treatment and control group are similar, randomization in research sites ensures that the estimates can be generalized from one group of participants to a more general population (think back to chapter 3 about sampling). If research sites are not sampled randomly, then generalization—even from a perfectly designed and conducted experiment—can be problematic.

Figure 4.7: Results of 111 experiments testing the effect of the Home Energy Report on electricity consumption. At sites where the program was adopted later, it tended to have smaller effects. Allcott (2015) argues that a major source of this pattern is that sites with more environmentally-focused customers were more likely to adopt the program earlier. Adapted from Allcott (2015), figure 3.

Together, these 111 experiments—10 in Allcott (2011) and 101 in Allcott (2015)—involved about 8.5 million households from all over the United States. They consistently show that Home Energy Reports reduce average electricity consumption, a result that supports the original findings of Schultz and colleagues from 300 homes in California. Beyond just replicating these original results, the follow-up experiments also show that the size of the effect varies by location. This set of experiments also illustrates two more general points about partially digital field experiments. First, researchers will be able to empirically address concerns about external validity when the cost of running experiments is low, and this can occur if the outcome is already being measured by an always-on data system. Therefore, it suggests that researchers should be on the lookout for other interesting and important behaviors that are already being recorded, and then design experiments on top of this existing measuring infrastructure. Second, this set of experiments reminds us that digital field experiments are not just online; increasingly, I expect that they will be everywhere with many outcomes measured by sensors in the built environment.

The four types of validity—statistical conclusion validity, internal validity, construct validity, and external validity—provide a mental checklist to help researchers assess whether the results from a particular experiment support a more general conclusion. Compared with analog-age experiments, in digital-age experiments, it should be easier to address external validity empirically, and it should also be easier to ensure internal validity. On the other hand, issues of construct validity will probably be more challenging in digital-age experiments, especially digital field experiments that involve partnerships with companies.