Questions about causality in social research are often complex and intricate. For a foundational approach to causality based on causal graphs, see Pearl (2009), and for a foundational approach based on potential outcomes, see Imbens and Rubin (2015). For a comparison between these two approaches, see Morgan and Winship (2014). For a formal approach to defining a confounder, see VanderWeele and Shpitser (2013).
In this chapter, I have created what seemed like a bright line between our ability to make causal estimates from experimental and non-experiment data. However, I think that, in reality, the distinction is more blurred. For example, everyone accepts that smoking causes cancer, even though no randomized controlled experiment that forces people to smoke has ever been done. For excellent book-length treatments on making causal estimates from non-experimental data see Rosenbaum (2002), (???), Shadish, Cook, and Campbell (2001), and Dunning (2012).
Chapters 1 and 2 of Freedman, Pisani, and Purves (2007) offer a clear introduction to the differences between experiments, controlled experiments, and randomized controlled experiments.
Manzi (2012) provides a fascinating and readable introduction to the philosophical and statistical underpinnings of randomized controlled experiments. It also provides interesting real-world examples of the power of experimentation in business. Issenberg (2012) provides a fascinating introduction to the use of experimentation in political campaigns.
Box, Hunter, and Hunter (2005), @casella_statistical_2008, and Athey and Imbens (2016b) provide good introductions to the statistical aspects of experimental design and analysis. Further, there are excellent treatments of the use of experiments in many different fields: economics (Bardsley et al. 2009), sociology (Willer and Walker 2007; Jackson and Cox 2013), psychology (Aronson et al. 1989), political science (Morton and Williams 2010), and social policy (Glennerster and Takavarasha 2013).
The importance of participant recruitment (e.g., sampling) is often under-appreciated in experimental research. However, if the effect of the treatment is heterogeneous in the population, then sampling is critical. Longford (1999) makes this point clearly when he advocates for researchers thinking of experiments as a population survey with haphazard sampling.
I have suggested that there is a continuum between lab and field experiments, and other researchers have proposed more detailed typologies, in particular ones that separate the various forms of field experiments (Harrison and List 2004; Charness, Gneezy, and Kuhn 2013).
A number of papers have compared lab and field experiments in the abstract (Falk and Heckman 2009; Cialdini 2009) and in terms of outcomes of specific experiments in political science (Coppock and Green 2015), economics (Levitt and List 2007a, 2007b; Camerer 2011; Al-Ubaydli and List 2013), and psychology (Mitchell 2012). Jerit, Barabas, and Clifford (2013) offer a nice research design for comparing results from lab and field experiments. Parigi, Santana, and Cook (2017) describes how online field experiments can combine some of the characteristics of lab and field experiments.
Concerns about participants changing their behavior because they know they are being closely observed are sometimes called demand effects, and they have been studied in psychology (Orne 1962) and economics (Zizzo 2010). Although mostly associated with lab experiments, these same issues can cause problems for field experiments as well. In fact, demand effects are also sometimes called Hawthorne effects, a term that derives the famous illumination experiments that began in 1924 at the Hawthorne Works of the Western Electric Company (Adair 1984; Levitt and List 2011). Both demand effects and Hawthorne effects are closely related to the idea of reactive measurement discussed in chapter 2 (see also Webb et al. (1966)).
Field experiments have a long history in economics (Levitt and List 2009), political science (Green and Gerber 2003; Druckman et al. 2006; Druckman and Lupia 2012), psychology (Shadish 2002), and public policy (Shadish and Cook 2009). One area of social science where field experiments quickly became prominent is international development. For a positive review of that work within economics see Banerjee and Duflo (2009), and for a critical assessment see Deaton (2010). For a review of this work in political science see Humphreys and Weinstein (2009). Finally, the ethical challenges arising from field experiments have been explored in the context of political science (Humphreys 2015; Desposato 2016b) and development economics (Baele 2013).
In this section, I suggested that pre-treatment information can be used to improve the precision of estimated treatment effects, but there is some debate about this approach; see Freedman (2008), W. Lin (2013), Berk et al. (2013), and Bloniarz et al. (2016) for more information.
Finally, there are two other types of experiments performed by social scientists that don’t fit neatly along the lab-field dimension: survey experiments and social experiments. Survey experiments are experiments using the infrastructure of existing surveys and compare responses to alternative versions of the same questions (some survey experiments are presented in Chapter 3); for more on survey experiments see Mutz (2011). Social experiments are experiments where the treatment is some social policy that can only be implemented by a government. Social experiments are closely related to program evaluation. For more on policy experiments, see Heckman and Smith (1995), Orr (1998), and@glennerster_running_2013.
I’ve chosen to focus on three concepts: validity, heterogeneity of treatment effects, and mechanisms. These concepts have different names in different fields. For example, psychologists tend to move beyond simple experiments by focusing on mediators and moderators (Baron and Kenny 1986). The idea of mediators is captured by what I call mechanisms, and the idea of moderators is captured by what I call external validity (e.g., would the results of the experiment be different if it were run in different situations) and heterogeneity of treatment effects (e.g., are the effects larger for some people than for others).
The experiment by Schultz et al. (2007) shows how social theories can be used to design effective interventions. For a more general argument about the role of theory in designing effective interventions, see Walton (2014).
The concepts of internal and external validity were first introduced by Campbell (1957). See Shadish, Cook, and Campbell (2001) for a more detailed history and a careful elaboration of statistical conclusion validity, internal validity, construct validity, and external validity.
For an overview of issues related to statistical conclusion validity in experiments see Gerber and Green (2012) (from a social science perspective) and Imbens and Rubin (2015) (from a statistical perspective). Some issues of statistical conclusion validity that arise specifically in online field experiments include issues such as computationally efficient methods for creating confidence intervals with dependent data (Bakshy and Eckles 2013).
Internal validity can be difficult to ensure in complex field experiments. See, for example, Gerber and Green (2000), Imai (2005), and Gerber and Green (2005) for debate about the implementation of a complex field experiment about voting. Kohavi et al. (2012) and Kohavi et al. (2013) provide an introduction into the challenges of interval validity in online field experiments.
One major threat to internal validity is the possibility of failed randomization. One potential way to detect problems with the randomization is to compare the treatment and control groups on observable traits. This kind of comparison is called a balance check. See Hansen and Bowers (2008) for a statistical approach to balance checks and Mutz and Pemantle (2015) for concerns about balance checks. For example, using a balance check, Allcott (2011) found some evidence that randomization was not implemented correctly in three of the Opower experiments (see table 2; sites 2, 6, and 8). For other approaches, see chapter 21 of Imbens and Rubin (2015).
Other major concerns related to internal validity are: (1) one-sided noncompliance, where not everyone in the treatment group actually received the treatment, (2) two sided noncompliance, where not everyone in the treatment group receives the treatment and some people in the control group receive the treatment, (3) attrition, where outcomes are not measured for some participants, and (4) interference, where the treatment spills over from people in the treatment condition to people in the control condition. See chapters 5, 6, 7, and 8 of Gerber and Green (2012) for more on each of these issues.
For more on construct validity, see Westen and Rosenthal (2003), and for more on construct validity in big data sources, Lazer (2015) and chapter 2 of this book.
One aspect of external validity is the setting in which an intervention is tested. Allcott (2015) provides a careful theoretical and empirical treatment of site selection bias. This issue is also discussed by Deaton (2010). Another aspect of external validity is whether alternative operationalizations of the same intervention will have similar effects. In this case, a comparison between Schultz et al. (2007) and Allcott (2011) shows that the Opower experiments had a smaller estimated treated effect than the original experiments by Schultz and colleagues (1.7% versus 5%). Allcott (2011) speculated that the follow-up experiments had a smaller effect because of the ways in which the treatment differed: a handwritten emoticon as part of a study sponsored by a university, compared with a printed emoticon as part of a mass-produced report from a power company.
For an excellent overview of heterogeneity of treatment effects in field experiments, see chapter 12 of Gerber and Green (2012). For introductions to heterogeneity of treatment effects in medical trials, see Kent and Hayward (2007), Longford (1999), and Kravitz, Duan, and Braslow (2004). Considerations of heterogeneity of treatment effects generally focus on differences based on pre-treatment characteristics. If you are interested in heterogeneity based on post-treatment outcomes, then more complex approaches are needed, such as principal stratification (Frangakis and Rubin 2002); see Page et al. (2015) for a review.
Many researchers estimate the heterogeneity of treatment effects using linear regression, but newer methods rely on machine learning; see, for example, Green and Kern (2012), Imai and Ratkovic (2013), Taddy et al. (2016), and Athey and Imbens (2016a).
There is some skepticism about findings of heterogeneity of effects because of multiple comparison problems and “fishing.” There are a variety of statistical approaches that can help address concerns about multiple comparison (Fink, McConnell, and Vollmer 2014; List, Shaikh, and Xu 2016). One approach to concerns about “fishing” is pre-registration, which is becoming increasingly common in psychology (Nosek and Lakens 2014), political science (Humphreys, Sierra, and Windt 2013; Monogan 2013; Anderson 2013; Gelman 2013; Laitin 2013), and economics (Olken 2015).
In the study by Costa and Kahn (2013) only about half of the households in the experiment could be linked to the demographic information. Readers interested in these details should refer to the original paper.
Mechanisms are incredibly important, but they turn out to be very difficult to study. Research about mechanisms is closely related to the study of mediators in psychology (but see also VanderWeele (2009) for a precise comparison between the two ideas). Statistical approaches to finding mechanisms, such as the approach developed in Baron and Kenny (1986), are quite common. Unfortunately, it turns out that those procedures depend on some strong assumptions (Bullock, Green, and Ha 2010) and suffer when there are multiple mechanisms, as one might expect in many situations (Imai and Yamamoto 2013; VanderWeele and Vansteelandt 2014). Imai et al. (2011) and Imai and Yamamoto (2013) offer some improved statistical methods. Further, VanderWeele (2015) offers a book-length treatment with a number of important results, including a comprehensive approach to sensitivity analysis.
A separate approach focuses on experiments that attempt to manipulate the mechanism directly (e.g., giving sailors vitamin C). Unfortunately, in many social science settings, there are often multiple mechanisms and it is hard to design treatments that change one without changing the others. Some approaches to experimentally altering mechanisms are described by Imai, Tingley, and Yamamoto (2013), Ludwig, Kling, and Mullainathan (2011), and Pirlott and MacKinnon (2016).
Researchers running fully factorial experiments will need to be concerned about multiple hypothesis testing; see Fink, McConnell, and Vollmer (2014) and List, Shaikh, and Xu (2016) for more information.
Finally, mechanisms also have a long history in the philosophy of science as described by Hedström and Ylikoski (2010).
For more on the use of correspondence studies and audit studies to measure discrimination, see Pager (2007).
The most common way to recruit participants to experiments that you build is Amazon Mechanical Turk (MTurk). Because MTurk mimics aspects of traditional lab experiments—paying people to complete tasks that they would not do for free—many researchers have already begun using Turkers (the workers on MTurk) as experimental participants, resulting in faster and cheaper data collection than can be achieved in traditional on-campus laboratory experiments (Paolacci, Chandler, and Ipeirotis 2010; Horton, Rand, and Zeckhauser 2011; Mason and Suri 2012; Rand 2012; Berinsky, Huber, and Lenz 2012).
Generally, the biggest advantages of using participants recruited from MTurk are logistical. Whereas lab experiments can take weeks to run and field experiments can take months to set-up, experiments with participants recruited from MTurk can be run in days. For example, Berinsky, Huber, and Lenz (2012) were able to recruit 400 subjects in a single day to participate in an 8 minute experiment. Further, these participants can be recruited for virtually any purpose (including surveys and mass collaboration, as discussed in chapters 3 and 5). This ease of recruitment means that researchers can run sequences of related experiments in rapid succession.
Before recruiting participants from MTurk for your own experiments, there are four important things that you need to know. First, many researchers have a nonspecific skepticism of experiments involving Turkers. Because this skepticism is not specific, it is hard to counter with evidence. However, after several years of studies using Turkers, we can now conclude that this skepticism is not particularly justified. There have been many studies comparing the demographics of Turkers with those of other populations and many studies comparing the results of experiments with Turkers wit those from other populations. Given all this work, I think that the best way for you to think about it is that Turkers are a reasonable convenience sample, much like students but slightly more diverse (Berinsky, Huber, and Lenz 2012). Thus, just as students are a reasonable population for some, but not all, research, Turkers are a reasonable population for some, but not all, research. If you are going to work with Turkers, then it makes sense to read many of these comparative studies and understand their nuances.
Second, researchers have developed best practices for increasing the internal validity of MTurk experiments, and you should learn about and follow these best-practices (Horton, Rand, and Zeckhauser 2011; Mason and Suri 2012). For example, researchers using Turkers are encouraged to use screeners to remove inattentive participants (Berinsky, Margolis, and Sances 2014, 2016) (but see also D. J. Hauser and Schwarz (2015b) and D. J. Hauser and Schwarz (2015a)). If you don’t remove inattentive participants, then any effect of the treatment can be washed out by the noise that they introduce, and in practice the number of inattentive participants can be substantial. In the experiment by Huber and colleagues (2012), about 30% of participants failed basic attention screeners. Other problems that commonly arise when Turkers are used are non-naive participants (Chandler et al. 2015) and attrition (Zhou and Fishbach 2016).
Third, relative to some other forms of digital experiments, MTurk experiments cannot scale; Stewart et al. (2015) estimate that at any given time there are only about 7,000 people on MTurk.
Finally, you should know that MTurk is a community with its own rules and norms (Mason and Suri 2012). In the same way that you would try to find out about the culture of a country where you were going to run your experiments, you should try to find out more about the culture and norms of Turkers (Salehi et al. 2015). And you should know that the Turkers will be talking about your experiment if you do something inappropriate or unethical (Gray et al. 2016).
MTurk is an incredibly convenient way to recruit participants to your experiments, whether they are lab-like, such as that of Huber, Hill, and Lenz (2012), or more field-like, such as those of Mason and Watts (2009), Goldstein, McAfee, and Suri (2013), Goldstein et al. (2014), Horton and Zeckhauser (2016), and Mao et al. (2016).
If you are thinking of trying to create your own product, I recommend that you read the advice offered by the MovieLens group in Harper and Konstan (2015). A key insight from their experience is that for each successful project there are many, many failures. For example, the MovieLens group launched other products, such as GopherAnswers, that were complete failures (Harper and Konstan 2015). Another example of a researcher failing while attempting to build a product is Edward Castronova’s attempt to build an online game called Arden. Despite $250,000 in funding, the project was a flop (Baker 2008). Projects like GopherAnswers and Arden are unfortunately much more common than projects like MovieLens.
I’ve heard the idea of Pasteur’s Quadrant discussed frequently at tech companies, and it helps organize research efforts at Google (Spector, Norvig, and Petrov 2012).
Bond and colleagues’ study (2012) also attempts to detect the effect of these treatments on the friends of those who received them. Because of the design of the experiment, these spillovers are difficult to detect cleanly; interested readers should see Bond et al. (2012) for a more thorough discussion. Jones and colleagues (2017) also conducted a very similar experiment during the 2012 election. These experiments are part of a long tradition of experiments in political science on efforts to encourage voting (Green and Gerber 2015). These get-out-the-vote experiments are common, in part because they are in Pasteur’s Quadrant. That is, there are many people who are motivated to increase voting and voting can be an interesting behavior to test more general theories about behavior change and social influence.
For advice about running field experiments with partner organizations such as political parties, NGOs, and businesses, see Loewen, Rubenson, and Wantchekon (2010), J. A. List (2011), and Gueron (2002). For thoughts about how partnerships with organizations can impact research designs, see King et al. (2007) and Green, Calfano, and Aronow (2014). Partnership can also lead to ethical questions, as discussed by Humphreys (2015) and Nickerson and Hyde (2016).
If you are going create an analysis plan before running your experiment, I suggest that you start by reading reporting guidelines. The CONSORT (Consolidated Standard Reporting of Trials) guidelines were developed in medicine (Schulz et al. 2010) and modified for social research (Mayo-Wilson et al. 2013). A related set of guidelines has been developed by the editors of the Journal of Experimental Political Science (Gerber et al. 2014) (see also Mutz and Pemantle (2015) and Gerber et al. (2015)). Finally, reporting guidelines have been developed in psychology (APA Working Group 2008), and see also Simmons, Nelson, and Simonsohn (2011).
If you create an analysis plan, you should consider pre-registering it because pre-registration will increase the confidence that others have in your results. Further, if you are working with a partner, it will limit your partner’s ability to change the analysis after seeing the results. Pre-registration is becoming increasingly common in psychology (Nosek and Lakens 2014), political science (Humphreys, Sierra, and Windt 2013; Monogan 2013; Anderson 2013; Gelman 2013; Laitin 2013), and economics (Olken 2015).
Design advice specifically for online field experiments is also presented in Konstan and Chen (2007) and Chen and Konstan (2015).
What I’ve called the armada strategy is sometimes called programmatic research; see Wilson, Aronson, and Carlsmith (2010).
For more on the MusicLab experiments, see Salganik, Dodds, and Watts (2006), Salganik and Watts (2008), Salganik and Watts (2009b), Salganik and Watts (2009a), and Salganik (2007). For more on winner-take-all markets, see Frank and Cook (1996). For more on untangling luck and skill more generally, see Mauboussin (2012), Watts (2012), and Frank (2016).
There is another approach to eliminating participant payments that researchers should use with caution: conscription. In many online field experiments participants are basically drafted into experiments and never compensated. Examples of this approach include Restivo and van de Rijt’s (2012) experiment on rewards in Wikipedia and Bond and colleague’s (2012) experiment on encouraging people to vote. These experiments don’t really have zero variable cost—rather, they have zero variable cost to researchers. In such experiments, even if the cost to each participant is extremely small, the aggregate cost can be quite large. Researchers running massive online experiments often justify the importance of small estimated treatment effects by saying that these small effects can become important when applied to many people. The exact same thinking applies to costs that researchers impose on participants. If your experiment causes one million people to waste one minute, the experiment is not very harmful to any particular person, but in aggregate it has wasted almost two years of time.
Another approach to creating zero variable cost payment to participants is to use a lottery, an approach that has also been used in survey research (Halpern et al. 2011). For more about designing enjoyable user experiences, see Toomim et al. (2011). For more about using bots to create zero variable cost experiments see (???).
The three R’s as originally proposed by Russell and Burch (1959) are as follows:
“Replacement means the substitution for conscious living higher animals of insentient material. Reduction means reduction in the numbers of animals used to obtain information of a given amount and precision. Refinement means any decrease in the incidence or severity of inhumane procedures applied to those animals which still have to be used.”
The three R’s that I propose don’t override the ethical principles described in chapter 6. Rather, they are a more elaborated version one of those principles—beneficence—specifically in the setting of human experiments.
In terms of the first R (“replacement”), comparing the emotional contagion experiment (Kramer, Guillory, and Hancock 2014) and the emotional contagion natural experiment (Lorenzo Coviello et al. 2014) offers some general lessons about the trade-offs involved in moving from experiments to natural experiments (and other approaches like matching that attempt to approximate experiments in non-experimental data; see chapter 2). In addition to the ethical benefits, switching from experimental to non-experimental studies also enables researchers to study treatments that they are logistically unable to deploy. These ethical and logistical benefits come at a cost, however. With natural experiments researchers have less control over things like recruitment of participants, randomization, and the nature of the treatment. For example, one limitation of rainfall as a treatment is that it both increases positivity and decreases negativity. In the experimental study, however, Kramer and colleagues were able to adjust positivity and negativity independently. The particular approach used by Lorenzo Coviello et al. (2014) was further elaborated by L. Coviello, Fowler, and Franceschetti (2014). For an introduction to instrumental variables, which is the approach used by Lorenzo Coviello et al. (2014), see Angrist and Pischke (2009) (less formal) or Angrist, Imbens, and Rubin (1996) (more formal). For a skeptical appraisal of instrumental variables, see Deaton (2010), and for an introduction to instrumental variables with weak instruments (rain is a weak instrument), see Murray (2006). More generally, a good introduction to natural experiments is given by Dunning (2012), while Rosenbaum (2002), (???), and Shadish, Cook, and Campbell (2001) offer good ideas about estimating causal effects without experiments.
In terms of the second R (“refinement”), there are scientific and logistical trade-offs when considering changing the design of Emotional Contagion from blocking posts to boosting posts. For example, it may be the case that the technical implementation of the News Feed makes it is substantially easier to do an experiment in which posts are blocked rather than one in which they are boosted (note that an experiment involving blocking of posts could be implemented as a layer on top of the News Feed system without any need for alterations of the underlying system). Scientifically, however, the theory addressed by the experiment did not clearly suggest one design over the other. Unfortunately, I am not aware of substantial prior research about the relative merits of blocking and boosting content in the News Feed. Also, I have not seen much research about refining treatments to make them less harmful; one exception is B. Jones and Feamster (2015), which considers the case of measurement of Internet censorship (a topic I discuss in chapter 6 in relationship to the Encore study (Burnett and Feamster 2015; Narayanan and Zevenbergen 2015)).
In terms of the third R (“reduction”), good introductions to traditional power analysis are given by Cohen (1988) (book) and Cohen (1992) (article), while Gelman and Carlin (2014) offer a slightly different perspective. Pre-treatment covariates can be included in the design and analysis stage of experiments; chapter 4 of Gerber and Green (2012) provides a good introduction to both approaches, and Casella (2008) provides a more in-depth treatment. Techniques that use this pre-treatment information in the randomization are typically called either blocked experimental designs or stratified experimental designs (the terminology is not used consistently across communities); these techniques are closely related to the stratified sampling techniques discussed in chapter 3. See Higgins, Sävje, and Sekhon (2016) for more on using these designs in massive experiments. Pre-treatment covariates can also be included in the analysis stage. McKenzie (2012) explores the difference-in-differences approach to analyzing field experiments in greater detail. See Carneiro, Lee, and Wilhelm (2016) for more on the trade-offs between different approaches to increase precision in estimates of treatment effects. Finally, when deciding whether to try to include pre-treatment covariates at the design or analysis stage (or both), there are a few factors to consider. In a setting where researchers want to show that they are not “fishing” (Humphreys, Sierra, and Windt 2013), using pre-treatment covariates in the design stage can be helpful (Higgins, Sävje, and Sekhon 2016). In situations where participants arrive sequentially, especially online field experiments, using pre-treatment information in the design stage may be difficult logistically; see, for example, Xie and Aurisset (2016).
It is worth adding a bit of intuition about why a difference-in-differences approach can be so much more effective than a difference-in-means one. Many online outcomes have very high variance (see e.g., R. A. Lewis and Rao (2015) and Lamb et al. (2015)) and are relatively stable over time. In this case, the change score will have substantially smaller variance, increasing the power of the statistical test. One reason this approach is not used more often is that prior to the digital age, it was not common to have pre-treatment outcomes. A more concrete way to think about this is to imagine an experiment to measure whether a specific exercise routine causes weight loss. If you adopt a difference-in-means approach, your estimate will have variability arising from the variability in weights in the population. If you do a difference-in-differences approach, however, that naturally occurring variation in weights is removed, and you can more easily detect a difference caused by the treatment.
Finally, I considered adding a fourth R: “repurpose”. That is, if researchers find themselves with more experimental data than they need to address their original research question, they should repurpose the data to ask new questions. For example, imagine that Kramer and colleagues had used a difference-in-differences estimator and found themselves with more data than they needed to address their research question. Rather than not using the data to the fullest extent, they could have studied the size of the effect as a function of pre-treatment emotional expression. Just as Schultz et al. (2007) found that the effect of the treatment was different for light and heavy users, perhaps the effects of the News Feed were different for people who already tended to post happy (or sad) messages. Repurposing could lead to “fishing” (Humphreys, Sierra, and Windt 2013) and “p-hacking” (Simmons, Nelson, and Simonsohn 2011), but these are largely addressable with a combination of honest reporting (Simmons, Nelson, and Simonsohn 2011), pre-registration (Humphreys, Sierra, and Windt 2013), and machine learning methods that attempt to avoid over-fitting.