Thanks! We'll let you know when the book is available for purchase.

Close this message

Let me know when the book is for sale

Thanks! We'll let you know when the book is available for purchase.

Close this message

Let me know when the book is finished

Further commentary

This section is designed to be used as a reference, rather than to be read as a narrative.

  • Introduction (Section 4.1)

Questions about causality in social research are often complex and intricate. For a foundational approach to causality based on causal graphs, see Pearl (2009), and for a foundational approach based on potential outcomes, see Imbens and Rubin (2015) (and the technical appendix in this chapter). For a comparison between these two approaches, see Morgan and Winship (2014). For a formal approach to defining a confounder, see VanderWeele and Shpitser (2013).

In the chapter, I created what seemed like a bright line between our ability to make causal estimates from experimental and non-experiment data. In reality, I think that the distinction is blurrier. For example, everyone accepts that smoking causes cancer even though we have never done a randomized controlled experiment that forces people to smoke. For excellent book length treatments on making causal estimates from non-experimental data see Rosenbaum (2002), Rosenbaum (2009), Shadish, Cook, and Campbell (2001), and Dunning (2012).

Chapters 1 and 2 of Freedman, Pisani, and Purves (2007) offer a clear introduction into the differences between experiments, controlled experiments, and randomized controlled experiments.

Manzi (2012) provides a fascinating and readable introduction into the philosophical and statistical underpinnings of randomized controlled experiments. It also provides interesting real-world examples of the power of experimentation in business.

  • What are experiments? (Section 4.2)

Casella (2008), Box, Hunter, and Hunter (2005), Athey and Imbens (2016b) provide good introductions to the statistical aspects of experimental design and analysis. Further, there are excellent treatments of the use of experiments in many different fields: economics (Bardsley et al. 2009), sociology (Willer and Walker 2007; Jackson and Cox 2013), psychology (Aronson et al. 1989), political science (Morton and Williams 2010), and social policy (Glennerster and Takavarasha 2013).

The importance of participant recruitment (e.g., sampling) is often under-appreciated in experimental research. However, if the effect of the treatment is heterogeneous in the population, then sampling is critical. Longford (1999) makes this point clearly when he advocates for researchers thinking of experiments as a population survey with haphazard sampling.

  • Two dimensions of experiments: lab-field and analog-digital (Section 4.3)

The dichotomy that I presented between lab and field experiments is a bit simplified. In fact, other researchers have proposed more detailed typologies, in particular ones that separate the various forms of field experiments (Harrison and List 2004; Charness, Gneezy, and Kuhn 2013). Further, there are two other types of experiments performed by social scientists that don’t fit neatly into the lab and field dichotomy: survey experiments and social experiments. Survey experiments are experiments using the infrastructure of existing surveys and compare responses to alternative versions of the same questions (some survey experiments are presented in Chapter 3); for more on survey experiments see Mutz (2011). Social experiments are experiments where the treatment is some social policy that can only be implemented by a government. Social experiments are closely related to program evaluation. For more on policy experiments, see Orr (1998), Glennerster and Takavarasha (2013), and Heckman and Smith (1995).

A number of papers have compared lab and field experiments in the abstract (Falk and Heckman 2009; Cialdini 2009) and in terms of outcomes of specific experiments in political science (Coppock and Green 2015), economics (Levitt and List 2007a; Levitt and List 2007b; Camerer 2011; Al-Ubaydli and List 2013) and psychology (Mitchell 2012). Jerit, Barabas, and Clifford (2013) offers a nice research design for comparing results from lab and field experiments.

Concerns about participants changing their behavior because they know they are being closely observed are sometimes called demand effects, and they have been studied in psychology (Orne 1962) and economics (Zizzo 2009). Although mostly associated with lab experiments, these same issues can cause problems for field experiments as well. In fact, demand effects are also sometimes called Hawthorne effects, a term that derives from a field experiment, specifically the famous illumination experiments that began in 1924 at the Hawthorne Works of the Western Electric Company (Adair 1984; Levitt and List 2011). Both demand effects and Hawthorn effects are closely related to the idea of reactive measurement discussed in Chapter 2 (see also Webb et al. (1966)).

The history of field experiments has been described in economics (Levitt and List 2009), political science (Green and Gerber 2003; Druckman et al. 2006; Druckman and Lupia 2012), psychology (Shadish 2002), and public policy (Shadish and Cook 2009). One area of social science where field experiments quickly became prominent is international development. For a positive review of that work within economics see Banerjee and Duflo (2009), and for a critical assessment see Deaton (2010). For a review of this work in political science see Humphreys and Weinstein (2009). Finally, the ethical challenges involved with field experiments have been explored in political science (Humphreys 2015; Desposato 2016b) and development economics (Baele 2013).

In the chapter, I suggested that pre-treatment information can be used to improve the precision of estimated treatment effects, but there is some debate about this approach: Freedman (2008), Lin (2013), and Berk et al. (2013); see Bloniarz et al. (2016) for more information.

  • Moving beyond simple experiments (Section 4.4)

I’ve chosen to focus on three concepts: validity, heterogeneity of treatment effects, and mechanisms. These concepts have different names in different fields. For example, psychologists tend to move beyond simple experiments by focusing on mediators and moderators (Baron and Kenny 1986). The idea of mediators is captured by what I call mechanisms, and the idea of moderators is captured by what I call external validity (e.g., would the results of the experiment be different if it was run in different situations) and heterogeneity of treatment effects (e.g., are the effects larger for some people than other people).

The experiment of Schultz et al. (2007) shows how social theories can be used to design effective interventions. For a more general argument about the role of theory in designing effective interventions, see Walton (2014).

  • Validity (Section 4.4.1)

The concepts of internal and external validity were first introduced in Campbell (1957). See Shadish, Cook, and Campbell (2001) for a more detailed history and a careful elaboration of statistical conclusion validity, internal validity, construct validity, and external validity.

For an overview of issues related to statistical conclusion validity in experiments see Gerber and Green (2012) (for a social science perspective) and Imbens and Rubin (2015) (for a statistical perspective). Some issues of statistical conclusion validity that arise specifically in online field experiments include issues such as computationally efficient methods for creating confidence intervals with dependent data (Bakshy and Eckles 2013).

Internal validity can be difficult to ensure in complex field experiments. See, for example, Gerber and Green (2000), Imai (2005), and Gerber and Green (2005) for debate about the implementation of a complex field experiment about voting. Kohavi et al. (2012) and Kohavi et al. (2013) provide an introduction into the challenges of interval validity in online field experiments.

One major concern with internal validity is problems with randomization. One way to potentially detect problems with the randomization is to compare the treatment and control groups on observable traits. This kind of comparison is called a balance check. See Hansen and Bowers (2008) for a statistical approach to balance checks, and see Mutz and Pemantle (2015) for concerns about balance checks. For example, using a balance check Allcott (2011) found that there is some evidence that the randomization was not implemented correctly in three of the experiments in some of the OPower experiments (see Table 2; sites 2, 6, and 8). For other approaches, see Imbens and Rubin (2015), Chapter 21.

Other major concerns related to internal validity are: 1) one-sided non-compliance, where not everyone in the treatment group actually received the treatment, 2) two sided non-compliance, where not everyone in the treatment group receives the treatment and some people in the control group receive the treatment, 3) attrition, where outcomes are not measured for some participants, and 4) interference, where the treatment spills over from people in the treatment condition to people in the control condition. See Gerber and Green (2012) Chapters 5, 6, 7, and 8 for more on each of these issues.

For more on construct validity, see Westen and Rosenthal (2003), and for more on construct validity in big data sources, Lazer (2015) and Chapter 2 of this book.

One aspect of external validity is the setting where an intervention is tested. Allcott (2015) provides a careful theoretical and empirical treatment of site selection bias. This issue is also discussed in Deaton (2010). In addition to being replicated in many sites, the Home Energy Report intervention has also been independently studied by multiple research groups (e.g., Ayres, Raseman, and Shih (2013)).

  • Heterogeneity of treatment effects (Section 4.4.2)

For an excellent overview of heterogeneity of treatment effects in field experiments, see Chapter 12 of Gerber and Green (2012). For introductions to heterogeneity of treatment effects in medical trials, see Kent and Hayward (2007), Longford (1999), and Kravitz, Duan, and Braslow (2004). Heterogeneity of treatment effects generally focus on differences based on pre-treatment characteristics. If you are interested in heterogeneity based on post-treatment outcomes, then more complex approachs are needed such as principal stratification (Frangakis and Rubin 2002); see Page et al. (2015) for a review.

Many researchers estimate the heterogeneity of treatment effects using linear regression, but newer methods rely on machine learning, for example Green and Kern (2012), Imai and Ratkovic (2013), Taddy et al. (2016), and Athey and Imbens (2016a).

There is some skepticism about findings of heterogeneity of effects because of multiple comparison problems and “fishing.” There are a variety of statistical approaches that can help address concerns about multiple comparison (Fink, McConnell, and Vollmer 2014; List, Shaikh, and Xu 2016). One approach to concerns about “fishing” is pre-registration, which is becoming increasingly common in psychology (Nosek and Lakens 2014), political science (Humphreys, Sierra, and Windt 2013; Monogan 2013; Anderson 2013; Gelman 2013; Laitin 2013), and economics (Olken 2015).

In the study of Costa and Kahn (2013) only about half of the households in the experiment were able to be linked to the demographic information. Readers interested in the details and possible problems with this analysis should refer to the original paper.

  • Mechanisms (Section 4.4.3)

Mechanisms are incredibly important, but they turn out to be very difficult to study. Research about mechanisms closely related to the study of mediators in psychology (but see also VanderWeele (2009) for a precise comparison between the two ideas). Statistical approaches to finding mechanisms, such as the approach developed in Baron and Kenny (1986), are quite common. Unfortunately, it turns out that those procedures depend on some strong assumptions (Bullock, Green, and Ha 2010) and suffer when there are multiple mechanisms, as one might expect in many situations (Imai and Yamamoto 2013; VanderWeele and Vansteelandt 2014). Imai et al. (2011) and Imai and Yamamoto (2013) offer some improved statistical methods. Further, VanderWeele (2015) offers a book-length treatment with a number of important results, including a comprehensive approach to sensitivity analysis.

A separate approach focuses on experiments that attempt to manipulate the mechanism directly (e.g., giving sailors vitamin C). Unfortunately, in many social science settings there are often multiple mechanisms and it is hard to design treatments that change one without changing the others. Some approaches to experimentally altering mechanisms are described in Imai, Tingley, and Yamamoto (2013), Ludwig, Kling, and Mullainathan (2011), and Pirlott and MacKinnon (2016).

Finally, mechanisms also have a long history in the philosophy of science as described by Hedström and Ylikoski (2010).

  • Using existing environments (Section

For more on the use of correspondence studies and audit studies to measure discrimination see Pager (2007).

  • Build your own experiment (Section

The most common way to recruit participants to experiments that you build is Amazon Mechanical Turk (MTurk). Because MTurk mimics aspects of traditional lab experiments—paying people to complete tasks that they would not do for free—many researchers have already begun using Turkers (the workers on MTurk) as participants in human subjects experiments resulting in faster and cheaper data collection than traditional on-campus laboratory experiments (Paolacci, Chandler, and Ipeirotis 2010; Horton, Rand, and Zeckhauser 2011; Mason and Suri 2012; Rand 2012; Berinsky, Huber, and Lenz 2012).

The biggest strength of experiments with participants recruited from MTurk are logistical: they allow researchers to recruit participants quickly and as needed. Whereas lab experiments can take weeks to run and field experiments can take months to set-up, experiments with participants recruited from MTurk can be run in days. For example, Berinsky, Huber, and Lenz (2012) were able to recruit 400 subjects in a single day to participate in an 8 minute experiment. Further, these participants can be recruited for virtually any purpose (including surveys and mass collaboration, as discussed in Chapters 3 and 5). This ease of recruitment means that researchers can run sequences of related experiments in rapid succession.

Before recruiting participants from MTurk for your own experiments, there are four important things to know. First, many researchers have a non-specific skepticism of experiments involving Turkers. Because this skepticism is not specific, it is hard to counter with evidence. However, after several years of studies using Turkers, we can now conclude that this skepticism is not especially necessary. There have been many studies comparing the demographics of Turkers to other populations and many studies comparing results of experiments with Turkers to the results from other populations. Given all this work, I think that the best way for you to think about it is that Turkers are a reasonable convenience sample, much like students but slightly more diverse (Berinsky, Huber, and Lenz 2012). Thus, just as students are a reasonable population for some but not all experimental research, Turkers are a reasonable population for some but not all research. If you are going to work with Turkers, then it makes sense to read many of these comparative studies and understand their nuances.

Second, researchers have developed best-practices for increasing internal validity of Turk experiments, and you should learn about and follow these best-practices (Horton, Rand, and Zeckhauser 2011; Mason and Suri 2012). For example, researchers using Turkers are encouraged to use screeners to remove inattentive participants (Berinsky, Margolis, and Sances 2014; Berinsky, Margolis, and Sances 2016) (but see also D. J. Hauser and Schwarz (2015b) and D. J. Hauser and Schwarz (2015a)). If you don’t remove inattentive participants, then any effect of the treatment can be washed out by noise introduced from inattentive participants, and in practice the number of inattentive participants can be substantial. In the experiment of Huber and colleagues (2012) about 30% of participants failed basic attention screeners. Another problem common with Turkers is non-naive participants (Chandler et al. 2015).

Third, relative to some other forms of digital experiments, MTurk experiments cannot scale; Stewart et al. (2015) estimates that at any given time there are only about 7,000 people on MTurk.

Finally, you should know that MTurk is a community with its own rules and norms (Mason and Suri 2012). In the same way that you would try to find out about the culture of a country where you were going to run your experiments, you should try to find out more about the culture and norms of Turkers (Salehi et al. 2015). And, you should know that the Turkers will be talking about your experiment if you do something inappropriate or unethical (Gray et al. 2016).

MTurk is an incredibly convenient way to recruit participants to your experiments, whether they are lab-like, such as Huber, Hill, and Lenz (2012), or more field-like, such as Mason and Watts (2009), Goldstein, McAfee, and Suri (2013), Goldstein et al. (2014), Horton and Zeckhauser (2016), and Mao et al. (2016).

  • Build your own product (Section

If you are thinking of trying to create your own product, I recommend that you read the advice offered by the MovieLens group in Harper and Konstan (2015). A key insight from their experience is that for each successful project there are many, many failures. For example, the MovieLens group launched other products such as GopherAnswers that were complete failures (Harper and Konstan 2015). Another example of a researcher failing while attempting to build a product is Edward Castronova’s attempt to build an online game called Arden. Despite $250,000 in funding, the project was a flop (Baker 2008). Projects like GopherAnswers and Arden are unfortunately much more common than projects like MovieLens. Finally, when I said that I did not know of any other researchers that had successfully built products for repeated experimentation here are my criteria: 1) participants use the product because of what it provides them (e.g., they are not paid and they are not volunteers helping science) and 2) the product has been used for more than one distinct experiment (i.e., not the same experiment multiple times with different participant pools). If you know of other examples, please let me know.

  • Partner with the powerful (Section 4.5.2)

I’ve heard the idea of Pasteur’s Quadrant discussed frequently at tech companies, and it helps organize research efforts at Google (Spector, Norvig, and Petrov 2012).

Bond and colleagues’ study (2012) also attempts to detect the effect of these treatments on the friends of those who received them. Because of the design of the experiment, these spillovers are difficult to detect cleanly; interested readers should see Bond et al. (2012) for a more thorough discussion. This experiment is part of a long tradition of experiments in political science on efforts to encourage voting (Green and Gerber 2015). These get-out-the-vote experiments are common in part because they are in Pasteur’s Quadrant. That is, there are many people who are motivated to increase voting and voting can be an interesting behavior to test more general theories about behavior change and social influence.

Other researchers have provided advice about running field experiments with partner organizations such as political parties, NGOs, and businesses (Loewen, Rubenson, and Wantchekon 2010; List 2011; Gueron 2002). Others have offered advice about how partnerships with organizations can impact research designs (Green, Calfano, and Aronow 2014; King et al. 2007). Partnership can also lead to ethical questions (Humphreys 2015; Nickerson and Hyde 2016).

  • Design advice (Section 4.6)

If you are going create an analysis plan before running your experiment, I suggest that you start by reading reporting guidelines. The CONSORT (Consolidated Standard Reporting of Trials) guidelines were developed in medicine (Schulz et al. 2010) and modified for social research (Mayo-Wilson et al. 2013). A related set of guidelines has been developed by the editors of the Journal of Experimental Political Science (Gerber et al. 2014) (see also Mutz and Pemantle (2015) and Gerber et al. (2015)). Finally, reporting guidelines have been developed in psychology (Group 2008), and see also Simmons, Nelson, and Simonsohn (2011).

If you create an analysis plan you should consider pre-registering it because pre-registration will increase the confidence that others have in your results. Further, if you are working with a partner, it will limit your partner’s ability to change the analysis after seeing the results. Pre-registration is becoming increasingly common in psychology (Nosek and Lakens 2014), political science (Humphreys, Sierra, and Windt 2013; Monogan 2013; Anderson 2013; Gelman 2013; Laitin 2013), and economics (Olken 2015).

While creating your pre-analysis plan you should be aware that some researchers also use regression and related approaches to improve the precision of the estimated treatment effect, and there is some debate about this approach: Freedman (2008), Lin (2013), and Berk et al. (2013); see Bloniarz et al. (2016) for more information.

Design advice specifically for online field experiments is also presented in Konstan and Chen (2007) and Chen and Konstan (2015).

  • Create zero variable cost data (Section 4.6.1)

For more on the MusicLab experiments, see Salganik, Dodds, and Watts (2006), Salganik and Watts (2008), Salganik and Watts (2009b), Salganik and Watts (2009a), and Salganik (2007). For more on winner-take-all markets, see Frank and Cook (1996). For more on untangling luck and skill more generally, see Mauboussin (2012), Watts (2012), and Frank (2016).

There is another approach to eliminating participant payments that researchers should use with caution: conscription. In many online field experiments participants are basically drafted into experiments and never compensated. Examples of this approach include Restivo and van de Rijt’s (2012) experiment on rewards in Wikipedia and Bond and colleague’s (2012) experiment on encouraging people to vote. These experiments don’t really have zero variable cost, they have zero variable cost to researchers. Even though the cost of many of these experiments is extremely small to each participant, small costs imposed an enormous number of participants can add up quickly. Researchers running massive online experiments often justify the importance of small estimated treatment effects by saying that these small effects can become important when applied to many people. The exact same thinking applies to costs that researchers impose on participants. If your experiments causes one million people to waste one minute, the experiment is not very harmful to any particular person, but in aggregate it has wasted almost two years of time.

Another approach to creating zero variable cost payment to participants is to use a lottery, an approach that has also been used in survey research (Halpern et al. 2011). Finally, for more about designing enjoyable user-experiences see Toomim et al. (2011).

  • Replace, Refine, and Reduce (Section 4.6.2)

Here are the original definitions of the three R, from Russell and Burch (1959):

“Replacement means the substitution for conscious living higher animals of insentient material. Reduction means reduction in the numbers of animals used to obtain information of a given amount and precision. Refinement means any decrease in the incidence or severity of inhumane procedures applied to those animals which still have to be used.”

The three R’s that I propose don’t override the ethical principles described in Chapter 6. Rather, they are a more elaborated version one of those principles—beneficence—specifically for the setting of human experiments.

When considering Emotional Contagion, there are three non-ethical issues to keep in mind when interpreting this experiment. First, it is not clear how the actual details of the experiment connect to the theoretical claims; in other words, there are questions about construct validity. It is not clear that the positive and negative word counts are actually a good indicator of the emotional state of participants because 1) it is not clear that the words that people post are a good indicator of their emotions and 2) it is not clear that the particular sentiment analysis technique that the researchers used is able to reliably infer emotions (Beasley and Mason 2015; Panger 2016). In other words, there might be a bad measure of a biased signal. Second, the design and analysis of the experiment tells us nothing about who was most impacted (i.e., there is no analysis of heterogeneity of treatment effects) and what the mechanism might be. In this case, the researchers had lots of information about the participants, but they were essentially treated as widgets in the analysis. Third, the effect size in this experiment was very small; the difference between the treatment and control conditions is about 1 in 1,000 words. In their paper, Kramer and colleagues make the case that an effect of this size is important because hundreds of millions of people access their News Feed each day. In other words, they argue that even effects that are small for each person they are big in aggregate. Even if you were to accept this argument, it is still not clear if an effect of this size is important regarding the more general scientific question about emotional contagion. For more on the situations where small effects are important see Prentice and Miller (1992).

In terms of the first R (Replacement), comparing the Emotional Contagion experiment (Kramer, Guillory, and Hancock 2014) and the emotional contagion natural experiment (Coviello et al. 2014) offers some general lessons about the trade-offs involved with moving from experiments to natural experiments (and other approaches like matching that attempt to approximate experiments in non-experimental data, see Chapter 2). In addition to the ethical benefits, switching from experimental to non-experimental studies also enables researchers to study treatments that they are logistically unable to deploy. These ethical and logistical benefits come at a cost, however. With natural experiments researchers have less control over things like recruitment of participants, randomization, and the nature of the treatment. For example, one limitation of rainfall as a treatment is that it both increases positivity and decreases negativity. In the experimental study, however, Kramer and colleagues were able to adjust positivity and negativity independently.

The particular approach used by Coviello et al. (2014) was further elaborated in Coviello, Fowler, and Franceschetti (2014). For an introduction to instrumental variables see Angrist and Pischke (2009) (less formal) or Angrist, Imbens, and Rubin (1996) (more formal). For a skeptical appraisal of instrumental variables see Deaton (2010), and for an introduction to instrumental variables with weak instruments (rain is a weak instrument), see Murray (2006).

More generally, a good introduction to natural experiments is Dunning (2012), and Rosenbaum (2002), Rosenbaum (2009), and Shadish, Cook, and Campbell (2001) offer good ideas about estimating causal effects without experiments.

In terms of the second R (Refinement), there are scientific and logistical trade-offs when considering changing the design of Emotional Contagion from blocking posts to boosting posts. For example, it may be the case that the technical implementation of the News Feed makes it is substantially easier to do an experiment with blocking posts rather than an experiment with boosting posts (note that an experiment with blocking posts could be implemented as a layer on top of the News Feed system without any need for alterations of the underlying system). Scientifically, however, the theory addressed by the experiment did not clearly suggest one design over the other.

Unfortunately, I am not aware of substantial prior research about the relative merits of blocking and boosting content in the News Feed. Also, I have not seen much research about refining treatments to make them less harmful; one exception is Jones and Feamster (2015), which considers the case of measurement of Internet censorship (a topic I discuss in Chapter 6 in relationship to the Encore study (Burnett and Feamster 2015; Narayanan and Zevenbergen 2015)).

In terms of the third R (Reduction), a good introduction to traditional power analysis is Cohen (1988). Pre-treatment covariates can be included in the design stage and the analysis stage of experiments; Chapter 4 of Gerber and Green (2012) provides a good introduction to both approaches, and Casella (2008) provides a more in-depth treatment. Techniques that use this pre-treatment information in the randomization are typically called either blocked experimental designs or stratified experimental designs (the terminology is not used consistently across communities); these techniques are deeply related to the stratified sampling techniques discussed in Chapter 3. See Higgins, Sävje, and Sekhon (2016) for more on using these designs in massive experiments. Pre-treatment covariates can also be included in the analysis stage. McKenzie (2012) explores the difference-in-differences approach to analyzing field experiments in greater detail. See Carneiro, Lee, and Wilhelm (2016) for more on the trade-offs between different approaches to increase precision in estimates of treatment effects. Finally, when deciding whether to try to include pre-treatment covariates at the design or analysis stage (or both), there are a few factors to consider. In a setting where researchers want to show that they are not “fishing” (Humphreys, Sierra, and Windt 2013), using pre-treatment covariates in the design stage can be helpful (Higgins, Sävje, and Sekhon 2016). In situations where participants arrive sequentially, especially online field experiments, using pre-treatment information in the design stage may be difficult logistically, see for example Xie and Aurisset (2016).

It is worth adding a bit of intuition about why difference-in-differences can be so much more effective than difference-in-means. Many online outcomes have very high variance (see e.g., Lewis and Rao (2015) and Lamb et al. (2015)) and are relatively stable over time. In this case, the change score will have substantially smaller variance, increasing the power of the statistical test. One reason this approached is not used more often is that prior to the digital age it was not common to have pre-treatment outcomes. A more concrete way to think about it is to imagine an experiment to measure whether a specific exercise routine causes weight loss. If you do a difference-in-means approach, your estimate will have variability that comes from the variability in weights in the population. If you do a difference-in-difference approach, however, that naturally occurring variation in weights gets removed and you can more easily detect a difference caused by the treatment.

One important way to reduce the number of participants in your experiment is to conduct a power analysis, which Kramer and colleagues could have done based on the effect sizes observed from the natural experiment by Coviello et al. (2014) or earlier non-experimental research by Kramer (2012) (in fact these are activities at the end of this chapter). Notice that this use of power analysis is a bit different than typical. In the analog age, researchers generally did power analysis to make sure that their study was not too small (i.e., under-powered). Now, however, researchers should do power analysis to make sure that their study is not too big (i.e., over-powered).

Finally, I considered adding a fourth R: Repurpose. That is, if researchers find themselves with more experimental data than they need to address their original research question, they should repurpose the data to ask new questions. For example, imagine that Kramer and colleagues had used a difference-in-differences estimator and found themselves with more data than needed to address their research question. Rather than not using the data to the fullest extent, they could have studied the size of the effect as a function to pre-treatment emotional expression. Just as Schultz et al. (2007) found that the effect of the treatment was different for light and heavy users, perhaps the effects of the News Feed were different for people who already tended to post happy (or sad) messages. Repurposing could lead to “fishing” (Humphreys, Sierra, and Windt 2013) and “p-hacking” (Simmons, Nelson, and Simonsohn 2011), but these are largely addressable with a combination of honest reporting (Simmons, Nelson, and Simonsohn 2011), pre-registration (Humphreys, Sierra, and Windt 2013), and machine learning methods that attempt to avoid over-fitting.