4.3 Two dimensions of experiments: lab-field and analog-digital

Lab experiments offer control, field experiments offer realism, and digital field experiments combine control and realism at scale.

Experiments come in many different shapes and sizes. In the past, researchers have found it helpful to organize experiments along a continuum between lab experiments and field experiments. Now, however, researchers should also organize experiments along a second continuum between analog experiments and digital experiments. This two-dimensional design space will help you understand the strengths and weaknesses of different approaches and highlight the areas of greatest opportunity (figure 4.1).

Figure 4.1: Schematic of design space for experiments. In the past, experiments varied along the lab-field dimension. Now, they also vary on the analog-digital dimension. This two-dimensional design space is illustrated by four experiments that I describe in this chapter. In my opinion, the area of greatest opportunity is digital field experiments.

Figure 4.1: Schematic of design space for experiments. In the past, experiments varied along the lab-field dimension. Now, they also vary on the analog-digital dimension. This two-dimensional design space is illustrated by four experiments that I describe in this chapter. In my opinion, the area of greatest opportunity is digital field experiments.

One dimension along which experiments can be organized is the lab-field dimension. Many experiments in the social sciences are lab experiments where undergraduate students perform strange tasks in a lab for course credit. This type of experiment dominates research in psychology because it enables researchers to create highly controlled settings to precisely isolate and test specific theories about social behavior. For certain problems, however, something feels a bit strange about drawing strong conclusions about human behavior from such unusual people performing such unusual tasks in such an unusual setting. These concerns have led to a movement toward field experiments. Field experiments combine the strong design of randomized control experiments with more representative groups of participants performing more common tasks in more natural settings.

Although some people think of lab and field experiments as competing methods, it is best to think of them as complementary, with different strengths and weaknesses. For example, Correll, Benard, and Paik (2007) used both a lab experiment and a field experiment in an attempt to find the sources of the “motherhood penalty.” In the United States, mothers earn less money than childless women, even when comparing women with similar skills working in similar jobs. There are many possible explanations for this pattern, one of which is that employers are biased against mothers. (Interestingly, the opposite seems to be true for fathers: they tend to earn more than comparable childless men.) In order to assess possible bias against mothers, Correll and colleagues ran two experiments: one in the lab and one in the field.

First, in a lab experiment they told participants, who were college undergraduates, that a company was conducting an employment search for a person to lead its new East Coast marketing department. The students were told that the company wanted their help in the hiring process, and they were asked to review resumes of several potential candidates and to rate the candidates on a number of dimensions, such as their intelligence, warmth, and commitment to work. Further, the students were asked if they would recommend hiring the applicant and what they would recommend as a starting salary. Unbeknownst to the students, however, the resumes were specifically constructed to be similar except for one thing: some of them signaled motherhood (by listing involvement in a parent-teacher association) and some did not. Correll and colleagues found that the students were less likely to recommend hiring the mothers and that they offered them a lower starting salary. Further, through a statistical analysis of both the ratings and the hiring-related decisions, Correll and colleagues found that mothers’ disadvantages were largely explained by the fact that they were rated lower in terms of competence and commitment. Thus, this lab experiment allowed Correll and colleagues to measure a causal effect and provide a possible explanation for that effect.

Of course, one might be skeptical about drawing conclusions about the entire US labor market based on the decisions of a few hundred undergraduates who have probably never had a full-time job, let alone hired someone. Therefore, Correll and colleagues also conducted a complementary field experiment. They responded to hundreds of advertised job openings with fake cover letters and resumes. Similar to the materials shown to the undergraduates, some resumes signaled motherhood and some did not. Correll and colleagues found that mothers were less likely to get called back for interviews than equally qualified childless women. In other words, real employers making consequential decisions in a natural setting behaved much like the undergraduates. Did they make similar decisions for the same reason? Unfortunately, we don’t know. The researchers were not able to ask the employers to rate the candidates or explain their decisions.

This pair of experiments reveals a lot about lab and field experiments in general. Lab experiments offer researchers near-total control of the environment in which participants are making decisions. So, for example, in the lab experiment, Correll and colleagues were able to ensure that all the resumes were read in a quiet setting; in the field experiment, some of the resumes might not even have been read. Further, because participants in the lab setting know that they are being studied, researchers are often able to collect additional data that can help explain why participants are making their decisions. For example, Correll and colleagues asked participants in the lab experiment to rate the candidates on different dimensions. This kind of process data could help researchers understand the mechanisms behind differences in how participants treat the resumes.

On the other hand, these exact same characteristics that I have just described as advantages are also sometimes considered disadvantages. Researchers who prefer field experiments argue that participants in lab experiments could act very differently because they know that they are being studied. For example, in the lab experiment, participants might have guessed the goal of the research and altered their behavior so as not to appear biased. Further, researchers who prefer field experiments might argue that small differences in resumes can only stand out in a very clean, sterile lab environment, and thus the lab experiment will overestimate the effect of motherhood on real hiring decisions. Finally, many proponents of field experiments criticize lab experiments’ reliance on WEIRD participants: mainly students from Western, Educated, Industrialized, Rich, and Democratic countries (Henrich, Heine, and Norenzayan 2010a). The experiments by Correll and colleagues (2007) illustrate the two extremes on the lab-field continuum. In between these two extremes there are also a variety of hybrid designs, including approaches such as bringing non-students into a lab or going into the field but still having participants perform an unusual task.

In addition to the lab-field dimension that has existed in the past, the digital age means that researchers now have a second major dimension along which experiments can vary: analog-digital. Just as there are pure lab experiments, pure field experiments, and a variety of hybrids in between, there are pure analog experiments, pure digital experiments, and a variety of hybrids. It is tricky to offer a formal definition of this dimension, but a useful working definition is that fully digital experiments are experiments that make use of digital infrastructure to recruit participants, randomize, deliver treatments, and measure outcomes. For example, Restivo and van de Rijt’s (2012) study of barnstars and Wikipedia was a fully digital experiment because it used digital systems for all four of these steps. Likewise, fully analog experiments do not make use of digital infrastructure for any of these four steps. Many of the classic experiments in psychology are fully analog experiments. In between these two extremes, there are partially digital experiments that use a combination of analog and digital systems.

When some people think of digital experiments, they immediately think of online experiments. This is unfortunate because the opportunities to run digital experiments are not just online. Researchers can run partially digital experiments by using digital devices in the physical world in order to deliver treatments or measure outcomes. For example, researchers could use smartphones to deliver treatments or sensors in the built environment to measure outcomes. In fact, as we will see later in this chapter, researchers have already used home power meters to measure outcomes in experiments about energy consumption involving 8.5 million households (Allcott 2015). As digital devices become increasingly integrated into people’s lives and sensors become integrated into the built environment, these opportunities to run partially digital experiments in the physical world will increase dramatically. In other words, digital experiments are not just online experiments.

Digital systems create new possibilities for experiments everywhere along the lab-field continuum. In pure lab experiments, for example, researchers can use digital systems for finer measurement of participants’ behavior; one example of this type of improved measurement is eye-tracking equipment that provides precise and continuous measures of gaze location. The digital age also creates the possibility of running lab-like experiments online. For example, researchers have rapidly adopted Amazon Mechanical Turk (MTurk) to recruit participants for online experiments (figure 4.2). MTurk matches “employers” who have tasks that need to be completed with “workers” who wish to complete those tasks for money. Unlike traditional labor markets, however, the tasks involved usually require only a few minutes to complete, and the entire interaction between employer and worker is online. Because MTurk mimics aspects of traditional lab experiments—paying people to complete tasks that they would not do for free—it is naturally suited for certain types of experiments. Essentially, MTurk has created the infrastructure for managing a pool of participants—recruiting and paying people—and researchers have taken advantage of that infrastructure to tap into an always available pool of participants.

Figure 4.2: Papers published using data from Amazon Mechanical Turk (MTurk). MTurk and other online labor markets offer researchers a convenient way to recruit participants for experiments. Adapted from Bohannon (2016).

Figure 4.2: Papers published using data from Amazon Mechanical Turk (MTurk). MTurk and other online labor markets offer researchers a convenient way to recruit participants for experiments. Adapted from Bohannon (2016).

Digital systems create even more possibilities for field-like experiments. In particular, they enable researchers to combine the tight control and process data that are associated with lab experiments with the more diverse participants and more natural settings that are associated with lab experiments. In addition, digital field experiments also offer three opportunities that tended to difficult in analog experiments.

First, whereas most analog lab and field experiments have hundreds of participants, digital field experiments can have millions of participants. This change in scale is because some digital experiments can produce data at zero variable cost. That is, once researchers have created an experimental infrastructure, increasing the number of participants typically does not increase the cost. Increasing the number of participants by a factor of 100 or more is not just a quantitative change; it is a qualitative change, because it enables researchers to learn different things from experiments (e.g., heterogeneity of treatment effects) and to run entirely different experimental designs (e.g., large-group experiments). This point is so important, I’ll return to it toward the end of the chapter when I offer advice about creating digital experiments.

Second, whereas most analog lab and field experiments treat participants as indistinguishable widgets, digital field experiments often use background information about participants in the design and analysis stages of the research. This background information, which is called pre-treatment information, is often available in digital experiments because they are run on top of always-on measurement systems (see chapter 2). For example, a researcher at Facebook has much more pre-treatment information about people in her digital field experiment than a university researcher has about the people in her analog field experiment. This pre-treatment enables more efficient experimental designs—such as blocking (Higgins, Sävje, and Sekhon 2016) and targeted recruitment of participants (Eckles, Kizilcec, and Bakshy 2016)—and more insightful analysis—such as estimation of heterogeneity of treatment effects (Athey and Imbens 2016a) and covariate adjustment for improved precision (Bloniarz et al. 2016).

Third, whereas many analog lab and field experiments deliver treatments and measure outcomes in a relatively compressed amount of time, some digital field experiments happen over much longer timescales. For example, Restivo and van de Rijt’s experiment had the outcome measured daily for 90 days, and one of the experiments I’ll tell you about later in the chapter (Ferraro, Miranda, and Price 2011) tracked outcomes over three years at basically no cost. These three opportunities—size, pre-treatment information, and longitudinal treatment and outcome data—arise most commonly when experiments are run on top of always-on measurement systems (see chapter 2 for more on always-on measurement systems).

While digital field experiments offer many possibilities, they also share some weaknesses with both analog lab and analog field experiments. For example, experiments cannot be used to study the past, and they can only estimate the effects of treatments that can be manipulated. Also, although experiments are undoubtedly useful to guide policy, the exact guidance they can offer is somewhat limited because of complications such as environmental dependence, compliance problems, and equilibrium effects (Banerjee and Duflo 2009; Deaton 2010). Digital field experiments also magnify the ethical concerns created by field experiments—a topic I’ll address later in this chapter and in chapter 6.