Activities

degree of difficulty: easy , medium , hard , very hard
requires math ( $requires math$ )
requires coding ()
data collection ()
my favorites ()

[, ] Berinsky and colleagues (2012) evaluated MTurk in part by replicating three classic experiments. Replicate the classic Asian Disease framing experiment by Tversky and Kahneman (1981). Do your results match Tversky and Kahneman’s? Do your results match those Berinsky and colleagues? What—if anything—does this teach us about using MTurk for survey experiments?
[, ] In a somewhat tongue-in-cheek paper titled “We Have to Break Up,” the social psychologist Robert Cialdini, one of the authors of Schultz et al. (2007), wrote that he was retiring early from his job as a professor, in part because of the challenges he faced doing field experiments in a discipline (psychology) that mainly conducts lab experiments (Cialdini 2009). Read Cialdini’s paper, and write him an email urging him to reconsider his break-up in light of the possibilities of digital experiments. Use specific examples of research that address his concerns.
[] In order to determine whether small initial successes lock in or fade away, van de Rijt and and colleagues (2014) intervened into four different systems bestowing success on randomly selected participants, and then measured the long-term impacts of this arbitrary success. Can you think of other systems in which you could run similar experiments? Evaluate these systems in terms of issues of scientific value, algorithmic confounding (see chapter 2), and ethics.
[, ] The results of an experiment can depend on the participants. Create an experiment and then run it on MTurk using two different recruitment strategies. Try to pick the experiment and recruitment strategies so that the results will be as different as possible. For example, your recruitment strategies could be to recruit participants in the morning and the evening or to compensate participants with high and low pay. These kinds of differences in recruitment strategy could lead to different pools of participants and different experimental outcomes. How different did your results turn out? What does that reveal about running experiments on MTurk?
[, $requires math$ , ] Imagine that you were planning the Emotional Contagion experiment (Kramer, Guillory, and Hancock 2014). Use the results from an earlier observational study by Kramer (2012) to decide the number of participants in each condition. These two studies don’t match perfectly so be sure to explicitly list all the assumptions that you make:
1. Run a simulation that will decide how many participants would have been needed to detect an effect as large as the effect in Kramer (2012) with $\alpha = 0.05$ and $1 - \beta = 0.8$ .
2. Do the same calculation analytically.
3. Given the results from Kramer (2012) was Emotional Contagion (Kramer, Guillory, and Hancock 2014) over-powered (i.e., did it have more participants than needed)?
4. Of the assumptions that you made, which have the biggest effect on your calculation?
[, $requires math$ , ] Answer the previous question again, but this time rather than using the earlier observational study by Kramer (2012), use the results from an earlier natural experiment by Lorenzo Coviello et al. (2014).
[] Both Margetts et al. (2011) and van de Rijt et al. (2014) performed experiments studying the process of people signing a petition. Compare and contrast the designs and findings of these studies.
[] Dwyer, Maki, and Rothman (2015) conducted two field experiments on the relationship between social norms and pro-environmental behavior. Here’s the abstract of their paper:

“How might psychological science be utilized to encourage proenvironmental behavior? In two studies, interventions aimed at promoting energy conservation behavior in public bathrooms examined the influences of descriptive norms and personal responsibility. In Study 1, the light status (i.e., on or off) was manipulated before someone entered an unoccupied public bathroom, signaling the descriptive norm for that setting. Participants were significantly more likely to turn the lights off if they were off when they entered. In Study 2, an additional condition was included in which the norm of turning off the light was demonstrated by a confederate, but participants were not themselves responsible for turning it on. Personal responsibility moderated the influence of social norms on behavior; when participants were not responsible for turning on the light, the influence of the norm was diminished. These results indicate how descriptive norms and personal responsibility may regulate the effectiveness of proenvironmental interventions.”

Read their paper and design a replication of study 1.
[, ] Building on the previous question, now carry out your design.
1. How do the results compare?
2. What might explain these differences?
[] There has been substantial debate about experiments using participants recruited from MTurk. In parallel, there has also been substantial debate about experiments using participants recruited from undergraduate student populations. Write a two-page memo comparing and contrasting Turkers and undergraduates as research participants. Your comparison should include a discussion of both scientific and logistical issues.
[] Jim Manzi’s book Uncontrolled (2012) is a wonderful introduction to the power of experimentation in business. In the book he relayed the following story:

“I was once in a meeting with a true business genius, a self-made billionaire who had a deep, intuitive understating of the power of experiments. His company spent significant resources trying to create great store window displays that would attract consumers and increases sales, as conventional wisdom said they should. Experts carefully tested design after design, and in individual test review sessions over a period of years kept showing no significant causal effect of each new display design on sales. Senior marketing and merchandising executives met with the CEO to review these historical test results in toto. After presenting all of the experimental data, they concluded that the conventional wisdom was wrong—that window displays don’t drive sales. Their recommended action was to reduce costs and effort in this area. This dramatically demonstrated the ability of experimentation to overturn conventional wisdom. The CEO’s response was simple: ‘My conclusion is that your designers aren’t very good.’ His solution was to increase effort in store display design, and to get new people to do it.” (Manzi 2012, 158–9)

Which type of validity is the concern of the CEO?
[] Building on the previous question, imagine that you were at the meeting where the results of the experiments were discussed. What are four questions that you could ask—one for each type of validity (statistical, construct, internal, and external)?
[] Bernedo, Ferraro, and Price (2014) studied the seven-year effect of the water-saving intervention described in Ferraro, Miranda, and Price (2011) (see figure 4.11). In this paper, Bernedo and colleagues also sought to understand the mechanism behind the effect by comparing the behavior of households that have and have not moved after the treatment was delivered. That is, roughly, they tried to see whether the treatment impacted the home or the homeowner.
1. Read the paper, describe their design, and summarize their findings.
2. Do their findings impact how you should assess the cost-effectiveness of similar interventions? If so, why? If not, why not?
[] In a follow-up to Schultz et al. (2007), Schultz and colleagues performed a series of three experiments on the effect of descriptive and injunctive norms on a different environmental behavior (towel reuse) in two contexts (a hotel and a timeshare condominium) (Schultz, Khazian, and Zaleski 2008).
1. Summarize the design and findings of these three experiments.
2. How, if at all, do they change your interpretation of Schultz et al. (2007)?
[] In response to Schultz et al. (2007), Canfield, Bruin, and Wong-Parodi (2016) ran a series of lab-like experiments to study the design of electric bills. Here’s how they describe it in the abstract:

“In a survey-based experiment, each participant saw a hypothetical electricity bill for a family with relatively high electricity use, covering information about (a) historical use, (b) comparisons to neighbors, and (c) historical use with appliance breakdown. Participants saw all information types in one of three formats including (a) tables, (b) bar graphs, and (c) icon graphs. We report on three main findings. First, consumers understood each type of electricity-use information the most when it was presented in a table, perhaps because tables facilitate simple point reading. Second, preferences and intentions to save electricity were the strongest for the historical use information, independent of format. Third, individuals with lower energy literacy understood all information less.”

Unlike other follow-up studies, the main outcome of interest in Canfield, Bruin, and Wong-Parodi (2016) is reported behavior, not actual behavior. What are the strengths and weaknesses of this type of study in a broader research program promoting energy savings?
[, ] Smith and Pell (2003) presented a satirical meta-analysis of studies demonstrating the effectiveness of parachutes. They concluded:

“As with many interventions intended to prevent ill health, the effectiveness of parachutes has not been subjected to rigorous evaluation by using randomised controlled trials. Advocates of evidence based medicine have criticised the adoption of interventions evaluated by using only observational data. We think that everyone might benefit if the most radical protagonists of evidence based medicine organised and participated in a double blind, randomised, placebo controlled, crossover trial of the parachute.”

Write an op-ed suitable for a general-readership newspaper, such as the New York Times, arguing against the fetishization of experimental evidence. Provide specific, concrete examples. Hint: See also Deaton (2010) and Bothwell et al. (2016).
[, , ] Difference-in-differences estimators of a treatment effect can be more precise than difference-in-mean estimators. Write a memo to an engineer in charge of A/B testing at a start-up social media company explaining the value of the difference-in-differences approach for running an online experiment. The memo should include a statement of the problem, some intuition about the conditions under which the difference-in-difference estimator will outperform the difference-in-mean estimator, and a simple simulation study.
[, ] Gary Loveman was a professor at Harvard Business School before becoming the CEO of Harrah’s, one of the largest casino companies in the world. When he moved to Harrah’s, Loveman transformed the company with a frequent-flier-like loyalty program that collected tremendous amounts of data about customer behavior. On top of this always-on measurement system, the company began running experiments. For example, they might run an experiment to evaluate the effect of a coupon for a free hotel night for customers with a specific gambling pattern. Here’s how Loveman described the importance of experimentation to Harrah’s everyday business practices:

“It’s like you don’t harass women, you don’t steal, and you’ve got to have a control group. This is one of the things that you can lose your job for at Harrah’s—not running a control group.” (Manzi 2012, 146)

Write an email to a new employee explaining why Loveman thinks it is so important to have a control group. You should try to include an example—either real or made up—to illustrate your point.
[, $requires math$ ] A new experiment aims to estimate the effect of receiving text message reminders on vaccination uptake. One hundred and fifty clinics, each with 600 eligible patients, are willing to participate. There is a fixed cost of $100 for each clinic you want to work with, and it costs $1 for each text message that you want to send. Further, any clinics that you are working with will measure the outcome (whether someone received a vaccination) for free. Assume that you have a budget of $1,000.
1. Under what conditions might it be better to focus your resources on a small number of clinics and under what conditions might it be better to spread them more widely?
2. What factors would determine the smallest effect size that you will be able to reliably detect with your budget?
3. Write a memo explaining these trade-offs to a potential funder.
[, $requires math$ ] A major problem with online courses is attrition: many students who start courses end up dropping out. Imagine that you are working at an online learning platform, and a designer at the platform has created a visual progress bar that she thinks will help prevent students from dropping out of the course. You want to test the effect of the progress bar on students in a large computational social science course. After addressing any ethical issues that might arise in the experiment, you and your colleagues get worried that the course might not have enough students to reliably detect the effects of the progress bar. In the following calculations, you can assume that half of the students will receive the progress bar and half not. Further, you can assume that there is no interference. In other words, you can assume that participants are only affected by whether they received the treatment or control; they are not effected by whether other people received the treatment or control (for a more formal definition, see chapter 8 of Gerber and Green (2012)). Keep track of any additional assumptions that you make.
1. Suppose the progress bar is expected to increase the proportion of students who finish the class by 1 percentage point; what is the sample size needed to reliably detect the effect?
2. Suppose the progress bar is expected to increase the proportion of students who finish the class by 10 percentage points; what is the sample size needed to reliably detect the effect?
3. Now imagine that you have run the experiment, and students who have completed all the course materials have taken a final exam. When you compare the final exam scores of students who received the progress bar with the scores of those who didn’t, you find, much to your surprise, that students who did not receive the progress bar actually scored higher. Does this mean that the progress bar caused students to learn less? What can you learn from this outcome data? (Hint: See chapter 7 of Gerber and Green (2012))
[, , ] Imagine that you are working as a data scientist at a tech company. Someone from the marketing department asks for your help in evaluating an experiment that they are planning in order to measure the return on investment (ROI) for a new online ad campaign. ROI is defined as the net profit from the campaign divided by the cost of the campaign. For example, a campaign that had no effect on sales would have an ROI of -100%; a campaign where profits generated were equal to costs would have an ROI of 0; and a campaign where profits generated were double the cost would have an ROI of 200%.

Before launching the experiment, the marketing department provides you with the following information based on their earlier research (in fact, these values are typical of the real online ad campaigns reported in Lewis and Rao (2015)):
- The mean sales per customer follows a log-normal distribution with a mean of $7 and a standard deviation of $75.
- The campaign is expected to increase sales by $0.35 per customer, which corresponds to an increase in profit of $0.175 per customer.
- The planned size of the experiment is 200,000 people: half in the treatment group and half in the control group.
- The cost of the campaign is $0.14 per participant.
- The expected ROI for the campaign is 25% [ $(0.175 - 0.14)/0.14$ ]. In other words, the marketing department believes that for each 100 dollars spent on marketing, the company will earn an additional $25 in profit.
Write a memo evaluating this proposed experiment. Your memo should use evidence from a simulation that you create, and it should address two major issues: (1) Would you recommend launching this experiment as planned? If so, why? If not, why not? Be sure to be clear about the criteria that you are using to make this decision. (2) What sample size would you recommend for this experiment? Again please be sure to be clear about the criteria that you are using to make this decision.

A good memo will address this specific case; a better memo will generalize from this case in one way (e.g., show how the decision changes as a function of the size of the effect of the campaign); and a great memo will present a fully generalized result. Your memo should use graphs to help illustrate your results.

Here are two hints. First, the marketing department might have provided you with some unnecessary information, and they might have failed to provide you with some necessary information. Second, if you are using R, be aware that the rlnorm() function does not work the way that many people expect.

This activity will give you practice with power analysis, creating simulations, and communicating your results with words and graphs. It should help you conduct power analysis for any kind of experiment, not just experiments designed to estimate ROI. This activity assumes that you have some experience with statistical testing and power analysis. If you are not familiar with power analysis, I recommend that you read “A Power Primer” by Cohen (1992).

This activity was inspired by a lovely paper by R. A. Lewis and Rao (2015), which vividly illustrates a fundamental statistical limitation of even massive experiments. Their paper—which originally had the provocative title “On the Near-Impossibility of Measuring the Returns to Advertising”—shows how difficult it is to measure the return on investment of online ads, even with digital experiments involving millions of customers. More generally, R. A. Lewis and Rao (2015) illustrate a fundamental statistical fact that is particularly important for digital-age experiments: it is hard to estimate small treatment effects amidst noisy outcome data.
[, $requires math$ ] Do the same as the previous question, but, rather than simulation, you should use analytical results.
[, $requires math$ , ] Do the same as the previous question, but use both simulation and analytical results.
[, $requires math$ , ] Imagine that you have written the memo described above, and someone from the marketing department provides one piece of new information: they expect a 0.4 correlation between sales before and after the experiment. How does this change the recommendations in your memo? (Hint: see section 4.6.2 for more on the difference-of-means estimator and the difference-in-differences estimator.)
[, $requires math$ ] In order to evaluate the effectiveness of a new web-based employment-assistance program, a university conducted a randomized control trial among 10,000 students entering their final year of school. A free subscription with unique log-in information was sent through an exclusive email invitation to 5,000 of the randomly selected students, while the other 5,000 students were in the control group and did not have a subscription. Twelve months later, a follow-up survey (with no nonresponse) showed that in both the treatment and control groups, 70% of the students had secured full-time employment in their chosen field (table 4.6). Thus, it seemed that the web-based service had no effect.

However, a clever data scientist at the university looked at the data a bit more closely and found that only 20% of the students in the treatment group ever logged into the account after receiving the email. Further, and somewhat surprisingly, among those who did log into the website, only 60% had secured full-time employment in their chosen field, which was lower than the rate for people who didn’t log in and lower than the rate for people in the control condition (table 4.7).
1. Provide an explanation for what might have happened.
2. What are two different ways to calculate the effect of the treatment in this experiment?
3. Given this result, should the provide this service to all students? Just to be clear, this is not a question with an simple answer.
4. What should they do next?
Hint: This question goes beyond the material covered in this chapter, but addresses issues common in experiments. This type of experimental design is sometimes called an encouragement design because participants are encouraged to engage in the treatment. This problem is an example of what is called one-sided noncompliance (see chapter 5 of Gerber and Green (2012)).
[] After further examination, it turned out that the experiment described in the previous question was even more complicated. It turned out that 10% of the people in the control group paid for access to the service, and they ended up with an employment rate of 65% (table 4.8).
1. Write an email summarizing what you think is happening and recommend a course of action.
Hint: This question goes beyond the material covered in this chapter, but addresses issues common in experiments. This problem is an example of what is called two-sided noncompliance (see chapter 6 of Gerber and Green (2012)).

Table 4.6: Simple View of Data from the Career Services Experiment
Group	Size	Employment rate
Granted access to website	5,000	70%
Not granted access to website	5,000	70%

Table 4.7: More Complete View of Data from the Career Services Experiment
Group	Size	Employment rate
Granted access to website and logged in	1,000	60%
Granted access to website and never logged in	4,000	72.5%
Not granted access to website	5,000	70%

Table 4.8: Full View of Data from the Career Services Experiment
Group	Size	Employment rate
Granted access to website and logged in	1,000	60%
Granted access to website and never logged in	4,000	72.5%
Not granted access to website and paid for it	500	65%
Not granted access to website and did not pay for it	4,500	70.56%