## Activities

Key:

• degree of difficulty: easy , medium , hard , very hard
• requires math ()
• requires coding ()
• data collection ()
• my favorites ()
1. [, ] Berinsky and colleagues (2012) evaluates Mechanical Turk in part by replicating three classic experiments. Replicate the classic Asian Disease framing experiment by Tversky and Kahneman (1981). Do your results match Tversky and Kahneman’s? Do your results match Berinsky and colleagues? What—if anything—does this teach us about using Mechanical Turk for survey experiments?

2. [, ] In a somewhat tongue-in-cheek paper titled “We Have to Break Up,” the social psychologist Robert Cialdini, one of the authors of Schultz et al. (2007), wrote that he was retiring early from his job as a professor, in part because of the challenges he faced doing field experiments in a discipline (psychology) that mainly conducts lab experiments (Cialdini 2009). Read Cialdini’s paper, and write him an email urging him to reconsider his break-up in light of the possibilities of digital experiments. Use specific examples of research that address his concerns.

3. [] In order to determine whether small initial successes lock-in or fade away, van de Rijt and and colleagues (2014) intervened into four different systems bestowing success on randomly selected participants, and then measured the long-term impacts of this arbitrary success. Can you think of other systems in which you could run similar experiments? Evaluate these systems in terms of issues of scientific value, algorithmic confounding (see Chapter 2), and ethics.

4. [, ] The results of an experiment can depend on the participants. Create an experiment and then run it on Amazon Mechanical Turk (MTurk) using two different recruitment strategies. Try to pick the experiment and recruitment strategies so that the results will be as different as possible. For example, your recruitment strategies could be to recruit participants in the morning and the evening or to compensate participants with high and low pay. These kinds of differences in recruitment strategy could lead to different pools of participants and different experimental outcomes. How different did your results turn out? What does that reveal about running experiments on MTurk?

5. [, , , ] Imagine that you were planning the Emotional Contagion study (Kramer, Guillory, and Hancock 2014). Use the results from an earlier observational study by Kramer (2012) to decide the number of participants in each condition. These two studies don’t match perfectly so be sure to explicitly list all the assumptions that you make:

1. Run a simulation that will decide how many participants would have been needed to detect an effect as large as the effect in Kramer (2012) with $$\alpha = 0.05$$ and $$1 - \beta = 0.8$$.
2. Do the same calculation analytically.
3. Given the results from Kramer (2012) was Emotional Contagion (Kramer, Guillory, and Hancock 2014) over-powered (i.e., did it have more participants than needed)?
4. Of the assumptions that you made, which have the biggest effect on your calculation?
6. [, , , ] Answer the question above, but rather than using the earlier observational study by Kramer (2012) use the results from an earlier natural experiment by Coviello et al. (2014).

7. [] Both Rijt et al. (2014) and Margetts et al. (2011) both perform experiments that study the process of people signing a petition. Compare and contrast the design and findings of these studies.

8. [] Dwyer, Maki, and Rothman (2015) conducted two field experiments on the relationship between social norms and proenvironmental behavior. Here’s the abstract of their paper:

“How might psychological science be utilized to encourage proenvironmental behavior? In two studies, interventions aimed at promoting energy conservation behavior in public bathrooms examined the influences of descriptive norms and personal responsibility. In Study 1, the light status (i.e., on or off) was manipulated before someone entered an unoccupied public bathroom, signaling the descriptive norm for that setting. Participants were significantly more likely to turn the lights off if they were off when they entered. In Study 2, an additional condition was included in which the norm of turning off the light was demonstrated by a confederate, but participants were not themselves responsible for turning it on. Personal responsibility moderated the influence of social norms on behavior; when participants were not responsible for turning on the light, the influence of the norm was diminished. These results indicate how descriptive norms and personal responsibility may regulate the effectiveness of proenvironmental interventions.”

Read their paper and design a replication of study 1.

9. [, ] Building on the previous question, now carry out your design.

1. How do the results compare?
2. What might explain these differences?
10. [] There has been substantial debate about experiments using participants recruited from Amazon Mechanical Turk. In parallel, there has also been substantial debate about experiments using participants recruited from undergraduate student populations. Write a two-page memo comparing and contrasting the Turkers and undergraduates as researchers participants. Your comparison should include a discussion of both scientific and logistical issues.

11. [] Jim Manzi’s book Uncontrolled (2012) is a wonderful introduction into the power of experimentation in business. In the book he relayed this story:

“I was once in a meeting with a true business genius, a self-made billionaire who had a deep, intuitive understating of the power of experiments. His company spent significant resources trying to create great store window displays that would attract consumers and increases sales, as conventional wisdom said they should. Experts carefully tested design after design, and in individual test review sessions over a period of years kept showing no significant causal effect of each new display design on sales. Senior marketing and merchandising executives met with the CEO to review these historical test results in toto. After presenting all of the experimental data, they concluded that the conventional wisdom was wrong—that window displays don’t drive sales. Their recommended action was to reduce costs and effort in this area. This dramatically demonstrated the ability of experimentation to overturn conventional wisdom. The CEO’s response was simple: ‘My conclusion is that your designers aren’t very good.’ His solution was to increase effort in store display design, and to get new people to do it.” (Manzi 2012, 158–9)

Which type of validity is the concern of the CEO?

12. [] Building on the previous question, imagine that you were at the meeting where the results of the experiments were discussed. What are four questions that you could ask, one for each type of validity (statistical, construct, internal, and external)?

13. [] Bernedo, Ferraro, and Price (2014) studies the seven-year effect of the water saving intervention described in Ferraro, Miranda, and Price (2011) (see Figure 4.10). In this paper, Bernedo and colleagues also seek to understand the mechanism behind the effect by comparing the behavior of households that have and have not moved after the treatment was delivered. That is, roughly, they try to see whether the treatment impacted the home or the homeowner.

1. Read the paper, describe their design, and summarize their findings. b) Do their findings impact how you should assess the cost-effectiveness of similar interventions? If so, why? If not, why not?
14. [] In a follow-up to Schultz et al. (2007), Schultz and colleagues perform a series of three experiments on the effect of descriptive and injunctive norms on a different environmental behavior (towel reuse) in two contexts (a hotel and a timeshare condominium) (Schultz, Khazian, and Zaleski 2008).

1. Summarize the design and findings of these three experiments.
2. How, if at all, do they change your interpretation of Schultz et al. (2007)?
15. [] In response to Schultz et al. (2007), Canfield, Bruin, and Wong-Parodi (2016) ran a series of lab-like experiments to study the design of electric bills. Here’s how they describe it in the abstract:

“In a survey-based experiment, each participant saw a hypothetical electricity bill for a family with relatively high electricity use, covering information about (a) historical use, (b) comparisons to neighbors, and (c) historical use with appliance breakdown. Participants saw all information types in one of three formats including (a) tables, (b) bar graphs, and (c) icon graphs. We report on three main findings. First, consumers understood each type of electricity-use information the most when it was presented in a table, perhaps because tables facilitate simple point reading. Second, preferences and intentions to save electricity were the strongest for the historical use information, independent of format. Third, individuals with lower energy literacy understood all information less.”

Unlike other follow-up studies, the main outcome of interest in Canfield, Bruin, and Wong-Parodi (2016) is reported behavior not actual behavior. What are the strengths and weaknesses of this type of study in a broader research program promoting energy savings?

16. [, ] Smith and Pell (2003) is a satirical meta-analysis of studies demonstrating the effectiveness of parachutes. They conclude:

“As with many interventions intended to prevent ill health, the effectiveness of parachutes has not been subjected to rigorous evaluation by using randomised controlled trials. Advocates of evidence based medicine have criticised the adoption of interventions evaluated by using only observational data. We think that everyone might benefit if the most radical protagonists of evidence based medicine organised and participated in a double blind, randomised, placebo controlled, crossover trial of the parachute.”

Write an op-ed suitable for a general readership newspaper, such as The New York Times, arguing against the fetishization of experimental evidence. Provide specific, concrete examples. Hint: See also, Bothwell et al. (2016) and Deaton (2010)

17. [, , ] Difference-in-differences estimators of a treatment effect can be more precise than difference-in-mean estimators. Write a memo to an engineer in charge of A/B testing at a start-up social media company explaining the value of the difference-in-differences approach for running an online experiment. The memo should include a statement of the problem, some intuition about the conditions under which the difference-in-difference estimator will outperform the difference-in-mean estimator, and a simple simulation study.

18. [, ] Gary Loveman was a professor at Harvard Business School before becoming the CEO of Harrah’s, one of the largest casino companies in the world. When he moved to Harrah’s, Loveman transformed the company with a frequent flier-like loyalty program that collected tremendous amounts of data about customer behavior. On top of this always-on measurement system, the company began running experiments. For example, they might run an experiment to evaluate the effect of a coupon for a free hotel night for customers with a specific gambling pattern. Here’s how Loveman described the importance of experimentation to Harrah’s everyday business practices:

“It’s like you don’t harass women, you don’t steal, and you’ve got to have a control group. This is one of the things that you can lose your job for at Harrah’s—not running a control group.” (Manzi 2012, 146)

Write an email to a new employee explaining why Loveman thinks it is so important to have a control group. You should try to include an example—either real or made up—to illustrate your point.

19. [, ] A new experiment aims to estimate the effect of receiving text message reminders on vaccination uptake. 150 clinics, each with 600 eligible patients, are willing to participate. There is a fixed cost of 100 dollars for each clinic you want to work with, and it costs 1 dollar for each text message that you want to send. Further, any clinics that you are working with will measure the outcome (whether someone received a vaccination) for free. Assume that you have a budget of 1000 dollars.

1. Under what conditions might it be better to focus your resources on a small number of clinics and under what conditions might it be better to spread them more widely?
2. What factors would determine the smallest effect size that you will be able to reliably detect with your budget?
3. Write a memo explaining these trade-offs to a potential funder.
20. [, ] A major problem with online courses is attrition; many students that start courses end up dropping-out. Imagine that you are working at an online learning platform, and a designer at the platform has created a visual progress bar that she thinks will help prevent students from dropping out of the course. You want to test the effect of the progress bar on students in a large computational social science course. After addressing any ethical issues that might arise in the experiment, you and your colleagues get worried that the course might not have enough students to reliably detect the effects of the progress bar. In the calculations below you can assume that half of the students will receive the progress bar and half not. Further, you can assume that there is no interference. In other words, you can assume that participants are only affected by whether they received the treatment or control; they are not effected by whether other people received the treatment or control (for a more formal definition, see Gerber and Green (2012), Ch. 8). Please keep track of any additional assumptions that you make.

1. Suppose the progress bar is expected to increase the proportion of students who finish the class by 1 percentage point, what is the sample size needed to reliably detect the effect?
2. Suppose the progress bar is expected to increase the proportion of students who finish the class by 10 percentage points, what is the sample size needed to reliably detect the effect?
3. Now imagine that you have run the experiment and students who have completed all the course materials have taken a final exam. When you compare the final exam scores of students who received the progress bar to those that didn’t, you find, much to your surprise, that students who did not receive the progress bar actually scored higher. Does this mean that the progress bar caused students to learn less? What can you learn from this outcome data? (Hint: See Gerber and Green (2012), Ch. 7)
21. [, ] In a lovely paper, Lewis and Rao (2015) vividly illustrate a fundamental statistical limitation of even massive experiments. The paper—which originally had the provocative title “On the Near-impossibility of Measuring the Returns to Advertising”—shows how difficult it is to measure the return on investment of online ads, even with digital experiments involving millions of customers. More generally, the paper clearly shows that it is hard to estimate small treatment effect amidst noisy outcome data. Or stated diffently, the paper shows that estimated treatment effects will have large confidence intervals when the impact-to-standard-deviation ($$\frac{\delta \bar{y}}{\sigma}$$) ratio is small. The important general lesson from this paper is that results from experiments with small impact-to-standard-deviation ratio (e.g., ROI of ad campaigns) will be unsatisfying. Your challenge will be to write a memo to someone in the marketing department of your company evaluting a planned experiment to measure the ROI of an ad campaign. Your memo should be supported with graphs of the results of computer simulations.

Here’s some background information that you might need. All of these numerical values are typical of the real experiments reported in Lewis and Rao (2015):

• ROI, a key metric for online ad campaigns, is defined to be the net profit from the campaign (gross profit from campaign minus cost of campaign) divided by the cost of the campaign. For example a campaign that had no effect on sales would have an ROI of -100% and a campaign where profits generated were equal to costs would have an ROI of 0.

• the mean sales per customer is $7 with a standard deviation of$75.

• the campaign is expected to increase sales by $0.35 per customer which corresponds to an increase in profit of$0.175 per customer. In other words, the gross margin is 50%.

• the planned size of the experiment is 200,000 people, half in the treatment group and half in the control group.

• the cost of the campaign is \$0.14 per participant.

Write a memo evaluting this experiment. Would you recommend launching this experiment as planned? If so, why? If not, what changes would you recommend?

A good memo will address this specific case; a better memo will generalize from this case in one way (e.g., show how the decision changes as a function of the impact-to-standard-deviation ratio); and a great memo will present a fully generalized result.

22. [, ] Do the same as the previous question, but rather than simulation you should use analytic results.

23. [, , ] Do the same as the previous question, but use both simulation and analytic results.

24. [, , ] Imagine that you have written the memo described above—using either simulation, analytic results, or both—and someone from the marketing department recommends using a difference-in-differences estimator rather than a difference in means estimator (see Section 4.6.2). Write a new shorter memo explaining how a 0.4 correlation between sales before the experiment and sales after the experiment would alter your conclusion.

25. [, ] In order to evaluate the effectiveness of a new web-based career service, a university career services office conducted a randomized control trial among 10,000 students entering their final year of school. A free subscription with unique log-in information was sent through an exclusive email invitation to 5,000 of the randomly selected students, while the other 5,000 students are in the control group and do not have a subscription. Twelve months later, a follow-up survey (with no non-response) shows that in both the treatment and control groups, 70% of the students have secured full-time employment in their chosen field (Table 4.5). Thus, it seems that the web-based service had no effect.

However, a clever data scientist at the university looked at the data a bit more closely and found that only 20% of the students in the treatment group ever logged into the account after receiving the email. Further, and somewhat surprisingly, among those who have logged into the website only 60% had secured full-time employment in their chosen field, which was lower than the rate for people that didn’t log in and lower than the rate for people in the control condition (Table 4.6).

1. Provide an explanation for what might have happened.
2. What are two different ways to calculate the effect of the treatment in this experiment?
3. Given this result, should the university career service provide this web-based career service to all students? Just to be clear, this is not a question with an simple answer.
4. What should they do next?

Hint: This question goes beyond the material covered in this chapter, but addresses issues common in experiments. This type of experimental design is sometimes called an encouragement design because participants are encouraged to engage in the treatment. This problem is an example of what is called one-sided non-compliance (see Gerber and Green (2012), Ch. 5)

26. [] After further examination, it turns out that the experiment described in the previous question was even more complicated. It turns out that 10% of the people in the control group paid for access to the service, and they ended up with an employment rate of 65% (Table 4.7).

1. Write an email summarizing what you think is happening and recommend a course of action.

Hint: This question goes beyond the material covered in this chapter, but addresses issues common in experiments. This problem is an example of what is called two-sided non-compliance (see Gerber and Green (2012), Ch. 6)

Table 4.5: Simple view of data from the career services experiment.
Group Size Employment rate