• degree of difficulty: easy easy, medium medium, hard hard, very hard very hard
  • requires math (requires math)
  • requires coding (requires coding)
  • data collection (data collection)
  • my favorites (my favorite)
  1. [hard, requires math] In the chapter, I was very positive about post-stratification. However, this does not always improve the quality of estimates. Construct a situation where post-stratification can decrease the quality of estimates. (For a hint, see Thomsen (1973).)

  2. [hard, data collection, requires coding] Design and conduct a non-probability survey on Amazon Mechanical Turk to ask about gun ownership and attitudes toward gun control. So that you can compare your estimates to those derived from a probability sample, please copy the question text and response options directly from a high-quality survey such as those run by the Pew Research Center.

    1. How long does your survey take? How much does it cost? How do the demographics of your sample compare with the demographics of the US population?
    2. What is the raw estimate of gun ownership using your sample?
    3. Correct for the nonrepresentativeness of your sample using post-stratification or some other technique. Now what is the estimate of gun ownership?
    4. How do your estimates compare with the latest estimate from a probability-based sample? What do you think explains the discrepancies, if there are any?
    5. Repeat questions (b)-(d) for attitudes toward gun control. How do your findings differ?
  3. [very hard, data collection, requires coding] Goel and colleagues (2016) administered 49 multiple-choice attitudinal questions drawn from the General Social Survey (GSS) and select surveys by the Pew Research Center to non-probability sample of respondents drawn from Amazon Mechanical Turk. They then adjusted for the non-representativeness of data using model-based post-stratification and compared their adjusted estimates with those from the probability-based GSS and Pew surveys. Conduct the same survey on Amazon Mechanical Turk and try to replicate figure 2a and figure 2b by comparing your adjusted estimates with the estimates from the most recent rounds of the GSS and Pew surveys. (See appendix table A2 for the list of 49 questions.)

    1. Compare and contrast your results with those from Pew and GSS.
    2. Compare and contrast your results with those from the Mechanical Turk survey in Goel, Obeng, and Rothschild (2016).
  4. [medium, data collection, requires coding] Many studies use self-reported measures of mobile phone use. This is an interesting setting in which researchers can compare self-reported behavior with logged behavior (see e.g., Boase and Ling (2013)). Two common behaviors to ask about are calling and texting, and two common time frames are “yesterday” and “in the past week.”

    1. Before collecting any data, which of the self-report measures do you think is more accurate? Why?
    2. Recruit five of your friends to be in your survey. Please briefly summarize how these five friends were sampled. Might this sampling procedure induce specific biases in your estimates?
    3. Ask them the following microsurvey questions:
    • “How many times did you use your mobile phone to call others yesterday?”
    • “How many text messages did you send yesterday?”
    • “How many times did you use your mobile phone to call others in the last seven days?”
    • “How many times did you use your mobile phone to send or receive text messages/SMS in the last seven days?”
    1. Once this microsurvey has been completed, ask to check their usage data as logged by their phone or service provider. How does self-report usage compare to log data? Which is most accurate, which is least accurate?
    2. Now combine the data that you have collected with the data from other people in your class (if you are doing this activity for a class). With this larger dataset, repeat part (d).
  5. [medium, data collection] Schuman and Presser (1996) argue that question orders would matter for two types of questions: part-part questions where two questions are at the same level of specificity (e.g., ratings of two presidential candidates); and part-whole questions where a general question follows a more specific question (e.g., asking “How satisfied are you with your work?” followed by “How satisfied are you with your life?”).

    They further characterize two types of question order effect: consistency effects occur when responses to a later question are brought closer (than they would otherwise be) to those given to an earlier question; contrast effects occur when there are greater differences between responses to two questions.

    1. Create a pair of part-part questions that you think will have a large question order effect; a pair of part-whole questions that you think will have a large order effect; and a pair of questions whose order you think would not matter. Run a survey experiment on Amazon Mechanical Turk to test your questions.
    2. How large a part-part effect were you able to create? Was it a consistency or contrast effect?
    3. How large a part-whole effect were you able to create? Was it a consistency or contrast effect?
    4. Was there a question order effect in your pair where you did not think the order would matter?
  6. [medium, data collection] Building on the work of Schuman and Presser, Moore (2002) describes a separate dimension of question order effect: additive and subtractive effects. While contrast and consistency effects are produced as a consequence of respondents’ evaluations of the two items in relation to each other, additive and subtractive effects are produced when respondents are made more sensitive to the larger framework within which the questions are posed. Read Moore (2002), then design and run a survey experiment on MTurk to demonstrate additive or subtractive effects.

  7. [hard, data collection] Christopher Antoun and colleagues (2015) conducted a study comparing the convenience samples obtained from four different online recruiting sources: MTurk, Craigslist, Google AdWords and Facebook. Design a simple survey and recruit participants through at least two different online recruiting sources (these sources can be different from the four sources used in Antoun et al. (2015)).

    1. Compare the cost per recruit—in terms of money and time—between different sources.
    2. Compare the composition of the samples obtained from different sources.
    3. Compare the quality of data between the samples. For ideas about how to measure data quality from respondents, see Schober et al. (2015).
    4. What is your preferred source? Why?
  8. [medium] In an effort to predict the results of the 2016 EU Referendum (i.e., Brexit), YouGov—an Internet-based market research firm—conducted online polls of a panel of about 800,000 respondents in the United Kingdom.

    A detailed description of YouGov’s statistical model can be found at Roughly speaking, YouGov partitioned voters into types based on 2015 general election vote choice, age, qualifications, gender, and date of interview, as well as the constituency in which they lived. First, they used data collected from the YouGov panelists to estimate, among those who voted, the proportion of people of each voter type who intended to vote Leave. They estimated the turnout of each voter type by using the 2015 British Election Study (BES), a post-election face-to-face survey, which validated turnout from the electoral rolls. Finally, they estimated how many people there were of each voter type in the electorate, based on latest Census and Annual Population Survey (with some addition information from other data sources).

    Three days before the vote, YouGov showed a two-point lead for Leave. On the eve of voting, the poll indicated that the result was too close to call (49/51 Remain). The final on-the-day study predicted 48/52 in favor of Remain ( In fact, this estimate missed the final result (52/48 Leave) by four percentage points.

    1. Use the total survey error framework discussed in this chapter to assess what could have gone wrong.
    2. YouGov’s response after the election ( explained: “This seems in a large part due to turnout—something that we have said all along would be crucial to the outcome of such a finely balanced race. Our turnout model was based, in part, on whether respondents had voted at the last general election and a turnout level above that of general elections upset the model, particularly in the North.” Does this change your answer to part (a)?
  9. [medium, requires coding] Write a simulation to illustrate each of the representation errors in figure 3.2.

    1. Create a situation where these errors actually cancel out.
    2. Create a situation where the errors compound each other.
  10. [very hard, requires coding] The research of Blumenstock and colleagues (2015) involved building a machine learning model that could use digital trace data to predict survey responses. Now, you are going to try the same thing with a different dataset. Kosinski, Stillwell, and Graepel (2013) found that Facebook likes can predict individual traits and attributes. Surprisingly, these predictions can be even more accurate than those of friends and colleagues (Youyou, Kosinski, and Stillwell 2015).

    1. Read Kosinski, Stillwell, and Graepel (2013), and replicate figure 2. Their data are available at
    2. Now, replicate figure 3.
    3. Finally, try their model on your own Facebook data: How well does it work for you?
  11. [medium] Toole et al. (2015) used call detail records (CDRs) from mobile phones to predict aggregate unemployment trends.

    1. Compare and contrast the study design of Toole et al. (2015) with that of Blumenstock, Cadamuro, and On (2015).
    2. Do you think CDRs should replace traditional surveys, complement them or not be used at all for government policymakers to track unemployment? Why?
    3. What evidence would convince you that CDRs can completely replace traditional measures of the unemployment rate?