• degree of difficulty: easy easy, medium medium, hard hard, very hard very hard
  • requires math (requires math)
  • requires coding (requires coding)
  • data collection (data collection)
  • my favorites (my favorite)
  1. [hard, requires math] In the chapter, I was very positive about post-stratification. However, it does not always improve the quality of estimates. Construct a situation where can post-stratification can decrease the quality of estimates. (For a hint, see Thomsen (1973)).

  2. [hard, data collection, requires coding] Design and conduct a non-probability survey on Amazon MTurk to ask about gun ownership (“Do you, or does anyone in your household, own a gun, rifle or pistol? Is that you or someone else in your household?”) and attitudes towards gun control (“What do you think is more important–to protect the right of Americans to own guns, or to control gun ownership?”).

    1. How long does your survey take? How much does it cost? How do the demographics of your sample compare to the demographics of the U.S. population?
    2. What is the raw estimate of gun ownership using your sample?
    3. Correct for the non-representativeness of your sample using post-stratification or some other technique. Now what is the estimate of gun ownership?
    4. How do your estimates compare to the latest estimate from Pew Research Center? What do you think explain the discrepancies, if there is any?
    5. Repeat the exercise 2-5 for attitudes toward gun control. How do your findings differ?
  3. [very hard, data collection, requires coding] Goel and colleagues (2016) administered a non-probability-based survey consisting of 49 multiple-choice attitudinal questions drawn from the General Social Survey (GSS) and select surveys by the Pew Research Center on Amazon MTurk. They then adjust for the non-representativeness of data using model-based post-stratification (Mr. P), and compare the adjusted estimates with those estimated using probability-based GSS/Pew surveys. Conduct the same survey on MTurk and try to replicate Figure 2a and Figure 2b by comparing your adjusted estimates with the estimates from the most recent rounds of GSS/Pew (See Appendix Table A2 for the list of 49 questions).

    1. Compare and contrast your results to the results from Pew and GSS.
    2. Compare and contrast your results to the results from the MTurk survey in Goel, Obeng, and Rothschild (2016).
  4. [medium, data collection, requires coding] Many studies use self-report measures of mobile phone activity data. This is an interesting setting where researchers can compare self-reported behavior with logged behavior (see e.g., Boase and Ling (2013)). Two common behaviors to ask about are calling and texting, and two common time frames are “yesterday” and “in the past week.”

    1. Before collecting any data, which of the self-report measures do you think is more accurate? Why?
    2. Recruit 5 of your friends to be in your survey. Please briefly summarize how these 5 friends were sampled. Might this sampling procedure induce specific biases in your estimates?
    3. Please ask them the following micro-survey:
    • “How many times did you use mobile phone to call others yesterday?”
    • “How many text messages did you send yesterday?”
    • “How many times did you use your mobile phone to call others in the last seven days?”
    • “How many times did you use your mobile phone to send or receive text messages/SMS in the last seven days?” Once the survey is complete, ask to check their usage data as logged by their phone or service provider.
    1. How does self-report usage compare to log data? Which is most accurate, which is least accurate?
    2. Now combine the data that you have collected with the data from other people in your class (if you are doing this activity for a class). With this larger dataset, repeat part (d).
  5. [medium, data collection] Schuman and Presser (1996) argue that question orders would matter for two types of relations between questions: part-part questions where two questions are at the same level of specificity (e.g. ratings of two presidential candidates); and part-whole questions where a general question follows a more specific question (e.g. asking “How satisfied are you with your work?” followed by “How satisfied are you with your life?”).

    They further characterize two types of question order effect: consistency effects occur when responses to a later question are brought closer (than they would otherwise be) to those given to an earlier question; contrast effects occur when there are greater differences between responses to two questions.

    1. Create a pair of part-part questions that you think will have a large question order effect, a pair of part-whole questions that you think will have a large order effect, and another pair of questions whose order you think would not matter. Run a survey experiment on MTurk to test your questions.
    2. How large was the part-part effect were you able to create? Was it a consistency or contrast effect?
    3. How large was the part-whole effect were you able to create? Was it a consistency or contrast effect?
    4. Was there a question order effect in your pair where you did not think the order would matter?
  6. [medium, data collection] Building on the work of Schuman and Presser, Moore (2002) describes a separate dimension of question order effect: additive and subtractive. While contrast and consistency effects are produced as a consequence of respondents’ evaluations of the two items in relation to each other, additive and subtractive effects are produced when respondents are made more sensitive to the larger framework within which the questions are posed. Read Moore (2002), then design and run a survey experiment on MTurk to demonstrate additive or subtractive effects.

  7. [hard, data collection] Christopher Antoun and colleagues (2015) conducted a study comparing the convenience samples obtained from four different online recruiting sources: MTurk, Craigslist, Google AdWords and Facebook. Design a simple survey and recruit participants through at least two different online recruiting sources (they can be different sources from the four sources used in Antoun et al. (2015)).

    1. Compare the cost per recruit, in terms of money and time, between different sources.
    2. Compare the composition of the samples obtained from different sources.
    3. Compare the quality of data between the samples. For ideas about how to measure data quality from respondents, see Schober et al. (2015).
    4. What is your preferred source? Why?
  8. [medium] YouGov, an internet-based market research firm, conducted online polls of a panel of about 800,000 respondents in the UK and used Mr. P. to predict the result of EU Referendum (i.e., Brexit) where the UK voters vote either to remain in or leave the European Union.

    A detailed description of YouGov’s statistical model is here ( Roughly speaking, YouGov partitions voters into types based on 2015 general election vote choice, age, qualifications, gender, date of interview, as well as the constituency they live in. First, they used data collected from the YouGov panelists to estimate, among those who vote, the proportion of people of each voter type who intend to vote Leave. They estimate turnout of each voter type by using the 2015 British Election Study (BES) post-election face-to-face survey, which validated turnout from the electoral rolls. Finally, they estimate how many people there are of each voter type in the electorate based on latest Census and Annual Population Survey (with some addition information from the BES, YouGov survey data from around the general election, and information on how many people voted for each party in each constituency).

    Three days before the vote, YouGov showed a two point lead for Leave. On the eve of voting, the poll showed too close to call (49-51 Remain). The final on-the-day study predicted 48/52 in favor of Remain ( In fact, this estimate missed the final result (52-48 Leave) by four percentage points.

    1. Use the total survey error framework discussed in this chapter to assess what could have gone wrong.
    2. YouGov’s response after the election ( explained: “This seems in a large part due to turnout – something that we have said all along would be crucial to the outcome of such a finely balanced race. Our turnout model was based, in part, on whether respondents had voted at the last general election and a turnout level above that of general elections upset the model, particularly in the North.” Does this change your answer to part (a)?
  9. [medium, requires coding] Write a simulation to illustrate each of the representation errors in Figure 3.1.

    1. Create a situation where these errors actually cancel out.
    2. Create a situation where the errors compound each other.
  10. [very hard, requires coding] The research of Blumenstock and colleagues (2015) involved building a machine learning model that could use digital trace data to predict survey responses. Now, you are going to try the same thing with a different dataset. Kosinski, Stillwell, and Graepel (2013) found that Facebook likes can predict individual traits and attributes. Surprisingly, these predictions can be even more accurate than those of friends and colleagues (Youyou, Kosinski, and Stillwell 2015).

    1. Read Kosinski, Stillwell, and Graepel (2013), and replicate Figure 2. Their data are available here:
    2. Now, replicate Figure 3.
    3. Finally, try their model on your own Facebook data: How well does it work for you?
  11. [medium] Toole et al. (2015) use call detail records (CDRs) from mobile phones to predict aggregate unemployment trends.

    1. Compare and contrast the design of Toole et al. (2015) with Blumenstock, Cadamuro, and On (2015).
    2. Do you think CDRs should replace traditional surveys, complement them or not be used at all for government policymakers to track unemployment? Why?
    3. What evidence would convince you that CDRs can completely replace traditional measures of the unemployment rate?