• degree of difficulty: easy easy, medium medium, hard hard, very hard very hard
  • requires math (requires math)
  • requires coding (requires coding)
  • data collection (data collection)
  • my favorites (my favorite)
  1. [very hard, requires coding, data collection, my favorite] One of the most exciting claims from Benoit and colleagues’ (2016) research on crowd-coding of political manifestos is that the results are reproducible. Merz, Regel, and Lewandowski (2016) provides access to the Manifesto Corpus. Try to reproduce figure 2 from Benoit et al. (2016) using workers from Amazon Mechanical Turk. How similar were your results?

  2. [medium] In the InfluenzaNet project a volunteer panel of people report the incidence, prevalence, and health-seeking behavior related to influenza-like-illness (Tilston et al. 2010; Noort et al. 2015).

    1. Compare and contrast the design, costs, and likely errors in InfluenzaNet, Google Flu Trends, and traditional influenza tracking systems.
    2. Consider an unsettled time, such as an outbreak of a novel form of influenza. Describe the possible errors in each system.
  3. [hard, requires coding, data collection] The Economist is a weekly news magazine. Create a human computation project to see if the ratio of women to men on the cover has changed over time.

    1. The magazine can have different covers in eight different regions (Africa, Asia Pacific, Europe, European Union, Latin America, Middle East, North America, and United Kingdom) and they can all be downloaded from the magazine’s website. Pick one of these regions and perform the analysis. Be sure to describe your procedures with enough detail that they could be replicated by someone else.

    This question was inspired by a similar project by Justin Tenuto, a data scientist at the crowdsourcing company CrowdFlower: see “Time Magazine Really Likes Dudes” (

  4. [very hard, requires coding, data collection] Building on the previous question, now perform the analysis for all eight regions.

    1. What differences did you find across regions?
    2. How much extra time and money did it take to scale up your analysis to all eight of the regions?
    3. Imagine that the Economist has 100 different covers each week. Estimate how much extra time and money would it take to scale up your analysis to 100 covers per week.
  5. [hard, requires coding] There are several websites that host open call projects, such as Kaggle. Participate in one of those projects, and describe what you learn about that particular project and about open calls in general.

  6. [medium] Look through a recent issue of a journal in your field. Are there any papers that could have been reformulated as open call projects? Why or why not?

  7. [easy] Purdam (2014) describes a distributed data collection about begging in London. Summarize the strengths and weaknesses of this research design.

  8. [medium] Redundancy is an important way to assess the quality of distributed data collection. Windt and Humphreys (2016) developed and tested a system to collect reports of conflict events from people in Eastern Congo. Read the paper.

    1. How does their design ensure redundancy?
    2. They offered several approaches to validate the data collected from their project. Summarize them. Which was most convincing to you?
    3. Propose a new way that the data could be validated. Suggestions should try to increase the confidence that you would have in the data in a way that is cost-effective and ethical.
  9. [medium] Karim Lakhani and colleagues (2013) created an open call to solicit new algorithms to solve a problem in computational biology. They received more than 600 submissions containing 89 novel computational approaches. Of the submissions, 30 exceeded the performance of the US National Institutes of Health’s MegaBLAST, and the best submission achieved both greater accuracy and speed (1,000 times faster).

    1. Read their paper, and then propose a social research problem that could use the same kind of open contest. In particular, this kind of open contest is focused on speeding up and improving the performance of an existing algorithm. If you can’t think of a problem like this in your field, try to explain why not.
  10. [medium, my favorite] Many human computation projects rely on participants from Amazon Mechanical Turk. Sign up to become a worker on Amazon Mechanical Turk. Spend one hour working there. How does this impact your thoughts about the design, quality, and ethics of human computation projects?