Activities

You are reading the Open Review Edition of Bit by Bit. Click here to read the 1st Edition.

Activities

Key:

degree of difficulty: easy , medium , hard , very hard
requires math ( $requires math$ )
requires coding ()
data collection ()
my favorites ()

[, ] Algorithmic confounding was a problem with Google Flu Trends. Read the paper by Lazer et al. (2014), and write a short, clear email to an engineer at Google explaining the problem and offering an idea of how to fix the problem.
[] Bollen, Mao, and Zeng (2011) claims that data from Twitter can be used to predict the stock market. This finding led to the creation of a hedge fund—Derwent Capital Markets—to invest in the stock market based on data collected from Twitter (Jordan 2010). What evidence would you want to see before putting your money in that fund?
[] While some public health advocates hail e-cigarettes as an effective aid for smoking cessation, others warn about the potential risks, such as the high-levels of nicotine. Imagine that a researcher decides to study public opinion toward e-cigarettes by collecting e-cigarettes-related Twitter posts and conducting sentiment analysis.
1. What are the three possible biases that you are most worried about in this study?
2. Clark et al. (2016) ran just such a study. First, they collected 850,000 tweets that used e-cigarette-related keywords from January 2012 through December 2014. Upon closer inspection, they realized that many of these tweets were automated (i.e., not produced by humans) and many of these automated tweets were essentially commercials. They developed a Human Detection Algorithm to separate automated tweets from organic tweets. Using this Human Detect Algorithm they found that 80% of tweets were automated. Does this finding change your answer to part (a)?
3. When they compared the sentiment in organic and automated tweets they found that the automated tweets are more positive than organic tweets (6.17 versus 5.84). Does this finding change your answer to (b)?
[] In November 2009, Twitter changed the question in the tweet box from “What are you doing?” to “What’s happening?” (https://blog.twitter.com/2009/whats-happening).
1. How do you think the change of prompts will affect who tweet and/or what they tweet?
2. Name one research project for which you would prefer the prompt “What are you doing?” Explain why.
3. Name one research project for which you would prefer the prompt “What’s happening?” Explain why.
[] Kwak et al. (2010) analyzed 41.7 million user profiles, 1.47 billion social relations, 4262 trending topics, and 106 million tweets between June 6th and June 31st, 2009. Based on this analysis they concluded that Twitter serves more as a new medium of information sharing than a social network.
1. Considering Kwak et al’s finding, what type of research would you do with Twitter data? What type of research would you not do with Twitter data? Why?
2. In 2010, Twitter added a Who To Follow service making tailored suggestion to users. Three recommendations are shown at a time on the main page. Recommendations are often drawn from one’s “friends-of-friends,” and mutual contacts are also displayed in the recommendation. Users can refresh to see a new set of recommendations or visit a page with a longer list of recommendations. Do you think this new feature would change your answer to part a)? Why or why not?
3. Su, Sharma, and Goel (2016) evaluated the effect of Who To Follow service and found that while users across the popularity spectrum benefited from the recommendations, the most popular users profited substantially more than average. Does this finding change your answer to part b)? Why or why not?
[] “Retweets” are often used to measure influence and spread of influence on Twitter. Initially, users had to copy and paste the tweet they liked, tag the original author with his/her handle, and manually type “RT” before the tweet to indicate that it’s a retweet. Then, in 2009 Twitter added a “retweet” button. In June 2016, Twitter made it possible for users to retweet their own tweets (https://twitter.com/twitter/status/742749353689780224). Do you think these changes should affect how you use “retweets” in your research? Why or why not?
[, , ] Michel et al. (2011) constructed a corpus emerging from Google’s effort to digitize books. Using the first version of the corpus, which was published in 2009 and contained over 5 million digitized books, the authors analyzed word usage frequency to investigate linguistic changes and cultural trends. Soon the Google Books Corpus became a popular data source for researchers, and a 2nd version of the database was released in 2012.

However, Pechenick, Danforth, and Dodds (2015) warned that researchers need to fully characterize the sampling process of the corpus before using it for drawing broad conclusions. The main issue is that the corpus is library-like, containing one of each book. As a result, an individual, prolific author is able to noticeably insert new phrases into the Google Books lexicon. Moreover, scientific texts constitute an increasingly substantive portion of the corpus throughout the 1900s. In addition, by comparing two versions of the English Fiction datasets, Pechenick et al. found evidence that insufficient filtering was used in producing the first version. All of the data needed for activity is available here: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
1. In Michel et al.’s original paper (2011), they used the 1st version of the English data set, plotted the frequency of usage of the years “1880”, “1912” and “1973”, and concluded that “we are forgetting our past faster with each passing year” (Fig. 3A, Michel et al.). Replicate the same plot using 1) 1st version of the corpus, English dataset (same as Fig. 3A, Michel et al.)
2. Now replicate the same plot with the 1st version, English fiction dataset.
3. Now replicate the same plot with the 2nd version of the corpus, English dataset.
4. Finally, replicate the same plot with the 2nd version, English fiction dataset.
5. Describe the differences and similarities between these four plots. Do you agree with Michel et al.’s original interpretation of the observed trend? (Hint: c) and d) should be the same as Figure 16 in Pechenick et al.)
6. Now that you have replicated this one finding using different Google Books corpora, choose another linguistic change or cultural phenomena presented in Michel et al.’s original paper. Do you agree with their interpretation in light of the limitations presented in Pechenick et al.? To make your argument stronger, try replicate the same graph using different versions of data set as above.
[, , , ] Penney (2016) explores whether the widespread publicity about NSA/PRISM surveillance (i.e., the Snowden revelations) in June 2013 is associated with a sharp and sudden decrease in traffic to Wikipedia articles on topics that raise privacy concerns. If so, this change in behavior would be consistent with a chilling effect resulting from mass surveillance. The approach of Penney (2016) is sometimes called an interrupted time series design and is related to the approaches in the chapter about approximating experiments from observational data (Section 2.4.3).

To choose the topic keywords, Penney referred to the list used by U.S. Department of Homeland Security for tracking and monitoring social media. The DHS list categorizes certain search terms into a range of issues, i.e. “Health Concern,” “Infrastructure Security,” and “Terrorism.” For the study group, Penney used the forty-eight keywords related to “Terrorism” (see Table 8 Appendix). He then aggregated Wikipedia article view counts on a monthly basis for the corresponding forty-eight Wikipedia articles over a thirty-two month period, from the beginning of January 2012 to the end of August 2014. To strengthen his argument, he also created several comparison groups by tracking article views on other topics.

Now, you are going to replicate and extend Penney (2016). All the raw data that you will need for this activity is available from Wikipedia (https://dumps.wikimedia.org/other/pagecounts-raw/). Or you can get it from the R package wikipediatrend (Meissner and Team 2016). When you write-up your responses, please note which data source you used. (Note: This same activity also appears in Chapter 6)
1. Read Penney (2016) and replicate Figure 2 which shows the page views for “Terrorism”-related pages before and after the Snowden revelation. Interpret the findings.
2. Next, replicate Fig 4A, which compares the study group (“Terrorism”-related articles) with a comparator group using keywords categorized under “DHS & Other Agencies” from the DHS list (see Appendix Table 10). Interpret the findings.
3. In part b) you compared the study group to one comparator group. Penney also compared to two other comparator groups: “Infrastructure Security”-related articles (Appendix Table 11) and popular Wikipedia pages (Appendix Table 12). Come up with an alternative comparator group, and test if the findings from part b) is sensitive to your choice of comparator group. Which choice of comparator group makes most sense? Why?
4. The author stated that keywords relating to “Terrorism” were used to select the Wikipedia articles because the U.S. government cited terrorism as a key justification for its online surveillance practices. As a check of these 48 “Terrorism”-related keywords, Penney (2016) also conducted a survey on MTurk asking respondents to rate each of keywords in terms of Government Trouble, Privacy-Sensitive, and Avoidance (Appendix Table 7 and 8). Replicate the survey on MTurk and compare your results.
5. Based on the results in part d) and your reading of the article, do you agree with the author’s choice of topic keywords in the study group? Why or why not? If not, what would you suggest instead?
[] Efrati (2016) reports, based on confidential information, that “total sharing” on Facebook had declined by about 5.5% year over year while “original broadcast sharing” was down 21% year over year. This decline was particularly acute with Facebook users under 30 years of age. The report attributed the decline to two factors. One is the growth in the number of “friends” people have on Facebook. The other is that some sharing activity has shifted to messaging and to competitors such as SnapChat. The report also revealed the several tactics Facebook had tried to boost sharing, including News Feed algorithm tweaks that make original posts more prominent, as well as periodical reminders of the original posts users “On This Day” several years ago. What implications, if any, does these findings have for researchers who want to use Facebook as a data source?
[] Tumasjan et al. (2010) reported that proportion of tweets mentioning a political party matched the proportion of votes that party received in the German parliamentary election in 2009 (Figure 2.9). In other words, it appeared that you could use Twitter to predict the election. At the time this study was published it was considered extremely exciting because it seemed to suggest a valuable use for a common source of big data.

Given the bad features of big data, however, you should immediately be skeptical of this result. Germans on Twitter in 2009 were quite a non-representative group, and supporters of one party might tweet about politics more often. Thus, it seems surprising that all the possible biases that you could imagine would somehow cancel out. In fact, the results in Tumasjan et al. (2010) turned out to be too good to be true. In their paper, Tumasjan et al. (2010) considered six political parties: Christian Democrats (CDU), Christian Social Democrats (CSU), SPD, Liberals (FDP), The Left (Die Linke), and the Green Party (Grüne). However, the most mentioned German political party on Twitter at that time was the Pirate Party (Piraten), a party that fights government regulation of the Internet. When the Pirate Party was included in the analysis, Twitter mentions becomes a terrible predictor of election results (Figure 2.9) (Jungherr, Jürgens, and Schoen 2012).

Figure 2.9: Twitter mentions appear to predict the results of the 2009 German election (Tumasjan et al. 2010), but this result turns out to depend on some arbitrary and unjustified choices (Jungherr, Jürgens, and Schoen 2012).

Subsequently, other researchers around the world have used fancier methods—such as using sentiment analysis to distinguish between positive and negative mentions of the parties—in order to improve the ability of Twitter data to predict a variety of different types of elections (Gayo-Avello 2013; Jungherr 2015, Ch. 7.). Here’s how Huberty (2015) summarized the results of these attempts to predict elections:

“All known forecasting methods based on social media have failed when subjected to the demands of true forward-looking electoral forecasting. These failures appear to be due to fundamental properties of social media, rather than to methodological or algorithmic difficulties. In short, social media do not, and probably never will, offer a stable, unbiased, representative picture of the electorate; and convenience samples of social media lack sufficient data to fix these problems post hoc.”

Read some of the research that lead Huberty (2015) to that conclusion, and write a one page memo to a political candidate describing if and how Twitter should be used to forecast elections.
[] What is the difference between a sociologist and a historian? According to Goldthorpe (1991), the main difference between a sociologist and a historian is control over data collection. Historians are forced to use relics whereas sociologists can tailor their data collection to specific purposes. Read Goldthorpe (1991). How is the difference between sociology and history related to the idea of Custommades and Readymades?
[] Building on the previous question, Goldthorpe (1991) drew a number of critical responses, including one from Nicky Hart (1994) that challenged Goldthorpe’s devotion to tailor made data. To clarify the potential limitations of tailor-made data, Hart described the Affluent Worker Project, a large survey to measure the relationship between social class and voting that was conducted by Goldthorpe and colleagues in the mid-1960s. As one might expect from a scholar who favored designed data over found data, the Affluent Worker Project collected data that was tailored to address a recently proposed theory about the future of social class in an era of increasing living standards. But, Goldthorpe and colleagues somehow “forgot” to collect information about the voting behavior of women. Here’s how Nicky Hart (1994) summaries the whole episode:

“. . . it [is] difficult to avoid the conclusion that women were omitted because this ‘tailor made’ dataset was confined by a paradigmatic logic which excluded female experience. Driven by a theoretical vision of class consciousness and action as male preoccupations . . . , Goldthorpe and his colleagues constructed a set of empirical proofs which fed and nurtured their own theoretical assumptions instead of exposing them to a valid test of adequacy.”

Hart continued:

“The empirical findings of the Affluent Worker Project tell us more about the masculinist values of mid-century sociology than they inform the processes of stratification, politics and material life.”

Can you think of other examples where tailor-made data collection has the biases of the data collector built into it? How does this compare to algorithmic confounding? What implications might this have for when researchers should use Readymades and when they should use Custommades?
[] In this chapter, I contrasted data collected by researchers for researchers with administrative records created by companies and governments. Some people call these administrative records “found data,” which they contrast with “designed data.” It is true that administrative records are found by researchers, but they are also highly designed. For example, modern tech companies spend enormous amounts of time and resources to collect and curate their data. Thus, these administrative records are both found and designed, it just depends on your perspective (Figure 2.10).

Figure 2.10: The picture is both a duck and a rabbit; what you see depends on your perspective. Government and business administrative records are both found and designed; what you see depends on your perspective. For example, the call data records collected by a cell phone company are found data from the perspective of a researcher. But, these exact same records are designed data perspective of someone working in the billing department of the phone company. Source: Wikimedia Commons

Provide an example of data source where seeing it both as found and designed is helpful when using that data source for research.
[] In a thoughtful essay, Christian Sandvig and Eszter Hargittai (2015) describe two kinds of digital research, where the digital system is “instrument” or “object of study.” An example of the first kind of study is where Bengtsson and colleagues (2011) used mobile phone data to track migration after the earthquake in Haiti in 2010. An example of the second kind is where Jensen (2007) studies how the introduction of mobile phones throughout Kerala, India impacted the functioning of the market for fish. I find this helpful because it clarifies that studies using digital data sources can have quite different goals even if they are using the same kind of data source. In order to further clarify this distinction, describe four studies that you’ve seen: two that use a digital system as an instrument and two that use a digital system as an object of study. You can use examples from this chapter if you want.