5.2.1 Galaxy Zoo

Galaxy Zoo combined the efforts of many non-expert volunteers to classify a million galaxies.

Galaxy Zoo grew out of a problem faced by Kevin Schawinski, a graduate student in Astronomy at the University of Oxford in 2007. Simplifying quite a bit, Schawinski was interested in galaxies, and galaxies can be classified by their morphology—elliptical or spiral—and by their color—blue or red. At the time, the conventional wisdom among astronomers was that spiral galaxies, like our Milky Way, were blue in color (indicating youth) and elliptical galaxies were red (indicating old age). Schawinski doubted this conventional wisdom. He suspected that while this pattern might be true in general, there were probably a sizable number of exceptions, and that by studying lots of these unusual galaxies—the ones that did not fit the expected pattern—he could learn something about the process through which galaxies formed.

Thus, what Schawinski needed in order to overturn conventional wisdom was a large set of morphologically classified galaxies; that is, galaxies that had been classified as either spiral or elliptical. The problem, however, was that existing algorithmic methods for classification were not yet good enough to be used for scientific research; in other words, classifying galaxies was, at that time, a problem that was hard for computers. Therefore, what was needed was a large number of human-classified galaxies. Schawinski undertook this classification problem with the enthusiasm of a graduate student. In a marathon session of seven 12-hour days, he was able to classify 50,000 galaxies. While 50,000 galaxies may sound like a lot, it is actually only about 5% of the almost one million galaxies that had been photographed in the Sloan Digital Sky Survey. Schawinski realized that he needed a more scalable approach.

Fortunately, it turns out that the task of classifying galaxies does not require advanced training in astronomy; you can teach someone to do it pretty quickly. In other words, even though classifying galaxies is a task that was hard for computers, it was pretty easy for humans. So, while sitting in a pub in Oxford, Schawinski and fellow astronomer Chris Lintott dreamed up a website where volunteers would classify images of galaxies. A few months later, Galaxy Zoo was born.

At the Galaxy Zoo website, volunteers would undergo a few minutes of training; for example, learning the difference between a spiral and elliptical galaxy (figure 5.2). After this training, each volunteer had to pass a relatively easy quiz—correctly classifying 11 of 15 galaxies with known classifications—and then would begin real classification of unknown galaxies through a simple web-based interface (figure 5.3). The transition from volunteer to astronomer would take place in less than 10 minutes and only required passing the lowest of hurdles, a simple quiz.

Figure 5.2: Examples of the two main types of galaxies: spiral and elliptical. The Galaxy Zoo project used more than 100,000 volunteers to categorize more than 900,000 images. Reproduced by permission from http://www.GalaxyZoo.org and Sloan Digital Sky Survey.

Figure 5.3: Input screen where volunteers were asked to classify a single image. Reproduced by permission from Chris Lintott based on an image from the Sloan Digital Sky Survey.

Galaxy Zoo attracted its initial volunteers after the project was featured in a news article, and in about six months the project grew to involve more than 100,000 citizen scientists, people who participated because they enjoyed the task and they wanted to help advance astronomy. Together, these 100,000 volunteers contributed a total of more than 40 million classifications, with the majority of the classifications coming from a relatively small, core group of participants (Lintott et al. 2008).

Researchers who have experience hiring undergraduate research assistants might immediately be skeptical about data quality. While this skepticism is reasonable, Galaxy Zoo shows that when volunteer contributions are correctly cleaned, debiased, and aggregated, they can produce high-quality results (Lintott et al. 2008). An important trick for getting the crowd to create professional-quality data is redundancy, that is, having the same task performed by many different people. In Galaxy Zoo, there were about 40 classifications per galaxy; researchers using undergraduate research assistants could never afford this level of redundancy and therefore would need to be much more concerned with the quality of each individual classification. What the volunteers lacked in training, they made up for with redundancy.

Even with multiple classifications per galaxy, however, combining the set of volunteer classifications to produce a consensus classification was tricky. Because very similar challenges arise in most human computation projects, it is helpful to briefly review the three steps that the Galaxy Zoo researchers used to produce their consensus classifications. First, the researchers “cleaned” the data by removing bogus classifications. For example, people who repeatedly classified the same galaxy—something that would happen if they were trying to manipulate the results—had all their classifications discarded. This and other similar cleaning removed about 4% of all classifications.

Second, after cleaning, the researchers needed to remove systematic biases in classifications. Through a series of bias detection studies embedded within the original project—for example, showing some volunteers the galaxy in monochrome instead of color—the researchers discovered several systematic biases, such as a systematic bias to classify faraway spiral galaxies as elliptical galaxies (Bamford et al. 2009). Adjusting for these systematic biases is extremely important because redundancy does not automatically remove systematic bias; it only help removes random error.

Finally, after debiasing, the researchers needed a method to combine the individual classifications to produce a consensus classification. The simplest way to combine classifications for each galaxy would have been to choose the most common classification. However, this approach would have given each volunteer equal weight, and the researchers suspected that some volunteers were better at classification than others. Therefore, the researchers developed a more complex iterative weighting procedure that attempted to detect the best classifiers and give them more weight.

Thus, after a three-step process—cleaning, debiasing, and weighting—the Galaxy Zoo research team had converted 40 million volunteer classifications into a set of consensus morphological classifications. When these Galaxy Zoo classifications were compared with three previous smaller-scale attempts by professional astronomers, including the classification by Schawinski that helped to inspire Galaxy Zoo, there was strong agreement. Thus, the volunteers, in aggregate, were able to provide high-quality classifications and at a scale that the researchers could not match (Lintott et al. 2008). In fact, by having human classifications for such a large number of galaxies, Schawinski, Lintott, and others were able to show that only about 80% of galaxies follow the expected pattern—blue spirals and red ellipticals—and numerous papers have been written about this discovery (Fortson et al. 2011).

Given this background, you can now see how Galaxy Zoo follows the split-apply-combine recipe, the same recipe that is used for most human computation projects. First, a big problem is split into chunks. In this case, the problem of classifying a million galaxies was split into a million problems of classifying one galaxy. Next, an operation is applied to each chunk independently. In this case, volunteers classified each galaxy as either spiral or elliptical. Finally, the results are combined to produce a consensus result. In this case, the combine step included the cleaning, debiasing, and weighting to produce a consensus classification for each galaxy. Even though most projects use this general recipe, each step needs to be customized to the specific problem being addressed. For example, in the human computation project described below, the same recipe will be followed, but the apply and combine steps will be quite different.

For the Galaxy Zoo team, this first project was just the beginning. Very quickly they realized that even though they were able to classify close to a million galaxies, this scale is not enough to work with newer digital sky surveys, which can produce images of about 10 billion galaxies (Kuminski et al. 2014). To handle an increase from 1 million to 10 billion—a factor of 10,000—Galaxy Zoo would need to recruit roughly 10,000 times more participants. Even though the number of volunteers on the Internet is large, it is not infinite. Therefore, the researchers realized that if they were going to handle ever-growing amounts of data, a new, even more scalable, approach was needed.

Therefore, Manda Banerji—working with Schawinski, Lintott, and other members of the Galaxy Zoo team (2010)—started teaching computers to classify galaxies. More specifically, using the human classifications created by Galaxy Zoo, Banerji built a machine learning model that could predict the human classification of a galaxy based on the characteristics of the image. If this model could reproduce the human classifications with high accuracy, then it could be used by Galaxy Zoo researchers to classify an essentially infinite number of galaxies.

The core of Banerji and colleagues’ approach is actually pretty similar to techniques commonly used in social research, although that similarity might not be clear at first glance. First, Banerji and colleagues converted each image into a set of numerical features that summarized its properties. For example, for images of galaxies, there could be three features: the amount of blue in the image, the variance in the brightness of the pixels, and the proportion of non-white pixels. The selection of the correct features is an important part of the problem, and it generally requires subject-area expertise. This first step, commonly called feature engineering, results in a data matrix with one row per image and then three columns describing that image. Given the data matrix and the desired output (e.g., whether the image was classified by a human as an elliptical galaxy), the researcher creates a statistical or machine learning model—for example, logistic regression—that predicts the human classification based on the features of the image. Finally, the researcher uses the parameters in this statistical model to produce estimated classifications of new galaxies (figure 5.4). In machine learning, this approach—using labeled examples to create a model that can then label new data—is called supervised learning.

Figure 5.4: Simplified description of how Banerji et al. (2010) used the Galaxy Zoo classifications to train a machine learning model to do galaxy classification. Images of galaxies were converted in a matrix of features. In this simplified example, there are three features (the amount of blue in the image, the variance in the brightness of the pixels, and the proportion of nonwhite pixels). Then, for a subset of the images, the Galaxy Zoo labels are used to train a machine learning model. Finally, the machine learning is used to estimate classifications for the remaining galaxies. I call this a computer-assisted human computation project because, rather than having humans solve a problem, it has humans build a dataset that can be used to train a computer to solve the problem. The advantage of this computer-assisted human computation system is that it enables you to handle essentially infinite amounts of data using only a finite amount of human effort. Images of galaxies reproduced by permission from Sloan Digital Sky Survey.

The features in Banerji and colleagues’ machine learning model were more complex than those in my toy example—for example, she used features like “de Vaucouleurs fit axial ratio”—and her model was not logistic regression, it was an artificial neural network. Using her features, her model, and the consensus Galaxy Zoo classifications, she was able to create weights on each feature, and then use these weights to make predictions about the classification of galaxies. For example, her analysis found that images with low “de Vaucouleurs fit axial ratio” were more likely to be spiral galaxies. Given these weights, she was able to predict the human classification of a galaxy with reasonable accuracy.

The work of Banerji and colleagues turned Galaxy Zoo into what I would call a computer-assisted human computation system. The best way to think about these hybrid systems is that rather than having humans solve a problem, they have humans build a dataset that can be used to train a computer to solve the problem. Sometimes, training a computer to solve the problem can require lots of examples, and the only way to produce a sufficient number of examples is a mass collaboration. The advantage of this computer-assisted approach is that it enables you to handle essentially infinite amounts of data using only a finite amount of human effort. For example, a researcher with a million human classified galaxies can build a predictive model that can then be used to classify a billion or even a trillion galaxies. If there are enormous numbers of galaxies, then this kind of human-computer hybrid is really the only possible solution. This infinite scalability is not free, however. Building a machine learning model that can correctly reproduce the human classifications is itself a hard problem, but fortunately there are already excellent books dedicated to this topic (Hastie, Tibshirani, and Friedman 2009; Murphy 2012; James et al. 2013).

Galaxy Zoo is a good illustration of how many human computation projects evolve. First, a researcher attempts the project by herself or with a small team of research assistants (e.g., Schawinski’s initial classification effort). If this approach does not scale well, the researcher can move to a human computation project with many participants. But, for a certain volume of data, pure human effort will not be enough. At that point, researchers need to build a computer-assisted human computation system in which human classifications are used to train a machine learning model that can then be applied to virtually unlimited amounts of data.