5.2.1 Galaxy Zoo

You are reading the Open Review Edition of Bit by Bit. Click here to read the 1st Edition.

5.2.1 Galaxy Zoo

Galaxy Zoo combines the efforts of many non-expert volunteers to classify a million galaxies.

Galaxy Zoo grew out of a problem faced by Kevin Schawinski, a graduate student in Astronomy at the University of Oxford in 2007. Simplifying quite a bit, Schawinski was interested in galaxies, and galaxies can be classified by their morphology—elliptical or spiral—and by their color—blue or red. At the time, conventional wisdom among astronomers was that spiral galaxies, like our Milky Way, were blue in color (indicating youth) and that elliptical galaxies were red in color (indicating old age). Schawinski doubted this conventional wisdom. He suspected that while this pattern might be true in general, there were probably a sizable number of exceptions, and that by studying lots of these unusual galaxies—the ones that did not fit the expected pattern—he could learn something about the process through which galaxies formed.

Thus, what Schawinski needed in order to overturn conventional wisdom was a large set of morphologically classified galaxies; that is, galaxies that had been classified as either spiral or elliptical. The problem, however, was that existing algorithmic methods for classification were not yet good enough to be used for scientific research; in other words, classifying galaxies was, at that time, a problem that was hard for computers. Therefore, what was needed was a large number of human classified galaxies. Schawinski undertook this classification problem with the enthusiasm of a graduate student. In a marathon session of seven, 12-hour days, he was able to classify 50,000 galaxies. While 50,000 galaxies may sound like a lot, it is actually only about 5% of the almost one million galaxies that had been photographed in the Sloan Digital Sky Survey. Schawinski realized that he needed a more scalable approach.

Fortunately, it turns out that the task of classifying galaxies does not require advanced training in astronomy; you can teach someone to do it pretty quickly. In other words, even though classifying galaxies is a task that was hard for computers, it was pretty easy for humans. So, while sitting in a pub in Oxford, Schawinski and fellow astronomer Chris Lintott dreamed up a website where volunteers would classify images of galaxies. A few months later, Galaxy Zoo was born.

At the Galaxy Zoo website, volunteers would undergo a few minutes of training; for example, learning the difference between a spiral and elliptical galaxy (Figure 5.2). After this training, the volunteers had to pass a relatively easy quiz—correctly classifying 11 of 15 galaxies with known classifications—and then the volunteer would begin real classification of unknown galaxies through a simple web-based interface (Figure 5.3). The transition from volunteer to astronomer would take place in less than 10 minutes and only required passing the lowest of hurdles, a simple quiz.

Figure 5.2: Examples of the two main types of galaxies: spiral and elliptical. The Galaxy Zoo project used more than 100,000 volunteers to categories more than 900,000 images. Source: www.galaxyzoo.org.

Figure 5.3: Input screen where voters were asked to classify a single image. Source: www.galaxyzoo.org.

Galaxy Zoo attracted its initial volunteers after the project was featured in a news article, and in about six months the project grew to involve more than 100,000 citizen scientists, people who participated because they enjoyed the task and they wanted to help advance astronomy. Together, these 100,000 volunteers contributed a total of more than 40 million classifications, with the majority of the classifications coming from a relatively small, core group of participants (Lintott et al. 2008).

Researchers who have experience hiring undergraduate research assistants might immediately be skeptical about data quality. While this skepticism is reasonable, Galaxy Zoo shows that when volunteer contributions are correctly cleaned, debiased, and aggregated, they can produce high-quality results (Lintott et al. 2008). An important trick for getting the crowd to create professional quality data is redundancy; that is, having the same task performed by many different people. In Galaxy Zoo, there were about 40 classifications per galaxy; researchers using undergraduate research assistants could never afford this level of redundancy and therefore need to be much more concerned with the quality of each individual classification. What the volunteers lacked in training, they made up for with redundancy.

Even with multiple classifications per galaxy, however, combining the set of volunteer classifications to produce a consensus classification is tricky. Because very similar challenges arise in most human computation projects, it is helpful to briefly review the three steps that the Galaxy Zoo researchers used to produce their consensus classifications. First, the researchers “cleaned” the data by removing bogus classifications. For example, people who repeatedly classified the same galaxy—something that would happen if they were trying to manipulate the results—had all their classifications discarded. This and other similar cleaning removed about 4% of all classifications.

Second, after cleaning, the researchers needed to remove systematic biases in classifications. Through a series of bias detection studies embedded within the original project—for example, showing some volunteers the galaxy in monochrome instead of color—the researchers discovered several systematic biases, such as a systematic bias to classify far away spiral galaxies as elliptical galaxies (Bamford et al. 2009). Adjusting for these systematic biases is extremely important because averaging many contributions does not remove systematic bias; it only removes random error.

Finally, after debiasing, the researchers needed a method to combine the individual classifications to produce a consensus classification. The simplest way to combine classifications for each galaxy would be to choose the most common classification. However, this approach would give each volunteer equal weight, and the researchers suspected that some volunteers were better at classification than others. Therefore, the researchers developed a more complex iterative weighting procedure that attempts to automatically detect the best classifiers and give them more weight.

Thus, after a three step process—cleaning, debiasing, and weighting—the Galaxy Zoo research team had converted 40 million volunteer classifications into a set of consensus morphological classifications. When these Galaxy Zoo classifications were compared to three previous smaller-scale attempts by professional astronomers, including the classification by Schawinski that helped to inspire Galaxy Zoo, there was strong agreement. Thus, the volunteers, in aggregate, were able to provide high quality classifications and at a scale that the researchers could not match (Lintott et al. 2008). In fact, by having human classifications for such a large number of galaxies, Schawinski, Lintott, and others were able to show that only about 80% of galaxies follow the expected pattern—blue spirals and red ellipticals—and numerous papers have been written about this discovery (Fortson et al. 2011).

Given this background, we can now see how Galaxy Zoo follows the split-apply-combine recipe, the same recipe that is used for most human computation projects. First, a big problem is split into chunks. In this case, the problem of classifying a million galaxies is split into a million problems of classifying one galaxy. Next, an operation is applied to each chunk independently. In this case, a volunteer would classify each galaxy as either spiral or elliptical. Finally, the results are combined to produce a consensus result. In this case, the combine step included the cleaning, debiasing, and weighting to produce a consensus classification for each galaxy. Even though most projects use this general recipe, each of the steps needs to customized to the specific problem being addressed. For example, in the human computation project described below, the same recipe will be followed, but the apply and combine steps will be quite different.

For the Galaxy Zoo team, this first project was just the beginning. Very quickly they realized that even though they were able to classify close to a million galaxies, this scale is not enough to work with newer digital sky surveys, which could produce images of about 10 billion galaxies (Kuminski et al. 2014). To handle an increase from 1 million to 10 billion—a factor of 10,000—Galaxy Zoo would need to recruit roughly 10,000 times more participants. Even though the number of volunteers on the Internet is large, it is not infinite. Therefore, the researchers realized that if they are going to handle ever growing amounts of data, a new, even more scalable, approach was needed.

Therefore, Manda Banerji—working with Kevin Schawinski, Chris Lintott, and other members of the Galaxy Zoo team—starting teaching computers to classify galaxies. More specifically, using the human classifications created by Galaxy Zoo, Banerji et al. (2010) built a machine learning model that could predict the human classification of a galaxy based on the characteristics of the image. If this machine learning model could reproduce the human classifications with high accuracy, then it could be used by Galaxy Zoo researchers to classify an essentially infinite number of galaxies.

The core of Banerji and colleagues’ approach is actually pretty similar to techniques commonly used in social research, although that similarity might not be clear at first glance. First, Banerji and colleagues converted each image into a set of numeric features that summarize it’s properties. For example, for images of galaxies there could be three features: the amount of blue in the image, the variance in the brightness of the pixels, and the proportion of non-white pixels. The selection of the correct features is an important part of the problem, and it generally requires subject-area expertise. This first step, commonly called feature engineering, results in a data matrix with one row per image and then three columns describing that image. Given the data matrix and the desired output (e.g., whether the image was classified by a human as an elliptical galaxy), the researcher estimates the parameters of a statistical model—for example, something like a logistic regression—that predicts the human classification based on the features of the image. Finally, the researcher uses the parameters in this statistical model to produce estimated classifications of new galaxies (Figure 5.4). To think of a social analog, imagine that you had demographic information about a million students, and you know whether they graduated from college or not. You could fit a logistic regression to this data, and then you could use the resulting model parameters to predict whether new students are going to graduate from college. In machine learning, this approach—using labeled examples to create a statistical model that can then label new data—is called supervised learning (Hastie, Tibshirani, and Friedman 2009).

Figure 5.4: Simplified description of how Banerji et al. (2010) used the Galaxy Zoo classifications to train a machine learning model to do galaxy classification. Images of galaxies were converted in a matrix of features. In this simplified example there are three features (the amount of blue in the image, the variance in the brightness of the pixels, and the proportion of non-white pixels). Then, for a subset of the images, the Galaxy Zoo labels are used to train a machine learning model. Finally, the machine learning is used to estimate classifications for the remaining galaxies. I call this kind of project a second-generation human computational project because, rather than having humans solve a problem, they have humans build a dataset that can be used to train a computer to solve the problem. The advantage of this computer-assisted approach is that it enables you to handle essentially infinite amounts of data using only a finite amount of human effort.

The features in Banerji et al. (2010) machine learning model were more complex than those in my toy example—for example, she used features like “de Vaucouleurs fit axial ratio”—and her model was not logistic regression, it was an artificial neural network. Using her features, her model, and the consensus Galaxy Zoo classifications, she was able to create weights on each feature, and then use these weights to make predictions about the classification of galaxies. For example, her analysis found that images with low “de Vaucouleurs fit axial ratio” were more likely to be spiral galaxies. Given these weights, she was able to predict the human classification of a galaxy with reasonable accuracy.

The work of Banerji et al. (2010) turned Galaxy Zoo into what I would call a second-generation human computation system. The best way to think about these second-generation systems is that rather than having humans solve a problem, they have humans build a dataset that can be used to train a computer to solve the problem. The amount of data needed to train the computer can be so large that it requires a human mass collaboration to create. In the case of Galaxy Zoo, the neural networks used by Banerji et al. (2010) required a very large number of human-labeled examples in order to build a model that was able to reliably reproduce the human classification.

The advantage of this computer-assisted approach is that it enables you to handle essentially infinite amounts of data using only a finite amount of human effort. For example, a researcher with a million human classified galaxies can build a predictive model that can then be used to classify a billion or even a trillion galaxies. If there are enormous numbers of galaxies, then this kind of human-computer hybrid is really the only possible solution. This infinite scalability is not free, however. Building a machine learning model that can correctly reproduce the human classifications is itself a hard problem, but fortunately there are already excellent books dedicated to this topic (Hastie, Tibshirani, and Friedman 2009; Murphy 2012; James et al. 2013).

Galaxy Zoo shows the evolution of many human computation projects. First, a researcher attempts the project by herself or with a small team of research assistants (e.g., Schawinski’s initial classification effort). If this approach does not scale well, the researcher can move to a human computation project where many people contribute classifications. But, for a certain volume of data, pure human effort will not be enough. At that point, researchers need to build second-generation systems where human classifications are used to train a machine learning model that can then be applied to virtually unlimited amounts of data.