This book began in 2005 in a basement at Columbia University. At the time, I was a graduate student, and I was running an online experiment that would eventually become my dissertation. I’ll tell you all about the scientific parts of that experiment in chapter 4, but now I’m going to tell you about something that’s not in my dissertation or in any of my papers. And it’s something that fundamentally changed how I think about research. One morning, when I came into my basement office, I discovered that overnight about 100 people from Brazil had participated in my experiment. This simple experience had a profound effect on me. At that time, I had friends who were running traditional lab experiments, and I knew how hard they had to work to recruit, supervise, and pay people to participate in these experiments; if they could run 10 people in a single day, that was good progress. However, with my online experiment, 100 people participated while I was sleeping. Doing your research while you are sleeping might sound too good to be true, but it isn’t. Changes in technology—specifically the transition from the analog age to the digital age—mean that we can now collect and analyze social data in new ways. This book is about doing social research in these new ways.
This book is for social scientists who want to do more data science, data scientists who want to do more social science, and anyone interested in the hybrid of these two fields. Given who this book is for, it should go without saying that it is not just for students and professors. Although, I currently work at a university (Princeton), I’ve also worked in government (at the US Census Bureau) and in the tech industry (at Microsoft Research) so I know that there is a lot of exciting research happening outside of universities. If you think of what you are doing as social research, then this book is for you, no matter where you work or what kind of techniques you currently use.
As you might have noticed already, the tone of this book is a bit different from that of many other academic books. That’s intentional. This book emerged from a graduate seminar on computational social science that I have taught at Princeton in the Department of Sociology since 2007, and I’d like it to capture some of the energy and excitement from that seminar. In particular, I want this book to have three characteristics: I want it to be helpful, future-oriented, and optimistic.
Helpful: My goal is to write a book that is helpful for you. Therefore, I’m going to write in an open, informal, and example-driven style. That’s because the most important thing that I want to convey is a certain way of thinking about social research. And, my experience suggests that the best way to convey this way of thinking is informally and with lots of examples. Also, at the end of each chapter, I have a section called “What to read next” that will help you transition into more detailed and technical readings on many of the topics that I introduce. In the end, I hope this book will help you both do research and evaluate the research of others.
Future-oriented: This book will help you to do social research using the digital systems that exist today and those that will be created in the future. I started doing this kind of research in 2004, and since then I’ve seen many changes, and I’m sure that over the course of your career you will see many changes too. The trick to staying relevant in the face of change is abstraction. For example, this is not going to be a book that teaches you exactly how to use the Twitter API as it exists today; instead, it is going to teach you how to learn from big data sources (chapter 2). This is not going to be a book that gives you step-by-step instructions for running experiments on Amazon Mechanical Turk; instead, it is going to teach you how to design and interpret experiments that rely on digital age infrastructure (chapter 4). Through the use of abstraction, I hope this will be a timeless book on a timely topic.
Optimistic: The two communities that this book engages—social scientists and data scientists—have very different backgrounds and interests. In addition to these science-related differences, which I talk about in the book, I’ve also noticed that these two communities have different styles. Data scientists are generally excited; they tend to see the glass as half full. Social scientists, on the other hand, are generally more critical; they tend to see the glass as half empty. In this book, I’m going to adopt the optimistic tone of a data scientist. So, when I present examples, I’m going to tell you what I love about these examples. And, when I do point out problems with the examples—and I will do that because no research is perfect—I’m going to try to point out these problems in a way that is positive and optimistic. I’m not going to be critical for the sake of being critical—I’m going to be critical so that I can help you create better research.
We are still in the early days of social research in the digital age, but I’ve seen some misunderstandings that are so common that it makes sense for me to address them here, in the preface. From data scientists, I’ve seen two common misunderstandings. The first is thinking that more data automatically solves problems. However, for social research, that has not been my experience. In fact, for social research, better data—as opposed to more data—seem to be more helpful. The second misunderstanding that I’ve seen from data scientists is thinking that social science is just a bunch of fancy talk wrapped around common sense. Of course, as a social scientist—more specifically as a sociologist—I don’t agree with that. Smart people have been working hard to understand human behavior for a long time, and it seems unwise to ignore the wisdom that has accumulated from this effort. My hope is that this book will offer you some of that wisdom in a way that is easy to understand.
From social scientists, I’ve also seen two common misunderstandings. First, I’ve seen some people write off the entire idea of social research using the tools of the digital age because of a few bad papers. If you’re reading this book, you’ve probably already read a bunch of papers that use social media data in ways that are banal or wrong (or both). I have too. However, it would be a serious mistake to conclude from these examples that all digital-age social research is bad. In fact, you’ve probably also read a bunch of papers that use survey data in ways that are banal or wrong, but you don’t write off all research using surveys. That’s because you know that there is great research done with survey data, and in this book I’m going to show you that there is also great research done with the tools of the digital age.
The second common misunderstanding that I’ve seen from social scientists is to confuse the present with the future. When we assess social research in the digital age—the research that I’m going to describe—it’s important that we ask two distinct questions: “How well does this style of research work right now?” and “How well will this style of research work in the future?” Researchers are trained to answer the first question, but for this book I think the second question is more important. That is, even though social research in the digital age has not yet produced massive, paradigm-changing intellectual contributions, the rate of improvement of digital-age research is incredibly rapid. It is this rate of change—more than the current level—that makes digital-age research so exciting to me.
Even though that last paragraph might seem to offer you potential riches at some unspecified time in the future, my goal is not to sell you on any particular type of research. I don’t personally own shares in Twitter, Facebook, Google, Microsoft, Apple, or any other tech company (although, for the sake of full disclosure, I should mention that I have worked at, or received research funding from, Microsoft, Google, and Facebook). Throughout the book, therefore, my goal is to remain a credible narrator, telling you about all the exciting new stuff that is possible, while guiding you away from a few traps that I’ve seen others fall into (and occasionally fallen into myself).
The intersection of social science and data science is sometimes called computational social science. Some consider this to be a technical field, but this will not be a technical book in the traditional sense. For example, there are no equations in the main text. I chose to write the book this way because I wanted to provide a comprehensive view of social research in the digital age, including big data sources, surveys, experiments, mass collaboration, and ethics. It turned out to be impossible to cover all these topics and provide technical details about each one. Instead, pointers to more technical material are given in the “What to read next” section at the end of each chapter. In other words, this book is not designed to teach you how to do any specific calculation; rather, it is designed to change the way that you think about social research.
How to use this book in a course
As I said earlier, this book emerged in part from a graduate seminar on computational social science that I’ve been teaching since 2007 at Princeton. Since you might be thinking about using this book to teach a course, I thought that it might be helpful for me to explain how it grew out of my course and how I imagine it being used in other courses.
For several years, I taught my course without a book; I’d just assign a collection of articles. While students were able to learn from these articles, the articles alone were not leading to the conceptual changes that I was hoping to create. So I would spend most of the time in class providing perspective, context, and advice in order to help the students see the big picture. This book is my attempt to write down all that perspective, context, and advice in a way that has no prerequisites—in terms of either social science or data science.
In a semester-long course, I would recommend pairing this book with a variety of additional readings. For example, such a course might spend two weeks on experiments, and you could pair chapter 4 with readings on topics such as the role of pre-treatment information in the design and analysis of experiments; statistical and computational issues raised by large-scale A/B tests at companies; design of experiments specifically focused on mechanisms; and practical, scientific, and ethical issues related to using participants from online labor markets, such as Amazon Mechanical Turk. It could also be paired with readings and activities related to programming. The appropriate choice between these many possible pairings depends on the students in your course (e.g., undergraduate, master’s, or PhD), their backgrounds, and their goals.
A semester-length course could also include weekly problem sets. Each chapter has a variety of activities that are labeled by degree of difficulty: easy (), medium (), hard (), and very hard (). Also, I’ve labeled each problem by the skills that it requires: math (), coding (), and data collection (). Finally, I’ve labeled a few of the activities that are my personal favorites (). I hope that within this diverse collection of activities, you’ll find some that are appropriate for your students.
In order to help people using this book in courses, I’ve started a collection of teaching materials such as syllabuses, slides, recommended pairings for each chapter, and solutions to some activities. You can find these materials—and contribute to them—at http://www.bitbybitbook.com.