No course at Harvard does a better job of increasing your odds of getting the future — the internship, the job, the graduate school, the career — that you want.
Data matters. Learning to think critically about data is a fundamental skill. How much money is donated to political campaigns? Does exposure to Spanish-speakers affect attitudes toward immigration? What characteristics are associated with voting Republican? We need data to answer these questions – to describe, to predict, and to infer.
This course, an introduction to data science, will teach you how to think with data, how to gather information from a variety of sources, how to import that information into a project, how to tidy and transform the variables and observations, how to visualize, how to model relationships, how to assess uncertainty, and how to communicate your findings. Each student will complete a final project, the first entry in their professional portfolio. Our main focus is data associated with political science, but we will also use examples from education, economics, public health, sociology, sports, finance, climate and any other topic which students find interesting.
Prerequisites: None. You must have a laptop with R, RStudio and Git installed.
Logistics: Class meets from 12:00 PM to 1:15 PM on T/TH. Depending on enrollment, we will probably also have a session from 7:30 AM to 8:45 AM on T/TH for students living in Europe/Asia. I will teach both sessions.
The central metaphor for this class is Ulysses and the Sirens. You are Ulysses. Ithaca is where you hope to arrive after graduation. The Sirens are the many distractions of the modern world. I am the rope.
Preceptor David Kane; firstname.lastname@example.org; CGIS South 310; 646-644-3626; office hours Thursday from 9:00 to 11:00. Please address me as “Preceptor,” not “David,” nor “Preceptor Kane,” nor “Brah,” nor “Professor Kane,” nor “Mr. Kane,” nor, worst of all, “Dr. Kane.” I respond to e-mail within 24 hours. If I don’t, e-mail me again.
Head Teaching Fellow:
Teaching Fellows: Tyler Simko (email@example.com), Juan Dodyk (firstname.lastname@example.org), Dan Baissa (email@example.com), Alla Baranovsky (firstname.lastname@example.org) and Mitchell Kilborn (email@example.com).
Head Course Assistant: Claire Fridkin (firstname.lastname@example.org).
No Lectures: The worst method for transmitting information from my head to yours is for me to lecture you. There are no lectures. We work on problems together during class. You learn soccer with the ball at your feet. You learn about data with your hands on the keyboard.
Bayesian: The philosophy of this class is unapologetically Bayesian.
R Everyday: Learning a new programming language is like learning a new human language: You will practice (almost) every day.
Community: You will probably learn the names of more students in this course than in all your other courses combined. Awkwardness in the pursuit of community is no vice.
Professionalism: We use professional tools. Your workflow will be very similar to the workflow involved in paid employment. Your problem sets and final project will be public, the better to impress others with your abilities. High quality work will be shared with your classmates. We will learn the “full cycle” of how to draw inferences from data and communicate those inferences to others.
Cold Calling: I call on students during class. This keeps every student involved, makes for a more lively discussion and helps to prepare students for the real world, in which you can’t hide in the back row. Want to be left alone? Don’t take this course.
Recitations: We do not have normal sections. Instead, you will spend 60 minutes each week meeting with your assigned TF in a small group in the first half of the semester. Recall what I just said about (not) being left alone in this course. In the second half of the semester, you will have one-on-one meetings with your TF for 20 minutes each week, focused on your final project.
Millism: Political disputes are not the focus of this class but, when such topics arise, I will insist that we follow John Stuart Mills’ advice: “He who knows only his own side of the case, knows little of that. His reasons may be good, and no one may have been able to refute them. But if he is equally unable to refute the reasons on the opposite side; if he does not so much as know what they are, he has no ground for preferring either opinion.”
Teaching to Learn: My main goal is not to teach you how do X. That is easy! More importantly, in a few months, I won’t be around to teach you Y. My goal is to teach you how to teach yourself, how to figure out X and Y and Z on your own. That is harder! Much of the pedagogy of the course — especially my insistence that you work on topics not covered in lecture — is driven by this goal. You will find it frustrating.
Workload: The course should take about 10 hours per week, outside of meetings, exams and the final project. This is an expected average across the class as a whole. It is not a maximum. Some students will end up spending much less time. Others will spend much, much more.
Late Days: Assignments (Tutorials, Problem Sets and Final Project Milestones) are always due at 11:55 PM, unless specified otherwise. An assignment is a day late if it is turned in any time after it was due (even 5 minutes after) but within 24 hours. After that, it is two days late, and so on. You have 5 late days in total. Late days may be used for any assignment, except the four exams and the final project Demo Day. You should save your late days. If you use them early in the semester for no particularly good reason and then, later in the semester, have an actual emergency, we will not be sympathetic. We will not give you extra late days in such a situation. (That isn’t fair to your classmates, and we are all about fairness.) We will just, mentally, move the late days you wasted so that they cover your actually emergency. You will now be penalized for being late earlier in the semester, when you did not have a good reason for tardiness. You may only use one late day on a given assignment. Hand it in after more than 24 hours and you get a zero on that assignment. But you still must hand it in! Everything must be completed. Late days accrue until you do. Each day late (beyond the five allowed) results in -1 point to your final score. This decrement is a point not percentage penalty. In other words, each additional late day used outside of the allotted five will drop your final class average by one point out of 100.
Submissions: All tutorials, problem sets, milestones and exams are turned in via Canvas. Late days are assigned on the basis of the official Canvas submission time.
Missing Class: You expect me to be present for lecture. I expect the same of you. There is nothing more embarrassing, for both us, than for me to call your name and have you not be there to answer. But, at the same time, conflicts arise. It is never a problem to miss class if, for example, you are out of town or have a health issue. Simply put an X by your name in the Google absence sheet and send me and your TF an e-mail. Failure to do so will decrease your participation points, as will missing too many classes, even with notification. There is no need to put a reason in the sheet. An X is enough.
Major Emergencies: We are not monsters. If you are hit with a major emergency — the sort of thing that necessitates the involvement of your Resident Dean — we will be sympathetic. Speak with your TF.
Monologues: I give brief monologues, designed to explain specific topics that have confused students in the past. I hope to never talk for more than 5 minutes straight.
No Cost: Every reading/tool we use is free. You don’t have to spend any money on this class. Some activities, like Shinyapps and GitHub, have paid options which provide more services, but you never have to use them. Don’t give anyone your credit card number.
Remind Me: In conversations outside a class, a student will often ask an important question or raise of issue of general interest. These topics should be brought to the attention of other students. I will ask you to “Remind me” about it. This means that, in the next class, you must raise your hand when I ask for reminders and then remind me! Couldn’t I just write it down in my notes? Perhaps. But learning how to raise your hand/voice in a big class is a useful skill. This is your opportunity to practice.
Role of Teaching Fellows: The TFs run their Recitations, approve all grades, keep track of late days, deal with emergencies and so on. Go to them first with any problems. You will be assigned to work closely with a specific TF — your “assigned” TF — but you may ask other TFs for help as well.
Role of Course Assistants: The CAs run Study Halls. They can make no commitments about how the TFs will assign the final grade on a problem set, milestone or exam. Never ask a CA a question about grading. Instead, ask on Slack and a TF will respond, or come to a TF privately with your question.
Exceptions: There may be a reason why you can’t adhere to class policies. For example, severe social anxiety may make being cold-called problematic. A learning disability may make take-home tests unfair. Whatever the situation, please seek me out for conversation. I am sure we can work out something! I will do whatever it takes to allow every Harvard student to thrive in this class.
Use your Harvard e-mail: Please use your official Harvard e-mail address for all aspects of this class, especially things like signing up for services like shinyapps, GitHub, and so on. Doing so makes it much easier for us to figure out who is doing what. This may not be easy if you already connect with these services but, even in that case, you should be able to add your Harvard e-mail address to your account.
Slack: All general questions — those not of a personal nature — should be posted to Slack so that all students can benefit from both the question and the answer(s).
Plagiarism: If you plagiarize, you will fail the course. See the Harvard College Handbook for Students for details.
Working with Others: Students are free (and encouraged) to discuss problem sets and their final projects with one another. However, you must hand in your own unique code and written work in all cases. Any copy/paste of another’s work is plagiarism. In other words, you can work with your friend, sitting side-by-side and going through the problem set question-by-question, but you must each type your own code. Your answers may be similar (obviously) but they must not be identical, or even identical’ish.
Git and GitHub: Analyzing data without using source control is like writing an essay without using a word processor — possible but not professional. We will do all our work using Git/GitHub.
Readings: Assignments in a given week cover (approximately) the material that we will use that week. I will not hesitate to cold-call students with questions about the readings. Do them.
Optional Activities: The syllabus includes background readings, videos and materials which students may find interesting. You do not have to do them.
Computer Emergencies: We are not sympathetic about computer emergencies. You should keep all your work on GitHub, so it won’t matter if your computer explodes. If it does explode, you will lose only the work after your last push. You can then restart your work on a public computer (the basement of CGIS Knafel has machines with R/RStudio installed) or on your roommate’s computer.
Github Classroom: We use Github Classroom to distribute problem sets and exams. You will receive an e-mail with a link. Click on that link and a repo, with instructions, will be created. Do this as soon as you receive the e-mail. We don’t want GitHub problems to arise the night before the assignment is due.
Tardiness: We begin on time and end on time.
Credit: Gov 50 fulfills the QRD requirement. You may also get concentration credit. This is true, obviously, for Government. It is also true in Statistics, Psychology, Sociology, and Social Studies. I am happy to support students who want to petition other departments.
Announcements: You are responsible for any assignment/exam/deadline updates/changes which are either announced in class or promulgated via the course Canvas e-mail list. The official Preceptor’s Notes on Slack are important, but we will e-mail them to you. You are not responsible for every other random post on Slack. In fact, you can ignore Slack completely, if you want.
Solo Participation: 5 points. This category relates to things you do alone in class. Missing class (without notifying us) or missing too many classes will cost you points, as will a failure to participate in class activities. Note that I do not care if you know the answer when I cold-call you. This plays no part in your grade.
Group Participation: 5 points. This category relates to activities you do with other students. Helping your fellow students, especially on Slack, is the best form of group participation, as is volunteering for a class role like scribe. Be a good class citizen. Help your classmates during Study Halls. Do not shirk on group projects.
Tutorials: 5 points. Grades are pass/fail only. Given the level of the questions and the hints provided, it is essentially impossible not to get full credit as long as you make an honest effort.
Problem Sets: 25 points. The first problem set is worth 1 point. The remaining 8 are worth 3 points each. Problem sets after the first are distributed on Thursday and then due the following Wednesday at 11:55 PM. You are welcome to work on them with your friends but, first, you must personally type in every character in the work you submit and, second, you must list all the people you worked with. We define “work with” very broadly, to include minor interactions. You would certainly list anyone you sat nearby during Study Hall, for example.
Exams: 35 points total. The four exams are take-home and unhackable. The first is worth 5 points and the others are each worth 10 points. They are open-book and open-web. Because students have different schedules, you can complete the exam any time within a four-day window starting after exam distribution. Late exams earn zero points. You may not seek or receive help on the exam from a person, e.g., asking a roommate or posting at RStudio Community. You may use any written materials from the class, including problem set answers. If you have a question, ask on Slack. Teaching staff (not other students) will answer it.
Final Project: 25 points. Students will present their projects publicly at the end of the semester. They will then have the opportunity to incorporate feedback before submitting the final version. There are eight milestones for the projects, each worth one point. Demo Day (which includes a review of your code) is worth 7 points. The final project submission is worth 10 points. Follow this advice.
Calculation: Each problem set, milestone and exam is graded out of a maximum of score of 20, regardless of its weight in the final grade calculation. For example, both Exam 2 and Milestone 2 are graded out of 20, but the former is worth ten times as much to your final grade.
Do you love soccer or wine or NYC politics? The final project provides you with an opportunity to study that topic in depth. Your final project will be, for most of you, the first item in your professional portfolio, something so impressive that you will be eager to show it to graduate schools or potential employers. You must show this work publicly, both on the web (viewable by all) and in person at our Demo Day. You will host your final project using Shiny Apps. Make use of free statistical consulting from the Harvard Statistics Department and from IQSS. Read this advice if you are working with data larger than 100 megabytes. Consider scheduling an interview with Hugh Truslow (email@example.com), Head, Social Sciences and Visualization, Harvard University. No one at Harvard knows more about potential data sources. Visualization Specialist Jessica Cohen-Tanugi (firstname.lastname@example.org) is a great person to talk to about your graphics. Explore the final projects from past semesters.
Most students will gather some data, estimate some models, and create a Shiny App. Good stuff! But there are other possible approaches:
Students interested in a topic about which there is no publicly available data are welcome to collect their own data. This must be something much more substantive than just asking 100 students outside Annenberg about their favorite salad. Two categories of data work best. First, pick a topic which you truly care about. Second, pick something Harvard-specific. This Crimson article and these class projects — spring 2019 and fall 2019 — are great examples of the latter.
You are welcome to use data from your other classes in the creation of your final project. This includes thesis work. You automatically have permission from us to do this, but you must also obtain permission from the instructor of the other class.
Interested in doing a project which seems different from what we describe above? Come talk to me! The best projects involve topics which students are passionate about. If you really care about X, then we are eager to help you create a final project about X. Examples: participation in the NFL Big Data Bowl, submitting Numerai forecasts or entering a Kaggle competion.
If you had tried to complete a data analysis project before taking this class, you would have done X well. Now that you have taken the class – now that you know how to describe, predict and infer – you will do Y well. The success (or failure) of the class can be measured by comparing Y with X.
Everything — Tutorials (Mondays), Problem Sets (Wednesdays), Milestones (Fridays) and Exams (Sundays) — is due at 11:55 PM, unless otherwise specified.
The class follows a steady weekly rhythm:
Monday 11:55 PM. Tutorials are due.
Tuesday 12:00 PM – 1:15 PM. Class.
Wednesday 11:55 PM. Problem sets are due.
Thursday 9:00 – 11:00 AM. Office Hours with Preceptor.
Thursday 12:00 PM – 1:15 PM. Class.
Thursday evening. Problem set due next week will be distributed.
Friday 11:55 PM. Final project milestones are due.
Sunday 11:55 PM. Exams, if distributed, are due.
Tutorial #1 due Monday, September 7.
Tutorial #2 due Monday, September 14. First TF group Recitation during the week of September 14.
Problem Set #1 due Wednesday, September 16. Completed together in class on the 15th.
Milestone #1 due Friday, September 18.
Tutorial #3 due Monday, September 21.
Problem Set #2 due Wednesday, September 23.
Milestone #2 due Friday, September 25.
Tutorial #4 due Monday, September 28.
Problem Set #3 due Wednesday, September 30.
Exam #1 distributed on Thursday, October 1 and due Sunday, October 4.
Tutorial #5 due Monday, October 5.
Milestone #3 due Friday, October 9.
Tutorial #6 due Monday, October 12.
Problem Set #4 due Wednesday, October 14.
Milestone #4 due Friday, October 16.
Tutorial #7 due Monday, October 19.
Problem Set #5 due Wednesday, October 21.
Milestone #5 due Friday, October 23.
Tutorial #8 due Monday, October 26.
Problem Set #6 due Wednesday, October 28.
Exam #2 distributed Thursday, October 29 and due Sunday, November 1.
Tutorial #9 due Monday, November 2.
Milestone #6 due Friday, November 6.
Tutorial #10 due Monday, November 9.
Problem Set #7 due Wednesday, November 11.
Milestone #7 due Friday, November 13.
Tutorial #11 due Monday, November 16.
Problem Set #8 due Wednesday, November 18.
Exam #3 distributed Thursday, November 19 and due Sunday, November 22.
Thanksgiving, November 26.
Milestone #8 due Monday, November 30. Last Day of class is Thursday, December 3.
Final project due Friday, December 11.
Exam #4 distributed Saturday, December 12 and due Sunday, December 20.
Data science involves both inputs and outputs. We bring in data from somewhere to analyze and, once we have some answers, distribute our results. During Part 1, we will bring in data from R packages, downloaded text files and files which reside on the web. We will distribute our results as html files to the course staff, requests for help (from strangers) using reproducible examples and animated graphics posted to the web.
Workflow: you should have one. – Jenny Bryan
You are Ulysses. I am the rope.
Readings: Chapter 1 Visualization and Appendix on Tools
Assignment: Tutorial 1 due Monday at 11:55
You can never look at your data too much. – Mark Engerman
We will learn how to create an R project in RStudio. The first problem set will be distributed on Tuesday, via Github Classroom, and completed during class. We will also learn how to recover from git mistakes. You will have your first Recitation with your Teaching Fellow this week.
Assignment: Tutorial 2 due Monday at 11:55
Readings: Chapter 2 Tidyverse, Getting Help and Rpubs
Problem Set #1 due September 16 at 11:55 PM. We will complete and submit this problem set in class on Tuesday. Its purpose is to ensure that everyone has a working computer, understands Git/GitHub and can compile an R Markdown document.
Final Project Milestone #1 due Friday, September 18. Speak with a Teaching Fellow about your final project. Bring your laptop. Submit information via Canvas. No need to prepare. But it is important to start thinking about what you want to do. Google Dataset Search is a good way to find data. See also these resources.
Optional: RStudio Essentials Videos. Most relevant for us are “Writing code in RStudio”, “Projects in RStudio” and “Github and RStudio”. Again, these are optional! But they are very useful for students who find find traditional lectures to be a helpful supplement to classroom practice. See also GitHub Classroom Guide for Students.
The best data science superpower is knowing how to ask a question. – Mara Averick
We will introduce the “potential outcomes” framework and review the fundamental problem of causal inference. We will discuss the slogan “no causation without manipulation.” We will learn how to produce a reproducible example — a “reprex” — in order to help strangers to help us. We send a thank-you note.
Assignment: Tutorial 3 due Monday at 11:55
Readings: Rubin Causal Model, Getting Help and Shiny
Problem Set #2 due Wednesday, September 23.
Final Project Milestone #2 due Friday, September 25. Github repo with Rmd (and knitted html) which discusses pros and cons two projects from past years. At least one project should be one which did extensive data gathering/cleaning. You should not select the same projects for commentary as your friends have. Students generally write about a paragraph for each project. The Rmd/html file should include the url for your repo. The only thing you are submitting is the html, via Canvas.
Optional: RStudio Webinar on Reprex. Again, these are optional! But they are very useful for students who find find traditional lectures to be a helpful supplement to classroom practice. Causality, Chapter 2 of Quantitive Social Science by Kosuke Imai, especially pages 46 – 63.
I stopped teaching frequentist methods when I decided they could not be learned. – Donald Berry
THERE IS NO SUCH THING AS PROBABILITY. — Bruno de Finetti
Milestone #3: due Friday, October 28. Create a new Github repo. This is the first version of your final project. (There will be many more to come.) Write an Rmd which provides a draft of your About page. (Naming it about.Rmd is wise.) Knit that Rmd into an html and submit the html via Canvas. The Rmd should include the url to your repo, should we want to examine it. Discuss all your data sources in the Rmd. (If you are gathering Harvard data, you should have a draft of your survey questions.) With luck, you will have gathered all your data and placed it in the repo. (This will generally be done with a different Rmd, like gather.Rmd, in your repo which contains the code which actually downloads your data.) You should have processed your data. (It is OK if you have not gotten quite this far as long as you discuss your progress and your plan in the About page.) Remember: You must gather data from two or more different sources. Learning how to source, clean and combine data is one of the goals of the project. On almost any topic, there are useful tables of information on Wikipedia. See here and here for advice.
Optional: Statistical Rethinking: A Bayesian Course with Examples in R and Stan (pdf) by Richard McElreath. Chapter 1.
Lot of points were taken off for small errors that I did not see as pedagogically important. – Gov 50 student
Problem Set #4 due Wednesday, October 14.
Final Project Milestone #4 due Friday, October 16. Create a rough Github repo, with at least some of your raw data or code which shows you working with data that is stored elsewhere or details of your plan to get the data. The repo must also include a reproducible html which provides a brief description of the data: where you got it, what you have done with it so far and what you plan to do. You may change your project completely, all the way until Demo Day. But you are still responsible for meeting these milestones, even if you know you are going to pivot. Your data can not be from a single source. Typing library(fivethirtyeight) is not enough!
Comment as a service to the dumbest possible version of your future self. – Alex Albright
Readings: One Parameter and “Causal effect of intergroup contact on exclusionary attitudes” by Ryan Enos. PNAS March 11, 2014 111 (10) 3699-3704.
Problem Set #5 due Wednesday, October 21.
Final Project Milestone #5 due Friday, October 23. Create a beautiful graphic, using ggplot2 or another package of your choice, which uses some of your data.
Amateurs test. Professionals summarize.
Readings: Two Parameters and Animation.
Problem Set #6 due October 28.
Exam #2 distributed Thursday morning, October 29 and due Sunday November 1.
Optional: How to Start Shiny video tutorial.
Fitting is easy. Prediction is hard. – Richard McElreath
Readings: N Parameters
Final Project Milestone #6 due Friday, November 6. You must have a working Shiny app, just to demonstrate that you can get something up and running. It can be a mess, but it should have at least one graphic with your data.
Optional: Shiny tutorials.
Put your work on the web. – David Sparks
Thanksgiving week. No class Thursday.
Readings: Chapter 12
Final Project Milestone #8 due Friday, November 30. Working rough draft of your final project. You must have a fairly complete version of your current project: a Shiny app with your About page, your data and your model. Write a four sentence elevator pitch for your project and e-mail it to your TF. This pitch is how you will begin each presentation during Demo Day.
A public portfolio of high quality work is better than a Harvard degree.
Last day of classes. Make memes, provide course feedback, discuss final projects and have fun!
Optional: Mastering Shiny by Hadley Wickham.
Important: Check your grades on Canvas, including your calculated late days. Any questions/complaints must be made before the last day of classes. After that, no changes will be made.
There are several ways to earn group participation points in class.
Imperator: Each Group will have a class Imperator, someone who helps to organize a study group, coordinate activities with other Groups, and so on. Imperators make everyone feel welcome, first by learning everyone’s name and, then, by introducing classmates to each other. Imperators must be able to get to class 10 minutes early. Role of Imperators is ensuring that everyone in a Group knows each other.
Scribe: We need note-takers, four students for each day. Meet for a meal at least one week prior to your class, take a selfie and post to Slack. They work separately, but will still be partnered with someone so they can participate in coding. After class, the four scribes get together and create one unified set of notes, which must be posted to Slack before 11:55 PM that evening.
Slack: Answering your classmates questions on Slack is the best way to earn participation points. Be a good class citizen! If you find a (meaningful!) typo in a problem set or exam, please post it to Slack. The first student to do so earns many participation points.
Course assistants are available for one-on-one or small group meetings outside of their regularly scheduled Study Halls. These meetings may not be used to work on the next problem set. That is what regular Study Halls are for. The most common purpose of these meetings is to review the questions/answers from past problem sets and exams, the better to set a solid foundation for students moving forward. A second purpose is to provide help for the final projects. Process:
This course is inspired by STAT 545, created by the legendary Jenny Bryan. The pedagogical goals follow Don Rubin’s vision. Some of the classroom exercises come from (Statistical Inference via Data Science: A moderndive into R and the tidyverse) by Chester Ismay and Albert Y. Kim. Slides were created via the R package xaringan by Yihui Xie. Many thanks to all the folks responsible for R, RStudio, Git and GitHub. This course would not be possible without their amazing contributions.