Data matters. Learning to think critically about data is a fundamental skill. How much money is donated to political campaigns? How do polls help us forecast elections? Does exposure to Spanish-speakers affect attitudes toward immigration? We need data to answer these questions – to describe, to predict, and to infer.

This course, an introduction to data science, will teach you how to think with data, how to gather information from a variety of sources, how to import that information into a project, how to tidy and transform the variables and observations, how to visualize, how to model relationships, how to assess uncertainty, and how to communicate your findings. Each student will complete a final project, the first entry in their professional portfolio. Our main focus is data associated with political science, but we will also use examples from education, economics, public health, sociology, sports, finance, climate and any other topic which students find interesting.

We use the R programming language, RStudio, Git, GitHub and DataCamp.

Prerequisites: None. You must have a laptop with R, RStudio and Git installed.

Logistics: Class meets in Tsai Auditorium from 12:00 to 1:15 on T/TH during Fall Semester 2019.

Ulysses and the Sirens, 1891, by John William Waterhouse. Homer’s Odyssey recounts the decade-long journey home of Odysseus (known as Ulysses in Roman myths) after the Trojan War. Although Ulysses’s ultimate goal is his kingdom of Ithaca, he does not shy away from adventure along the way. The Sirens use their enchanting voices to lure unwary sailors to their deaths. Ulysses wanted to hear their songs. He instructed his men to fill their ears with beeswax and to tie him to the mast.

Ulysses and the Sirens, 1891, by John William Waterhouse. Homer’s Odyssey recounts the decade-long journey home of Odysseus (known as Ulysses in Roman myths) after the Trojan War. Although Ulysses’s ultimate goal is his kingdom of Ithaca, he does not shy away from adventure along the way. The Sirens use their enchanting voices to lure unwary sailors to their deaths. Ulysses wanted to hear their songs. He instructed his men to fill their ears with beeswax and to tie him to the mast.

Course Metaphor

The central metaphor for this class is Ulysses and the Sirens. You are Ulysses. Ithaca is the future you want. The Sirens are the many distractions of the modern world. I am the rope.

Course Staff

Preceptor David Kane; ; CGIS South 310; 646-644-3626; office hours Thursday from 1:30 to 4:00, generally held in Fisher Commons. Please address me as “Preceptor,” not “David,” nor “Preceptor Kane,” nor “Professor Kane,” nor “Mr. Kane,” nor, worst of all, “Dr. Kane.”

Teaching Fellows: Georgie Evans (), Sascha Riaz () and Alice Xu (). Georgie is your first stop for DataCamp questions. Sascha handles all Google sheets related issues. Ask Alice about all your Piazza and git/GitHub problems.

Course Assistants: Shivani Aggarwal (), Enxhi Buxheli (), Claire Fridkin (), Jack Luby (), Seeam Noor (), Kodi Obika (), Dillon Smith (), and Celine Vendler ().

Course Philosophy

No Lectures: The worst method for transmitting information from my head to yours is for me to lecture you. There are no lectures. We work on problems together during class. You learn soccer with the ball at your feet. You learn about data with your hands on the keyboard.

R Everyday: Learning a new programming language is like learning a new human language: You will practice (almost) every day.

Cold Calling: I call on students during class. This keeps every student involved, makes for a more lively discussion and helps to prepare students for the real world, in which you can’t hide in the back row.

Community: You will meet more Harvard students than you would in a normal course. Awkwardness in the pursuit of community is no vice. You will probably learn the names of more students in this course than in all your other courses combined.

Organized by House: We use geography to create a community. During class, you will sit with students from your house, grouped with other houses near yours. You will work with different peers each class.

Bayesian: The philosophy of this class is unapologetically Bayesian.

Professionalism: We use professional tools. Your workflow will be very similar to the workflow involved in paid employment. Your problem sets and final project will be public, the better to impress others with your abilities. We will learn the “full cycle” of how to draw inferences from data and communicate those inferences to others.

Monologues: I give brief monologues, designed to explain specific topics that have confused students in the past. I hope to never talk for more than 5 minutes straight.

Speakers: Data scientists, from both industry and academia, will speak with us. If there is someone you would like to meet, talk to me about it and we can invite them!

Millism: Political disputes are not the focus of this class but, when such topics arise, I will insist that we follow John Stuart Mills’ advice: “He who knows only his own side of the case, knows little of that. His reasons may be good, and no one may have been able to refute them. But if he is equally unable to refute the reasons on the opposite side; if he does not so much as know what they are, he has no ground for preferring either opinion.”

No Cost: Every reading/tool we use is free. You don’t have to spend any money on this class. Some activities, like DataCamp and GitHub, have paid options which provide more services, but you never have to use them. Don’t give anyone your credit card number.

Workload: The course should take about 10 to 15 hours a week, outside of class meetings, exams and the final project. This is an expected average across the class as a whole. It is not a maximum. Some students will end up spending much less time. Others will spend much, much more.

Course Policies

Late Days: Assignments (DataCamp, Problem Sets and Final Project Milestones) are always due at 11:55 PM, unless specified otherwise. An assignment is a day late if it is turned in any time after it was due (even 5 minutes after) but within 24 hours. After that, it is two days late, and so on. You have 5 late days in total. These may be used for any assignment, except for the four exams and final project Demo Day. You should save your late days. If you use them early in the semester for no particularly good reason and then, later in the semester, have an actual emergency, we will not be sympathetic. We will not give you extra late days in such a situation. (That isn’t fair to your classmates, and we are all about fairness.) We will just, mentally, move the late days you wasted so that they cover your actually emergency. You will now be penalized for being late earlier in the semester, when you did not have a good reason for tardiness. See below for more details.

Missing Class: You expect me to be present for lecture. I expect the same of you. There is nothing more embarrassing, for both us, than for me to call your name and have you not be there to answer. But, at the same time, conflicts arise. It is never a problem to miss class if, for example, you are out of town or have a health issue. Simply put an X by your name in the Google attendance sheet. Failure to do so will decrease your participation points, as will missing too many classes, even with notification.

Major Emergencies: We are not monsters. If you are hit with a major emergency — the sort of thing that necessitates the involvement of your Resident Dean — we will be sympathetic. We require a signed letter (not an e-mail) from your Resident Dean as documentation.

Role of Teaching Fellows: The TFs run Study Halls, grade all assignments, keep track of late days, deal with emergencies and so on. Go to them first with any problems.

Role of Course Assistants: The CAs only run Study Halls. They are not involved in grading assignments and can make no commitments about how the TFs will grade. Never ask a CA a question about grading. Instead, ask on Piazza and a TF will respond, or come to a TF privately with your question.

Exceptions: There may be a reason why you can’t adhere to class policies. For example, severe social anxiety may make being cold-called problematic. A learning disability may make take-home tests unfair. Whatever the situation, please seek me out for conversation. I am sure we can work out something! I will do whatever it takes to allow every Harvard student to participate (and thrive!) in this class.

Use your Harvard e-mail: Please use your official Harvard e-mail address for all aspects of this class, especially things like signing up for services like DataCamp, GitHub, and so on. Doing so makes it much easier for us to figure out who is doing what. This may not be easy if you already connect with these services but, even in that case, you should be able to add your Harvard e-mail address to your account.

Piazza: All general questions — those not of a personal nature — should be posted to Piazza so that all students can benefit from both the question and the answer(s).

Plagiarism: If you plagiarize, you will fail the course. See the Harvard College Handbook for Students for details.

Working with Others: Students are free (and encouraged) to discuss problem sets and their final projects with one another. However, you must hand in your own unique code and written work in all cases. Any copy/paste of another’s work is plagiarism. In other words, you can work with your friend, sitting side-by-side and going through the problem set question-by-question, but you must each type your own code. Your answers may be similar (obviously) but they must not be identical, or even identical’ish.

Git and GitHub: Analyzing data without using source control is like writing an essay without using a word processor — possible but not professional. We will do all our work using Git/GitHub.

DataCamp: We make extensive use of lessons from DataCamp. All DataCamp courses are graded pass/fail. Each week’s course(s) are due by Monday at 11:55 PM.

Readings: Assignments in a given week cover (approximately) the material that we will use that week, although DataCamp is a more precise guide to our in-class activities. Some students prefer to do such readings ahead of time, the better to prepare for class. Some students prefer to do the readings after those classes, the better to reinforce the material. Some students prefer to never do the readings. No matter what path you select, know that, when constructing/grading the problem sets, exams and final projects, we will assume that you understand all assigned material.

Optional Activities: The syllabus includes background readings and DataCamp assignments which students may find interesting. You do not have to do them.

Waite Rule: We don’t wear hats in the classroom. (Obviously, this prohibition does not apply to headgear of a religious nature.)

Computer Emergencies: We are not sympathetic about computer emergencies. You should keep all your work on GitHub, so it won’t matter if your computer explodes. If it does explode, you will lose only the work after your last push. You can then restart your work on a public computer (the basement of CGIS Knafel has machines with R/RStudio installed) or on your roommate’s computer.

Github Classroom: We use Github Classroom to distribute problem sets and exams. You will receive an e-mail with a link. Click on that link and a repo, with instructions, will be created. Do this as soon as you receive the e-mail. We don’t want GitHub problems to arise the night before the assignment is due.

Speakers: We follow a No Laptop Rule during speaker presentations. Close your laptops. Put down your phones. If you want to take notes, use a pen. We do this because we respect the speakers, want to give them our full attention, and are thankful that they have taken the time to talk with us.

Tardiness: We begin on time and end on time. Do not start gathering your belongings until class is over, especially when we have a speaker.

Credit: Gov 1005 fulfills the QRD requirement. You may also get concentration credit. This is true, obviously, for Government. It is also true in Statistics and in Psychology. I am happy to support students who want to petition other departments.

Announcements: You are responsible for any assignment/exam/deadline updates/changes which are either announced in class or promulgated via the course Canvas e-mail list. You are not responsible for every random post on Piazza.


Solo Participation: 5 points. This category relates to things you do alone in class. Missing class (without notifying us) or missing too many classes will cost you points, as will a failure to participate in class activities. We keep track of this via Google sheets, so be sure to fill them out when requested. Note that I do not care if you know the answer when I cold-call you. This plays no part in your grade.

Group Participation: 5 points. This category relates to activities you do with other students. Helping your fellow students, especially on Piazza, is the best form of group participation, as is volunteering for a class role. Be a good class citizen. Help your classmates during Study Halls. Do not shirk on group projects.

DataCamp Lessons: 5 points. Grades are pass/fail only. Given the level of the questions and the hints provided, it is essentially impossible not to get full credit as long as you make an honest effort. There are 8 weeks of DataCamp, so each week’s assignments (whether one course or more) counts for 5/8th of a point.

Problem Sets: 25 points. The first problem set is worth 1 point. The remaining 8 are worth 3 points each. Problem sets after the first are distributed on Thursday and then due the following Wednesday at 11:55 PM. You are welcome to work on them with your friends but, first, you must personally type in every character in the work you submit and, second, you must list all the people you worked with. We define “work with” very broadly, to include minor interactions. You would certainly list anyone you sat nearby during Study Hall, for example.

Exams: 35 points total. The four exams are take-home. The first is worth 5 points and the others are each worth 10 points. They are open-book and open-web. Because students have different schedules, you can complete the exam any time within a four-day window starting after exam distribution. Late exams earn zero points.

Final Project: 25 points. Students will present their projects publicly at the end of the semester. They will then have the opportunity to incorporate feedback before submitting the final version. There are eight milestones for the projects, worth either 1 or 2 points, depending on difficulty.

Late Days: You may only use one late day on a given assignment. Hand it in after more than 24 hours and you get a zero on that assignment. But you still must hand it in! Everything must be completed. Late days accrue until you do. Each day late (beyond the five allowed) results in -1 point to your final score. This decrement is a point not percentage penalty. In other words, each additional late day used outside of the allotted five will drop your final class grade by one grade point.


The texts for the class are R for Data Science (R4DS) by Garrett Grolemund and Hadley Wickham, Statistical Inference via Data Science: A moderndive into R and the tidyverse (MD) by Chester Ismay and Albert Y. Kim, and Data Visualization: A practical introduction (DV) by Kieran Healy. These resources may also be helpful. All are free.

Final Project

Do you love soccer or wine or NYC politics? The final project provides you with an opportunity to study that topic in depth. Your final project will be, for most of you, the first item in your professional portfolio, something so impressive that you will be eager to show it to graduate schools or potential employers. You must show this work publicly, both on the web (viewable by all) and in person at our Demo Day. You will host your final project using Shiny Apps, a free service provided by RStudio. Make use of free statistical consulting from the Harvard Statistics Department and from IQSS.

You may combine this project with a research paper or other assignment from a different class. You automatically have my permission. But you must get explicit permission from the instructor for the other class as well.

It is not enough to simply use an already-assembled data set. Instead, you must combine data from a variety of different sources. Looking at your data-munging code will confirm for us that you have made an actual contribution. Imagine that your roommate also cares about soccer/wine/politics/whatever. You are building something that would interest her, something that will make her say, “That is cool! Let’s spend 30 minutes poking around with your data.” Projects without at least 10,000 data points are unlikely to be interesting enough, but feel free to convince us otherwise.

Projects must feature some statistical modelling. (modelDown is a useful tool.) Regression and Other Stories provides several examples of how to create, and document your creation of, such a model, e.g., section 13.5 (gun control) and section X (wells in Bangladesh).

The typical Shiny App will include three tabs. The “About” tab will provide background information about you and your data. The second tab will display your final model, and allow the user to change some of your assumptions and see the results. The third tab will be a detailed tour of the modeling choices you made and an explanation of why you made them.

All projects must include a two minute video in which you explain what you have found and a three page PDF which you must submit to this competition.

Possible Approaches

Most students will gather some data, estimate some models, and create a Shiny App. Good stuff! But there are other possible approaches:

Paper Replication

Read “Publication, Publication” (pdf) by Gary King. PS: Political Science and Politics, Vol. 39, No. 1 (Jan., 2006), pp. 119-125. King describes how to replicate the results of a published academic paper. See more details here. You will not be doing all of that! (Take Gov 1006 or Gov 2001 for that experience.) Instead, you will be creating a Shiny App which reproduces at least some of the key results of the paper and demonstrates what happens when changes are made in the modelling approach. How “robust” – to use Leamer’s terminology — are the results?

Read “Making the Most of Statistical Analyses: Improving Interpretation and Presentation” (pdf) by Gary King, Michael Tomz and Jason Wittenberg. American Journal of Political Science, Vol. 44, No. 2 (April, 2000), pp. 347-361. This is one of the most cited articles in political science in the last 20 years. Just redoing the analysis/graphics of a published article by making use of these techniques would make for an outstanding final project.

Original Data Collection

Students interested in a topic about which there is no publicly available data are welcome to collect their own data. This must be something much more substantive than just asking 100 students outside Annenberg about their favorite salad. Two categories of data work best. First, pick a topic which you truly care about. Second, pick something Harvard-specific. This Crimson article and this spring 2019 project are great examples of the latter.

Work with Other Classes

You are welcome to use data from other classes/projects in the creation of your final project. This includes thesis work. You automatically have permission from us to do this, but you must also obtain permission from the instructor of the other class.


Interested in doing a project which seems different from what we describe above? Come talk to us! The best projects involve topics which students are passionate about. If you really care about X, then we are eager to help you create a final project about X.

Prior Projects

Consider all the final projects from past semesters. Click on the project title to explore the Shiny App. Click on the student’s name to explore their Github repo. Note that, in prior years, the course had less of a statistical focus. So, these projects do not feature as much statistical modeling as yours will. Highlights:

Shivani Aggarwal: How Couples Meet. Visualizing the ways in which different kinds of U.S. couples meet and enter into relationships.

Neil Khurana: Harvard Dining. Archiving Harvard menus and exploring variations and repetition in meal choices.

Dasha Metropolitansky: First-Year Blocking Group Project. Harvard says it fosters a diverse community; trends in students’ housing indicate otherwise. This was a group project. The other group members were: Adiya Abdilkhay, Ilkin Bayramli, April Chen, Alistair Gluck, Christopher Milne, Neil Schrage and Stephanie Yao. Read more about the project here and here.

Christopher Onesti: Course Enrollment Statistics. This project presents an inside look and trend visualization regarding fall and spring undergraduate course enrollment data at Harvard.

Margaret Sun: Beyond The Stage. Various insights into the music group BTS.

Ruoqi Zhang: Settling the Dust: Censorship & Environmental Activism in China, 2012. What does social media data tell us about environmental awareness and censorship in China, 2012?

Maclaine Fields: Harvard Volleyball. I analyzed setting, serving, receiving, digging, and attacking results and created plots that show the setting tendencies and serving trajectories of Harvard Volleyball and its opponents

Kemi Akenzua: Death Row Last Words. A closer look at the final words of people executed in Texas.


If you had tried to complete a data analysis project before taking this class, you would have done X well. Now that you have taken the class – now that you know how to describe, predict and infer – you will do Y well. The success (or failure) of the class can be measured by comparing Y with X.


Everything — DataCamp (Mondays), Problem Sets (Wednesdays), Milestones (Fridays) and Exams (Sundays) — is due at 11:55 PM, unless otherwise specified.

Rhythm of the Class

The class follows a steady weekly rhythm:

Sunday, 2:00 – 5:00 PM, Study Hall with Claire Fridkin, Dunster Dining Hall.
Sunday, 7:00 PM – 10:00 PM. Study Hall with Kodi Obika, Currier Dining Hall.
Sunday, 7:00 PM – 10:00 PM. Study Hall with Dillon Smith, Smith Center.
Monday 4:30 PM – 7:30 PM. Study Hall with Sascha Riaz, K108 in Knafel CGIS.
Monday 7:00 PM – 10:00 PM. Study Hall with Shivani Aggarwal, Science Center.
Monday 11:55 PM. DataCamp exercises due.
Tuesday 12:00 PM – 1:15 PM. Class.
Tuesday 5:00 PM – 8:00 PM. Study Hall with Alice Xu, Fisher Commons.
Tuesday 6:00 PM – 9:00 PM. Study Hall with Celine Vendler, Smith Center.
Tuesday 7:00 PM – 10:00 PM. Study Hall with Enxhi Buxheli, Lowell Dining Hall.
Tuesday 8:00 PM – 11:00 PM. Study Hall with Jack Luby, Winthrop Dining Hall.
Wednesday 2:00 PM – 5:00 PM. Study Hall with Georgina Evans, Fisher Commons.
Wednesday 7:00 PM – 10:00 PM. Study Hall with Seeam Noor, Smith Center.
Wednesday 11:55 PM. Problem set due.
Thursday 12:00 PM – 1:15 PM. Class.
Thursday 1:30 PM – 4:00 PM. Office Hours with Preceptor, Fisher Commons.
Thursday evening. Problem set due next week will be distributed.
Friday 11:55 PM. Final project milestones are due.
Sunday 11:55 PM. Exams, if distributed, are due.

Key Dates

Part 1: Tools and Framework

Problem Set #1 due Wednesday, September 11.
Final Project Milestone #1 due Friday, September 13.
Problem Set #2 due Wednesday, September 18.
Final Project Milestone #2 due Friday, September 20.
Problem Set #3 due Wednesday, September 25.
Exam #1 distributed on Wednesday, September 25 and due Sunday, September 29.

Part 2: Sampling and Inference

Final Project Milestone #3 due Friday, October 4.
Problem Set #4 due Wednesday, October 9.
Final Project Milestone #4 due Friday, October 11.
Problem Set #5 due Wednesday, October 16.
Final Project Milestone #5 due Friday, October 18.
Problem Set #6 due Wednesday, October 23.
Exam #2 distributed Wednesday, October 23 and due Sunday October 27.

Part 3: Models

Final Project Milestone #6 due Friday, November 1.
Problem Set #7 due Wednesday, November 6.
Final Project Milestone #7 due Friday, November 8.
Problem Set #8 due Wednesday, November 13.
Final Project Milestone #8 due Friday, November 15.
Problem Set #9 due Wednesday, November 20.
Exam #3 distributed Wednesday, November 20 and due Sunday, November 24.

Part 4: Projects

Thanksgiving is Thursday, November 28.

Tuesday, December 3 is last day of classes.

Possible Demo Days: Tuesday, November 26; Tuesday, December 3; Wednesday, December 4; Monday, December 9.

Final project due Friday, December 13. Students must participate in this competition. The pdf which you submit must also be available from your Shiny App. Place the url for that PDF in the Google sheet.

Exam #4 distributed Wednesday, December 4 and due Sunday, December 15.


Part 1: Tools and Framework

Data science involves both inputs and outputs. We bring in data from somewhere to analyze and, once we have some answers, distribute our results. During Part 1, we will bring in data from R packages, downloaded text files and text files on the web. We will distribute our results as html files to the course staff, requests for help (from strangers) using reproducible examples and animated graphics posted to the web.

Week 1: September 2

Shopping Week

You are Ulysses. I am the rope.

Install R, RStudio and Git on your laptop. Start on the DataCamp assignments. They are due on Monday, September 9 at 11:55 PM. Sign up for a meeting with a member of the Course Staff. This will fulfill the first milestone, due September 13, for the final project. We will use RStudio Cloud on Tuesday and individual laptops on Thursday. Although it is not officially due till Monday, please try to do Introduction to the Tidyverse for Thursday’s class.


R4DS: Chapters 1, 2, 3, 4, 6 and 8.
DV: Chapters 1 and 2.

Week 2: September 9


You can never look at your data too much. – Mark Engerman

We will review some basic R operations including constructing vectors with c() and subsetting elements with []. We will mention useful functions, like slice() and pull(), which are not covered in the DataCamp assignments. We will learn how to create an R project in RStudio. The first problem set will be distributed on Tuesday, via Github Classroom, and completed during class. We will also learn how to recover from git mistakes. We will introduce the “potential outcomes” framework and review the fundamental problem of causal inference.


R4DS: Chapters 5 and 7.
DV: Chapter 3.


Remember: DataCamp assignments are due Monday at 11:55 PM.

Introduction to the Tidyverse
Introduction to Shell for Data Science only first chapter, “Manipulating files and directories”
Introduction to Git for Data Science only first chapter, “Basic workflow”
Visualization Best Practices in R
Communicating with Data in the Tidyverse, only third chapter, “Introduction to RMarkdown”


Problem Set #1 due September 11 at 11:55 PM. We will complete and submit this problem set in class on Tuesday. Its purpose is to ensure that everyone has a working computer, understands Git/GitHub and can compile an R Markdown document.

Final Project Milestone #1 due Friday, September 13. Speak with any member of the course staff (Course Assistant or Teaching Fellow) about your final project. Bring your laptop. Most staff study halls are Sunday through Wednesday. Do not wait until Friday morning. Record the name of the person you met with in the Final Project Google Sheet. This is the first of three required meetings. No need to prepare for this meeting. But it is important to start thinking about what you want to do. This also provides for an opportunity to meet some of the course staff. Sign ups will be distributed via Piazza. Consider scheduling an interview with Hugh Truslow (), Head, Social Sciences and Visualization, Harvard University. No one at Harvard knows more about potential data sources.

Optional: RStudio Essentials Videos. Most relevant for us are “Writing code in RStudio”, “Projects in RStudio” and “Github and RStudio”. Again, these are optional! But they are very useful for students who find find traditional lectures to be a helpful supplement to classroom practice. See also GitHub Classroom Guide for Students.

Week 3: September 16

Seeking Help

The best data science superpower is knowing how to ask a question. – Mara Averick

We will learn how to produce a reproducible example — a “reprex” — in order to help strangers to help us. We will discuss the slogan “no causation without manipulation” and implement a permutation test.


R4DS: Chapters 9, 10, 11.
DV: Chapter 4.


Problem Set #2 due Wednesday, September 18.

Final Project Milestone #2 due Friday, September 20. Github repo with Rmd (and knitted html) which discusses pros and cons two projects from past years. At least one project should be one which did extensive data gathering/cleaning. You should not select the same projects for commentary as your friends have. Students generally write about a paragraph for each project.

Optional: Causality, Chapter 2 of Quantitive Social Science by Kosuke Imai, especially pages 46 – 63.

Speaker: September 19: Mara Averick, RStudio. Lunch to follow.

Week 4: September 23


Workflow: you should have one. – Jenny Bryan

We will learn how to make engaging animations. We will discuss the meaning of “average effect” and related terms.


The Cognitive Style of Powerpoint by Edward Tufte.

MD: Chapters 1 through 5.
R4DS: Chapters 12, 13, 14, 15 and 16.
DV: Chapter 5.


Joining Data in R with dplyr
String Manipulation in R with stringr. Chapter 1, String basics
String Manipulation in R with stringr. Chapter 2, Introduction to stringr


Problem Set #3 due Wednesday, September 25.
Exam #1 due Sunday, September 29.

Optional: The Unix Workbench, chapters 1 – 6.

Speaker: September 26: Will Kurt, Data Science Manager at Wayfair. Lunch to follow.

Part 2: Sampling and Inference

Week 5: September 30


Lot of points were taken off for small errors that I did not see as pedagogically important. – Gov 1005 student


MD: Chapter 8 Sampling.
R4DS: Chapters 17, 18, 19, 20 and 21.


Because of the exam, DataCamps are not due till Wednesday at 11:55 PM.

Introduction to Function Writing in R
Foundations of Functional Programming with purrr


Final Project Milestone #3 due Friday, October 4. Speak with any member of the course staff (CA or TF) about your final project. This is the second of three required meetings. In addition to this meeting, you must create a rough Github repo, with at least some of your raw data (or details of your plan to get the data), and a reproducible html which provides a brief description of the data: where you got, what you have done with it so far and what you plan to do. You may change your project completely, all the way until Demo Day. But you are still responsible for meeting these milestones, even if you know you are going to pivot. Your data can not be from a single source. Typing library(fivethirtyeight) is not enough!

Optional: Visual and Statistical Thinking: Displays of Evidence for Making Decisions by Edward Tufte.

Speaker: October 3: Alex Albright, Harvard University. Lunch to follow.

Week 6: October 7

Confidence Intervals

Comment as a service to the dumbest possible version of your future self. – Alex Albright



Problem Set #4 due Wednesday, October 9.

Final Project Milestone #4 due Friday, October 11. Create a beautiful ggplot2 graphic which uses some of your data.

Optional:Rich State, Poor State, Red State, Blue State: What’s the Matter with Connecticut?” by Gelman et al.

Speaker: October 10: Angela Bassa, iRobot. Lunch to follow.

Week 7: October 14


I stopped teaching frequentist methods when I decided they could not be learned. – Donald Berry

How many twins are there at Harvard? Does a subgroup of voters from a larger sample provide the best estimate for the entire subgroup population?


Chapter 1 in Think Bayes (pdf) by Allen Downey.
Chapter 2 in Doing Bayesian Data Analysis (pdf) by John Kruschke.


Beginning Bayes in R. Note that there are a handful of exercises which use simple linear regression, which we don’t cover for two weeks. If you are unfamiliar with this topic, feel free to just “Show Answer” on those questions.


Problem Set #5 due October 16.

Final Project Milestone #5 due Friday, October 18. Create an Rmd/html which provides a draft of your About page. All of your data processing should be complete. Remember: You must gather data from two or more different sources. Learning how to source, clean and combine data is one of the goals of the project. On almost any topic, there are useful tables of information on Wikipedia. See here and here for advice.

Optional: Statistical Rethinking: A Bayesian Course with Examples in R and Stan (pdf) by Richard McElreath. Chapter 1.

Speaker: October 17: Stefanie Costa Leabo, Chief Data Officer, City of Boston. Lunch to follow.

Week 8: October 21



“Let’s Take the Con Out of Econometrics,” by Edward E. Leamer. The American Economic Review, Vol. 73, No. 1 (March, 1983), pp. 31-43. link


Problem Set #6 due October 23.
Exam #2 due Sunday October 27.

Optional:A Balanced Perspective on Prediction and Inference for Data Science in Industry” by Nathan Sanders, DV: Chapter 7, and Introduction to Mapping with sf.

Speaker: October 24: Nathan Sanders, Chief Scientist at Warner Media Applied Analytics. Lunch to follow.

Part 3: Models

Week 9: October 28


Teach people to drive. Then later, if they want or need, they can learn how the internal combustion engine works. – Andrew Gelman


MD: Chapter 6 Basic Regression.
DV: Chapter 6.


Note that, although the DataCamps this week focus on univariate regression, they both also uses examples from multivariate regression which we, officially, do not cover until next week. You may find it useful to look ahead at MD Chapter 7. Because of the exam, DataCamps are not due till Wednesday at 11:55 PM.

Modeling with Data in the Tidyverse
Bayesian Regression Modeling with rstanarm


Final Project Milestone #6 due Friday, November 1. Speak with any member of the course staff (CA or TF) about your final project. This is the third of three required meetings. By 11:55 PM, you must have a working Shiny App, just to demonstrate that you can get something up and running. This does not have to be working for your meeting. Add the name of the person you met with and the url of your Shiny App to the appropriate Google sheet.

Optional: Statistical Rethinking: A Bayesian Course with Examples in R and Stan (pdf) by Richard McElreath. Chapter 2.

Week 10: November 4

Multivariate Regression

Amateurs test. Professionals summarize.


MD: Chapter 7 Multiple Regression.
DV: Chapter 8.


Problem Set #7 due Wednesday, November 6.

Final Project Milestone #7 due Friday, November 8. Cleaned up Github repo. This is your chance to make use of the feedback you received during last week’s meeting.

Optional:The Bayesian New Statistics” by John K. Kruschke and Torrin M. Liddel.

Week 11: November 11




Problem Set #8 due Wednesday, November 13.

Final Project Milestone #8 due Friday, November 15. Working rough draft of your final project. Demo Day is still two weeks away, and you can completely pivot if you want, but you must have a fairly complete version of your current project.

Optional: A Guide to Bayesian Statistics

Speaker: November 14: Andrew Therriault, former Facebook and City of Boston. Lunch to follow

Week 12: November 18

Machine Learning

Put your work on the web. – David Sparks


Chapters 28, 29, 30 and 32 from Introduction to Data Science by Rafael A. Irizarry.


Problem Set #9 due Wednesday, November 20.
Exam #3 due Sunday, November 24.

Optional: Chapter 7 (pdf) from The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition.

Speaker: November 21: David Sparks, Director of Basketball Analytics for the Boston Celtics. Lunch to follow.

Part 4: Projects

The main focus of the last two class meetings is the final projects. Note that we only have one meeting (on Tuesday) during each of the last two weeks.

Week 13: November 25


Fitting is easy. Prediction is hard. – Richard McElreath

It is tough to get much done on the Tuesday of Thanksgiving week. Main focus will be on Shiny Apps.


R4DS: Chapters 26, 27 28, 29 and 30.
Shiny tutorials

Optional: Shiny Apps User Guide

Week 14: December 2

Demo Day

A public portfolio of high quality work is better than a Harvard degree.

Last day of classes. Make memes, provide course feedback, discuss final projects and have fun!

Class Room Seating

Record the name of your Partner in the Google sheet for the day. Each person does this, even though doing so leads to duplication.

Assignment Details


There are several ways to earn group participation points in class.

Imperator: Each Group will have a class Imperator, someone who helps to organize a study group, coordinate activities with other Groups, and so on. Imperators make everyone feel welcome, first by learning everyone’s name and, then, by introducing classmates to each other.

Actarium: We need note-takers, two students for each day. They work separately, but will still be partnered with someone so they can participate in coding. After class, the two actarii get together and create one unified set of notes, which must be posted to Piazza before 11:55 PM that evening.

Welcome Committee: We organize a Welcome Committee of five students for each speaker. See below for the duties associated with this job.

Piazza: Answering your classmates questions on Piazza is the best way to earn participation points. Be a good class citizen! If you find a (meaningful!) typo in a problem set or exam, please post it to Piazza. The first student to do so earns many participation points.

Final Project Milestones

Final project milestones are always due at 11:55 PM of the designated date, which is always a Friday. You may use late days, except for Demo Day and the final due date. All submissions are made via a Google sheet, the url of which will be distributed on Piazza. There are no milestones due during exams periods. The milestones which occur in the week after an exam (Oct 4th and Nov 1st) are major milestones, requiring more work and, therefore, being worth two points. Other milestones are worth 1 point. All 8 milestones together count for 10 points. Demo Day counts for 10 points. The final project submission is worth 15 points and is due on December 13. Fill out the Google sheets correctly!

Study Halls

Study Halls (SH) are run by Course Assistants (CAs), undergraduates who have taken the class in the past. They are one of the most popular parts of the course. Teaching Fellows (TFs) also run Study Halls, although these will often have more of an office hours flavor. Students who make the most use of these resources do better in class, and enjoy it more, than students who do not. Course Staff (CS) is a term which incorporates both course assistants and teaching fellows.


At every SH, the CS will ensure that everyone knows everyone else’s name. This class is a community and community begins with names. The process starts with the first student arriving and sitting at the table. They and the CS chat. (It is always nice for the student to take the initiative and introduce themselves to the CS. Remembering all your names is hard!) A second person arrives and sits at the same table, followed by introductions. Persons 3 and 4 arrive. More introductions. Help your CS by introducing yourself, even if you are 75% sure they remember your name. Be friendly!

At this point, the table is filled. Another person arrives. Instead of that person starting a new table, CS gives the new student their spot and moves their belongings to a new table. No student ever sits alone. The CS hovers around the table until more students arrive and start filling out table #2. And so on. At each stage, students are responsible for, at a minimum, introducing themselves to the CS and, even better, to the other students. Best is when students who are already present shower newly arriving students with welcomes and introductions.

Help Us Help You

CS will, to the greatest extent possible, never just give you the answer. Something like “Use annotate()” might solve your immediate problem, but it does not set you up for success during the exams — when we won’t be around to serve as your personal oRacles — much less for the rest of your life.

Instead, we will take the time to show you how to find the answer yourself. This starts with how to search for help, especially when you are not sure what you are looking for. This is more art than science, but adding certain strings — like “R”, “tidyverse”, or “ggplot” — to the search often helps. Then, we provide advice about which locations are the highest quality (anything to do with RStudio or tidyverse), which locations are less good than they initially appear (,,, and which are difficult to use (Stack Overflow). We then explain the best way to make use of what you find.

We also point you directly to the best resources, especially to R for Data Science by Garrett Grolemund and Hadley Wickham and to Data Visualization: A practical introduction by Kieran Healy. We won’t say: “Just use starts_width().” Instead, we will ask, “Have you read Section 5.4 of R4DS, involving the use of select()?” Yes, this will require an extra five minutes of your time. But every extra minute you spend reading a high quality reference is a minute well-spent.

We also help you learn how to seek help from others. There is a good way to ask for help on Piazza or Stack Overflow — generally involving the use of reproducible example which highlights your precise problem — and a bad way.

Only if none of this works will we just tell you the answer.

Social Events

Socializing with students outside of class is fun. Joining/inviting me is optional and has no influence on your grade in the class, i.e., it earns you no participation points. The three main options are:

Restaurant Lunches

My wife and I host students in groups of 4 for lunch throughout the semester, sometimes via the Harvard Class-Room-to-Table program and sometimes on our own dime. We organize this by House at the start and then open up spots to everyone later. Invitations to come. Dress is casual. Please be on time. The reservation will be under “Kane.” Just go straight to the table.

House Lunches

I enjoy having lunch with you after class, either in the CGIS cafe, Annenberg or at your House.

Faculty Dinners

I enjoy attending faculty dinners, so feel free to invite me to yours. My only request is that you also invite the other students in the class who live in your House. It is often fun to take over a table with a group of 4 or 5 or . . .


This course is inspired by STAT 545, created by the legendary Jenny Bryan. The pedagogical goals follow Don Rubin’s vision. Some of the slides and exercises come from Data Science in a Box, by Mine Çetinkaya-Rundel. Some of the in-class exercises are from Teaching Statistics: A Bag of Tricks by Andrew Gelman and Deborah Nolan. Kudos to authors like Garrett Grolemund and Hadley Wickham (R for Data Science), Kieran Healy (Data Visualization: A practical introduction), Chester Ismay and Albert Y. Kim (Statistical Inference via Data Science: A moderndive into R and the tidyverse) for making their books freely available. Thanks to Kosuke Imai for open sourcing several of the datasets from Quantitative Social Science: An Introduction. Slides were created via the R package xaringan by Yihui Xie. Many thanks to all the folks responsible for R, RStudio, Git, GitHub and DataCamp. This course would not be possible without their amazing contributions.