Data matters. Learning to think critically about data is a fundamental skill. How much money is donated to political campaigns? How do polls help us forecast elections? Does exposure to Spanish-speakers affect attitudes toward immigration? We need data to answer these questions – to describe, predict and infer.

This course, an introduction to data science, will teach you how to think with data, how to gather information from a variety of sources, how to import that information into a project, how to tidy and transform the variables and observations, how to visualize, how to model relationships, how to assess uncertainty, and how to communicate your findings in a sophisticated fashion. Each student will complete a final project, the first entry in their professional portfolio. Our main focus is data associated with political science, but we will also use examples from education, economics, public health, sociology, sports, finance, climate and any other topic which students find interesting.

We use the R programming language, RStudio, Git, GitHub and DataCamp. Although we will learn how to write code, this is not a course in computer science. Although we will learn to think with data, this is not a course in statistics. We focus on practice, not theory. We make stuff.

Prerequisites: None. You must have a laptop with R, RStudio and Git installed.

Logistics: Class meets in Tsai Auditorium in the basement of CGIS South from 12:00 to 1:15 on T/TH.

Ulysses and the Sirens, 1891, by John William Waterhouse. Homer’s Odyssey recounts the decade-long journey home of Odysseus (known as Ulysses in Roman myths) after the Trojan War. Although Ulysses’s ultimate goal is his home of Ithaca, he does not shy away from adventure along the way. The Sirens use their enchanting voices to lure unwary sailors to their deaths. Ulysses wanted to hear their songs. He instructured his men to fill their ears with wax and tie him to the mast.

Ulysses and the Sirens, 1891, by John William Waterhouse. Homer’s Odyssey recounts the decade-long journey home of Odysseus (known as Ulysses in Roman myths) after the Trojan War. Although Ulysses’s ultimate goal is his home of Ithaca, he does not shy away from adventure along the way. The Sirens use their enchanting voices to lure unwary sailors to their deaths. Ulysses wanted to hear their songs. He instructured his men to fill their ears with wax and tie him to the mast.

Course Metaphor

The central metaphor for this class is Ulysses and the Sirens. You are Ulysses. Ithaca is the future you want. The Sirens are the many distractions of the modern world. I am the rope.

Course Staff

Preceptor David Kane;; CGIS South 310; 646-644-3626; office hours Thursday from 1:30 to 4:00, generally held in Fisher Commons. Please address me as “Preceptor,” not “David,” nor “Preceptor Kane,” nor “Professor Kane,” nor “Mr. Kane,” nor, worst of all, “Dr. Kane.”

Teaching Fellows Georgie Evans ( and Sascha Riaz (

Course Assistants: Claire Fridkin and Dillon Smith.

Course Philosophy

No Lectures: The worst method for transmitting information from my head to yours is for me to lecture you. There are no lectures. We work on problems together during class. You learn soccer with the ball at your feet. You learn about data with your hands on the keyboard.

R Everyday: Learning a new programming language is like learning a new human language: You should practice every day. In this class, you will.

Cold Calling: I call on students during class. This keeps every student involved, makes for a more lively discussion and helps to prepare students for the real world, in which you can’t hide in the back row.

Community: You will meet and work with many more of your fellow Harvard students than you would in a normal course. Awkwardness in the pursuit of community is no vice. You will probably learn the names of more students in this course than you will in all your other courses put together.

Organized by House: We use geography to create a community. During class, you will sit with students from your house, grouped with other houses near yours. If you live in Adams, for example, you will sit with other Adams students, and nearby the students from Lowell and Quincy. Within your house, you will work with different peers each class. Don’t want to meet a score or more of Harvard students? Don’t take this class.

Professionalism: We use professional tools in a professional fashion. Your workflow will be very similar to the workflow involved in paid employment. Your problem sets and final project will be public, the better to impress others with your abilities.

Monologues: I give brief monologues, designed to explain specific topics that have confused students in the past. I hope to never talk for more than 5 minutes straight.

Speakers: We will have a variety of visitors to class, people performing professional data analysis, both inside and outside of academia, often using exactly the same tools that we use. If there is someone you would like to meet, talk to me about it and we can invite them!

Millism: Political disputes are not the focus of this class but, when such topics arise, I will insist that we follow John Stuart Mills’ advice: “He who knows only his own side of the case, knows little of that. His reasons may be good, and no one may have been able to refute them. But if he is equally unable to refute the reasons on the opposite side; if he does not so much as know what they are, he has no ground for preferring either opinion.”

No Cost: Every reading/tool we use is free. You don’t have to spend any money on this class. Some activities, like DataCamp and GitHub, have paid options which provide more services, but you never have to use them. Don’t give anyone your credit card number.

Workload: The course should take about 10 to 15 hours a week, outside of class meetings, exams and the final project. This is an expected average across the class as a whole. It is not a maximum. Some students will end up spending much less time. Others will spend much, much more.

Course Policies

Late Days: Assignments are always due at midnight, unless specified otherwise. An assignment is a day late if it is turned in any time after it was due (even 5 minutes after) but within 24 hours. After that, it is two days late, and so on. You have 5 late days which may be used for any assignment, except for the four exams and final project Demo Day. You should save your late days. If you use them early in the semester for no particularly good reason and then, later in the semester, have an actual emergency, we will not be sympathetic. We will not give you extra late days in such a situation. (That isn’t fair to your classmates, and we are all about fairness.) We will just, mentally, move the late days you wasted so that they cover your actually emergency. You will now be penalized for being late earlier in the semester, when you did not have a good reason for tardiness.

Missing Class: You expect me to be present for lecture. I expect the same of you. There is nothing more embarrassing, for both us, than for me to call your name and have you not be there to answer. But, at the same time, conflicts arise. It is never a problem to miss class if, for example, you are out of town or have a health issue. Simply put an X by your name in the Google attendance sheet. Failure to do so will decrease your participation points, as will missing too many classes, even with notification.

Major Emergencies: We are not monsters. If you are hit with a major emergency — the sort of thing that necessitates the involvement of your Resident Dean — we will be sympathetic. We require a signed letter (not an e-mail) from your Resident Dean as documentation.

Role of Teaching Fellows: The TFs are responsible for grading, keeping track of late days, dealing with emergencies and so on. Go to them first with any problems.

Role of Course Assistants: The CAs run Study Halls. They are not involved in grading assignments and can make no commitments about how the TFs will grade. Never ask a CA a question about grading. Instead, ask on Piazza and a TF will respond, or come to a TF privately with your question.

Use your Harvard e-mail: Please use your official Harvard e-mail address for all aspects of this class, especially things like signing up for services like DataCamp, GitHub, and so on. Doing so makes it much easier for us to figure out who is doing what. This may not be easy if you already connect with these services but, even in that case, you should be able to add your Harvard e-mail address to your account.

Piazza: All general questions — those not of a personal nature — should be posted to Piazza so that all students can benefit from both the question and the answer(s).

Plagiarism: If you plagiarize, you will fail the course. See the Harvard College Handbook for Students for details.

Working with Others: Students are free (and encouraged) to discuss problem sets and their final projects with one another. However, you must hand in your own unique code and written work in all cases. Any copy/paste of another’s work is plagiarism. In other words, you can work with your friend, sitting side-by-side and going through the problem set question-by-question, but you must each type your own code. Your answers may be similar (obviously) but they must not be identical, or even identical’ish.

R: You must use R and RStudio for this class. You are responsible for installing both on your laptop.

Git and GitHub: Analyzing data without using source control is like writing an essay without using a word processor — possible but not professional. We will do all our work using Git/GitHub. If Git is not already installed on your computer, please install it.

DataCamp: We make extensive use of lessons from DataCamp. All DataCamp courses are graded pass/fail. Each week’s course(s) are due by Monday at midnight. Class on Tuesday will assume the completion of this work.

Readings: Assignments in a given week cover material that we will use that week. Some students prefer to do such readings ahead of time, the better to prepare for class. Some students prefer to do the readings after those classes, the better to reinforce the material. Some students prefer to never do the readings. No matter what path you select, know that, when constructing/grading the problem sets, exams and final projects, we will assume that you understand all assigned material.

Optional Activities: The syllabus includes background readings and DataCamp assignments which students may find interesting.

Computer Problems: If you are having problems with your computer, follow these steps. First, post the problem on Piazza, with details and screenshots. With luck, a fellow student will be able to solve it. (And students who help their peers with technical issues are guaranteed full participation points for grading purposes.) Second, if I and/or the TFs can’t solve it, we will direct you toward the IQSS IT Client Support Services, located in the basement of CGIS Knafel. They are excellent! E-mail them with the details of your problem, mentioning your enrollment in this class, at Provide them with a link to your Piazza post. Although I and the teaching fellows want to be helpful, we are not experts in troubleshooting computer problems. Third, once your problem is solved, tell us all the solution by responding to your own post on Piazza.

Waite Rule: We don’t wear hats in the classroom. (Obviously, this prohibition does not apply to headgear of a religious nature.)

Computer Emergencies: We are very unsympathetic to computer emergencies. You should keep all your work on GitHub, so it won’t matter if your computer explodes. If it does explode, you will lose only the work after your last push. You can then restart your work on a public computer (the basement of CGIS Knafel has machines with R/RStudio installed) or on your roommate’s computer.

Github Classroom: We use Github Classroom to distribute problem sets and exams. You will receive an e-mail with a link. Click on that link and a repo, with instructions, will be created. Do this as soon as you receive the e-mail. We don’t want GitHub problems to arise the night before the assignment is due.

Speakers: We follow a No Laptop Rule during speaker presentations. Close your laptops. Put down your phones. If you want to take notes, use a pen. We do this because we respect the speakers, want to give them our full attention, and are thankful that they have taken the time to talk with us. I will still need to look at my phone, but only to ensure that the class ends exacly on time. Do not start gathering your belongings until class ends.

Credit: You may get concentration credit for Gov 1005. This is true, obviously, for Government. It is also true in Psychology. I am happy to support students who want to petition other departments.

Announcements: You are responsible for any assignment/exam/deadline updates/changes which are either announced in class or promulgated via the course Canvas e-mail list. You are not responsible for every random post on Piazza.


Solo Participation: 5 points. This category relates to things you do alone in class. Missing class (without notifying us) or missing too many classes will cost you points, as will a failure to participate in class activities. We keep track of this via Google sheets, so be sure to fill them out when requested.

Group Participation: 5 points. This category relates to activities you do with other students. Helping your fellow students, especially on Piazza, is the best form of group participation, as is volunteering for a class role. Be a good class citizen. Help your classmates during Study Halls. Do not shirk on group projects.

DataCamp Lessons: 5 points. Grades are pass/fail only. These are free points! Given the level of the questions and the hints provided, it is essentially impossible not to get full credit as long as you make an honest effort. Each day late (beyond the five allowed) results in -1 point from the 5 points total allocated for DataCamp assignments. If you use up more than 5 points, further days late will make a negative contribution to your final grade. Note that these are point not percentage penalities. This means that each additional late day used outside of the allotted five will drop your final class grade by one grade point.

Problem Sets: 25 points. The first problem set is worth 1 point. The remaining 8 are worth 3 points each. Problem sets are distributed on Thursday and then due the following Wednesday at midnight. You are welcome to work on them with your friends but, first, you must personally type in every character in the work you submit and, second, you must list all the people you worked with. We define “work with” very broadly, to include minor interactions. You would certainly list anyone you sat nearby during Study Hall, for example. You may only use one late day for a given problem set since, after one day, we will distribute the answers. If you do not submit your problem set within 24 hours of its due date, you receive a zero for that assignment. You will (also!) still be “charged” with the late day.

Exams: 35 points total. The four exams are take-home. The first is worth 5 points and the others are worth 10 points each. They are open-book and open-web. Because students have different schedules, you can complete the exam any time within a four-day window starting after exam distribution. Late exams earn zero points.

Final Project: 25 points. Students will present their projects publicly at the end of the semester. They will then have the opportunity to incorporate feedback before submitting the final version. There are several milestones for the projects. You may use your late days for them, just as you might for DataCamp assignments or Problem Sets. But, as with DataCamp (described above), these milestones must be met. Negative points will accrue until the milestone is completed. If you use up more than 5 points, further days late on a milestone will make a negative contribution to your final grade. Note that these are point not percentage penalties. This means that each additional late day used outside of the allocated five will drop your final class grade by one grade point.

See below for details on schedule and grading standards.


The texts for the class are R for Data Science (R4DS) by Garrett Grolemund and Hadley Wickham, Statistical Inference via Data Science: A moderndive into R and the tidyverse (MD) by Chester Ismay and Albert Y. Kim, and Data Visualization: A practical introduction (DV) by Kieran Healy. The primary resources below are also useful, but are not required reading. The secondary resources may also be helpful. All are free.


Happy Git and GitHub for the useR by Jenny Bryan
What They Forgot to Teach You About R by Jennifer Bryan and Jim Hester
R Graphics Cookbook, 2nd edition by Winston Chang


The Unix Workbench by Sean Kross
R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund
Pro Git by Scott Chacon and Ben Straub

Final Project

Do you love soccer or wine or NYC politics? The final project provides you with an opportunity to study that topic in depth. Your final project will be, for most of you, the first item in your professional portfolio, something so impressive that you will be eager to show it to potential employers. You must show this work publicly, both on the web (viewable by all) and in person at our Demo Day. You will host your final project using Shiny Apps, a free service provided by RStudio. Make use of free statistical consulting from the Harvard Statistics Department and from IQSS.

You may combine this project with a research paper or other assignment from a different class. You automatically have my permission. But you must get explicit permission from the instructor for the other class as well.

Possible Approaches

Data Exploration

Your goal is to gather data and present it in an engaging fashion. We are not necessarily investigating specific hypotheses or trying to fit a statistical model. Instead, imagine that your roommate also cares about soccer/wine/politics/whatever. You are building something that would interest her, something that will make her say, “That is cool! Let’s spend 30 minutes poking around with your data.” Projects without at least 10,000 data points are unlikely to be interesting enough.

It is not enough to simply use an already-assembled data set. Instead, you must combine data from a variety of different sources. Looking at your data-munging code will confirm for us that you have made an actual contribution. In past years, this was the only type of project that students were allowed to do.

Paper Replication

Read “Publication, Publication” (pdf) by Gary King. PS: Political Science and Politics, Vol. 39, No. 1 (Jan., 2006), pp. 119-125. King describes how to replicate the results of a published academic paper. See more details here. You will not be doing all of that! (Take Gov 1006 or Gov 2001 for that experience.) Instead, you will be creating a Shiny App which reproduces at least some of the key results of the paper and demonstrates what happens when changes are made in the modelling approach. How “robust” – to use Leamer’s terminology — are the results?

Read “Making the Most of Statistical Analyses: Improving Interpretation and Presentation” (pdf) by Gary King, Michael Tomz and Jason Wittenberg. American Journal of Political Science, Vol. 44, No. 2 (April, 2000), pp. 347-361. This is one of the most cited articles in political science in the last 20 years. Just redoing the analysis/graphics of a published article by making use of these techniques would make for an outstanding final project.

Data Collection

Students interested in a topic about which there is no publicly available data are welcome to collect their own data. This must be something much more substantive than just asking 100 students outside Annenberg about their favorite lunch entry. Two categories of data work best. First, pick a topic which you truly care about. Second, pick something Harvard-specific.

Statistical Model

Creating your own statistical model can be a part of any of the three final project categories listed above, but it is not a requirement. Students may also create a final project in which modeling is the main focus. Take a clean data set and create a model which improves are understanding of the world. (modelDown is a useful tool.) Regression and Other Stories provides several examples of how to create, and document your creation of, such a model.

The typical Shiny App for this type of project will include three tabs. The “About” tab — just like the About tab for the other final projects — will provide background information about you and your data. The second tab will display your final model, and allow the user to change some of your assumptions and see the results. The third tab will be a detailed tour of the modeling choices you made and an explanation of why you made them.

Statistical Demonstration

Build a Shiny App which explains statistical concepts. See here for some inspiration.


Interested in doing a project which does not fall into one of these categories? Come talk to us! The best projects involve topics which students are passionate about. If you really care about X, then we are eager to help you create a final project about X.

Prior Projects

Consider all the final projects from past semesters. Click on the project title to explore the Shiny App. Click on the student’s name to explore their Github repo. Highlights:

Shivani Aggarwal: How Couples Meet. Visualizing the ways in which different kinds of U.S. couples meet and enter into relationships.

Neil Khurana: Harvard Dining. Archiving Harvard menus and exploring variations and repetition in meal choices.

Dasha Metropolitansky: First-Year Blocking Group Project. Harvard says it fosters a diverse community; trends in students’ housing indicate otherwise. This was a group project. The other group members were: Adiya Abdilkhay, Ilkin Bayramli, April Chen, Alistair Gluck, Christopher Milne, Neil Schrage and Stephanie Yao.

Christopher Onesti: Course Enrollment Statistics. This project presents an inside look and trend visualization regarding fall and spring undergraduate course enrollment data at Harvard.

Margaret Sun: Beyond The Stage. Various insights into the music group BTS.

Ruoqi Zhang: Settling the Dust: Censorship & Environmental Activism in China, 2012. What does social media data tell us about environmental awareness and censorship in China, 2012?

Maclaine Fields: Harvard Volleyball. I analyzed setting, serving, receiving, digging, and attacking results and created plots that show the setting tendencies and serving trajectories of Harvard Volleyball and its opponents

Kemi Akenzua: Death Row Last Words. A closer look at the final words of people executed in Texas.


If you had tried to complete a data analysis project before taking this class, you would have done X well. Now that you have taken the class – now that you know how to describe, predict and infer – you will do Y well. The success (or failure) of the class can be measured by comparing Y with X.


Everything (DataCamp, Problem Sets, Exams, Milestones) is due at midnight, unless otherwise specified.

Rhythm of the Class

The class follows a steady weekly rhythm:

Sunday, 2:00 – 5:00 PM, Study Hall with Claire Fridkin, Dunster Dining Hall.
Sunday, 7:00 PM – 10:00 PM. Study Hall with Dillon Smith, Smith Center.
Monday 4:00 PM – 7:00 PM. Study Hall with Sascha Riaz, Fisher Commons.
Monday midnight. DataCamp exercises due. Tuesday 12:00 PM – 1:15 PM. Class.
Wednesday 2:00 PM – 5:00 PM. Study Hall with Georgina Evans, Fisher Commons.
Wednesday midnight. Problem set due.
Thursday 12:00 PM – 1:15 PM. Class.
Thursday 1:30 PM – 4:00 PM. Office Hours with Preceptor, Fisher Commons.
Thursday evening. Problem set due next week will be distributed.
Friday midnight. Final project milestones are due.
Sunday midnight. Exams, if distributed, are due.

Key Dates

Part 1: Tools and Framework

Problem Set #1 due Wednesday, September 11.
Final Project Milestone #1 due Friday, September 13.
Problem Set #2 due Wednesday, September 18.
Final Project Milestone #2 due Friday, September 20.
Problem Set #3 due Wednesday, September 25.
Exam #1 distributed on Wednesday, September 25 and due Sunday, September 29.

Part 2: Sampling and Inference

Final Project Milestone #3 due Friday, October 4.
Problem Set #4 due Wednesday, October 9.
Final Project Milestone #4 due Friday, October 11.
Problem Set #5 due Wednesday, October 16.
Final Project Milestone #5 due Friday, October 18.
Problem Set #6 due Wednesday, October 23.
Exam #2 distributed Wednesday, October 23 and due Sunday October 27.

Part 3: Models

Final Project Milestone #6 due Friday, November 1.
Problem Set #7 due Wednesday, November 6.
Final Project Milestone #7 due Friday, November 8.
Problem Set #8 due Wednesday, November 13.
Final Project Milestone #8 due Friday, November 15.
Problem Set #9 due Wednesday, November 20.
Exam #3 distributed Wednesday, November 20 and due Sunday, November 24.

Part 4: Projects

Thanksgiving is Thursday, November 28.

Tuesday, December 3 is last day of classes.

Possible Demo Days: Tuesday, November 26; Tuesday, December 3; Wednesday, December 4; Monday, December 9.

Final project due Friday, December 13.

Exam #4 distributed Wednesday, December 4 and due Sunday, December 15.


Part 1: Tools and Framework

Data science involves both inputs and outputs. We bring in data from somewhere to analyze and, once we have some answers, distribute our results. During Part 1, we will bring in data from R packages, downloaded text files and text files on the web. We will distribute our results as html files to the course staff, requests for help (from strangers) using reproducible examples and animated graphics posted to the web.

Week 1: September 2: Shopping Week

You are Ulysses. I am the rope. – Preceptor

Install R, RStudio and Git on your laptop. Start on the DataCamp assignments. They are due on Monday, September 9 at midnight. Sign up for a meeting with a member of the Course Staff. This will fulfill the first milestone, due September 13, for the final project.


R4DS: Chapters 1, 2, 3, 4, 6 and 8.
DV: Chapters 1 and 2.

Week 2: September 9. Visualization

You can never look at your data too much. – Mark Engerman

We will review some basic R operations including constructing vectors with c() and subsetting elements with []. The first problem set will be distributed on Tuesday, via Github Classroom, and completed during class. We will also learn how to recover from git mistakes.


R4DS: Chapters 5 and 7.
DV: Chapter 3.


Remember: DataCamp assignments are due Monday at midnight.


Problem Set #1 due September 11 at midnight. We will complete and submit this problem set in class on Tuesday. Its purpose is to ensure that everyone has a working computer, understands Git/GitHub and can compile an R Markdown document.

Final Project Milestone #1 due Friday, September 13. The only requirement is to meet with a member of the course staff. Most staff study halls are Sunday through Wednesday. Do not wait until Friday morning. Record the name of the person you met with in the Final Project Google Sheet.


Potential Outcomes
Fundamental Problem of Causal Inference

Week 3: September 16. Seeking Help

The best data science superpower is knowing how to ask a question. – Mara Averick

We will learn how to produce a reproducible example — a “reprex” — in order to help strangers to help us.


R4DS: Chapters 9, 10, 11.
DV: Chapter 4.


Problem Set #2 due Wednesday, September 18.
Final Project Milestone #2 due Friday, September 20.


No Causation Without Manipulation
Permutation Tests


The Unix Workbench, chapters 1 – 6.

Week 4: September 23. Animation

Workflow: you should have one. – Jenny Bryan


MD: Chapters 1 through 5.
R4DS: Chapters 12, 13, 14, 15 and 16.
DV: Chapter 5.


Problem Set #3 due Wednesday, September 25.
Exam #1 due Sunday, September 29.


Average Effect


Causality, Chapter 2 of Quantitive Social Science by Kosuke Imai.

Part 2: Sampling and Inference

Week 5: September 30. Sampling

Lot of points were taken off for small errors that I did not see as pedagogically important. – Gov 1005 student


MD: Chapter 8 Sampling.
R4DS: Chapters 17, 18, 19, 20 and 21.


Final Project Milestone #3 due Friday, October 4.


The Cognitive Style of Powerpoint by Edward Tufte.

Week 6: October 7. Confidence Intervals

Comment as a service to the dumbest possible version of your future self. – Alex Albright



Problem Set #4 due Wednesday, October 9.
Final Project Milestone #4 due Friday, October 11.

Week 7: October 14. Bayes


Chapter 1 in Think Bayes (pdf) by Allen Downey.
Chapter 2 (pdf) in Doing Bayesian Data Analysis by John Kruschke.


Statistical Rethinking: A Bayesian Course with Examples in R and Stan (pdf) by Richard McElreath. Chapter 1.


Problem Set #5 due October 16.
Final Project Milestone #5 due Friday, October 18.

Week 8: October 21. Maps


“Let’s Take the Con Out of Econometrics,” by Edward E. Leamer. The American Economic Review, Vol. 73, No. 1 (March, 1983), pp. 31-43. link


Problem Set #6 due October 23.
Exam #2 due Sunday October 27.


DV: Chapter 7
Introduction to Mapping with sf

Part 3: Models

Week 9: October 28. Regression

Teach people to drive. Then later, if they want or need, they can learn how the internal combustion engine works. – Andrew Gelman


MD: Chapter 6 Basic Regression.
R4DS: Chapters 22, 23, 24 and 25.
DV: Chapter 6.


Final Project Milestone #6 due Friday, November 1.


Statistical Rethinking: A Bayesian Course with Examples in R and Stan (pdf) by Richard McElreath. Chapter 2.

Week 10: November 4. Multiple Regression


MD: Chapter 7 Multiple Regression.
DV: Chapter 8.


Problem Set #7 due Wednesday, November 6.
Final Project Milestone #7 due Friday, November 8.


The Bayesian New Statistics” by John K. Kruschke and Torrin M. Liddel.

Week 11: November 11. Model Inference

Amateurs test. Professionals summarize. – Preceptor



Problem Set #8 due Wednesday, November 13. Final Project Milestone #8 due Friday, November 15.

Week 12: November 18. Machine Learning


Chapters 28, 29, 30 and 32 from Introduction to Data Science by Rafael A. Irizarry.


Problem Set #9 due Wednesday, November 20.
Exam #3 due Sunday, November 24.


Introduction to Machine Learning
Machine Learning Toolbox
Chapter 7 (pdf) from The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition.

Part 4: Projects

The main focus of the last two class meetings is the final projects. Note that we only have one meeting (on Tuesday) during each of the last two weeks.

Week 13: November 25. Shiny

It is tough to get much done on the Tuesday of Thanksgiving week. Main focus will be on Shiny Apps.


R4DS: Chapters 26, 27 28, 29 and 30.
Shiny tutorials

Week 14: December 2. Memes

Put your work on the web. – David Sparks

Last day of classes. Make memes, provide course feedback, discuss final projects and have fun!

Class Room Seating

Record the name of your Partner in the Google sheet for the day. Each person does this, even though doing so leads to duplication.

Assignment Details


There are several ways to earn group participation points in class.

Imperator: Each Group will have a class Imperator, someone who helps to organize a study group, coordinate activities with other Groups, and so on. Imperators make everyone feel welcome, first by learning everyone’s name and, then, by introducing classmates to each other.

Magicum: The class will have several technical wizards, students who have volunteered to help their peers with computer problems either in person or on Piazza. The most difficult of these questions will involve Git/GitHub, so only volunteer for this role if you are comfortable with those tools.

Actarium: We need note-takers, ideally two students for each day. They work separately, but will still be partnered with someone so they can participate in coding. After class, the two actarii get together and create one unified set of notes, which must be posted to Piazza before midnight that evening.

Welcome Committee: We organize a Welcome Committee of four students for each speaker. See below for the duties associated with this job.

Piazza: Answering your classmates questions on Piazza is the best way to earn participation points. Be a good class citizen! If you find a (meaningful!) typo in a problem set or exam, please post it to Piazza. The first student to do so earns many participation points.

Problem Sets and Exams

All problem set and exam solutions are submitted via Github Classroom. For problem sets, you will create, at least, two new files: ps_N.Rmd and ps_N.html, where N is replaced by the number of the problem set. You must use exactly these names. Replace “ps” with “exam” for exam submissions. Your projects may also contain other files, either distributed by us as part of the assignment or added by you.

  • The two documents you are submitting are very different.
    • The Rmd file is a technical document, an accurate record of your work which allows you (and us!) to reproduce your html easily. It should be well-organized, nicely formatted and clean. Non-technical readers will not understand it, but that is OK.
    • The html file is a presentation document, designed for non-technical readers. No R code or weird warnings or obscure messages mar its pleasing appearance. It is a simple list of the answers to the questions.
  • It must be possible for us to replicate your work. That is, we will clone your repo and open your ps_N.Rmd file. When we knit it, we should produce your .html. If we can’t, we will take off points. (The Course Assistants will be happy to test your work. Visit them during Study Hall!)

Question Types

There are only three types of questions on the problem sets: tables, graphics and Mad Libs. Outside of these, you do not write any prose. Some exam questions, on the other hand, require a paragraph or so of explanation.

A Mad Libs style question provides a sentence with an X which you must replace with the correct answer. For example, the problem set might state:

The state with the most rows is X. (format state like Massachusetts, not MA)

You copy/paste that sentence as your answer, but replace the X with inline R code that determines the correct replacement for X dynamically. Do not include the words in the parantheses. They are there for explanation. Do not simply copy/paste the correct answer. In your Rmd, you might write:

The state with the most rows is `r x %>% group_by(state) %>% count() %>% arrange(desc(n)) %>% slice(1) %>% pull(state)`.

When you knit your Rmd file, this will turn into:

The state with the most rows is Massachusetts.

This is (you hope!) the answer that we are looking for.

Obviously, x needs to be a tibble which you have already created and which has state as a variable name. Sometimes, so much code is needed to answer the Mad Lib that it is placed in its own code chunk, with the answer saved as an object.

y <- x %>% 
  group_by(state) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  slice(1) %>% 

But that object is still placed in the inline code:

The state with the most rows is `r y`.

Late Days

You may use your late days on the problem sets, with a maximum of 1 late day per problem set. When using GitHub, there is no “submission”" button. Rather, we download the latest commit you’ve pushed as of midnight on Wednesday and grade that. If you want to use a late day for a problem set, email a teaching fellow before the due date. Otherwise, we will grade your latest commit as of the deadline.

Late days may not be used on exams. We grade whatever is in your Github repo as of the deadline.


Always list, at the very end, the names of any students with whom you worked on the problem set. If there were none, write None. We define “worked” with very broadly. It would certainly include anyone you sat next to or across from at a Study Hall, even if you only exchanged a few words.

Final Project Milestones

Final project milestones are always due at midnight of the designated date, which is always a Friday. You may use late days, except for Demo Day and the final due date. All submissions are made via a Google spreadsheet, the url of which will be distributed on Piazza. There are no milestones due during exams periods. The milestones which occur in the week after an exam (Oct 4th and Nov 1st) are major milestones, requiring more work and, therefore, being worth two points. Other milestones are worth 1 point. All 8 milestones together count for 10 points. Demo Day counts for 10 points. The final project submission is worth 15 points.

  1. September 13: Speak with any member of the course staff (CA/TF/Preceptor) about your final project. This is the first of three required meetings. No need to prepare for this meeting. But it is important to start thinking about what you want to do. This also provides for an opportunity to meet some of the course staff. Sign ups will be distributed via Piazza. Consider scheduling an interview with Hugh Truslow (, Head, Social Sciences and Visualization, Harvard University. No one at Harvard knows more about potential data sources.
  2. September 20: Github repo with Rmd (and knitted html) which discusses pros and cons two projects from past years. At least one project should be one which did extensive data gathering/cleaning. You should not select the same projects for commentary as your friends have. Students generally write about a paragraph for each project.
  3. October 4: Speak with any member of the course staff (CA/TF/Preceptor) about your final project. This is the second of three required meetings. In addition to this meeting, you must create a rough Github repo, with at least some of your raw data (or details of your plan to get the data), and a reproducible Rmd document which provides a brief description of the data: where you got, what you have done with it so far and what you plan to do. You may change your project completely, all the way until Demo Day. But you are still responsible for meeting these milestones, even if you know you are going to pivot.
  4. October 11: Add a beautiful ggplot2 graphic using some of your data.
  5. October 18: Rmd/html which provides a draft of your About page.
  6. Novmeber 1: Speak with any member of the course staff (CA/TF/Preceptor) about your final project. This is the third of three required meetings. By midnight, you must have a working Shiny App, just to demonstrate that you can get something up and running. This does not have to be working for your meeting.
  7. Novmeber 8: Cleaned up Github repo.
  8. Novmeber 15: Working rough draft of your final project. Demo Day is still two weeks away, and you can completely pivot if you want, but you must have a fairly complete version of your current project.

Demo Day at the end of the semester is worth 10 points. Details TBD.

December 13: Final Project due. Fill out Google spreadsheet correctly! 15 points.

Study Halls

Study Halls (SH) are run by Course Assistants (CAs), undergraduates who have taken the class in the past. They are one of the most popular parts of the course. Teaching Fellows (TFs) also run Study Halls, although these will often have more of an office hours flavor. Students who make the most use of these resources do better in class, and enjoy it more, than students who do not. Course Staff (CS) is a term which incorporates course assistants, teaching fellows and Preceptor.


At every SH, the CS will ensure that everyone knows everyone else’s name. This class is a community and community begins with names. The process starts with the first student arriving and sitting at the table. They and the CS chat. (It is always nice for the student to take the initiative and introduce themselves to the CS. Remembering all your names is hard!) A second person arrives and sits at the same table, followed by introductions. Persons 3 and 4 arrive. More introductions. Help your CS by introducing yourself, even if you are 75% sure they remember your name. Be friendly!

At this point, the table is filled. Another person arrives. Instead of that person starting a new table, CS gives the new student their spot and moves their belongings to a new table. No student ever sits alone. The CS hovers around the table until more students arrive and start filling out table #2. And so on. At each stage, students are responsible for, at a minimum, introducing themselves to the CS and, even better, to the other students. Best is when students who are already present shower newly arriving students with welcomes and introductions.

I realize that this is not how things work in (m)any other course(s). But awkwardness in the pursuit of class community is no vice.

Help Us Help You

CS will, to the greatest extent possible, never just give you the answer. Something like “Use annotate()” might solve your immediate problem, but it does not set you up for success during the exams — when we won’t be around to serve as your personal oRacles — much less for the rest of your life.

Instead, we will take the time to show you how to find the answer yourself. This starts with how to search for help, especially when you are not sure what you are looking for. This is more art than science, but adding certain strings — like “R”, “tidyverse”, or “ggplot” — to the search often helps. Then, we provide advice about which locations are the highest quality (anything to do with RStudio or tidyverse), which locations are less good than they initially appear (,,, and which are difficult to use (Stack Overflow). We then explain the best way to make use of what you find.

We also point you directly to the best resources, especially to R for Data Science by Garrett Grolemund and Hadley Wickham and to Data Visualization: A practical introduction by Kieran Healy. We won’t say: “Just use starts_width().” Instead, we will ask, “Have you read Section 5.4 of R4DS, involving the use of select()?” Yes, this will require an extra five minutes of your time. But every extra minute you spend reading a high quality reference is a minute well-spent.

We also help you learn how to seek help from others. There is a good way to ask for help on Piazza or Stack Overflow — generally involving the use of reproducible example which highlights your precise problem — and a bad way.

Only if none of this works will we just tell you the answer.

Social Events

Socializing with students outside of class is fun. Joining/inviting me is optional and has no influence on your grade in the class, i.e., it earns you no participation points. The three main options are:

Restaurant Lunches

My wife and I host students in groups of 4 for lunch throughout the semester, sometimes via the Harvard Class-Room-to-Table program and sometimes on our own dime. We organize this by House at the start and then open up spots to everyone later. Invitations to come. Dress is casual. Please be on time. The reservation will be under “Kane.” Just go straight to the table.

House Lunches

I enjoy having lunch with you after class, either in the CGIS cafe, Annenberg or at your House.

Faculty Dinners

I enjoy attending faculty dinners, so feel free to invite me to yours. My only request is that you also invite the other students in the class who live in your House. It is often fun to take over a table with a group of 4 or 5 or . . .


This course is inspired by STAT 545, created by the legendary Jenny Bryan. The pedagogical goals follow Don Rubin’s vision. Some of the slides and exercises come from Data Science in a Box, by Mine Çetinkaya-Rundel. Some of the in-class exercises are from Teaching Statistics: A Bag of Tricks by Andrew Gelman and Deborah Nolan. Kudos to authors like Garrett Grolemund and Hadley Wickham (R for Data Science), Kieran Healy (Data Visualization: A practical introduction), Chester Ismay and Albert Y. Kim (Statistical Inference via Data Science: A moderndive into R and the tidyverse) for making their books freely available. Thanks to Kosuke Imai for open sourcing several of the datasets from Quantitative Social Science: An Introduction. Lecture slides were created via the R package xaringan by Yihui Xie. Many thanks to all the folks responsible for R, RStudio, Git, GitHub and DataCamp. This course would not be possible without their amazing contributions.