Description

Data matters. How much of an advantage do incumbents have in winning re-election? How much money is spent on political campaigns? How do polls help us forecast election results? We need data to answer these questions.

This course, an introduction to data science, will teach you how to work with data, how to gather information from a variety of sources and in various formats, how to import that information into a project, how to tidy and transform the variables and observations, how to visualize the data, and how to communicate your findings in a sophisticated fashion. Each student will complete a final project, the first entry in their professional portfolio. Our main focus is data associated with political science, but we will also use examples from education, economics, public health, sociology, sports, finance, climate and any other topic which students find interesting.

We use the R programming language, RStudio, Git, GitHub and DataCamp. Although we will learn how to write code, this is not a course in computer science. Although we will learn how to work with data, this is not a course in statistics. We focus on practice, not theory. We make stuff.

Prerequisites: None. You must have a laptop with R, RStudio and Git installed.

Logistics: Class meets in Tsai Auditorium in the basement of CGIS South from 1:30 to 2:45 on T/TH.

Ulysses and the Sirens, 1891, by John William Waterhouse. This dramatic painting illustrates an episode from the journeys of the Greek hero Odysseus (in Latin, Ulysses) told in the poet Homer’s Odyssey in which the infamous Sirens lured unwary sailors towards perilous rocks and their doom by singing in the most enchanting manner. Odysseus wished to hear the Siren’s song and ordered his crew to lash him to a mast and block their ears in order to ensure their safe passage. Waterhouse has depicted each Siren with the body of a bird and the head of a beautiful woman, which came as a surprise to Victorian audiences, who were more used to seeing these mythic creatures portrayed as comely mermaid-like nymphs. He borrowed the motif from an ancient Greek vase that he studied in the British Museum. The next stop on Odysseus’s journey was Thrinacia.

Ulysses and the Sirens, 1891, by John William Waterhouse. “This dramatic painting illustrates an episode from the journeys of the Greek hero Odysseus (in Latin, Ulysses) told in the poet Homer’s Odyssey in which the infamous Sirens lured unwary sailors towards perilous rocks and their doom by singing in the most enchanting manner. Odysseus wished to hear the Siren’s song and ordered his crew to lash him to a mast and block their ears in order to ensure their safe passage. Waterhouse has depicted each Siren with the body of a bird and the head of a beautiful woman, which came as a surprise to Victorian audiences, who were more used to seeing these mythic creatures portrayed as comely mermaid-like nymphs. He borrowed the motif from an ancient Greek vase that he studied in the British Museum.” The next stop on Odysseus’s journey was Thrinacia.

Course Metaphor

The central metaphor for this class is Ulysses and the Sirens. You are Ulysses. Thrinacia is the internship/job/degree/career you want. The Sirens are the many distractions of the modern world. I am the rope.

Course Staff

Course Philosophy

No Lectures: The worst method for transmitting information from my head to yours is for me to lecture you. There are no lectures. We work on problems together during class.

R Everyday: Learning a new programming language is like learning a new human language: You should practice every day.

No Math: Our focus is on practical skills for working with data. To make time for topics like Git, we need to cut out material that might typically be included in a course like this. The biggest impact relates to (the lack of) math, probability and statistical theory.

Cold Calling: I call on students during class. This keeps every student involved, makes for a more lively discussion and helps to prepare students for the real world, in which you can’t hide in the back row.

Visitors: We will have a variety of visitors to class, people performing professional data analysis, both inside and outside of academia, often using exactly the same tools that we use. If there is someone you would like to meet, talk to me about it and we can invite them!

Class Activities: Awkwardness in the pursuit of learning is no vice. We will do a variety of class activities that will sometimes take you out of your comfort zones. You will meet and work with many more of your fellow Harvard students than you would in a normal class.

Millism: Political disputes are not the focus of this class but, when such topics arise, I will insist that we follow John Stuart Mills’ advice: “He who knows only his own side of the case, knows little of that. His reasons may be good, and no one may have been able to refute them. But if he is equally unable to refute the reasons on the opposite side; if he does not so much as know what they are, he has no ground for preferring either opinion.”

Organized by House: We use geography to create a community. During class, you will sit with students from your house, grouped with other houses near yours. If you live in Adams, for example, you will sit with other Adams students, and nearby the students from Lowell and Quincy. Within your house, you will work with different peers each class. Don’t want to meet a score or more of Harvard students? Don’t take this class.

Professionalism: We use professional tools in a professional fashion. Your workflow will be very similar to the workflow involved in paid employment. Your problem sets and final project will be public, the better to interest employers in your abilities.

Monologues: On occasion, I give brief monologues, designed to explain specific topics that have confused students in the past. I hope to never talk for more than 5 minutes straight.

No Cost: Every tool we use and reading I assign is available for free. You don’t have to spend any money on this class. Some activities, like DataCamp and GitHub, have paid options which provide more services, but you never have to use them. Don’t give anyone your credit card number.

Course Policies

Workload: The course should take about 10 to 15 hours a week, outside of class meetings, exams and the final project. This is an expected average across the class as a whole. It is not a maximum. Some students will end up spending much less time. Others will spend much, much more.

Use your Harvard e-mail: Please use your official Harvard e-mail address for all aspects of this class, especially things like signing up for services like DataCamp, GitHub, and so on. Doing so makes it much easier for us to figure out who is doing what. This may not be easy if you already connect with these services but, even in that case, you should be able to add your Harvard e-mail address to your account.

Piazza: All general questions — those not of a personal nature — should be posted to Piazza so that all students can benefit from both the question and the answer(s).

Plagiarism: If you plagiarize, you will fail the course. See the Harvard College Handbook for Students for details.

Working with Others: Students are free (and encouraged) to discuss problem sets and their final projects with one another. However, you must hand in your own unique code and written work in all cases. Any copy/paste of another’s work is plagiarism. In other words, you can work with your friend, sitting side-by-side and going through the problem set question-by-question, but you must each type your own code. Your answers may be similar (obviously) but they must not be identical, or even identical’ish.

R: You must use R and RStudio for this class. You are responsible for installing both on your laptop.

Git and GitHub: Analyzing data without using source control is like writing an essay without using a word processor — possible but not professional. Starting in week 3, we will do all our work using Git/GitHub. If Git is not already installed on your computer, please install it.

DataCamp: We make extensive use of lessons from DataCamp. All DataCamp courses are graded pass/fail. Each week’s course(s) are due by Monday at 10:05 AM (except in the cases of holidays or exams). Class on Tuesday will assume the completion of this work.

R for Data Science (R4DS): Reading assignments from R4DS in a given week cover material that we will use that week. Some students prefer to do such readings ahead of time, the better to prepare for class. Some students prefer to do the readings after those classes, the better to reinforce the material. Some students prefer to never do the readings. No matter what path you select, know that, when constructing/grading the problem sets, exams and final projects, we will assume that you understand the material in R4DS. If you are struggling in the class, the best advice we can offer is to read R4DS cover-to-cover.

Optional Activities: The syllabus includes background readings and DataCamp assignments which students may find interesting.

Computer Problems: If you are having problems with your computer, follow these steps. First, post the problem on Piazza, with details and screenshots. With luck, a fellow student will be able to solve it. (And students who help their peers with technical issues are guaranteed full participation points for grading purposes.) Second, if I and/or the TFs can’t solve it, we will direct you toward the IQSS IT Client Support Services, located in the basement of CGIS Knafel. They are excellent! E-mail them with the details of your problem, mentioning your enrollment in this class, at help@iq.harvard.edu. Although I and the teaching fellows want to be helpful, we are not experts in troubleshooting computer problems. Third, once your problem is solved, tell us all the solution by responding to your own post on Piazza.

Pass/Fail: You may not take this class Pass/Fail.

Missing Class: You expect me to be present for lecture. I expect the same of you. There is nothing more embarrassing, for both us, than for me to call your name and have you not be there to answer. But, at the same time, conflicts arise. It is never a problem to miss class if, for example, you are out of town or have a health issue. Simply fill out the Google sheet indicating you will be absent. Failure to do so will decrease your participation points, as will missing too many classes, even with notification.

Late Days: An assignment is a day late if it is turned in anytime after it was due (even 5 minutes after) but within 24 hours. After that, it is two days late, and so on. You have 5 Late Days which may be used for any assignment, except for the two midterms and final project Demo Day. You should save your late days. If you use them early in the semester for no particularly good reason and then, later in the semester, have an actual emergency, we will not be sympathetic. We will not give you extra late days in such a situation. (That isn’t fair to your classmates, and we are all about fairness.) We will just, mentally, move the Late Days you wasted so that they cover your actually emergency. You will now be penalized for being late earlier in the semester, when you did not have a good reason for tardiness.

Major Emergencies: We are not monsters. If you are hit with a major emergency — the sort of thing that necessitates the involvement of your Resident Dean — we will be sympathetic. We require a signed letter (not an e-mail) from your Resident Dean as documentation.

Role of Teaching Fellows: The TFs run all aspects of grading for the course, keeping track of late days, dealing with emergencies and so on. Go to them first with any problems. (Feel free to cc the Preceptor if you want to keep me in the loop but I am very respectful of TF authority on these matters.)

Computer Emergencies: We are very unsympathetic to computer emergencies. You should keep all your work on GitHub, so it won’t matter if your computer explodes. If it does explode, you will lose only the work after your last push. You can then restart your work on a public computer (the basement of CGIS Knafel has machines with R/RStudio installed) or on your roommate’s computer.

Github Classroom: We use Github Classroom to distribute problem sets and midterms. You will receive an e-mail with a link. Click on that link and a repo, with instructions, will be created. Do this as soon as you receive the e-mail. We don’t want GitHub problems to arise the night before the assignment is due.

Grading

Participation: 10 points. I expect you to participate, both in class and online. Helping your fellow students, especially on Piazza, is the best form of participation, as is volunteering for a class role. Be a good class citizen. Missing class (without notifying us) or missing too many classes will cost you points.

DataCamp Lessons: 5 points. Grades are pass/fail only. These are free points! Given the level of the questions and the hints provided, it is essentially impossible not to get full credit as long as you make an honest effort. Each day late (beyond the five allowed) results in -1 point from the 5 points total allocated for DataCamp assignments. If you use up more than 5 points, further days late will make a negative contribution to your final grade.

Problem Sets: 20 points. Problem sets are distributed on Wednesday and then due the following Wednesday at 10:05 AM. You are welcome to work on them with your friends but, first, you must personally type in every character in the work you submit and, second, you must list all the people you worked with. You may only use two Late Days for a given problem set since, after two days, we will distribute the answers. If you do not submit your problem set within 48 hours of its due date, you receive a zero for that assignment. You will (also!) still be “charged” with the two late days.

Midterms: 25 points each. The two midterms are take-home. They are open-book and open-web. Because students have different schedules, you can complete the midterm any time within a four-day window starting after midterm distribution. Late midterms will not be accepted.

Final Project: 20 points. Students will present their projects publicly during Reading Period. They will then have the opportunity to incorporate feedback before submitting the final version. There are several milestones for the projects. You may use your Late Days for them, just as you might for DataCamp assignments or the Problem Sets. But, as with DataCamp, these milestones must be met. Negative points will accrue until the milestone is completed.

The end of this syllabus provides details on schedule and grading standards.

Resources

The text for the class is R for Data Science (R4DS) by Garrett Grolemund and Hadley Wickham. The primary resources below are also useful, but are not required reading. The secondary resources may also be helpful. All are free.

Primary

Data Visualization: A practical introduction by Kieran Healy
Happy Git and GitHub for the useR by Jenny Bryan
The Unix Workbench by Sean Kross
R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund

Secondary

Pro Git by Scott Chacon and Ben Straub
ModernDive: An Introduction to Statistical and Data Sciences via R by Chester Ismay and Albert Y. Kim
Introduction to Data Science by Rafael A. Irizarry
Handling Strings with R by Gaston Sanchez
Text Mining with R: A Tidy Approach by Julia Silge and David Robinson

Conclusion

If you had tried to complete a data analysis project before taking this class, you would have done X well. Now that you have taken the class – now that you have learned how gather information in various formats, how to import that information into a project, how to tidy and transform the variables and observations, how to visualize and model the data for both analysis and prediction, and how to communicate your findings in a sophisticated fashion – you will do Y well. The success (or failure) of the class can be measured by comparing Y with X.

Schedule

Rhythm of the Class

The class follows a steady weekly rhythm:

  • Monday 10:05 AM. DataCamp exercises due, except for extensions because of holidays or midterms.
  • Monday 2:00 PM – 5:00 PM. Study Hall in Fisher with Albert.
  • Tuesday 1:30 PM – 2:45 PM. Class. Main focus of class will be interactive R session using material from DataCamp exercises you have just completed.
  • Tuesday 9:00 AM to 12:00 PM. Study Hall in Fisher with Z.
  • Tuesday 3:30 PM to 6:30 PM. Study Hall in Fisher with Z.
  • Tuesday 6:30 PM to 9:30 PM. Study Hall in Fisher with Z.
  • Wednesday 10:05 AM. Problem set (distributed last week) due.
  • Wednesday 4:00 PM. Problem set (or take-home midterm) distributed.
  • Thursday 8:30 AM to 11:30 PM. Study Hall in Fisher with Preceptor.
  • Thursday 1:30 PM – 2:45 PM. Class. In addition to continuing with the new R commands from Tuesday’s class, we will review material from the previous problem set (or exam, if one was given last week). By “previous problem set,” I mean the one you turned in last week, not the one you turned in on Wednesday. Other students, who have taken two late days, might still be working on that most recent problem set.
  • Friday 10:05 PM. Final project milestones are due.
  • Sunday 10:05 PM. Midterm exams, if distributed on Wednesday, are due.

Week 1: January 28: Shopping Week

Install R, RStudio and Git on your machine. Start on the DataCamp assignments. They are due on Monday, February 4 at 10:05 AM.

Readings

R4DS: Chapters 1, 2 and 3

R Packages, Commands and Arguments

packages: tidyverse, ggplot2, dplyr
commands: library, install.packages, ggplot, aes, facet_wrap, facet_grid, filter, arrange, select
arguments: data, mapping, x, y, color, shape, group, geom_point, geom_line, geom_smooth, geom_bar, geom_freqpoly, geom_density

Week 2: February 4.

Remember: DataCamp assignments are due Monday at 10:05 AM. If you have already decided to take the class, then these assignments are due Monday, February 4th. Extensions, until Friday February 8th at 10:05 AM will be granted for students who joined the course late.

Readings

R4DS: Chapters 4, 6, 8, 26 and 27.

R Packages, Commands and Arguments

packages: tidyverse, ggplot2, dplyr, janitor,
commands: mutate, group_by, summarize, transmute, near, arguments: starts_with, ends_with, contains, scale_{x,y}_discrete,
misc: janitor::clean_names, janitor::adorn_pct_formatting

Assignments

Problem Set #1 due February 6 at 10:05 AM. We will do this problem set in class. It’s purpose is to ensure that everyone has a working computer set up. Students will hand this assignment in via Canvas.

Optional

Week 3: February 11.

Readings

R4DS: Chapters 5 and 7.

Speaker

Assignments

Problem Set #2 due February 13 at 10:05 AM. Students will hand in this assignment via Canvas. Please work with other students in your House. You are all in this together!

Week 4: February 18.

Monday, February 18 is a holiday so DataCamp is due on Tuesday at 10:05 AM.

Assignments

Problem Set #3 due February 20 at 10:05 AM. This problem set will be distributed, collected and graded using GitHub Classroom.

Week 5: February 25.

Readings

R4DS: Chapters 9, 10, 11, 12, 13 and 14.

Assignments

Problem Set #4 due February 27 at 10:05 AM.

Speaker

R Packages, Commands and Arguments

packages: dplyr, tibble, readr, readxl, haven, tidyr, stringr
commands: tibble, as_tibble, print, View, .$, .[[“”]], read_csv, read_excel, write_csv, parse_{logical,integer,date,number,character,factor}, col_*, problems, guess_encoding, {write,read}_rds, gather, spread, separate, unite, pull, left_join, bind_rows
arguments: `` (back ticks), n, width, Inf, skip, comment, col_names, na, col_types, n_max, locale, levels, format, %b,%y,%Y,%*,

Week 6: March 4.

First midterm distributed March 6 and due Sunday March 10 at 10:05 PM. Focus will be everything in R4DS through chapter 14.

Readings

R4DS: Chapters 15 and 16.

Assignments

Problem Set #5 due Wednesday March 6 at 10:05 AM.

R Packages, Commands and Arguments

packages: forcats, lubridate,
commands: factor, parse_factor, parse_date, levels, count, fct_reorder, fct_reorder2, fct_relevel, fct_infreq, fct_recode, fct_collapse, fct_lump, today, ymd, make_date, as_date, year, month, mday, wday, {round,floor,ceiling}_date, update, days, months, weeks, years,
arguments: levels, format

Week 7: March 11.

Readings

R4DS: Chapters 17 and 18.

DataCamp

Choose your own adventure! Pick two of the DataCamp courses from Week 6 and complete them. Provide the TFs with your DataCamp course certificates to confirm. You may also choose a different DataCamp class, if you like, but you must confirm your choice with us ahead of time.

Optional

Week of March 18 is Spring Break.

Week 8: March 25.

Because of Spring Break, DataCamp assignments are not due until Wednesday.

Readings

Optional

Week 9: April 1.

Readings

R4DS: Chapters 19, 20 and 21.

Assignments

Problem Set #6 due Wednesday April 3 at 10:05 AM.

packages: purrr
commands: function, if, else, all, any, identical, near, switch, cut, stop, stopifnot, return, typeof, length, as.*, is_*, is_*_scalar, set_names, [], [[]], $, list, str, attributes, vector, seq_along, flatten_* , while, map, map_*, split, safely, possibly, quietly, invoke_map, keep, discard, detect, {head,tail}_while, reduce
arguments: x, na.rm, …, L, NA, NaN, Inf, -Inf, NA_{integer,real,character}, type, length, .x, .f, .$, ~, .
tricks: Cmd/Ctrl-Shift-R, list(…), lazyeval, use [[]] in all loops

Week 10: April 8.

Readings

R4DS: Chapters 22 and 23.

Assignments

Problem Set #7 due Wednesday April 10 at 10:05 AM.

Second midterm distributed April 10 and due Sunday April 14 at 10:05 PM. This midterm will be cumulative.

Week 11: April 15.

During the last three weeks of class, our focus shifts to Shiny. DataCamp is due Wednesday because of the midterm.

Readings

R4DS: Chapters 24 and 25.

DataCamp

Optional

Week 12: April 22.

Assignments

Problem Set #8 due Wednesday April 24 at 10:05 AM.

Optional

Week 13: April 29.

Readings

No problem set or DataCamp during the week of final projects.

Only Tuesday class. I will seek 3 volunteers to present their final projects during class. This is a good option for those who have a conflict with the Demo Day schedule.

Demo Days to be scheduled, probably Thursday/Friday of this week.

Contacts

A variety of data science professionals have kindly volunteered to be available to talk with students, both about data science in general and about data availability for final projects:

Assignment Details

Participation

There are several ways to earn participation points in class.

Imperator: Each House will have a class Imperator, someone who helps to organize a House study group, coordinate activities with other Houses and so on.

Magicum: The class will have several technical wizards, students who have volunteered to help their peers with computer problems either in person or on Piazza. The most difficult of these questions will involve Git/GitHub, so only volunteer for this role if you are comfortable with those tools.

Welcome Committee: We organize a Welcome Committee of four students for each speaker. See below for the duties associated with this job.

Piazza Participation: Answering your classmates questions on Piazza is the best way to earn participation points. Be a good class citizen!

Problem Sets

Of the 20 points allocated to problem sets, the first two problem sets counts for 1/2 point each. The remaining 6, which are much more time consuming, count for 3 points each.

Grading Rubrics

  • Ensure that your repo is clean.
  • At least 5 commits with sensible commit messages, i.e., not “stuff” or “update.”
  • Once we download your repo, can we replicate your work easily? (It is OK if you use a library which we need to download, but your Rmd better include all the necessary library() commands.)
  • List the colleagues you worked with, if any.
  • Make your code readable. Formatting matters.
  • Include comments in your code. Rough guideline: You should have as many lines of comments as you have lines of code.
  • Make your comments meaningful. They should not be a simple description of what your code does.
  • Name your R code chunks.
  • Follow the Tidyverse Style Guide.
  • Spelling and punctuation matter.
  • Use captions, titles, axis labels and so on to make it clear what your tables and graphics mean.
  • Provide clear axis labels.
  • Create a title and/or subtitle that describe the key result of your graphics.
  • Use your best judgment. For example, sometimes axis labels are unnecessary. Data Visualization: A practical introduction by Kieran Healy is an excellent (and free!) guide to making high quality graphics with R.

Final Project

Do you love soccer or wine or NYC politics? The final project provides you with an opportunity to study that topic in depth. Your goal is to gather data and present it in an engaging fashion. We are not necessarily investigating specific hypotheses or trying to fit a statistical model, although you can do those things if you want. Instead, imagine that your roommate also cares about soccer/wine/politics/whatever. You are building something that would interest her, something that will make her say, “That is cool! Let’s spend 30 minutes poking around with your data.” Projects without at least 10,000 data points are unlikely to be interesting enough.

Your final project will be, for most of you, the first item in your professional portfolio, something so impressive that you will be eager to show it to potential employers. You must show this work publicly, both on the web (viewable by all) and in person at our Demo Day. You will host your final project using ShinyApps, a free service provided by RStudio. Make use of free statistical consulting from the Harvard Statistics Department.

Here are some final projects from last year:

Maclaine Fields: Harvard Volleyball
Kemi Akenzua: Death Row Last Words
Cayanne Chachati: Syrian Civil War
Charlie Olmert: Harvard Mens Lacrosse
Richard Qui: Vaccines

Milestones

Final project milestones are always due at 10:05 PM of the designated date. You may use Late Days, except for Demo Day itself.

  • March 15: URL for (or short description of) your data. Submit via Google form. You may change your project completely, all the way until Demo Day. But you are still responsible for meeting these milestones, even if you know you are going to pivot. 1 point.
  • March 29: Github repo with Rmd (and knitted html) which discusses pros and cons of at least two projects from last year. At least one project should be one which did extensive data gathering/cleaning. You should not select the same projects for commentaries as your buddies. Submit repo URL via Google form. 1 point.
  • April 5: Rough Github repo, with all necessary data, and reproducible Rmd document. The Rmd must contain two items. First, a brief description of the data, where you got, what you have done with it so far and what you plan to do. Second, a beautiful ggplot2 graphic using some of the data. Submit repo URL via Google form. 1 point.
  • April 19: Working Shiny App, just to demonstrate that you can get something up and running. Submit app URL via Google form. 1 point.
  • April 26: Must have a working rough draft of your Shiny App. Submit repo URL via Google form. 1 point.
  • May 2/3: Demo Days! Details TBD. 5 points.
  • May 10: Final Project due. Fill out Google form correctly! 15 points.

Grading Rubrics

The milestones are pass/fail. Keep the following in mind for Demo Day and for the submission of your final project.

  • All the rubrics for problem sets apply here as well.
  • Your repo must be public.
  • We will look at (and grade) your code in conjunction with the Demo Day evaluation.
  • Give your repo and Shiny App a descriptive name. “syrian_civil_war” or “Vaccine-Explorer” is good. “Gov_1005_Final_Project” or “project_test” is not.
  • Some students work with messy data which requires a great deal of cleaning. Good stuff! Those students can create a very “vanilla” Shiny App and still receive full credit for the final project. Other students just use a ready-made data set from someplace like 538. Good stuff! But, in that case, they need to do something special with the analysis and/or display.
  • Apps should all have an “Info” or “About” tab which includes, your name, contact information, GitHub repo and data source information. Include other background information as you see fit.
  • Apps should “open” on an interesting tab, which will usually not be the “About” tab.
  • Apps should have at least one tab in which the user can select something and see a change.
  • Apps often have “story” tabs which, although they do not allow for user selections, do highlight specific aspects of the data which are interesting, and which users are unlikely to find by themselves.

Social Events

Socializing with students outside of class is fun (for me). Joining/inviting me is optional and has no influence on your grade in the class, i.e., it earns you no participation points. The three main options are:

Restaurant Lunches

My wife and I host students in groups of 4 for lunch throughout the semester, sometimes via the Harvard Class-Room-to-Table program and sometimes on our own dime. We organize this by House at the start and then open up spots to everyone later. Invitations to come. Dress is casual. Please be on time. The reservation will be under “Kane.” Just go straight to the table.

House Lunches

I enjoy having lunch with you before class, either in Annenberg or in your House. I will leave this to the Imperators to organize.

Faculty Dinners

I enjoy attending faculty dinners, so feel free to invite me to yours. My only request is that you also invite the other students in the class who also live in your House. It is often fun to take over a table with a group of 4 or 5 or . . .

Technical Advice

Follow this advice.

R

  • When using download.file(), make sure to set mode = "wb". This ensures that the download will work on all platforms.
  • janitor::adorn_pct_formatting() is often a better choice than scales::percent() for use in a table.
  • When loading libraries at the start of an Rmd file, load dplyr last. This decreases the chance of confusing name conflicts, like getting count() from the plyr library rather than from the dplyr library, which is almost certainly the version you want.

Git

  • If you have git problems, your first stop is Happy Git and GitHub for the useR by Jenny Bryan.
  • Always check in your .gitignore file.
  • Never check in your .Rproj file.

RStudio

  • Under Tools -> Global Options -> General, set the “Save workspace to .RData on exit:” to “Never”.
  • Under Tools -> Global Options -> Code -> Saving, set the “Default text encoding:” to “UTF-8”. This is especially important for Windows users from non-English locales.

Testimonials

None of these were solicited.

On Friday I received an offer for a data analyst role at the sports betting start up I have been interviewing at for much of this semester. Without the skills I learned in 1005 it is very unlikely I would have been able to even begin the technical assignment, and I definitely believe submitting an Rmarkdown rather than a spreadsheet or a word document made a big difference as you said it would! One of the main reasons I took this class was to help my job search and there is no doubt that the class’s claims were more than delivered upon in that respect. — student from Fall 2018

Within a week of finishing Gov 1005, I started my winter internship in the district office of U.S. Representative Gottheimer (NJ-05). On the first day, I was tasked with analyzing interviewers’ reports of applicants to the US service academies, a project that other interns had been working on for a few weeks. With the R coding skills I learned in Gov 1005, I was able to complete the task in a couple of hours, efficiently analyzing thousands of interviews which allows the Congressman to determine the applicants who will receive his nomination to the academies. My supervisor was very impressed! — student from Fall 2018

David Kane is a whackjob. The guy we just hired. I think they [the Government Department] are still mad at me for recommending him. — Harvard Statistics professor

[T]his guy is a loose cannon. I think we should take a “wait and see” attitude toward his course. — Harvard Government professor

gov1005 is the cringiest harvard course I have see this far. — Harvard student who did not take the course but who was unimpressed with the honesty of the syllabus

Acknowledgements

This course is inspired by STAT 545, created by the legendary Jenny Bryan. Some of the slides and exercises come from Data Science in a Box, by Mine Çetinkaya-Rundel. Some of the in-class exercises are from Teaching Statistics: A Bag of Tricks by Andrew Gelman and Deborah Nolan. Kudos to authors like Garrett Grolemund and Hadley Wickham (R for Data Science) and to Chester Ismay and Albert Y. Kim (ModernDive: An Introduction to Statistical and Data Sciences via R) for making their books freely available. Thanks to Kosuke Imai for open sourcing several of the datasets from Quantitative Social Science: An Introduction and to Matt Blackwell and Xiang Zhou for sharing the data from their courses. Lecture slides were created via the R package xaringan by Yihui Xie. Many thanks to all the folks responsible for R, RStudio, Git and GitHub. This course would not be possible without their amazing contributions.