R Versus Python

My brilliant student Hemanth Bharatha Chakravarthy writes:

I am enjoying my summer internship at JPAL: I’m currently cleaning and doing text analyses with calls data to the government women’s helpline. Super fun stuff!

Anyway, found this interesting read. Thought I’d ask you for your thoughts? Especially on the R vs Tidy stuff. I presume that when curating the 1005 syllabus, you’d have made decisions on why to use dplyr over data.table and tibbles and Tidy over base R? If you could share some insights on that and explain why we used Tidy primarily, I’d be super interested in reading your ideas!

  1. Very happy to see a first year student like Hemanth land a cool summer internship at a first class organization like J-PAL. Hemanth is talented enough that they probably would have gotten the job without taking Gov 1005 — and it is always tough to know what would have happened in the counter-factual world — but I still like to believe that the causal effect of taking 1005 was positive.

  2. I am lucky in that there are several different reasons why R is the best choice for me in Gov 1005.

  • All the upper level courses in Gov use R.

  • R is (universally?) regarded as the best choice for a first data science class with no prerequisites. That last bit is key. The (excellent) Data Science 1 course at Harvard uses Python, but they require at least one prior programming class.

  • To paraphrase (I think) Olin’s Allen Downey: “If you use R, then you only have to teach R and statistics. If you use Python, then you must teach programming, Python and statistics.” We did very little (any?) programming in 1005. We wrote scripts. That is much easier. (The rise of Python notebooks will, someday, lead to convergence on this dimension, I think.)

  • I want every student to finish the class with the ability to do everything on their own laptop. This is possible to do with R, because RStudio is so easy to set up. Installing Python (which version!) on 70 laptops for non-technical students? I don’t know anyone who does this. I think that approaches which set up fake cloud environments (which then disappear after the end of the course) shortchange students.

  1. I think data.table is powerful, but dplyr is easier and orders of magnitude more widely used. The Tidysphere (tibbles, dplyr, etc) is where R is going, and you always want your students to have one step in the future.
David Kane
Preceptor in Statistical Methods and Mathematics
comments powered by Disqus