A Second Chance

“A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks” (pdf) by Miguel A. Hernán, John Hsu & Brian Healy is a lovely article which helped me rethink how I should teach interference in Gov 1005. Key figure:

I prefer to replace “causal inference” with “control,” even though the former is more precise. I think that students will find it easier to remember the mantra of describe, predict, control. I am fully on board with the claim that this approach helps in “integrating all scientific questions, including causal ones, in a principled data analysis framework.”

Hernán et al. write:

A Classification of Data Science Tasks

Data scientists often define their work as “gaining insights” or “extracting meaning” from data. These definitions are too vague to characterize the scientific uses of data science. Only by precisely classifying the “insights” and “meaning” that data can provide will we be able to think systematically about the types of data, assumptions, and analytics that are needed. The scientific contributions of data science can be organized into three classes of tasks: description, prediction, and counterfactual prediction.

Description is using data to provide a quantitative summary of certain features of the world. Descriptive tasks include, for example, computing the proportion of individuals with diabetes in a large healthcare database and representing social networks in a community. The analytics employed for description range from elementary calculations (a mean or a proportion) to sophisticated techniques such as unsupervised learning algorithms (cluster analysis) and clever data visualizations.

Prediction is using data to map some features of the world (the inputs) to other features of the world (the outputs). Prediction often starts with simple tasks (quantifying the association between albumin levels at admission and death within one week among patients in the intensive care unit) and then progresses to more complex ones (using hundreds of variables measured at admission to predict which patients are more likely to die within one week). The analytics employed for prediction range from elementary calculations (a correlation coefficient or a risk difference) to sophisticated pattern recognition methods and supervised learning algorithms that can be used as classifiers (random forests, neural networks) or predict the joint distribution of multiple variables.

Counterfactual prediction is using data to predict certain features of the world as if the world had been different, which is required in causal inference applications. An example of causal inference is the estimation of the mortality rate that would have been observed if all individuals in a study population had received screening for colorectal cancer vs. if they had not received screening. The analytics employed for causal inference range from elementary calculations in randomized experiments with no loss to follow-up and perfect adherence (the difference in mortality rates between the screened and the unscreened) to complex implementations of g-methods in observational studies with treatment-confounder feedback (the plug-in g-formula).

Exactly right. My plan is to make describe, predict, control a central theme of Gov 1005, to ensure that students remember that every data science task falls into one of these three buckets. Perhaps the initials DPC will be helpful?

But note:

Some methodologists have referred to the causal infer-ence task as “explanation,” but this is a somewhat misleading term because causal effects may be quantified while remaining unexplained (randomized trials identify causal effects even if the causal mechanisms that explain them are unknown).

Hmm. Perhaps “explain” is better than “control?”

David Kane
Preceptor in Statistical Methods and Mathematics
comments powered by Disqus