Rubin Suggestion

Consider Don Rubin’s suggestion for how to teach introductory data science and/or statistics:

We should start the inferential process with clear statements of scientific objectives and associated estimands, and a precise description of the hypothetical data set from which we would simply calculate the estimands. Such a data set could be composed mostly of unobserved or even unobservable values, but it defines the objects of inference. Then we should think about the mechanism that led to observed values being observed and unobserved values being missing. To me, statistical inference concerns estimating missing values from observed values rather than pondering p-values, and one cannot even begin to do this without assuming something about the process that created the missing values: was it, for example, simple random sampling or was it a censoring mechanism?

This inferential process is most directly conceptualized as finding the probability distribution of missing values given observed values and scientific assumptions, i.e., formally, finding the posterior predictive distribution of the missing values, whence the posterior distribution of the estimands can be calculated.

There is no need to teach technical details of this Bayesian approach; instead we can use ideas based on simulating the unknowns under scientifically motivated models. We should not avoid discussing the scientific context of any statistical problem, because this can greatly affect the resulting inference. The approach of distributionally filling in (or multiply imputing) the missing values is intuitive and reveals the natural uncertainty of inference. The standard frequentist approach, which treats some known functions of the observed values as unknown (e.g. sufficient statistics) and treats some unknowns (e.g. parameters) as known, leads to the confusing double-negative logic of ‘failing to reject the null hypothesis’ etc.

After understanding this natural approach to access uncertainty of inference, then we can teach the importance of evaluating operating characteristics of such procedures.

Rubin (2011, page 288), Discussion of “Towards more accessible conceptions of statistical inferences” by C. J. Wild, M. Pfannkuch, M. Regan and N. J. Horton, Journal of the Royal Statistical Society. Series A (Statistics in Society), Vol. 174, No. 2 (APRIL 2011), pp. 247-295.

This seems exactly right to me. My project this summer is to follow this advice in revamping Gov 1005.

David Kane
Data Scientist
comments powered by Disqus