Two must-read statistics books

17 Jul, 2019 at 16:36 | Posted in Statistics & Econometrics | 4 Comments

freedmanMathematical statistician David Freedman‘s Statistical Models and Causal Inference (Cambridge University Press, 2010)  and Statistical Models: Theory and Practice (Cambridge University Press, 2009) are marvellous books. They ought to be mandatory reading for every serious social scientist — including economists and econometricians — who doesn’t want to succumb to ad hoc assumptions and unsupported statistical conclusions!

freedHow do we calibrate the uncertainty introduced by data collection? Nowadays, this question has become quite salient, and it is routinely answered using wellknown methods of statistical inference, with standard errors, t -tests, and P-values … These conventional answers, however, turn out to depend critically on certain rather restrictive assumptions, for instance, random sampling …

Thus, investigators who use conventional statistical technique turn out to be making, explicitly or implicitly, quite restrictive behavioral assumptions about their data collection process … More typically, perhaps, the data in hand are simply the data most readily available …

The moment that conventional statistical inferences are made from convenience samples, substantive assumptions are made about how the social world operates … When applied to convenience samples, the random sampling assumption is not a mere technicality or a minor revision on the periphery; the assumption becomes an integral part of the theory …

In particular, regression and its elaborations … are now standard tools of the trade. Although rarely discussed, statistical assumptions have major impacts on analytic results obtained by such methods.

Consider the usual textbook exposition of least squares regression. We have n observational units, indexed by i = 1, . . . , n. There is a response variable yi , conceptualized as μi + i , where μi is the theoretical mean of yi while the disturbances or errors i represent the impact of random variation (sometimes of omitted variables). The errors are assumed to be drawn independently from a common (gaussian) distribution with mean 0 and finite variance. Generally, the error distribution is not empirically identifiable outside the model; so it cannot be studied directly—even in principle—without the model. The error distribution is an imaginary population and the errors i are treated as if they were a random sample from this imaginary population—a research strategy whose frailty was discussed earlier.

Usually, explanatory variables are introduced and μi is hypothesized to be a linear combination of such variables. The assumptions about the μi and i are seldom justified or even made explicit—although minor correlations in the i can create major bias in estimated standard errors for coefficients …

Why do μi and i behave as assumed? To answer this question, investigators would have to consider, much more closely than is commonly done, the connection between social processes and statistical assumptions …

We have tried to demonstrate that statistical inference with convenience samples is a risky business. While there are better and worse ways to proceed with the data at hand, real progress depends on deeper understanding of the data-generation mechanism. In practice, statistical issues and substantive issues overlap. No amount of statistical maneuvering will get very far without some understanding of how the data were produced.

More generally, we are highly suspicious of efforts to develop empirical generalizations from any single dataset. Rather than ask what would happen in principle if the study were repeated, it makes sense to actually repeat the study. Indeed, it is probably impossible to predict the changes attendant on replication without doing replications. Similarly, it may be impossible to predict changes resulting from interventions without actually intervening.


  1. Read the book (statistical models and casual inference from 2010) and understod at least part of the reasoning (like your example from the book) but when ending I was bewilderd. How do you do it then? Sample theory in big populations would take care of some problems with convenience samples, but it seems not to be enough as questions asked in a survey in itself is related to other variables? How should ypu proceed?

    • Well, sometimes I think one could just say as Keynes did when criticizing Tinbergen — let them just go on! Given that you are explicit and transparent with what you assume — “Assumptions 1-5 are made because otherwise I couldn’t apply our impressive-looking mathematical-statistical machinery, assumptions 6-8 are totally unrealistic, assumptions 9-12 are harmless approximations to reality, and assumptions 13-17 are pretty good descriptions of the real-world” — there shouldn’t really be that much reasons to complain. If only mainstream economists were explicit with this and put up “warning signs” at the beginning of their papers, I guess I could reduce my own critical activities with at least 90 % 🙂

  2. We do have this well worked out apparatus for drawing samples of colored balls from urns, but it seems like researchers are under tremendous and inappropriate pressure to make their data collection procedure an analogue to drawing colored balls from urns. Reference to “convenience” samples seems like it could be an expression of this bigotry in favor of urns.
    In real life, accurate data collection can present many practical challenges that can not be met by ritualized attempts to create an analogy to drawing balls from urns.
    Not least of these challenges is the fact that the phenomena to be studied is, naturally, a data generating process that has, convenient or not, created a sample of survivors to be observed. The researcher’s data collection procedure is secondary. If a researcher narcissistically exhibits his ritual adherence to the urn drawing analogy, while ignoring phenomenal selection by survival, how interesting will his interpretation of the data be?
    Accurate measurement is rarely as simple as drawing a colored ball from an urn. Say, the researcher is using a survey instrument, asking people to answer questions. People lie. If you let people opt out of your survey, your data is inconveniently sampling itself. But, try getting an involuntary survey past ethical review. Use of trained interviewers might overcome the deceptions or reluctance to answer, but critics will complain that the analogy to drawing colored balls from urns has been violated.

    • That about says it all (and if someone isn’t convinced, please read my latest post) 🙂

Sorry, the comment form is closed at this time.

Blog at
Entries and comments feeds.