Causal inference and the rhetoric of imaginary populations (wonkish)

11 Jun, 2019 at 11:10 | Posted in Statistics & Econometrics | 4 Comments

morg The most expedient population and data generation model to adopt is one in which the population is regarded as a realization of an infinite super population. This setup is the standard perspective in mathematical statistics, in which random variables are assumed to exist with fixed moments for an uncountable and unspecified universe of events …

This perspective is tantamount to assuming a population machine that spawns individuals forever (i.e., the analog to a coin that can be flipped forever). Each individual is born as a set of random draws from the distributions of Y¹, Y°, and additional variables collectively denoted by S …

Because of its expediency, we will usually write with the superpopulation model in the background, even though the notions of infinite superpopulations and sequences of sample sizes approaching infinity are manifestly unrealistic.

In econometrics one often gets the feeling that many of its practitioners think of it as a kind of automatic inferential machine: input data and out comes casual knowledge. This is like pulling a rabbit from a hat. Great — but first you have to put the rabbit in the hat. And this is where assumptions come into the picture.

The assumption of imaginary ‘super populations’ is one of the many dubious assumptions used in modern econometrics.

As social scientists — and economists — we have to confront the all-important question of how to handle uncertainty and randomness. Should we define randomness with probability? If we do, we have to accept that to speak of randomness we also have to presuppose the existence of nomological probability machines, since probabilities cannot be spoken of – and actually, to be strict, do not at all exist – without specifying such system-contexts. Accepting a domain of probability theory and sample space of infinite populations also implies that judgments are made on the basis of observations that are actually never made!

Infinitely repeated trials or samplings never take place in the real world. So that cannot be a sound inductive basis for a science with aspirations of explaining real-world socio-economic processes, structures or events. It’s not tenable.


In Statistical Models and Causal Inference: A Dialogue with the Social Sciences David Freedman also touched on this fundamental problem, arising when you try to apply statistical models outside overly simple nomological machines like coin tossing and roulette wheels:

freedLurking behind the typical regression model will be found a host of such assumptions; without them, legitimate inferences cannot be drawn from the model. There are statistical procedures for testing some of these assumptions. However, the tests often lack the power to detect substantial failures. Furthermore, model testing may become circular; breakdowns in assumptions are detected, and the model is redefined to accommodate. In short, hiding the problems can become a major goal of model building.

Using models to make predictions of the future, or the results of interventions, would be a valuable corrective. Testing the model on a variety of data sets – rather than fitting refinements over and over again to the same data set – might be a good second-best … Built into the equation is a model for non-discriminatory behavior: the coefficient d vanishes. If the company discriminates, that part of the model cannot be validated at all.

Regression models are widely used by social scientists to make causal inferences; such models are now almost a routine way of demonstrating counterfactuals. However, the “demonstrations” generally turn out to depend on a series of untested, even unarticulated, technical assumptions. Under the circumstances, reliance on model outputs may be quite unjustified. Making the ideas of validation somewhat more precise is a serious problem in the philosophy of science. That models should correspond to reality is, after all, a useful but not totally straightforward idea – with some history to it. Developing appropriate models is a serious problem in statistics; testing the connection to the phenomena is even more serious …

In our days, serious arguments have been made from data. Beautiful, delicate theorems have been proved, although the connection with data analysis often remains to be established. And an enormous amount of fiction has been produced, masquerading as rigorous science.

And as if this wasn’t enough, one could — as we’ve seen — also seriously wonder what kind of ‘populations’ these statistical and econometric models ultimately are based on. Why should we as social scientists — and not as pure mathematicians working with formal-axiomatic systems without the urge to confront our models with real target systems — unquestioningly accept models based on concepts like the ‘infinite super populations’ used in e.g. the ‘potential outcome’ framework that has become so popular lately in social sciences?

Of course one could treat observational or experimental data as random samples from real populations. I have no problem with that (although it has to be noted that most ‘natural experiments’ are not based on random sampling from some underlying population — which, of course, means that the effect-estimators, strictly seen, only are unbiased for the specific groups studied). But probabilistic econometrics does not content itself with that kind of populations. Instead, it creates imaginary populations of ‘parallel universes’ and assume that our data are random samples from that kind of  ‘infinite super populations.’

But this is actually nothing else but hand-waving! And it is inadequate for real science. As David Freedman writes:

With this approach, the investigator does not explicitly define a population that could in principle be studied, with unlimited resources of time and money. The investigator merely assumes that such a population exists in some ill-defined sense. And there is a further assumption, that the data set being analyzed can be treated as if it were based on a random sample from the assumed population. These are convenient fictions … Nevertheless, reliance on imaginary populations is widespread. Indeed regression models are commonly used to analyze convenience samples … The rhetoric of imaginary populations is seductive because it seems to free the investigator from the necessity of understanding how data were generated.

In social sciences — including economics — it’s always wise to ponder C. S. Peirce’s remark that universes are not as common as peanuts …


  1. Prof. Syll has difficulties in understanding how “judgments are made on the basis of observations that are actually never made!”
    However, all of us do this very often in ordinary life.
    For example, from past experience and knowledge of the experiences of others, when crossing a busy road we can “see” the consequences of stepping in front of a fast moving vehicle.
    Prof. Syll asks:
    “Why should we as social scientists … unquestioningly accept models based on concepts like the ‘infinite super populations ?”
    The following answers are compelling:
    1) Such concepts are innate in humans,
    e.g. in the rapid learning of toddlers that bumps are correlated with pain depending on speed and direction of movement, hardness of objects, part of body, etc.

    2) There is overwhelming empirical evidence that our ancestors were very good at statistical thinking – that’s how they survived and we evolved. They succeeded at at hunting, fishing, scavenging, foraging, marauding, philandering etc. by making judgements about prospects in an uncertain world with imperfect information.
    A study of indigenous Maya people found that “probabilistic reasoning does not depend on formal education”.

    3) Even chimps and many other animals naturally think probabilistically.
    From a series of 7 experiments with Bonobos, Chimpanzees, Gorillas and Orangutans it was concluded that:
    “a basic form of drawing inferences from populations to samples is not uniquely human, but evolutionarily more ancient:
    It is shared by our closest living primate relatives, the great apes, and perhaps by other species in the primate lineage and beyond and it thus clearly antedates language and formal mathematical thinking both phylogenetically and ontogenetically.”
    Rakoczy et al. (2014) – Apes are intuitive statisticians. Cognition, 131(1):60-8

    4) Contrary to Prof. Syll’s beliefs that the world is dominated by “fundamental uncertainty”, Keynes argued:
    (a) There are “psychological characteristics of human nature and those social practices and institutions which, though not unalterable, are unlikely to undergo a material change over a short period of time except in abnormal or revolutionary circumstances” (General Theory, Chapter 3)
    (b) There are “fundamental psychological law(s), upon which we are entitled to depend with great confidence both a priori from our knowledge of human nature and from the detailed facts of experience, that men are disposed, as a rule and on the average” (General Theory, Chapter 8, Part III).
    (c) “Experience shows that some such psychological law must actually hold…[otherwise] … there would be a violent instability” (General Theory, Chapter 18, Part III (i)).

    • Kingsley,
      Re point 4 – there may be psychological laws which can be discerned. These are quite something else compared to knowledge of the future. Uncertainty refers to knowledge about the future, not psychological laws.

      • Henry,
        My reading is that Keyes saw “psychological laws” and “social practices and institutions” as factors which are likely to persist over time “on the average”, including for a “short period” into the future.
        This opens the possibility of making probabilistic forecasts in the social sciences as in other sciences.
        Of course, this does not apply in “abnormal or revolutionary circumstances”.
        Keynes himself made probabilistic estimates of the investment-income multiplier in the USA for 1929-early 1930s:
        (a) “the multiplier seems to have been less than 3 and probably fairly stable in the neighbourhood of 2·5. ”
        General Theory, Chapter 10 part IV.
        (b) Keynes later revised his estimate: “the changes in money-incomes were from three to five times the changes in net investment … within a reasonable margin of error.”
        General Theory, Appendix 2 (last paragraph), from Economic Journal 1936.

        • Kingsley,
          You can make all the probabilistic forecasts you like. The reality is that NO-ONE knows what the outcome will be. Just because a modeled outcome has a high probability does not mean it is the outcome that will occur.
          What you call Keynes’ probabilistic estimates seem to me to be no more than eyeballing estimates.

Sorry, the comment form is closed at this time.

Blog at
Entries and comments feeds.