Why data is NOT enough to answer scientific questions

7 Aug, 2018 at 10:50 | Posted in Statistics & Econometrics | 3 Comments

The Book of Why_coverIronically, the need for a theory of causation began to surface at the same time that statistics came into being. In fact modern statistics hatched out of the causal questions that Galton and Pearson asked about heredity and out of their ingenious attempts to answer them from cross-generation data. Unfortunately, they failed in this endeavor and, rather than pause to ask “Why?”, they declared those questions off limits, and turned to develop a thriving, causality- free enterprise called statistics.

This was a critical moment in the history of science. The opportunity to equip causal questions with a language of their own came very close to being realized, but was squandered. In the following years, these questions were declared unscientific and went underground. Despite heroic efforts by the geneticist Sewall Wright (1889-1988), causal vocabulary was virtually prohibited for more than half a century. And when you prohibit speech, you prohibit thought, and you stifle principles, methods, and tools.

Readers do not have to be scientists to witness this prohibition. In Statistics 101, every student learns to chant: “Correlation is not causation.” With good reason! The rooster crow is highly correlated with the sunrise, yet it does not cause the sunrise.

Unfortunately, statistics took this common-sense observation and turned it into a fetish. It tells us that correlation is not causation, but it does not tell us what causation is. In vain will you search the index of a statistics textbook for an entry on “cause.” Students are never allowed to say that X is the cause of Y — only that X and Y are related or associated.

A popular idea in quantitative social sciences is to think of a cause (C) as something that increases the probability of its effect or outcome (O). That is:

P(O|C) > P(O|-C)

However, as is also well-known, a correlation between two variables, say A and B, does not necessarily imply that that one is a cause of the other, or the other way around, since they may both be an effect of a common cause, C.

In statistics and econometrics, we usually solve this confounder problem by controlling for C, i. e. by holding C fixed. This means that we actually look at different populations – those in which C occurs in every case, and those in which C doesn’t occur at all. This means that knowing the value of A does not influence the probability of C [P(C|A) = P(C)]. So if there then still exist a correlation between A and B in either of these populations, there has to be some other cause operating. But if all other possible causes have been controlled for too, and there is still a correlation between A and B, we may safely conclude that A is a cause of B, since by controlling for all other possible causes, the correlation between the putative cause A and all the other possible causes (D, E,. F …) is broken.

This is, of course, a very demanding prerequisite, since we may never actually be sure to have identified all putative causes. Even in scientific experiments may the number of uncontrolled causes be innumerable. Since nothing less will do, we do all understand how hard it is to actually get from correlation to causality. This also means that only relying on statistics or econometrics is not enough to deduce causes from correlations.

Some people think that randomization may solve the empirical problem. By randomizing we are getting different populations that are homogeneous in regards to all variables except the one we think is a genuine cause. In that way, we are supposed being able not having to actually know what all these other factors are.

If you succeed in performing an ideal randomization with different treatment groups and control groups that is attainable. But — it presupposes that you really have been able to establish — and not just assumed — that the probability of all other causes but the putative (A) have the same probability distribution in the treatment and control groups, and that the probability of assignment to treatment or control groups are independent of all other possible causal variables.

Unfortunately, real experiments and real randomizations seldom or never achieve this. So, yes, we may do without knowing all causes, but it takes ideal experiments and ideal randomizations to do that, not real ones.

That means that in practice we do have to have sufficient background knowledge to deduce causal knowledge. Without old knowledge, we can’t get new knowledge, and — no causes in, no causes out.

Econometrics is basically a deductive method. Given the assumptions (such as manipulability, transitivity, Reichenbach probability principles, separability, additivity, linearity, etc., etc.) it delivers deductive inferences. The problem, of course, is that we will never completely know when the assumptions are right. Real target systems are seldom epistemically isomorphic to axiomatic-deductive models/systems, and even if they were, we still have to argue for the external validity of the e conclusions reached from within these epistemically convenient models/systems. Causal evidence generated by statistical/econometric procedures may be valid in closed models, but what we usually are interested in, is causal evidence in the real target system we happen to live in.

Advocates of econometrics want to have deductively automated answers to fundamental causal questions. But to apply ‘thin’ methods we have to have ‘thick’ background knowledge of what’s going on in the real world, and not in idealized models. Conclusions can only be as certain as their premises — and that also applies to the quest for causality in econometrics.

The central problem with the present ‘machine learning’ and ‘big data’ hype is that so many — falsely — think that they can get away with analysing real-world phenomena without any (commitment to) theory. But — data never speaks for itself. Without a prior statistical set-up, there actually are no data at all to process. And — using a machine learning algorithm will only produce what you are looking for.

Clever data-mining tricks are never enough to answer important scientific questions. Theory matters.


  1. Statistics tell us very little except when used to test the outcomes of well-designed studies or, when external controls, time, and/or physical forces limit the causal arrow to only one direction. Statistics merely indicate the likelihood that your correlation occurred by mere chance.

    For example, there is a correlation between snowfall and auto accidents. Clearly accidents do not cause snow, so it is reasonable to surmise that snow causes accidents. Other ways exist to burrow down on causality such as the elimination of moderator and mediator effects that might be contributing causal factors. OLS regression is one way to do that. MANOVA, MDA, interrupted time-series analyses, 2SLS, maximum likelihood statistics, and an emerging favorite of mine is dynamical models offer alternatives.

    Indeed econometricians are about as bad as epidemiologists in falsely inferring cause from correlation. We all can recall that research showed that drinking red wine was proven by correlation to extend your longevity, but white wine was not, then white was good too; then most recently, we hear that each drink of any wine knocks an hour or two off your life expectancy. Duh.

    What is at the heart of the problem is the dichotomy between research TO PROVE (aka junk science) vs research TO FIND OUT (real science). Finally, one thing we know without statistics is that econometric analyses have failed to predict economic catastrophes. They do seem to be good at helping diagnose contributing factors post catastrophe. Alas, those analyses rarely drive market controls (Dodd-Frank RIP).

  2. You cannot use axiomatic-deductive methods of analysis to devise an ersatz, generic theory entirely out of concepts of abstract probability and use that to bootstrap an observed fog of statistics into useful knowledge? How sad and disappointing.

  3. “how hard it is to actually get from correlation to causality … statistics or econometrics is not enough to deduce causes from correlations … we have to have ‘thick’ background knowledge of what’s going on in the real world”

    Prof. Syll is misleading in saying “how hard it is to actually get from correlation to causality”. This can be read as implying that discovering causal relationships is rare. In fact we succeed doing this in multiple ways almost every day of our lives, and we would die very young if we couldn’t do this. Even apes can do it!
    Prof. Syll also misleading in saying “statistics or econometrics is not enough to deduce causes from correlations”. Statistics nor econometrics can’t “deduce” causation. Causation is an empirical matter. It may be strongly or weakly suggested by data but it can’t be deduced.
    Of course empical judgements need to be consistent with ALL of the best available data. But Prof. Syll is misleading in the way he suggests that there is need for a “‘thick background knowledge”. He seems to imply that it is possible to gain “thick” knowledge beyond mere empirical knowledge based on observed data. There is no evidence for this peculiar theory – it is merely is part of Prof. Syll’s philosophy “dialectical transcendental critical realism”, which was invented by Roy Bhaskar.

Sorry, the comment form is closed at this time.

Blog at WordPress.com.
Entries and Comments feeds.