Causality in social sciences — and economics — can never solely be a question of statistical inference. Causality entails more than predictability, and to really in depth explain social phenomena require theory. Analysis of variation — the foundation of all econometrics — can never in itself reveal how these variations are brought about. First when we are able to tie actions, processes or structures to the statistical relations detected, can we say that we are getting at relevant explanations of causation.
For more on these issues — see the chapter “Capturing causality in economics and the limits of statistical inference” in my On the use and misuse of theories and models in economics.
Most scientists use two closely related statistical approaches to make inferences from their data: significance testing and hypothesis testing. Significance testers and hypothesis testers seek to determine if apparently interesting patterns (“effects”) in their data are real or illusory. They are concerned with whether the effects they observe could just have emanated from randomness in the data.
The first step in this process is to nominate a “null hypothesis” which posits that there is no effect. Mathematical procedures are then used to estimate the probability that an effect at least as big as that which was observed would have arisen if the null hypothesis was true. That probability is called “p”.
If p is small (conventionally less than 0.05, or 5%) then the significance tester will claim that it is unlikely an effect of the observed magnitude would have arisen by chance alone. Such effects are said to be “statistically significant”. Sir Ronald Fisher who, in the 1920s, developed contemporary methods for generating p values, interpreted small p values as being indicative of “real” (not chance) effects. This is the central idea in significance testing.
Significance testing has been under attack since it was first developed … Jerzy Neyman and Egon Pearson argued that Fisher’s interpretation of p was dodgy. They developed an approach called hypothesis testing in which the p value serves only to help the researcher make an optimised choice between the null hypothesis and an alternative hypothesis: If p is greater than or equal to some threshold (such as 0.05) the researcher chooses to believe the null hypothesis. If p is less than the threshold the researcher chooses to believe the alternative hypothesis. In the long run (over many experiments) adoption of the hypothesis testing approach minimises the rate of making incorrect choices.
Critics have pointed out that there is limited value in knowing only that errors have been minimised in the long run – scientists don’t just want to know they have been wrong as infrequently as possible, they want to know if they can believe their last experiment!
Today’s scientists typically use a messy concoction of significance testing and hypothesis testing. Neither Fisher nor Neyman would be satisfied with much of current statistical practice.
Scientists have enthusiastically adopted significance testing and hypothesis testing because these methods appear to solve a fundamental problem: how to distinguish “real” effects from randomness or chance. Unfortunately significance testing and hypothesis testing are of limited scientific value – they often ask the wrong question and almost always give the wrong answer. And they are widely misinterpreted.
Consider a clinical trial designed to investigate the effectiveness of new treatment for some disease. After the trial has been conducted the researchers might ask “is the observed effect of treatment real, or could it have arisen merely by chance?” If the calculated p value is less than 0.05 the researchers might claim the trial has demonstrated the treatment was effective. But even before the trial was conducted we could reasonably have expected the treatment was “effective” – almost all drugs have some biochemical action and all surgical interventions have some effects on health. Almost all health interventions have some effect, it’s just that some treatments have effects that are large enough to be useful and others have effects that are trivial and unimportant.
So what’s the point in showing empirically that the null hypothesis is not true? Researchers who conduct clinical trials need to determine if the effect of treatment is big enough to make the intervention worthwhile, not whether the treatment has any effect at all.
A more technical issue is that p tells us the probability of observing the data given that the null hypothesis is true. But most scientists think p tells them the probability the null hypothesis is true given their data. The difference might sound subtle but it’s not. It is like the difference between the probability that a prime minister is male and the probability a male is prime minister! …
Significance testing and hypothesis testing are so widely misinterpreted that they impede progress in many areas of science. What can be done to hasten their demise? Senior scientists should ensure that a critical exploration of the methods of statistical inference is part of the training of all research students. Consumers of research should not be satisfied with statements that “X is effective”, or “Y has an effect”, especially when support for such claims is based on the evil p.
Knowing the contents of a toolbox, of course, requires statistical thinking, that is, the art of choosing a proper tool for a given problem. Instead, one single procedure that I call the “null ritual” tends to be featured in texts and practiced by researchers. Its essence can be summarized in a few lines:
The null ritual:
1. Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses.
2. Use 5% as a convention for rejecting the null. If signiﬁcant, accept your research hypothesis. Report the result as p < 0.05, p < 0.01, or p < 0.001 (whichever comes next to the obtained p-value).
3. Always perform this procedure …
The routine reliance on the null ritual discourages not only statistical thinking but also theoretical thinking. One does not need to specify one’s hypothesis, nor any challenging alternative hypothesis … The sole requirement is to reject a null that is identiﬁed with “chance.” Statistical theories such as Neyman–Pearson theory and Wald’s theory, in contrast, begin with two or more statistical hypotheses.
In the absence of theory, the temptation is to look ﬁrst at the data and then see what is signiﬁcant. The physicist Richard Feynman … has taken notice of this misuse of hypothesis testing. I summarize his argument:
To report a signiﬁcant result and reject the null in favor of an alternative hypothesis is meaningless unless the alternative hypothesis has been stated before the data was obtained.
Feynman’s conjecture is again and again violated by routine signiﬁcance testing, where one looks at the data to see what is signiﬁcant. Statistical packages allow every difference, interaction, or correlation against chance to be tested. They automatically deliver ratings of “signiﬁcance” in terms of stars, double stars, and triple stars, encouraging the bad afterthe-fact habit. The general problem Feynman addressed is known as overﬁtting … Fitting per se has the same
problems as story telling after the fact, which leads to a “hindsight bias.” The true test of a model is to ﬁx its parameters on one sample, and to test it in a new sample. Then it turns out that predictions based on simple heuristics can be more accurate than routine multiple regressions … Less can be more. The routine use of linear multiple regression exempliﬁes another mindless use of statistics …
We know but often forget that the problem of inductive inference has no single solution. There is no uniformly most powerful test, that is, no method that is best for every problem. Statistical theory has provided us with a toolbox with effective instruments, which require judgment about when it is right to use them … Judgment is part of the art of statistics.
To stop the ritual, we also need more guts and nerves. We need some pounds of courage to cease playing along in this embarrassing game. This may cause friction with editors and colleagues, but it will in the end help them to enter the dawn of statistical thinking.
It will be remembered that the seventy translators of the Septuagint were shut up in seventy separate rooms with the Hebrew text and brought out with them, when they emerged, seventy identical translations. Would the same miracle be vouchsafed if seventy multiple correlators were shut up with the same statistical material? And anyhow, I suppose, if each had a different economist perched on his a priori, that would make a difference to the outcome.
Take a look at a map of Africa showing male circumcision rates, and impose on that data on HIV/AIDS prevalence. There is a very close correspondence between the two, with the exceptions being cities with large numbers of recent uncircumcised male migrants. One might therefore conclude that male circumcision reduces the chances of contracting HIV/AIDS, and indeed there are medical reasons to believe this may be so. But maybe some third, underlying variable, explains both circumcision and HIV/AIDS prevalence. That is, those who select to get circumcised have special characteristics which make them less likely to contract HIV/AIDS, so a comparison of HIV/AIDS rates between circumcised and uncircumcised men will give a biased estimate of the impact of circumcision on HIV/AIDS prevalence. There is such a factor, it is being Muslim. Muslim men are circumcised and less likely to engage in risky sexual behaviour exposing themselves to HIV/AIDS, partly as they do not drink alcohol. Again we are not comparing like with like: circumcised men have different characteristics compared to uncircumcised men, and these characteristics affect the outcome of interest.