Significance testing — an embarrassing ritual

29 Apr, 2015 at 10:17 | Posted in Statistics & Econometrics | 1 Comment

Knowing the contents of a toolbox, of course, requires statistical thinking, that is, the art of choosing a proper tool for a given problem. Instead, one single procedure that I call the “null ritual” tends to be featured in texts and practiced by researchers. Its essence can be summarized in a few lines:

The null ritual:
1. Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses.
2. Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis. Report the result as p < 0.05, p < 0.01, or p < 0.001 (whichever comes next to the obtained p-value).
3. Always perform this procedure …

gigThe routine reliance on the null ritual discourages not only statistical thinking but also theoretical thinking. One does not need to specify one’s hypothesis, nor any challenging alternative hypothesis … The sole requirement is to reject a null that is identified with “chance.” Statistical theories such as Neyman–Pearson theory and Wald’s theory, in contrast, begin with two or more statistical hypotheses.

In the absence of theory, the temptation is to look first at the data and then see what is significant. The physicist Richard Feynman … has taken notice of this misuse of hypothesis testing. I summarize his argument:

Feynman’s conjecture:
To report a significant result and reject the null in favor of an alternative hypothesis is meaningless unless the alternative hypothesis has been stated before the data was obtained.

Feynman’s conjecture is again and again violated by routine significance testing, where one looks at the data to see what is significant. Statistical packages allow every difference, interaction, or correlation against chance to be tested. They automatically deliver ratings of “significance” in terms of stars, double stars, and triple stars, encouraging the bad afterthe-fact habit. The general problem Feynman addressed is known as overfitting … Fitting per se has the same
problems as story telling after the fact, which leads to a “hindsight bias.” The true test of a model is to fix its parameters on one sample, and to test it in a new sample. Then it turns out that predictions based on simple heuristics can be more accurate than routine multiple regressions … Less can be more. The routine use of linear multiple regression exemplifies another mindless use of statistics …

We know but often forget that the problem of inductive inference has no single solution. There is no uniformly most powerful test, that is, no method that is best for every problem. Statistical theory has provided us with a toolbox with effective instruments, which require judgment about when it is right to use them … Judgment is part of the art of statistics.

To stop the ritual, we also need more guts and nerves. We need some pounds of courage to cease playing along in this embarrassing game. This may cause friction with editors and colleagues, but it will in the end help them to enter the dawn of statistical thinking.

1 Comment

  1. It is very difficult to follow what Gigerenzer is trying to say in the quoted passage.
    He claims that it is wrong “to look first at the data and then see what is significant” because this is a “misuse of hypothesis testing”.
    However, looking at data is not hypothesis testing, so it can’t be a misuse of hypothesis testing.
    Moreover, this conflicts with the excellent advice of David Giles:
    “1. Always, but always, plot your data.
    2. Remember that data quality is at least as important as data quantity.
    3. Always ask yourself, “Do these results make economic/common sense”?
    4. Check whether your “statistically significant” results are also “numerically/economically significant”.

    Gigerenzer then claims, without explanation, that reporting in “favor of an alternative hypothesis is meaningless unless the alternative hypothesis has been stated before the data was obtained”. Surely this nonsense – the validity of data is not in any way affected by the date of its collection relative to the date when a hypothesis was first stated, or by whether examination of the data influenced the formulation of a hypothesis.
    Of course, it is always good to get extra data, but this is not Gigerenzer’s point – he claims that any conclusions from the original data are “meaningless”.
    Gigerenzer’s claim would invalidate almost all empirical work where it is impossible to get new experimental data, e.g. macroeconomics.

    Gigerenzer describes his claim as “Feynman’s conjecture”. This follows Gigerenzer’s 2004 paper:

    However, this paper is merely Gigerenzer’s interpretation of anecdotal remarks made by Feynman in an informal lecture in 1963, which was only published in the 1990’s after Feynman’s death.

Sorry, the comment form is closed at this time.

Blog at
Entries and comments feeds.