Significance testing — an embarrassing ritual29 April, 2015 at 10:17 | Posted in Statistics & Econometrics | 1 Comment
Knowing the contents of a toolbox, of course, requires statistical thinking, that is, the art of choosing a proper tool for a given problem. Instead, one single procedure that I call the “null ritual” tends to be featured in texts and practiced by researchers. Its essence can be summarized in a few lines:
The null ritual:
1. Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses.
2. Use 5% as a convention for rejecting the null. If signiﬁcant, accept your research hypothesis. Report the result as p < 0.05, p < 0.01, or p < 0.001 (whichever comes next to the obtained p-value).
3. Always perform this procedure …
The routine reliance on the null ritual discourages not only statistical thinking but also theoretical thinking. One does not need to specify one’s hypothesis, nor any challenging alternative hypothesis … The sole requirement is to reject a null that is identiﬁed with “chance.” Statistical theories such as Neyman–Pearson theory and Wald’s theory, in contrast, begin with two or more statistical hypotheses.
In the absence of theory, the temptation is to look ﬁrst at the data and then see what is signiﬁcant. The physicist Richard Feynman … has taken notice of this misuse of hypothesis testing. I summarize his argument:
To report a signiﬁcant result and reject the null in favor of an alternative hypothesis is meaningless unless the alternative hypothesis has been stated before the data was obtained.
Feynman’s conjecture is again and again violated by routine signiﬁcance testing, where one looks at the data to see what is signiﬁcant. Statistical packages allow every difference, interaction, or correlation against chance to be tested. They automatically deliver ratings of “signiﬁcance” in terms of stars, double stars, and triple stars, encouraging the bad afterthe-fact habit. The general problem Feynman addressed is known as overﬁtting … Fitting per se has the same
problems as story telling after the fact, which leads to a “hindsight bias.” The true test of a model is to ﬁx its parameters on one sample, and to test it in a new sample. Then it turns out that predictions based on simple heuristics can be more accurate than routine multiple regressions … Less can be more. The routine use of linear multiple regression exempliﬁes another mindless use of statistics …
We know but often forget that the problem of inductive inference has no single solution. There is no uniformly most powerful test, that is, no method that is best for every problem. Statistical theory has provided us with a toolbox with effective instruments, which require judgment about when it is right to use them … Judgment is part of the art of statistics.
To stop the ritual, we also need more guts and nerves. We need some pounds of courage to cease playing along in this embarrassing game. This may cause friction with editors and colleagues, but it will in the end help them to enter the dawn of statistical thinking.