## Sizeless science and the cult of statistical significance

19 January, 2013 at 14:47 | Posted in Statistics & Econometrics | 14 CommentsLast year yours truly had an interesting luncheon discussion with **Deirdre McCloskey **on her controversy with **Kevin Hoover **on significance testing. It got me thinking about where the fetish status of significance testing comes from and why we are still teaching and practising it without serious qualifications despite its obvious inadequacies.

A non-trivial part of teaching statistics is made up of learning students to perform significance testing. A problem I have noticed repeatedly over the years, however, is that no matter how careful you try to be in explicating what the probabilities generated by these statistical tests – *p-values* – really are, still most students misinterpret them.

Giving a statistics course for the *Swedish National Research School in History*, I asked the students at the exam to explain how one should correctly interpret *p-values*. Although the correct definition is p(data|null hypothesis), a majority of the students either misinterpreted the *p-value* as being the *likelihood of a sampling error* (which of course is wrong, since the very computation of the p value is based on the assumption that sampling errors are what causes the sample statistics not coinciding with the null hypothesis) or that the *p-value* is the probability of the null hypothesis being true, given the data (which of course also is wrong, since that is p(null hypothesis|data) rather than the correct p(data|null hypothesis)).

This is not to blame on students’ ignorance, but rather on significance testing not being particularly transparent (conditional probability inference is difficult even to those of us who teach and practice it). A lot of researchers fall pray to the same mistakes. So – given that it anyway is very unlikely than any population parameter is exactly zero, and that contrary to assumption most samples in social science and economics are not random or having the right distributional shape – why continue to press students and researchers to do null hypothesis significance testing, testing that relies on weird backward logic that students and researchers usually don’t understand?

In a recent review of Deirdre’s and **Stephen Ziliak**‘s *The Cult of Statistical Significance *(University of Michigan Press 2008), mathematical statistician **Olle Häggström **succinctly summarizes what the debate is all about:

Stephen Ziliak and Deirdre McCloskey, claim in their recent book

The Cult of Statistical Significance[ZM] that the reliance on statistical methods has gone too far and turned into a ritual and an obstacle to scientific progress.A typical situation is the following. A scientist formulates a

null hypothesis. By means of asignificance test, she tries to falsify it. The analysis leads to ap-value, which indicates how likely it would have been, if the null hypothesis were true, to obtain data at least as extreme as those she actually got. If thep-valueis below a certain prespecified threshold (typically 0.01 or 0.05), the result is deemedstatistically significant, which, although far from constituting a definite

disproof of the null hypothesis, counts as evidence against it.Imagine now that a new drug for reducing blood pressure is being tested and that the fact of the matter is that the drug does have a positive effect (as compared with a placebo) but that the effect is so small that it is of no practical relevance to the patient’s health or well-being. If the study involves sufficiently many patients, the effect will nevertheless with high probability be detected, and the study will yield statistical significance. The lesson to learn from this is that in a medical study, statistical significance is not enough—the detected effect also needs to be large enough to be

medically significant. Likewise, empirical studies in economics (or psychology, geology, etc.) need to consider not only statistical significance but also economic (psychological, geological, etc.) significance.A major point in

The Cult of Statistical Significanceis the observation that many researchers are so obsessed with statistical significance that they neglect to ask themselves whether the detected discrepancies are large enough to be of any subject-matter significance. Ziliak and McCloskey call this neglectsizeless science …

The Cult of Statistical Significanceis written in an entertaining and polemical style. Sometimes the authors push their position a bit far, such as when they ask themselves: “If nullhypothesis significance testing is as idiotic as we and its other critics have so long believed, how on earth has it survived?” (p. 240). Granted, the single-minded focus on statistical significance that they label sizeless science is bad practice. Still, to throw out the use of significance tests would be a mistake, considering how often it is a crucial tool for concluding with confidence that what we see really is a pattern, as opposed to just noise. For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically needbothstatisticalandsubject-matter significance.

Statistical significance doesn’t say that something is important or true. Although Häggström has a point in his last remark, I still think – since there already are far better and more relevant testing that can be done (see e. g. my posts here and here) - it is high time to consider what should be the proper function of what has now really become a statistical fetish.

## 14 Comments »

RSS feed for comments on this post. TrackBack URI

### Leave a Reply

Create a free website or blog at WordPress.com. | The Pool Theme.

Entries and comments feeds.

Actually it’s the fight against p-values that’s closer to being a fetish. In many cases (e.g. linear regression), p-value and confidence interval represent exactly the same information, and one can be derived from the other. Computing p-value and comparing it with 0.05, or computing 95% confidence interval and checking whether it overlaps with zero is the same thing. One might argue about which form conveys the information better in practice – but in practice, economists often already report standard errors or t-stats, in addition to (or instead of) p-values.

Comment by ivansml— 19 January, 2013 #

Not only is the (constant and repetitive) fight against p-values a fetish of its own, it is a fetish conveniently designed to establish a research area in which to publish criticisms. The reformers call for confidence intervals, but without supplements they commit the same fallacies that worry those calling for reforms. A little spoof may be found on my blog today:

http://errorstatistics.com/2013/01/19/saturday-night-brainstorming-and-task-forces-2013-tfsi-on-nhst/

Comment by Mayo— 20 January, 2013 #

I beg to differ.

P-values says nothing of the direction or size of a difference in effect between e.g. treatment and control groups in an experiment.

C.I. reflects the results at the level of data measurement. P-values don’t.

P-values gives an oversimplistic binary (yes-no) picture of many times complex decisionmakings.

Comment by Lars P Syll— 19 January, 2013 #

P-value by itself does not. P-value together with a coefficient estimate does, and reporting estimates themselves is a standard practice (in fact, I don’t recall seeing a paper that would report only p-values, nothing else).

Comment by ivansml— 19 January, 2013 #

I find it all very one dimensional. When I study chemical systems I rarely find that the truth about these fairly simple systems turn out to be flat.

Comment by Martin— 19 January, 2013 #

It is worth noting that the p-value is incorrectly defined here as a conditional probability: p(data|null). Ziliac and McCloskey are “reformers” who, unfortunately, are spreading blatantly incorrect definitions of concepts such as power. The reformers, in short, are in need of reform. (Interested readers can search my blog on this.)

Comment by Mayo— 20 January, 2013 #

Deborah, I’m not quite sure I follow you. We do define p-value as the probability of obtaining a test statistic value larger than its observed value, given that the null hypothesis is true. And this is what the abbreviated formula you criticize says (and also what I remember Fisher himself said), so what’s the problem? Could you elaborate a little?

Comment by Lars P Syll— 20 January, 2013 #

Firstly, I don’t see where it is explained what your “abbreviation” is an abbreviation for (did I miss that)? Second, the P-value is not a conditional probability. The P-value, in relation to a statistic d(X) (and a null hypothesis Ho, and a model M) gives the probability of {d(X) > d(x)} under the assumption that x was generated by the process described in Ho (according to statistical model M).

Comment by Mayo— 20 January, 2013 #

The “abbreviation” is found in lots of statistics and econometrics textbooks as a handy/shorter version of the longer and more “correct” formula that I thought (until now) we all agreed on.

Your definition of p-value is not the one usually used in statistics texts, and even Aris e.g. in his Probability Theory and Statistical Inference (p. 689) writes that only “IN A CERTAIN SENSE the p-value might be INTERPRETED as a measure of how appropriately the null hypothesis describes the mechanism that has given rise to the observed data.” I’m not sure statisticians really make an evaluation of the process/mechanism generating x, although of course I agree that in a strict sense we can never talk about probabliities outside probabilistic/nomological machines/models (I’ve read your “Error” book, which I appreciated for its critique on Bayesianism, but on this issue I’m not convinced you’re right and although it was quite a while ago since I read Fisher, I’m not sure he would have agreed neither.)

Comment by Lars P Syll— 21 January, 2013 #

You’re focusing on the wrong part of my comment. Go ahead and put in Ho:

P(d(X) > d(x); Ho).

And yes, you do need a statistical model to compute this.

Comment by Mayo— 21 January, 2013 #

We need models to compute p-values, yes, and once we ponder on that we get into the much more interesting question of “statistical adequacy” – do the data “fit” the model – and the fact (which your fellow error-statistician Aris Spanos also noticed in his Erasmus Journal of Philosophy and Economics article a couple of years ago) that e.g. probably more than 99% of all “applied” articles published in American Economic Review during the last decades wouldn’t live up to the standards of the usual adequacy tests. And if you don’t pass the adequacy test, the value of p-value is zero! So, my view is that of course you can use significance tests (as they are usually defined), but that people often put to much emphasis on them, and that they are totally uninteresting so long as we haven’t FIRST properly done a “statistical adequacy” test.

Comment by Lars P Syll— 21 January, 2013 #

“Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing by

Raymond Hubbard ,Drake University,

R. Murray Lindsay of Lethbridge, (2008).

Abstract:

“Reporting p values from statistical significance tests is common empirical literature. Sir Ronald Fisher saw the p value as playing a useful role in knowledge development by acting as an `objective’ measure of inductive evidence against the null hypothesis. We review several reasons why the p value is an unobjective and inadequate measure of evidence when statistically testing hypotheses. A common theme throughout many of these reasons is that p values exaggerate the evidence against H0 . This, in turn, calls into question the validity of much published work based on comparatively small, including .05, p values. Indeed, if researchers were fully informed about the limitations of the p value as a measure of evidence, this inferential index could not possibly enjoy its ongoing ubiquity. Replication with extension research focusing on sample statistics, effect sizes, and their confidence intervals is a better vehicle for reliable knowledge development than using p values. Fisher would also have agreed with the need for replication research.”

http://wiki.bio.dtu.dk/~agpe/papers/pval_notuseful.pdf

“Why P=0.05?”- by Gerard E. Dallal

(2007) “Historical background to the origins of p-values and the choice of 0.05 as the cut-off for significance.The standard level of significance used to justify a claim of a statistically significant effect is 0.05. For better or worse, the term statistically significant has become synonymous with P0.05.”

http://www.jerrydallal.com/LHSP/p05.htm

Comment by Jan Milch— 21 January, 2013 #

Before leaving off, let me recommend dropping what you call your abbreviated statement of p-values: It is extremely misleading and promotes the supposition that p-values are likelihoods.

Comment by Mayo— 21 January, 2013 #

“Sizeless science and the cult of statistical significance LARS P SYLL” ended up being a

terrific blog, can’t wait to look at alot more of ur blog

posts. Time to waste a little time on the net lol.

I appreciate it -Raul

Comment by http://google.com— 14 February, 2013 #