## Avoiding statistical ‘dichotomania’

19 Sep, 2022 at 11:49 | Posted in Statistics & Econometrics | 2 Comments

We are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis …

The rigid focus on statistical significance encourages researchers to choose data and methods that yield statistical significance for some desired (or simply publishable) result, or that yield statistical non-significance for an undesired result, such as potential side effects of drugs — thereby invalidating
conclusions …

Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.

One reason to avoid such ‘dichotomania’ is that all statistics, including P values and confidence intervals, naturally vary from study to study, and often do so to a surprising degree. In fact, random variation alone can easily lead to large disparities in P values, far beyond falling just to either side of the 0.05 threshold …

We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits. In doing so, they should remember that all the values between the interval’s limits are reasonably compatible with the data, given the statistical assumptions used to compute the interval. Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.

Valentin Amrhein, Sander Greenland, Blake McShane

In its standard form, a significance test is not the kind of ‘severe test’ that we are looking for in our search for being able to confirm or disconfirm empirical scientific hypotheses. This is problematic for many reasons, one being that there is a strong tendency to accept the null hypothesis since it can’t be rejected at the standard 5% significance level. In their standard form, significance tests bias against new hypotheses by making it hard to disconfirm the null hypothesis.

And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” Standard scientific methodology tells us that when there is only say a 10 % probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more ‘reasonable’ to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same 10 % result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.

Most importantly — we should never forget that the underlying parameters we use when performing significance tests are model constructions. Our P values mean next to nothing if the model is wrong.

1. It’s always good to see these points taken up. Nonetheless, it was hard for me to follow the passage
“…when there is only say a 10 % probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more ‘reasonable’ to conclude that we have a case of disconfirmation…”
– The only meaning I can see in the phrase “probability that pure sampling error could account for the observed difference between the data and the null hypothesis” is that it is a posterior probability that the null hypothesis is correct and there was no bias (systematic error) present. True, the phrase is often confused with the usual P-value, but it isn’t, grammatically or logically, because a P-value is a probability that a statistic would be as or more extreme than observed given the null hypothesis and no bias, not the probability sampling error alone could account for something (which is a causal null hypothesis).
This confusion is addressed as item #2 on p. 4 of
Greenland S, Senn SJ, Rothman KJ, Carlin JC, Poole C, Goodman SN, Altman DG (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at https://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf
Our most recent discussions of confusions about P-values and how to address them include
Greenland S, Mansournia M, Joffe M (2022). To curb research misreporting, replace significance and confidence by compatibility. Preventive Medicine, 164, https://www.sciencedirect.com/science/article/pii/S0091743522001761.
Amrhein V, Greenland S (2022). Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values. Journal of Information Technology, 37, 316-320. https://doi.org/10.1177%2F02683962221105904

• It is indeed a great pleasure to have a blog with such eminently competent readers! Thanks for sharing your views and links, Sander, and for pointing out a somewhat ill-phrased passage in my argumentation 🙂 The point I wanted to make in that passage was only that in many situations/contexts one could argue that it would be acceptable, when based on a sample data you were trying to decide if the data would still be considered to be generalizable to the intended population, you came to the conclusion it was, even if the (totally arbitrary level, as Fisher never tired of pointing out) 5% p-value hurdle was not passed.

Sorry, the comment form is closed at this time.