## Hypothesis and significance tests — the art of asking the wrong questions

3 May, 2015 at 09:39 | Posted in Statistics & Econometrics | 3 Comments

Most scientists use two closely related statistical approaches to make inferences from their data: significance testing and hypothesis testing. Significance testers and hypothesis testers seek to determine if apparently interesting patterns (“effects”) in their data are real or illusory. They are concerned with whether the effects they observe could just have emanated from randomness in the data.

The first step in this process is to nominate a “null hypothesis” which posits that there is no effect. Mathematical procedures are then used to estimate the probability that an effect at least as big as that which was observed would have arisen if the null hypothesis was true. That probability is called “p”.

If p is small (conventionally less than 0.05, or 5%) then the significance tester will claim that it is unlikely an effect of the observed magnitude would have arisen by chance alone. Such effects are said to be “statistically significant”. Sir Ronald Fisher who, in the 1920s, developed contemporary methods for generating p values, interpreted small p values as being indicative of “real” (not chance) effects. This is the central idea in significance testing.

Significance testing has been under attack since it was first developed … Jerzy Neyman and Egon Pearson argued that Fisher’s interpretation of p was dodgy. They developed an approach called hypothesis testing in which the p value serves only to help the researcher make an optimised choice between the null hypothesis and an alternative hypothesis: If p is greater than or equal to some threshold (such as 0.05) the researcher chooses to believe the null hypothesis. If p is less than the threshold the researcher chooses to believe the alternative hypothesis. In the long run (over many experiments) adoption of the hypothesis testing approach minimises the rate of making incorrect choices.

Critics have pointed out that there is limited value in knowing only that errors have been minimised in the long run – scientists don’t just want to know they have been wrong as infrequently as possible, they want to know if they can believe their last experiment!

Today’s scientists typically use a messy concoction of significance testing and hypothesis testing. Neither Fisher nor Neyman would be satisfied with much of current statistical practice.

Scientists have enthusiastically adopted significance testing and hypothesis testing because these methods appear to solve a fundamental problem: how to distinguish “real” effects from randomness or chance. Unfortunately significance testing and hypothesis testing are of limited scientific value – they often ask the wrong question and almost always give the wrong answer. And they are widely misinterpreted.

Consider a clinical trial designed to investigate the effectiveness of new treatment for some disease. After the trial has been conducted the researchers might ask “is the observed effect of treatment real, or could it have arisen merely by chance?” If the calculated p value is less than 0.05 the researchers might claim the trial has demonstrated the treatment was effective. But even before the trial was conducted we could reasonably have expected the treatment was “effective” – almost all drugs have some biochemical action and all surgical interventions have some effects on health. Almost all health interventions have some effect, it’s just that some treatments have effects that are large enough to be useful and others have effects that are trivial and unimportant.

So what’s the point in showing empirically that the null hypothesis is not true? Researchers who conduct clinical trials need to determine if the effect of treatment is big enough to make the intervention worthwhile, not whether the treatment has any effect at all.

A more technical issue is that p tells us the probability of observing the data given that the null hypothesis is true. But most scientists think p tells them the probability the null hypothesis is true given their data. The difference might sound subtle but it’s not. It is like the difference between the probability that a prime minister is male and the probability a male is prime minister! …

Significance testing and hypothesis testing are so widely misinterpreted that they impede progress in many areas of science. What can be done to hasten their demise? Senior scientists should ensure that a critical exploration of the methods of statistical inference is part of the training of all research students. Consumers of research should not be satisfied with statements that “X is effective”, or “Y has an effect”, especially when support for such claims is based on the evil p.

Rob Herbert

1. I did a wash
a sock was awash
I wondered why
an hypothesis I must try
to determine what’s true or what’s quash!!! (he, he)

2. Dumbbell distributions are very common, and yet theoreticians rarely confront the implications by framing the problem accordingly. A dumbbell (named for the weight-training apparatus) is the sampling distribution that results from drawing observations from a data population that is being generated by two (or more) quite different data-generating processes.
.
Let’s say you are interested in rainfall. You have data on rainfall amounts on every day for a given place and time period. The population daily average rainfall is not very informative; rainfall amounts on the days it rains are much higher than on the days it does not rain. The overall population average is not a very good predictor of rainfall amount, nor is the overall rate of rainy days during the whole time period likely to be a good predictor of whether it will rain on any particular day: it is much more likely to rain on rainy days during rainy seasons, etc. (In other words, it rains when there’s a process operating that generates rain.)
.
Actually learning about weather and rain will be a matter of distinguishing rainy days from dry days, that is, distinguishing and identifying when the processes that generate rain with a high probability are in operation. And, so meteorologists may be interested, for example, in identifying oscillations that give rise to monsoons or the so-called El Nino in the Pacific.
.
Medical treatments involve a similar confounding. The problem isn’t whether there is an effect, but locating and identifying patients who have a problem the effect meliorates at a practically acceptable level of risk.
.
Formally, the medical researcher may be tasked with determining whether Sildenafil, known to affect biochemical processes related to the regulation of blood pressure, relieves the measured symptoms of angina. The stylized rituals of measuring and sorting do as much to disguise and obstruct the real work, which is figuring out that Viagra does solve a commercially important problem. If Sildenafil did not have a salient effect in a very large portion of the whole population, would researchers even notice? And, what of the scientific, as opposed to commercial purpose — do we advance understanding of the biochemical processes regulating blood pressure?
.
The classic framing of the problem for theoreticians is given by the example of the athlete experiencing a “hot streak”. The common-sense impression of fans, coaches and athletes themselves is that athletes experience periods when they are “on” and performing at a peak, and other periods when they are “off” for usually unknown reasons, and performing at a subpar level. The difference in performance, as reflected in scoring statistics such as golf scores or time to complete a race, may be numerically small, especially in elite athletes, but still practically important as the difference may be the difference between winning a race or losing, or the difference between completing a high-risk performance and having an accident resulting in injury.
.
The conventional assessment of theoreticians of the problem of the “hot streak” is complacent: applying conventional tools, they conclude that “hot streaks” are illusory. They cannot reject the hypothesis that a single, controlled process generates the variation (in the data around a single central tendency) as a residual. The tools of statistical analysis cannot cross the threshold, to take the next step, of distinguishing the distribution of data generated during a “hot streak” from the distribution generated during an “off” period. Unfortunately, it is precisely the power to find the difference, so that it can be investigated, that science needs.