Read my lips — statistical significance is NOT a substitute for doing real science!24 July, 2014 at 14:12 | Posted in Theory of Science & Methodology | 2 Comments
Frequentist hypothesis testing has come under sustained and vigorous attack in recent years … But there are a couple of good things about Frequentist hypothesis testing that I haven’t seen many people discuss. Both of these have to do not with the formal method itself, but with social conventions associated with the practice …
Why do I like these social conventions? Two reasons. First, I think they cut down a lot on scientific noise. “Statistical significance” is sort of a first-pass filter that tells you which results are interesting and which ones aren’t. Without that automated filter, the entire job of distinguishing interesting results from uninteresting ones falls to the reviewers of a paper, who have to read through the paper much more carefully than if they can just scan for those little asterisks of “significance”.
A non-trivial part of teaching statistics is made up of teaching students to perform significance testing. A problem I have noticed repeatedly over the years, however, is that no matter how careful you try to be in explicating what the probabilities generated by these statistical tests – p-values – really are, still most students misinterpret them. And a lot of researchers obviously also fall pray to the same mistakes:
Are women three times more likely to wear red or pink when they are most fertile? No, probably not. But here’s how hardworking researchers, prestigious scientific journals, and gullible journalists have been fooled into believing so.
The paper I’ll be talking about appeared online this month in Psychological Science, the flagship journal of the Association for Psychological Science, which represents the serious, research-focused (as opposed to therapeutic) end of the psychology profession.
“Women Are More Likely to Wear Red or Pink at Peak Fertility,” by Alec Beall and Jessica Tracy, is based on two samples: a self-selected sample of 100 women from the Internet, and 24 undergraduates at the University of British Columbia. Here’s the claim: “Building on evidence that men are sexually attracted to women wearing or surrounded by red, we tested whether women show a behavioral tendency toward wearing reddish clothing when at peak fertility. … Women at high conception risk were more than three times more likely to wear a red or pink shirt than were women at low conception risk. … Our results thus suggest that red and pink adornment in women is reliably associated with fertility and that female ovulation, long assumed to be hidden, is associated with a salient visual cue.”
Pretty exciting, huh? It’s (literally) sexy as well as being statistically significant. And the difference is by a factor of three—that seems like a big deal.
Really, though, this paper provides essentially no evidence about the researchers’ hypotheses …
The way these studies fool people is that they are reduced to sound bites: Fertile women are three times more likely to wear red! But when you look more closely, you see that there were many, many possible comparisons in the study that could have been reported, with each of these having a plausible-sounding scientific explanation had it appeared as statistically significant in the data.
The standard in research practice is to report a result as “statistically significant” if its p-value is less than 0.05; that is, if there is less than a 1-in-20 chance that the observed pattern in the data would have occurred if there were really nothing going on in the population. But of course if you are running 20 or more comparisons (perhaps implicitly, via choices involved in including or excluding data, setting thresholds, and so on), it is not a surprise at all if some of them happen to reach this threshold.
The headline result, that women were three times as likely to be wearing red or pink during peak fertility, occurred in two different samples, which looks impressive. But it’s not really impressive at all! Rather, it’s exactly the sort of thing you should expect to see if you have a small data set and virtually unlimited freedom to play around with the data, and with the additional selection effect that you submit your results to the journal only if you see some catchy pattern. …
Statistics textbooks do warn against multiple comparisons, but there is a tendency for researchers to consider any given comparison alone without considering it as one of an ensemble of potentially relevant responses to a research question. And then it is natural for sympathetic journal editors to publish a striking result without getting hung up on what might be viewed as nitpicking technicalities. Each person in this research chain is making a decision that seems scientifically reasonable, but the result is a sort of machine for producing and publicizing random patterns.
There’s a larger statistical point to be made here, which is that as long as studies are conducted as fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant, we will see lots of dramatic claims based on data patterns that don’t represent anything real in the general population. Again, this fishing can be done implicitly, without the researchers even realizing that they are making a series of choices enabling them to over-interpret patterns in their data.
Indeed. If anything, this underlines how important it is not to equate science with statistical calculation. All science entail human judgement, and using statistical models doesn’t relieve us of that necessity. Working with misspecified models, the scientific value of significance testing is actually zero – even though you’re making valid statistical inferences! Statistical models and concomitant significance tests are no substitutes for doing real science. Or as a noted German philosopher once famously wrote:
There is no royal road to science, and only those who do not dread the fatiguing climb of its steep paths have a chance of gaining its luminous summits.
Statistical significance doesn’t say that something is important or true. Since there already are far better and more relevant testing that can be done (see e. g. here and here)- it is high time to consider what should be the proper function of what has now really become a statistical fetish. Given that it anyway is very unlikely than any population parameter is exactly zero, and that contrary to assumption most samples in social science and economics are not random or having the right distributional shape – why continue to press students and researchers to do null hypothesis significance testing, testing that relies on a weird backward logic that students and researchers usually don’t understand?
Suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and has a standard error (telling us how much the mean varies from one sample to another) of 20.
Does this imply that the data do not disconfirm the hypothesis? Given the usual normality assumptions on sampling distributions the one-tailed p-value is approximately 0.11. Thus, approximately 11% of the time we would expect a score this low or lower if we were sampling from this voucher system population. That means – using the ordinary 5% significance-level — we would not reject the null hypothesis although the test has shown that it is “likely” that the hypothesis is false.
In its standard form, a significance test is not the kind of “severe test” that we are looking for in our search for being able to confirm or disconfirm empirical scientific hypothesis. This is problematic for many reasons, one being that there is a strong tendency to accept the null hypothesis since they can’t be rejected at the standard 5% significance level. In their standard form, significance tests bias against new hypothesis by making it hard to disconfirm the null hypothesis.
And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” But looking at our example, standard scientific methodology tells us that since there is only 11% probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more “reasonable” to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.
And, most importantly, of course we should never forget that the underlying parameters we use when performing significance tests are model constructions. Our p-value of 0.11 means next to nothing if the model is wrong. As David Freedman writes in Statistical Models and Causal Inference:
I believe model validation to be a central issue. Of course, many of my colleagues will be found to disagree. For them, fitting models to data, computing standard errors, and performing significance tests is “informative,” even though the basic statistical assumptions (linearity, independence of errors, etc.) cannot be validated. This position seems indefensible, nor are the consequences trivial. Perhaps it is time to reconsider.