Spurious statistical significance – when science gets it all wrong

28 July, 2013 at 11:20 | Posted in Statistics & Econometrics | 2 Comments

A non-trivial part of teaching statistics is made up of teaching students to perform significance testing. A problem I have noticed repeatedly over the years, however, is that no matter how careful you try to be in explicating what the probabilities generated by these statistical tests – p values – really are, still most students misinterpret them. And a lot of researchers obviously also fall pray to the same mistakes:

Are women three times more likely to wear red or pink when they are most fertile? No, probably not. But here’s how hardworking researchers, prestigious scientific journals, and gullible journalists have been fooled into believing so.

The paper I’ll be talking about appeared online this month in Psychological Science, the flagship journal of the Association for Psychological Science, which represents the serious, research-focused (as opposed to therapeutic) end of the psychology profession.

images-11“Women Are More Likely to Wear Red or Pink at Peak Fertility,” by Alec Beall and Jessica Tracy, is based on two samples: a self-selected sample of 100 women from the Internet, and 24 undergraduates at the University of British Columbia. Here’s the claim: “Building on evidence that men are sexually attracted to women wearing or surrounded by red, we tested whether women show a behavioral tendency toward wearing reddish clothing when at peak fertility. … Women at high conception risk were more than three times more likely to wear a red or pink shirt than were women at low conception risk. … Our results thus suggest that red and pink adornment in women is reliably associated with fertility and that female ovulation, long assumed to be hidden, is associated with a salient visual cue.”

Pretty exciting, huh? It’s (literally) sexy as well as being statistically significant. And the difference is by a factor of three—that seems like a big deal.

Really, though, this paper provides essentially no evidence about the researchers’ hypotheses …

The way these studies fool people is that they are reduced to sound bites: Fertile women are three times more likely to wear red! But when you look more closely, you see that there were many, many possible comparisons in the study that could have been reported, with each of these having a plausible-sounding scientific explanation had it appeared as statistically significant in the data.

The standard in research practice is to report a result as “statistically significant” if its p-value is less than 0.05; that is, if there is less than a 1-in-20 chance that the observed pattern in the data would have occurred if there were really nothing going on in the population. But of course if you are running 20 or more comparisons (perhaps implicitly, via choices involved in including or excluding data, setting thresholds, and so on), it is not a surprise at all if some of them happen to reach this threshold.

The headline result, that women were three times as likely to be wearing red or pink during peak fertility, occurred in two different samples, which looks impressive. But it’s not really impressive at all! Rather, it’s exactly the sort of thing you should expect to see if you have a small data set and virtually unlimited freedom to play around with the data, and with the additional selection effect that you submit your results to the journal only if you see some catchy pattern. …

Statistics textbooks do warn against multiple comparisons, but there is a tendency for researchers to consider any given comparison alone without considering it as one of an ensemble of potentially relevant responses to a research question. And then it is natural for sympathetic journal editors to publish a striking result without getting hung up on what might be viewed as nitpicking technicalities. Each person in this research chain is making a decision that seems scientifically reasonable, but the result is a sort of machine for producing and publicizing random patterns.

There’s a larger statistical point to be made here, which is that as long as studies are conducted as fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant, we will see lots of dramatic claims based on data patterns that don’t represent anything real in the general population. Again, this fishing can be done implicitly, without the researchers even realizing that they are making a series of choices enabling them to over-interpret patterns in their data.

Andrew Gelman

Indeed. If anything, this underlines how important it is not to equate science with statistical calculation. All science entail human judgement, and using statistical models doesn’t relieve us of that necessity. Working with misspecified models, the scientific value of significance testing is actually zero –  even though you’re making valid statistical inferences! Statistical models and concomitant significance tests are no substitutes for doing real science. Or as a noted German philosopher once famously wrote:

There is no royal road to science, and only those who do not dread the fatiguing climb of its steep paths have a chance of gaining its luminous summits.

Statistical significance doesn’t say that something is important or true. Since there already are far better and more relevant testing that can be done (see e. g. here and  here)- it is high time to consider what should be the proper function of what has now really become a statistical fetish. Given that it anyway is very unlikely than any population parameter is exactly zero, and that contrary to assumption most samples in social science and economics are not random or having the right distributional shape – why continue to press students and researchers to do null hypothesis significance testing, testing that relies on weird backward logic that students and researchers usually don’t understand?

Added 31/7: Beall and Tracy has a commnent on the critique here.

Advertisements

2 Comments

  1. Lars, very interesting as usual. But I don’t fully understand: are you endorsing Geoff Cumming’s view that we can use confidence intervals rather than p values on social science data?

    • Thanx Philip 🙂
      My critique is mainly directed at the mindless habit of drawing far-reaching conclusions based on p-values and null hypothesis significance testing. Especially since there are altenatives that are easier to understand and more informative – such as (especially endorsed by Cumming) confidence intervals and effect sizes. As a general rule I agree with Cumming that – IF you want to pursue this kind of analysis – it is better to make an estimation and report confidence intervals and effect sizes, rather than loading the dice and perform rather pointless null hypothesis testing and reporting p-values that few seems to properly use and interpret.


Sorry, the comment form is closed at this time.

Create a free website or blog at WordPress.com.
Entries and comments feeds.