Misunderstanding the p-value – here we go again

19 Mar, 2013 at 20:59 | Posted in Statistics & Econometrics | 6 Comments

A non-trivial part of teaching statistics is made up of learning students to perform significance testing. A problem I have noticed repeatedly over the years, however, is that no matter how careful you try to be in explicating what the probabilities generated by these statistical tests – p values – really are, still most students misinterpret them.

Giving a statistics course for the Swedish National Research School in History, I asked the students at the exam to explain how one should correctly interpret p-values. Although the correct definition is p(data|null hypothesis), a majority of the students either misinterpreted the p value as being the likelihood of a sampling error (which of course is wrong, since the very computation of the p value is based on the assumption that sampling errors are what causes the sample statistics not coinciding with the null hypothesis) or that the p value is the probability of the null hypothesis being true, given the data (which of course also is wrong, since that is p(null hypothesis|data) rather than the correct p(data|null hypothesis)).

This is not to blame on students’ ignorance, but rather on significance testing not being particularly transparent (conditional probability inference is difficult even to those of us who teach and practice it). A lot of researchers fall pray to the same mistakes. So – given that it anyway is very unlikely than any population parameter is exactly zero, and that contrary to assumption most samples in social science and economics are not random or having the right distributional shape – why continue to press students and researchers to do null hypothesis significance testing, testing that relies on weird backward logic that students and researchers usually don’t understand?

That media often misunderstand what p-values and significance testing are all about is well-known. Andrew Gelman gives a recent telling example:

The New York Times has a feature in its Tuesday science section, Take a Number … Today’s column, by Nicholas Balakar, is in error. The column begins:

“When medical researchers report their findings, they need to know whether their result is a real effect of what they are testing, or just a random occurrence. To figure this out, they most commonly use the p-value.”

This is wrong on two counts. First, whatever researchers might feel, this is something they’ll never know. Second, results are a combination of real effects and chance, it’s not either/or.

Perhaps the above is a forgivable simplification, but I don’t think so; I think it’s a simplification that destroys the reason for writing the article in the first place. But in any case I think there’s no excuse for this, later on:

“By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance.”

This is the old, old error of confusing p(A|B) with p(B|A). I’m too rushed right now to explain this one, but it’s in just about every introductory statistics textbook ever written. For more on the topic, I recommend my recent paper, P Values and Statistical Practice, which begins:

“The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings … The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations) …”

I can’t get too annoyed at science writer Bakalar for garbling the point—it confuses lots and lots of people—but, still, I hate to see this error in the newspaper.

On the plus side, if a newspaper column runs 20 times, I guess it’s ok for it to be wrong once—we still have 95% confidence in it, right?

Statistical significance doesn’t say that something is important or true. And since there already are far better and more relevant testing that can be done (see e. g. here and here), it is high time to give up on this statistical fetish. 


  1. Although the correct definition is p(data|null hypothesis)…

    Are you sure this is the correct definition?

    • Yes I’m absolutely sure this is the correct – although for “pedagogical” reasons somewhat abbreviated – form of the longer and more “technical” definition, which is “the p-value is the probability of obtaining our observed results (data), or results that are more extreme (than our data), if the null hypothesis is true.” In short form we usually write this as – p(data|null hypothesis).

      • Hi Lars,

        Though as a mnemonic it might has its merit, defining the p-value as P(X|H_0) might not be a good pedagogical idea for the following:

        Imagine a student picks this definition and goes:

        P(X|H_0)= P(H_0|X)/P(H_0) \cdot P(X)

        The student might wonder about P(H_0) when actually it makes no sense since H_0 is not a random variable and get more confused than anything else.

        Sure you can warn them, but if they forget the warning they will mix up everything so, don’t you think it is maybe better to use a mnemonic not based on conditional probabilities?

        • IF my students would make that mistake, I would worry. But – as most other students without Bayesian inclinations – they don’t.

    • I’m pretty sure this is not the correct definition. A simple counter example is when the data comes from a normal (or any other continous) distribution. Then P(data|anything)=0.

      • The p-value is the probability of getting the observed data value or more extreme values if the null hypothesis is true. As I think I made amply clear in the comment section debate with Deborah Mayo, the conditional expression I use in the post is a standard shorthand representation of that definition in statistics.

Sorry, the comment form is closed at this time.

Blog at WordPress.com.
Entries and Comments feeds.