P values – noisy measures of evidence12 February, 2013 at 10:47 | Posted in Statistics & Econometrics | 1 Comment
In theory, the P value is a continuous measure of evidence, but in practice it is typically trichotomized approximately into strong evidence, weak evidence, and no evidence (these can also be labeled highly significant, marginally significant, and not statistically significant at conventional levels), with cutoffs roughly at P = 0.01 and 0.10.
One big practical problem with P values is that they cannot easily be compared. The difference between a highly significant P value and a clearly nonsignificant P value is itself not necessarily statistically significant. (Here, I am using “significant” to refer to the 5% level that is standard in statistical practice in much of biostatistics, epidemiology, social science, and many other areas of application.) Consider a simple example of two independent experiments with estimates (standard error) of 25 (10) and 10 (10). The first experiment is highly statistically significant (two and a half standard errors away from zero, corresponding to a normal-theory P value of about 0.01) while the second is not significant at all. Most disturbingly here, the difference is 15 (14), which is not close to significant. The naive (and common) approach of summarizing an experiment by a P value and then contrasting results based on significance levels, fails here, in implicitly giving the imprimatur of statistical significance on a comparison that could easily be explained by chance alone … [T]his is not simply the well-known problem of arbitrary thresholds, the idea that a sharp cutoff at a 5% level, for example, misleadingly separates the P = 0.051 cases from P = 0.049. This is a more serious problem: even an apparently huge difference between clearly significant and clearly nonsignificant is not itself statistically significant.
In short, the P value is itself a statistic and can be a noisy measure of evidence.