Did p values work? Read my lips – they didn’t!29 January, 2013 at 21:36 | Posted in Statistics & Econometrics, Theory of Science & Methodology | Leave a comment
Jager and Leek may well be correct in their larger point, that the medical literature is broadly correct. But I don’t think the statistical framework they are using is appropriate for the questions they are asking. My biggest problem is the identification of scientific hypotheses and statistical “hypotheses” of the “theta = 0″ variety.
Based on the word “empirical” title, I thought the authors were going to look at a large number of papers with p-values and then follow up and see if the claims were replicated. But no, they don’t follow up on the studies at all! What they seem to be doing is collecting a set of published p-values and then fitting a mixture model to this distribution, a mixture of a uniform distribution (for null effects) and a beta distribution (for non-null effects). Since only statistically significant p-values are typically reported, they fit their model restricted to p-values less than 0.05. But this all assumes that the p-values have this stated distribution. You don’t have to be Uri Simonsohn to know that there’s a lot of p-hacking going on. Also, as noted above, the problem isn’t really effects that are exactly zero, the problem is that a lot of effects are lots in the noise and are essentially undetectable given the way they are studied.
Jager and Leek write that their model is commonly used to study hypotheses in genetics and imaging. I could see how this model could make sense in those fields … but I don’t see this model applying to published medical research, for two reasons. First … I don’t think there would be a sharp division between null and non-null effects; and, second, there’s just too much selection going on for me to believe that the conditional distributions of the p-values would be anything like the theoretical distributions suggested by Neyman-Pearson theory.
So, no, I don’t at all believe Jager and Leek when they write, “we are able to empirically estimate the rate of false positives in the medical literature and trends in false positive rates over time.” They’re doing this by basically assuming the model that is being questioned, the textbook model in which effects are pure and in which there is no p-hacking.
Indeed. If anything, this underlines how important it is not to equate science with statistical calculation. All science entail human judgement, and using statistical models doesn’t relieve us of that necessity. Working with misspecified models, the scientific value of significance testing is actually zero – even though you’re making valid statistical inferences! Statistical models and concomitant significance tests are no substitutes for doing real science. Or as a noted German philosopher once famously wrote:
There is no royal road to science, and only those who do not dread the fatiguing climb of its steep paths have a chance of gaining its luminous summits.