In loving memory of my brother – Peter “Uncas” Pålsson
Jager and Leek may well be correct in their larger point, that the medical literature is broadly correct. But I don’t think the statistical framework they are using is appropriate for the questions they are asking. My biggest problem is the identification of scientific hypotheses and statistical “hypotheses” of the “theta = 0″ variety.
Based on the word “empirical” title, I thought the authors were going to look at a large number of papers with p-values and then follow up and see if the claims were replicated. But no, they don’t follow up on the studies at all! What they seem to be doing is collecting a set of published p-values and then fitting a mixture model to this distribution, a mixture of a uniform distribution (for null effects) and a beta distribution (for non-null effects). Since only statistically significant p-values are typically reported, they fit their model restricted to p-values less than 0.05. But this all assumes that the p-values have this stated distribution. You don’t have to be Uri Simonsohn to know that there’s a lot of p-hacking going on. Also, as noted above, the problem isn’t really effects that are exactly zero, the problem is that a lot of effects are lots in the noise and are essentially undetectable given the way they are studied.
Jager and Leek write that their model is commonly used to study hypotheses in genetics and imaging. I could see how this model could make sense in those fields … but I don’t see this model applying to published medical research, for two reasons. First … I don’t think there would be a sharp division between null and non-null effects; and, second, there’s just too much selection going on for me to believe that the conditional distributions of the p-values would be anything like the theoretical distributions suggested by Neyman-Pearson theory.
So, no, I don’t at all believe Jager and Leek when they write, “we are able to empirically estimate the rate of false positives in the medical literature and trends in false positive rates over time.” They’re doing this by basically assuming the model that is being questioned, the textbook model in which effects are pure and in which there is no p-hacking.
Indeed. If anything, this underlines how important it is not to equate science with statistical calculation. All science entail human judgement, and using statistical models doesn’t relieve us of that necessity. Working with misspecified models, the scientific value of significance testing is actually zero – even though you’re making valid statistical inferences! Statistical models and concomitant significance tests are no substitutes for doing real science. Or as a noted German philosopher once famously wrote:
There is no royal road to science, and only those who do not dread the fatiguing climb of its steep paths have a chance of gaining its luminous summits.
Arnold Zellner’s KISS rule – Keep It Sophisticatedly Simple – has its application even outside of econometrics. An example is the film music of Stefan Nilsson. Here in the breathtakingly beautiful “Fäboden” from Bille August’s and Ingmar Bergman’s masterpiece The Best Intentions.
In many social sciences p values and null hypothesis significance testing (NHST) are often used to draw far-reaching scientific conclusions – despite the fact that they are as a rule poorly understood and that there exist altenatives that are easier to understand and more informative.
Not the least using confidence intervals (CIs) and effect sizes are to be preferred to the Neyman-Pearson-Fisher mishmash approach that is so often practised by applied researchers.
Running a Monte Carlo simulation with 100 replications of a fictitious sample having N = 20, confidence itervals of 95%, a normally distributed population with a mean = 10 and a standard deviation of 20, taking two-tailed p values on a zero null hypothesis, we get varying CIs (since they are based on varying sample standard deviations), but with a minimum of 3.2 and a maximum of 26.1 we still get a clear picture of what would happen in an infinite limit sequence. On the other hand p values (even though from a purely mathematical statistical sense more or less equivalent to CIs) vary strongly from sample to sample, and jumping around between a minimum of 0.007 and a maximum of 0.999 don’t give you a clue of what will happen in an infinite limit sequence! So, I can’t but agree with Geoff Cummings:
The problems are so severe we need to shift as much as possible from NHST … The first shift should be to estimation: report and interpret effect sizes and CIs … I suggest p should be given only a marginal role, its problem explained, and it should be interpreted primarily as an indicator of where the 95% CI falls in relation to a null hypothesised value.
[In case you want to do your own Monte Carlo simulation, here's an example using Gretl:
loop 100 --progressive
series y = normal(10,15)
scalar zs = (10-mean(y))/sd(y)
scalar df = $nobs-1
scalar ysd= sd(y)
scalar tstat = (ybar-10)/ybarsd
pvalue t df tstat
scalar lowb = mean(y) - critical(t,df,0.025)*ybarsd
scalar uppb = mean(y) + critical(t,df,0.025)*ybarsd
scalar pval = pvalue(t,df,tstat)
store E:\pvalcoeff.gdt lowb uppb pval
Yours truly gives a PhD course in statistics for students in education and sports this semester. And between teaching them all about Chebyshev’s Theorem, Beta Distributions, Moment-Generating Functions and the Neyman-Pearson Lemma, I try to remind them that statistics can actually also be fun …
Almost a hundred years after John Maynard Keynes wrote his seminal A Treatise on Probability (1921), it is still very difficult to find statistics textbooks that seriously try to incorporate his far-reaching and incisive analysis of induction and evidential weight.
The standard view in statistics – and the axiomatic probability theory underlying it – is to a large extent based on the rather simplistic idea that “more is better.” But as Keynes argues – “more of the same” is not what is important when making inductive inferences. It’s rather a question of “more but different.”
Variation, not replication, is at the core of induction. Finding that p(x|y) = p(x|y & w) doesn’t make w “irrelevant.” Knowing that the probability is unchanged when w is present gives p(x|y & w) another evidential weight (“weight of argument”). Running 10 replicative experiments do not make you as “sure” of your inductions as when running 10 000 varied experiments – even if the probability values happen to be the same.
According to Keynes we live in a world permeated by unmeasurable uncertainty – not quantifiable stochastic risk – which often forces us to make decisions based on anything but “rational expectations.” Keynes rather thinks that we base our expectations on the confidence or “weight” we put on different events and alternatives. To Keynes expectations are a question of weighing probabilities by “degrees of belief,” beliefs that often have preciously little to do with the kind of stochastic probabilistic calculations made by the rational agents as modeled by “modern” social sciences. And often we “simply do not know.” As Keynes writes in Treatise:
The kind of fundamental assumption about the character of material laws, on which scientists appear commonly to act, seems to me to be [that] the system of the material universe must consist of bodies … such that each of them exercises its own separate, independent, and invariable effect, a change of the total state being compounded of a number of separate changes each of which is solely due to a separate portion of the preceding state … Yet there might well be quite different laws for wholes of different degrees of complexity, and laws of connection between complexes which could not be stated in terms of laws connecting individual parts … If different wholes were subject to different laws qua wholes and not simply on account of and in proportion to the differences of their parts, knowledge of a part could not lead, it would seem, even to presumptive or probable knowledge as to its association with other parts … These considerations do not show us a way by which we can justify induction … /427 No one supposes that a good induction can be arrived at merely by counting cases. The business of strengthening the argument chiefly consists in determining whether the alleged association is stable, when accompanying conditions are varied … /468 In my judgment, the practical usefulness of those modes of inference … on which the boasted knowledge of modern science depends, can only exist … if the universe of phenomena does in fact present those peculiar characteristics of atomism and limited variety which appears more and more clearly as the ultimate result to which material science is tending.
Science according to Keynes should help us penetrate to “the true process of causation lying behind current events” and disclose “the causal forces behind the apparent facts.” Models can never be more than a starting point in that endeavour. He further argued that it was inadmissible to project history on the future. Consequently we cannot presuppose that what has worked before, will continue to do so in the future. That statistical models can get hold of correlations between different “variables” is not enough. If they cannot get at the causal structure that generated the data, they are not really “identified.”
How strange that writers of statistics textbook as a rule do not even touch upon these aspects of scientific methodology that seems to be so fundamental and important for anyone trying to understand how we learn and orient ourselves in an uncertain world. An educated guess on why this is a fact would be that Keynes concepts are not possible to squeeze into a single calculable numerical “probability.” In the quest for quantities one puts a blind eye to qualities and looks the other way – but Keynes ideas keep creeping out from under the statistics carpet.
It’s high time that statistics textbooks give Keynes his due.