Including all relevant material – good, bad, and indifferent – in meta-analysis admits the subjective judgments that meta-analysis was designed to avoid. Several problems arise in meta-analysis: regressions are often non -linear; effects are often multivariate rather than univariate; coverage can be restricted; bad studies may be included; the data summarised may not be homogeneous; grouping different causal factors may lead to meaningless estimates of effects; and the theory-directed approach may obscure discrepancies. Meta-analysis may not be the one best method for studying the diversity of fields for which it has been used …
Glass and Smith carried out a meta-analysis of research on class size and achievement and concluded that “a clear and strong relationship between class size and achievement has emerged.”10 The study was done and analysed well; it might almost be cited as an example of what meta-analysis can do. Yet the conclusion is very misleading, as is the estimate of effect size it presents: “between class-size of 40 pupils and one pupil lie more than 30 percentile ranks of achievement.” Such estimates imply a linear regression, yet the regression is extremely curvilinear, as one of the authors’ figures shows: between class sizes of 20 and 40 there is absolutely no difference in achievement; it is only with unusually small classes that there seems to be an effect. For a teacher the major result is that for 90% of all classes the number of pupils makes no difference at all to their achievement. The conclusions drawn by the authors from their meta-analysis are normally correct, but they are statistically meaningless and particularly misleading. No estimate of effect size is meaningful unless regressions are linear, yet such linearity is seldom investigated, or, if not present, taken seriously.
Systematic reviews in sciences are extremely important to undertake in our search for robust evidence and explanations — simply averaging data from different populations, places, and contexts, is not.
Much has been said about significance testing – most of it negative. Methodologists constantly point out that researchers misinterpret p-values. Some say that it is at best a meaningless exercise and at worst an impediment to scientific discoveries. Consequently, I believe it is extremely important that students and researchers correctly interpret statistical tests. This visualization is meant as an aid for students when they are learning about statistical hypothesis testing.
Modern econometrics is fundamentally based on assuming — usually without any explicit justification — that we can gain causal knowledge by considering independent variables that may have an impact on the variation of a dependent variable. This is however, far from self-evident. Often the fundamental causes are constant forces that are not amenable to the kind of analysis econometrics supplies us with. As Stanley Lieberson has it in his modern classic Making It Count:
One can always say whether, in a given empirical context, a given variable or theory accounts for more variation than another. But it is almost certain that the variation observed is not universal over time and place. Hence the use of such a criterion first requires a conclusion about the variation over time and place in the dependent variable. If such an analysis is not forthcoming, the theoretical conclusion is undermined by the absence of information …
Moreover, it is questionable whether one can draw much of a conclusion about causal forces from simple analysis of the observed variation … To wit, it is vital that one have an understanding, or at least a working hypothesis, about what is causing the event per se; variation in the magnitude of the event will not provide the answer to that question.
Trygve Haavelmo was making a somewhat similar point back in 1941, when criticizing the treatmeant of the interest variable in Tinbergen’s regression analyses. The regression coefficient of the interest rate variable being zero was according to Haavelmo not sufficient for inferring that “variations in the rate of interest play only a minor role, or no role at all, in the changes in investment activity.” Interest rates may very well play a decisive indirect role by influencing other causally effective variables. And:
the rate of interest may not have varied much during the statistical testing period, and for this reason the rate of interest would not “explain” very much of the variation in net profit (and thereby the variation in investment) which has actually taken place during this period. But one cannot conclude that the rate of influence would be inefficient as an autonomous regulator, which is, after all, the important point.
Causality in economics — and other social sciences — can never solely be a question of statistical inference. Causality entails more than predictability, and to really in depth explain social phenomena requires theory. Analysis of variation — the foundation of all econometrics — can never in itself reveal how these variations are brought about. First when we are able to tie actions, processes or structures to the statistical relations detected, can we say that we are getting at relevant explanations of causation. Too much in love with axiomatic-deductive modeling, neoclassical economists especially tend to forget that accounting for causation — how causes bring about their effects — demands deep subject-matter knowledge and acquaintance with the intricate fabrics and contexts. As already Keynes argued in his A Treatise on Probability, statistics and econometrics should not primarily be seen as means of inferring causality from observational data, but rather as description of patterns of associations and correlations that we may use as suggestions of possible causal realations.
I do not claim that the technique developed in the present paper will, like a stone of the wise, solve all the problems of testing “significance” with which the economic statistician is confronted. No statistical technique, however, refined, will ever be able to do such a thing. The ultimate test of significance must consist in a network of conclusions and cross checks where theoretical economic considerations, intimate and realistic knowledge of the data and a refined statistical technique concur.
Noah Smith has a post up trying to defend p-values and traditional statistical significance testing against the increasing attacks launched against it:
Suddenly, everyone is getting really upset about p-values and statistical significance testing. The backlash has reached such a frenzy that some psych journals are starting to ban significance testing. Though there are some well-known problems with p-values and significance testing, this backlash doesn’t pass the smell test. When a technique has been in wide use for decades, it’s certain that LOTS of smart scientists have had a chance to think carefully about it. The fact that we’re only now getting the backlash means that the cause is something other than the inherent uselessness of the methodology.
That doesn’t sound very convincing.
Maybe we should apply yet another smell test …
A non-trivial part of teaching statistics is made up of learning students to perform significance testing. A problem I have noticed repeatedly over the years, however, is that no matter how careful you try to be in explicating what the probabilities generated by these statistical tests – p values – really are, still most students misinterpret them.
This is not to blame on students’ ignorance, but rather on significance testing not being particularly transparent (conditional probability inference is difficult even to those of us who teach and practice it). A lot of researchers fall pray to the same mistakes. So — given that it anyway is very unlikely than any population parameter is exactly zero, and that contrary to assumption most samples in social science and economics are not random or having the right distributional shape — why continue to press students and researchers to do null hypothesis significance testing, testing that relies on weird backward logic that students and researchers usually don’t understand?
Statistical significance doesn’t say that something is important or true. And since there already are far better and more relevant testing that can be done, it is high time to give up on this statistical fetish.
Jager and Leek may well be correct in their larger point, that the medical literature is broadly correct. But I don’t think the statistical framework they are using is appropriate for the questions they are asking. My biggest problem is the identification of scientific hypotheses and statistical “hypotheses” of the “theta = 0″ variety.
Based on the word “empirical” title, I thought the authors were going to look at a large number of papers with p-values and then follow up and see if the claims were replicated. But no, they don’t follow up on the studies at all! What they seem to be doing is collecting a set of published p-values and then fitting a mixture model to this distribution, a mixture of a uniform distribution (for null effects) and a beta distribution (for non-null effects). Since only statistically significant p-values are typically reported, they fit their model restricted to p-values less than 0.05. But this all assumes that the p-values have this stated distribution. You don’t have to be Uri Simonsohn to know that there’s a lot of p-hacking going on. Also, as noted above, the problem isn’t really effects that are exactly zero, the problem is that a lot of effects are lots in the noise and are essentially undetectable given the way they are studied.
Jager and Leek write that their model is commonly used to study hypotheses in genetics and imaging. I could see how this model could make sense in those fields … but I don’t see this model applying to published medical research, for two reasons. First … I don’t think there would be a sharp division between null and non-null effects; and, second, there’s just too much selection going on for me to believe that the conditional distributions of the p-values would be anything like the theoretical distributions suggested by Neyman-Pearson theory.
So, no, I don’t at all believe Jager and Leek when they write, “we are able to empirically estimate the rate of false positives in the medical literature and trends in false positive rates over time.” They’re doing this by basically assuming the model that is being questioned, the textbook model in which effects are pure and in which there is no p-hacking.
Indeed. If anything, this underlines how important it is — and on this Noah Smith and yours truly agree — not to equate science with statistical calculation. All science entail human judgement, and using statistical models doesn’t relieve us of that necessity. Working with misspecified models, the scientific value of significance testing is actually zero – even though you’re making valid statistical inferences! Statistical models and concomitant significance tests are no substitutes for doing real science. Or as a noted German philosopher once famously wrote:
There is no royal road to science, and only those who do not dread the fatiguing climb of its steep paths have a chance of gaining its luminous summits.
In its standard form, a significance test is not the kind of “severe test” that we are looking for in our search for being able to confirm or disconfirm empirical scientific hypothesis. This is problematic for many reasons, one being that there is a strong tendency to accept the null hypothesis since they can’t be rejected at the standard 5% significance level. In their standard form, significance tests bias against new hypotheses by making it hard to disconfirm the null hypothesis.
And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” Standard scientific methodology tells us that when there is only say a 10 % probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more “reasonable” to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same 10 % result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.
Most importantly — we should never forget that the underlying parameters we use when performing significance tests are model constructions. Our p-values mean next to nothing if the model is wrong. As eminent mathematical statistician David Freedman writes:
I believe model validation to be a central issue. Of course, many of my colleagues will be found to disagree. For them, fitting models to data, computing standard errors, and performing significance tests is “informative,” even though the basic statistical assumptions (linearity, independence of errors, etc.) cannot be validated. This position seems indefensible, nor are the consequences trivial. Perhaps it is time to reconsider.
Statistical significance tests DO NOT validate models!
In journal articles a typical regression equation will have an intercept and several explanatory variables. The regression output will usually include an F-test, with p – 1 degrees of freedom in the numerator and n – p in the denominator. The null hypothesis will not be stated. The missing null hypothesis is that all the coefficients vanish, except the intercept.
If F is significant, that is often thought to validate the model. Mistake. The F-test takes the model as given. Significance only means this: if the model is right and the coefficients are 0, it is very unlikely to get such a big F-statistic. Logically, there are three possibilities on the table:
i) An unlikely event occurred.
ii) Or the model is right and some of the coefficients differ from 0.
iii) Or the model is wrong.
I had one of the most satisfying eureka experiences of my career while teaching flight instructors … about the psychology of effective training. I was telling them about an important principle of skill training: rewards for improved performance work better than punishment of mistakes…
When I finished my enthusiastic speech, one of the most seasoned instructors in the group raised his hand and made a short speech of his own. He began by conceding that rewarding improved performance might be good for the birds, but he denied that it was optimal for flight cadets. This is what he said: “On many occasions I have praised flight the next time thy try the same maneuver they usually do worse. On the other hand, I have often screamed into a cadet’s earphone for bad execution, and in general he does better on his next try. So please don’t tell us that reward works and punishment does not, because the opposite is the case” …
What he had observed is known as regression to the mean, which in that case was due to random fluctuations in the quality of performance. Naturally, he praised only a cadet whose performance was far better than average. But the cadet was probably just lucky on that particular attempt and therefore likely to deteriorate regardless of whether or not he was praised. Similarly, the instructor would shout into a cadet’s earphones only when the cadet’s performance was unusually bad and therefore likely to improve regardless of what the instructor did. The instructor had attached a causal interpretation to the inevitable fluctuations of a random process …
I had stumbled onto a significant fact of the human condition: the feedback to which life exposes us is perverse. Because we tend to be nice to other people when they please us and nasty when they do not, we are statistically punished for being nice and rewarded for being nasty …
It took Francis Galton several years to figure out that correlation and regression are not two concepts – they are different perspectives on the same concept: whenever the correlation between two scores is imperfect, there will be regression to the mean …
Causal explanations will be evoked when regression is detected, but they will be wrong because the truth is that regression to the mean has an explanation but does not have a cause.
Could it be better to discard 90% of the reported research? Surprisingly, the answer is yes to this statistical paradox. This paper has shown how publication selection can greatly distort the research record and its conventional summary statistics. Using both Monte Carlo simulations and actual research examples, we show how a simple estimator, which uses only 10 percent of the reported research reduces publication bias and improves efficiency over conventional summary statistics that use all the reported research.
The average of the most precise 10 percent, ‘Top10,’ of the reported estimates of a given empirical phenomenon is often better than conventional summary estimators because of its heavy reliance on the reported estimate’s precision (i.e., the inverse of the estimate’s standard error). When estimates are chosen, in part, for their statistical significance, studies cursed with imprecise estimates have to engage in more intense selection from among alternative statistical techniques, models, data sets, and measures to produce the larger estimate that statistical significance demands. Thus, imprecise estimates will contain larger biases.
Studies that have access to more data will tend to be more precise, and hence less biased. At the level of the original empirical research, the statistician’s motto, “the more data the better,” holds because more data typically produce more precise estimates. It is only at the meta-level of integrating, summarizing, and interpreting an entire area of empirical research (meta-analysis), where the removal of 90% of the data might actually improve our empirical knowledge. Even when the authors of these larger and more precise studies actively select for statistical significance in the desired direction, smaller significant estimates will tend to be reported. Thus, precise studies will, on average, be less biased and thereby possess greater scientific quality, ceteris paribus.
We hope that the statistical paradox identified in this paper refocuses the empirical sciences upon precision. Precision should be universally adopted as one criterion of research quality, regardless of other statistical outcomes.
Thus we have “econometric modelling”, that activity of matching an incorrect version of [the parameter matrix] to an inadequate representation of [the data generating process], using insufficient and inaccurate data. The resulting compromise can be awkward, or it can be a useful approximation which encompasses previous results, throws’ light on economic theory and is sufficiently constant for prediction, forecasting and perhaps even policy. Simply writing down an “economic theory”, manipulating it to a “condensed form” and “calibrating” the resulting parameters using a pseudo-sophisticated estimator based on poor data which the model does not adequately describe constitutes a recipe for disaster, not for simulating gold! Its only link with alchemy is self-deception.