The increasing use of natural and quasi-natural experiments in economics during the last couple of decades has led some economists to triumphantly declare it as a major step on a recent path toward empirics, where instead of being a deductive philosophy, economics is now increasingly becoming an inductive science.
In their plaidoyer for this view, the work of Joshua Angrist and Jörn-Steffen Pischke is often apostrophized, so lets start with one of their later books and see if there is any real reason to share the optimism on this ’empirical turn’ in economics.
In their new book, Mastering ‘Metrics: The Path from Cause to Effect, Angrist and Pischke write:
Our first line of attack on the causality problem is a randomized experiment, often called a randomized trial. In a randomized trial, researchers change the causal variables of interest … for a group selected using something like a coin toss. By changing circumstances randomly, we make it highly likely that the variable of interest is unrelated to the many other factors determining the outcomes we want to study. Random assignment isn’t the same as holding everything else fixed, but it has the same effect. Random manipulation makes other things equal hold on average across the groups that did and did not experience manipulation. As we explain … ‘on average’ is usually good enough.
Angrist and Pischke may “dream of the trials we’d like to do” and consider “the notion of an ideal experiment” something that “disciplines our approach to econometric research,” but to maintain that ‘on average’ is “usually good enough” is an allegation that in my view is rather unwarranted, and for many reasons.
First of all it amounts to nothing but hand waving to simpliciter assume, without argumentation, that it is tenable to treat social agents and relations as homogeneous and interchangeable entities.
Randomization is used to basically allow the econometrician to treat the population as consisting of interchangeable and homogeneous groups (‘treatment’ and ‘control’). The regression models one arrives at by using randomized trials tell us the average effect that variations in variable X has on the outcome variable Y, without having to explicitly control for effects of other explanatory variables R, S, T, etc., etc. Everything is assumed to be essentially equal except the values taken by variable X.
In a usual regression context one would apply an ordinary least squares estimator (OLS) in trying to get an unbiased and consistent estimate:
Y = α + βX + ε,
where α is a constant intercept, β a constant “structural” causal effect and ε an error term.
The problem here is that although we may get an estimate of the “true” average causal effect, this may “mask” important heterogeneous effects of a causal nature. Although we get the right answer of the average causal effect being 0, those who are “treated”( X=1) may have causal effects equal to – 100 and those “not treated” (X=0) may have causal effects equal to 100. Contemplating being treated or not, most people would probably be interested in knowing about this underlying heterogeneity and would not consider the OLS average effect particularly enlightening.
Limiting model assumptions in economic science always have to be closely examined since if we are going to be able to show that the mechanisms or causes that we isolate and handle in our models are stable in the sense that they do not change when we “export” them to our “target systems”, we have to be able to show that they do not only hold under ceteris paribus conditions and a fortiori only are of limited value to our understanding, explanations or predictions of real economic systems.
Real world social systems are not governed by stable causal mechanisms or capacities. The kinds of “laws” and relations that econometrics has established, are laws and relations about entities in models that presuppose causal mechanisms being atomistic and additive. When causal mechanisms operate in real world social target systems they only do it in ever-changing and unstable combinations where the whole is more than a mechanical sum of parts. If economic regularities obtain they do it (as a rule) only because we engineered them for that purpose. Outside man-made “nomological machines” they are rare, or even non-existant. Unfortunately that also makes most of the achievements of econometrics – as most of contemporary endeavours of mainstream economic theoretical modeling – rather useless.
Remember that a model is not the truth. It is a lie to help you get your point across. And in the case of modeling economic risk, your model is a lie about others, who are probably lying themselves. And what’s worse than a simple lie? A complicated lie.
Sam L. Savage The Flaw of Averages
When Joshua Angrist and Jörn-Steffen Pischke in an earlier article of theirs [“The Credibility Revolution in Empirical Economics: How Better Research Design Is Taking the Con out of Econometrics,” Journal of Economic Perspectives, 2010] say that “anyone who makes a living out of data analysis probably believes that heterogeneity is limited enough that the well-understood past can be informative about the future,” I really think they underestimate the heterogeneity problem. It does not just turn up as an external validity problem when trying to “export” regression results to different times or different target populations. It is also often an internal problem to the millions of regression estimates that economists produce every year.
But when the randomization is purposeful, a whole new set of issues arises — experimental contamination — which is much more serious with human subjects in a social system than with chemicals mixed in beakers … Anyone who designs an experiment in economics would do well to anticipate the inevitable barrage of questions regarding the valid transference of things learned in the lab (one value of z) into the real world (a different value of z) …
Absent observation of the interactive compounding effects z, what is estimated is some kind of average treatment effect which is called by Imbens and Angrist (1994) a “Local Average Treatment Effect,” which is a little like the lawyer who explained that when he was a young man he lost many cases he should have won but as he grew older he won many that he should have lost, so that on the average justice was done. In other words, if you act as if the treatment effect is a random variable by substituting βt for β0 + β′zt, the notation inappropriately relieves you of the heavy burden of considering what are the interactive confounders and finding some way to measure them …
If little thought has gone into identifying these possible confounders, it seems probable that little thought will be given to the limited applicability of the results in other settings.
Evidence-based theories and policies are highly valued nowadays. Randomization is supposed to control for bias from unknown confounders. The received opinion is that evidence based on randomized experiments therefore is the best.
More and more economists have also lately come to advocate randomization as the principal method for ensuring being able to make valid causal inferences.
I would however rather argue that randomization, just as econometrics, promises more than it can deliver, basically because it requires assumptions that in practice are not possible to maintain.
Especially when it comes to questions of causality, randomization is nowadays considered some kind of “gold standard”. Everything has to be evidence-based, and the evidence has to come from randomized experiments.
But just as econometrics, randomization is basically a deductive method. Given the assumptions (such as manipulability, transitivity, separability, additivity, linearity, etc.) these methods deliver deductive inferences. The problem, of course, is that we will never completely know when the assumptions are right. And although randomization may contribute to controlling for confounding, it does not guarantee it, since genuine ramdomness presupposes infinite experimentation and we know all real experimentation is finite. And even if randomization may help to establish average causal effects, it says nothing of individual effects unless homogeneity is added to the list of assumptions. Real target systems are seldom epistemically isomorphic to our axiomatic-deductive models/systems, and even if they were, we still have to argue for the external validity of the conclusions reached from within these epistemically convenient models/systems. Causal evidence generated by randomization procedures may be valid in “closed” models, but what we usually are interested in, is causal evidence in the real target system we happen to live in.
When does a conclusion established in population X hold for target population Y? Only under very restrictive conditions!
Angrist’s and Pischke’s “ideally controlled experiments” tell us with certainty what causes what effects — but only given the right “closures”. Making appropriate extrapolations from (ideal, accidental, natural or quasi) experiments to different settings, populations or target systems, is not easy. “It works there” is no evidence for “it will work here”. Causes deduced in an experimental setting still have to show that they come with an export-warrant to the target population/system. The causal background assumptions made have to be justified, and without licenses to export, the value of “rigorous” and “precise” methods — and ‘on-average-knowledge’ — is despairingly small.
Like us, you want evidence that a policy will work here, where you are. Randomized controlled trials (RCTs) do not tell you that. They do not even tell you that a policy works. What they tell you is that a policy worked there, where the trial was carried out, in that population. Our argument is that the changes in tense – from “worked” to “work” – are not just a matter of grammatical detail. To move from one to the other requires hard intellectual and practical effort. The fact that it worked there is indeed fact. But for that fact to be evidence that it will work here, it needs to be relevant to that conclusion. To make RCTs relevant you need a lot more information and of a very different kind.
So, no, I find it hard to share the enthusiasm and optimism on the value of (quasi)natural experiments and all the statistical-econometric machinery that comes with it. Guess I’m still waiting for the export-warrant …
In econometrics one often gets the feeling that many of its practitioners think of it as a kind of automatic inferential machine: input data and out comes casual knowledge. This is like pulling a rabbit from a hat. Great — but first you have to put the rabbit in the hat. And this is where assumptions come in to the picture.
As social scientists — and economists — we have to confront the all-important question of how to handle uncertainty and randomness. Should we equate randomness with probability? If we do, we have to accept that to speak of randomness we also have to presuppose the existence of nomological probability machines, since probabilities cannot be spoken of – and actually, to be strict, do not at all exist – without specifying such system-contexts.
Accepting a domain of probability theory and a sample space of “infinite populations” — which is legion in modern econometrics — also implies that judgments are made on the basis of observations that are actually never made! Infinitely repeated trials or samplings never take place in the real world. So that cannot be a sound inductive basis for a science with aspirations of explaining real-world socio-economic processes, structures or events. It’s not tenable.
In his book Statistical Models and Causal Inference: A Dialogue with the Social Sciences David Freedman touches on this fundamental problem, arising when you try to apply statistical models outside overly simple nomological machines like coin tossing and roulette wheels:
Lurking behind the typical regression model will be found a host of such assumptions; without them, legitimate inferences cannot be drawn from the model. There are statistical procedures for testing some of these assumptions. However, the tests often lack the power to detect substantial failures. Furthermore, model testing may become circular; breakdowns in assumptions are detected, and the model is redefined to accommodate. In short, hiding the problems can become a major goal of model building.
Using models to make predictions of the future, or the results of interventions, would be a valuable corrective. Testing the model on a variety of data sets – rather than fitting refinements over and over again to the same data set – might be a good second-best … Built into the equation is a model for non-discriminatory behavior: the coefficient d vanishes. If the company discriminates, that part of the model cannot be validated at all.
Regression models are widely used by social scientists to make causal inferences; such models are now almost a routine way of demonstrating counterfactuals. However, the “demonstrations” generally turn out to depend on a series of untested, even unarticulated, technical assumptions. Under the circumstances, reliance on model outputs may be quite unjustified. Making the ideas of validation somewhat more precise is a serious problem in the philosophy of science. That models should correspond to reality is, after all, a useful but not totally straightforward idea – with some history to it. Developing appropriate models is a serious problem in statistics; testing the connection to the phenomena is even more serious …
In our days, serious arguments have been made from data. Beautiful, delicate theorems have been proved, although the connection with data analysis often remains to be established. And an enormous amount of fiction has been produced, masquerading as rigorous science.
Making outlandish statistical assumptions does not provide a solid ground for doing relevant social science.
Many treatments of regression seem to take for granted that the investigator knows the relevant variables, their causal order, and the functional form of the relationships among them; measurements of the independent variables are assumed to be without error. Indeed, Gauss developed and used regression in physical science contexts where these conditions hold, at least to a very good approximation. Today, the textbook theorems that justify regression are proved on the basis of such assumptions.
In the social sciences, the situation seems quite different. Regression is used to discover relationships or to disentangle cause and effect.Ho wever, investigators have only vague ideas as to the relevant variables and their causal order; functional forms are chosen on the basis of convenience or familiarity; serious problems of measurement are often encountered.
Regression may offer useful ways of summarizing the data and making predictions. Investigators may be able to use summaries and predictions to draw substantive conclusions. However, I see no cases in which regression equations, let alone the more complex methods, have succeeded as engines for discovering causal relationships …
The larger problem remains. Can quantitative social scientists infer causality by applying statistical technology to correlation matrices? That is not a mathematical question, because the answer turns on the way the world is put together. As I read the record, correlational methods have not delivered the goods. We need to work on measurement, design, theory. Fancier statistics are not likely to help much.
If you only have time to study one mathematical statistician, the choice should be easy — David Freedman.
I consider the work of Bernanke (1986) on the relationship between money and output because I agree with Robert King’s observation that it is an “admirable piece of normal science” even if I do not share his view that the paper is “a tribute to macroeconomics”. Bernanke’s thoughtfulness, care and attention to detail is apparent. His scientific objectivity in considering his own theory and that of others is exemplary … I single Bernanke’s work out because it is an example of excellent research within the tradition it represents.
Bernanke’s paper does not really reach a substantive conclusion that could possibly change the views of any serious observer of the economy … The only firm conclusion reached is that structural interpretations of VARs are very sensitive to the model one assumes and that future research using VARs should take this into account. It is hard to see how such findings can stimulate new theoretical developments or bring about improvements in our ability to predict, control or explain economic events …
Investigators like Bernanke, who believe that they can learn something new about causal relationships without introducing information beyond that contained in time series whose properties have already been studied thousands of times, are shadow boxing with reality. As the foregoing observations suggest, their identifying assumptions are obviously implausible once stated in English. Formalism and the attendant matrix algebra serves primarily to obscure the futility of the exercise in which they are engaged.
And, best of all, it is totally free!
Gretl is up to the tasks you may have, so why spend money on expensive commercial programs?
The latest snapshot version of Gretl – 2015d – can be downloaded here.
With this new version also comes a handy primer on Hansl — the scripting language of Gretl.
So just go ahead. With Gretl and Hansl, econometrics has never been easier to master!
In an article posted earlier on this blog — What are the key assumptions of linear regression models? — yours truly tried to argue that since econometrics doesn’t content itself with only making “optimal” predictions,” but also aspires to explain things in terms of causes and effects, econometricians need loads of assumptions — and that most important of these are additivity and linearity.
Let me take the opportunity to elaborate a little more on why I find these assumptions of such paramount importance and ought to be much more argued for — on both epistemological and ontological grounds — if at all being used.
Limiting model assumptions in economic science always have to be closely examined since if we are going to be able to show that the mechanisms or causes that we isolate and handle in our models are stable in the sense that they do not change when we “export” them to our “target systems”, we have to be able to show that they do not only hold under ceteris paribus conditions and a fortiori only are of limited value to our understanding, explanations or predictions of real economic systems. As the always eminently quotable Keynes wrote (emphasis added) in Treatise on Probability (1921):
The kind of fundamental assumption about the character of material laws, on which scientists appear commonly to act, seems to me to be [that] the system of the material universe must consist of bodies … such that each of them exercises its own separate, independent, and invariable effect, a change of the total state being compounded of a number of separate changes each of which is solely due to a separate portion of the preceding state … Yet there might well be quite different laws for wholes of different degrees of complexity, and laws of connection between complexes which could not be stated in terms of laws connecting individual parts … If different wholes were subject to different laws qua wholes and not simply on account of and in proportion to the differences of their parts, knowledge of a part could not lead, it would seem, even to presumptive or probable knowledge as to its association with other parts … These considerations do not show us a way by which we can justify induction … /427 No one supposes that a good induction can be arrived at merely by counting cases. The business of strengthening the argument chiefly consists in determining whether the alleged association is stable, when accompanying conditions are varied … /468 In my judgment, the practical usefulness of those modes of inference … on which the boasted knowledge of modern science depends, can only exist … if the universe of phenomena does in fact present those peculiar characteristics of atomism and limited variety which appears more and more clearly as the ultimate result to which material science is tending.
Econometrics may be an informative tool for research. But if its practitioners do not investigate and make an effort of providing a justification for the credibility of the assumptions on which they erect their building, it will not fulfill its tasks. There is a gap between its aspirations and its accomplishments, and without more supportive evidence to substantiate its claims, critics will continue to consider its ultimate argument as a mixture of rather unhelpful metaphors and metaphysics. Maintaining that economics is a science in the “true knowledge” business, yours truly remains a skeptic of the pretences and aspirations of econometrics. So far, I cannot really see that it has yielded very much in terms of relevant, interesting economic knowledge.
The marginal return on its ever higher technical sophistication in no way makes up for the lack of serious under-labouring of its deeper philosophical and methodological foundations that already Keynes complained about. The rather one-sided emphasis of usefulness and its concomitant instrumentalist justification cannot hide that neither Haavelmo, nor the legions of probabilistic econometricians following in his footsteps, give supportive evidence for their considering it “fruitful to believe” in the possibility of treating unique economic data as the observable results of random drawings from an imaginary sampling of an imaginary population. After having analyzed some of its ontological and epistemological foundations, I cannot but conclude that econometrics on the whole has not delivered “truth”. And I doubt if it has ever been the intention of its main protagonists.
Our admiration for technical virtuosity should not blind us to the fact that we have to have a cautious attitude towards probabilistic inferences in economic contexts. Science should help us penetrate to “the true process of causation lying behind current events” and disclose “the causal forces behind the apparent facts” [Keynes 1971-89 vol XVII:427]. We should look out for causal relations, but econometrics can never be more than a starting point in that endeavour, since econometric (statistical) explanations are not explanations in terms of mechanisms, powers, capacities or causes. Firmly stuck in an empiricist tradition, econometrics is only concerned with the measurable aspects of reality, But there is always the possibility that there are other variables – of vital importance and although perhaps unobservable and non-additive, not necessarily epistemologically inaccessible – that were not considered for the model. Those who were can hence never be guaranteed to be more than potential causes, and not real causes. A rigorous application of econometric methods in economics really presupposes that the phenomena of our real world economies are ruled by stable causal relations between variables. A perusal of the leading econom(etr)ic journals shows that most econometricians still concentrate on fixed parameter models and that parameter-values estimated in specific spatio-temporal contexts are presupposed to be exportable to totally different contexts. To warrant this assumption one, however, has to convincingly establish that the targeted acting causes are stable and invariant so that they maintain their parametric status after the bridging. The endemic lack of predictive success of the econometric project indicates that this hope of finding fixed parameters is a hope for which there really is no other ground than hope itself.
Real world social systems are not governed by stable causal mechanisms or capacities. As Keynes wrote in his critique of econometrics and inferential statistics already in the 1920s (emphasis added):
The atomic hypothesis which has worked so splendidly in Physics breaks down in Psychics. We are faced at every turn with the problems of Organic Unity, of Discreteness, of Discontinuity – the whole is not equal to the sum of the parts, comparisons of quantity fails us, small changes produce large effects, the assumptions of a uniform and homogeneous continuum are not satisfied. Thus the results of Mathematical Psychics turn out to be derivative, not fundamental, indexes, not measurements, first approximations at the best; and fallible indexes, dubious approximations at that, with much doubt added as to what, if anything, they are indexes or approximations of.
The kinds of “laws” and relations that econometrics has established, are laws and relations about entities in models that presuppose causal mechanisms being atomistic and additive. When causal mechanisms operate in real world social target systems they only do it in ever-changing and unstable combinations where the whole is more than a mechanical sum of parts. If economic regularities obtain they do it (as a rule) only because we engineered them for that purpose. Outside man-made “nomological machines” they are rare, or even non-existant. Unfortunately that also makes most of the achievements of econometrics – as most of contemporary endeavours of mainstream economic theoretical modeling – rather useless.
Following our recent post on econometricians’ traditional privileging of unbiased estimates, there were a bunch of comments echoing the challenge of teaching this topic, as students as well as practitioners often seem to want the comfort of an absolute standard such as best linear unbiased estimate or whatever. Commenters also discussed the tradeoff between bias and variance, and the idea that unbiased estimates can overfit the data.
I agree with all these things but I just wanted to raise one more point: In realistic settings, unbiased estimates simply don’t exist. In the real world we have nonrandom samples, measurement error, nonadditivity, nonlinearity, etc etc etc.
So forget about it. We’re living in the real world …
It’s my impression that many practitioners in applied econometrics and statistics think of their estimation choice kinda like this:
1. The unbiased estimate. It’s the safe choice, maybe a bit boring and maybe not the most efficient use of the data, but you can trust it and it gets the job done.
2. A biased estimate. Something flashy, maybe Bayesian, maybe not, it might do better but it’s risky. In using the biased estimate, you’re stepping off base—the more the bias, the larger your lead—and you might well get picked off …
If you take the choice above and combine it with the unofficial rule that statistical significance is taken as proof of correctness (in econ, this would also require demonstrating that the result holds under some alternative model specifications, but “p less than .05″ is still key), then you get the following decision rule:
A. Go with the safe, unbiased estimate. If it’s statistically significant, run some robustness checks and, if the result doesn’t go away, stop.
B. If you don’t succeed with A, you can try something fancier. But . . . if you do that, everyone will know that you tried plan A and it didn’t work, so people won’t trust your finding.
So, in a sort of Gresham’s Law, all that remains is the unbiased estimate. But, hey, it’s safe, conservative, etc, right?
And that’s where the present post comes in. My point is that the unbiased estimate does not exist! There is no safe harbor. Just as we can never get our personal risks in life down to zero … there is no such thing as unbiasedness. And it’s a good thing, too: recognition of this point frees us to do better things with our data right away.
One of my favourite “problem situating lecture arguments” against Bayesianism goes something like this: Assume you’re a Bayesian turkey and hold a nonzero probability belief in the hypothesis H that “people are nice vegetarians that do not eat turkeys and that every day I see the sun rise confirms my belief.” For every day you survive, you update your belief according to Bayes’ Rule
P(H|e) = [P(e|H)P(H)]/P(e),
where evidence e stands for “not being eaten” and P(e|H) = 1. Given that there do exist other hypotheses than H, P(e) is less than 1 and a fortiori P(H|e) is greater than P(H). Every day you survive increases your probability belief that you will not be eaten. This is totally rational according to the Bayesian definition of rationality. Unfortunately — as Bertrand Russell famously noticed — for every day that goes by, the traditional Christmas dinner also gets closer and closer …
For more on my own objections to Bayesianism:
Bayesianism — a patently absurd approach to science
Bayesianism — preposterous mumbo jumbo
One of the reasons I’m a Keynesian and not a Bayesian
Keynes and Bayes in paradise
Because I was there when the economics department of my university got an IBM 360, I was very much caught up in the excitement of combining powerful computers with economic research. Unfortunately, I lost interest in econometrics almost as soon as I understood how it was done. My thinking went through four stages:
1.Holy shit! Do you see what you can do with a computer’s help.
2.Learning computer modeling puts you in a small class where only other members of the caste can truly understand you. This opens up huge avenues for fraud:
3.The main reason to learn stats is to prevent someone else from committing fraud against you.
4.More and more people will gain access to the power of statistical analysis. When that happens, the stratification of importance within the profession should be a matter of who asks the best questions.
Disillusionment began to set in. I began to suspect that all the really interesting economic questions were FAR beyond the ability to reduce them to mathematical formulas. Watching computers being applied to other pursuits than academic economic investigations over time only confirmed those suspicions.
1.Precision manufacture is an obvious application for computing. And for many applications, this worked magnificently. Any design that combined straight line and circles could be easily described for computerized manufacture. Unfortunately, the really interesting design problems can NOT be reduced to formulas. A car’s fender, for example, can not be describe using formulas—it can only be described by specifying an assemblage of multiple points. If math formulas cannot describe something as common and uncomplicated as a car fender, how can it hope to describe human behavior?
2.When people started using computers for animation, it soon became apparent that human motion was almost impossible to model correctly. After a great deal of effort, the animators eventually put tracing balls on real humans and recorded that motion before transferring it to the the animated character. Formulas failed to describe simple human behavior—like a toddler trying to walk.
Lately, I have discovered a Swedish economist who did NOT give up econometrics merely because it sounded so impossible. In fact, he still teaches the stuff. But for the rest of us, he systematically destroys the pretensions of those who think they can describe human behavior with some basic Formulas.
Wonder who that Swedish economist could be …
Andrew: I believed, and still believe, in checking the fit of a model by comparing data to hypothetical replications. This is not the same as significance testing in which a p-value is used to decide whether to reject a model or whether to believe that a finding is true.
Mayo: I don’t know that significance tests are used to decide that a finding is true, and I’m surprised to see you endorsing/spreading the hackneyed and much lampooned view of significance tests, p-values, etc. despite so many of us trying to correct the record. And statistical hypothesis testing denies uncertainty? Where in the world do you get this? (I know it’s not because they don’t use posterior probabilities…) But never mind, let me ask: when you check the fit of a model using p-value assessments, are you not inferring the adequacy/inadequacy of the model? Tell me what you are doing if not. I don’t particularly like calling it a decision, neither do many people, and I like viewing the output as “whether to believe” even less. But I don’t know what your output is supposed to be.
Andrew: I don’t think hypothesis testing inherently denies uncertainty. But I do think that it is used by many researchers as a way of avoiding uncertainty: it’s all too common for “significant” to be interpreted as “true” and “non-significant” to be interpreted as “zero.” Consider, for example, all the trash science we’ve been discussing on this blog recently, studies that may have some scientific content but which get ruined by their authors’ deterministic interpretations. When I check the fit of a model, I’m assessing its adequacy for some purpose. This is not the same as looking for p< .05 or p<.01 in order to go around saying that some theory is now true.
Mayo: I fail to see how a deterministic interpretation could go hand in hand with error probabilities; and I never hear even the worst test abusers declare a theory is not true, give me A break… So when you assess adequacy for a purpose, what does this mean? Adequate vs inadequate for a purpose is pretty dichotomous. Do you assess how adequate? I’m unclear as to where the uncertainty enters for you, because as I understand it is not in terms of a posterior probability.
Andrew: Here’s a quote from a researcher, I posted it on the blog a few days ago: “Our results demonstrate that physically weak males are more reluctant than physically strong males to assert their self-interest…” Here’s another quote: “Ovulation led single women to become more liberal, less religious, and more likely to vote for Barack Obama. In contrast, ovulation led married women to become more conservative, more religious, and more likely to vote for Mitt Romney.” These are deterministic statements based on nothing more than p-values that happen to be statistically significant. Researchers make these sorts of statements all the time. It’s not your fault, I’m not saying you would do this, but it’s a serious problem. Along similar lines, we’ll see claims that a treatment has an effect on men and not on women, when really what is happening is that p< .05 for the men in the study and p>.05 for the women. In addition to brushing away uncertainty, people also seem to want to brush away uncertainty, thus talking about “the effect” as if it is a constant across all groups and all people. A recent example featured on this blog was a study primarily of male college students which was referred repeatedly (by its authors, not just by reporters and public relations people) as a study of “men” with no qualifications. P.S. Bayesians do this too, indeed there’s a whole industry (which I hate) of Bayesian methods for getting the posterior probability that a null hypothesis is true. Bayesians use different methods but often have the misguided goal of other statisticians, to deny uncertainty and variation.
Mayo: These moves from observed associations, and even correlations, to causal claims are poorly warranted, but these are classic fallacies that go beyond tests to reading all manner of “explanations” into the data. I find it very odd to view this as a denial of uncertainty by significance tests. Even if they got their statistics right, the link from stat to substantive causal claim would exist. I just find it odd to regard the statistical vs substantive and correlation vs cause fallacies, which every child knows, some kind of shortcoming with significance tests. Any method or no method can commit these fallacies, especially from observational studies. But when you berate the tests as somehow responsible, you misleadingly suggest that other methods are better, rather than worse. At least error statistical methods can identify the flaws at 3 levels (data, statistical inference, stat-> substantive causal claim) in a systematic way. We can spot the flaws a mile off… I still don’t know where you want the uncertainty to show up; I’ve indicated how I do.
Andrew: You write, “I still don’t know where you want the uncertainty to show up;” I want the uncertainty to show up in a posterior distribution for continuous parameters, as described in my books.
Mayo: You write, “I want the uncertainty to show up in a posterior distribution for continuous parameters”. Let’s see if I have this right. You would report the posterior probabilities that a model was adequate for a goal. Yes? Now you have also said you are a falsificationist. So is your falsification rule to move from a low enough posterior probability in the adequacy of a model, to the falsity of a claim that the model of is adequate (for the goal). And would high enough posterior in the adequacy of a model translate into something like, not being able to falsify its adequacy or perhaps, accepting it as adequate (the latter would not be falsificationist, but might be more sensible than the former). Or are you no longer falsificationist-leaning.
Andrew: No, I would not “report the posterior probabilities that a model was adequate for a goal.” That makes no sense to me. I would report the posterior distribution of parameters and make probabilistic predictions within a model.
Mayo: Well if you’re going to falsify as a result, you need a rule from these posteriors to infer the predictions are met satisfactorily or not. Else there is no warrant for rejecting/improving the model. That’s the kind of thing significance tests can do. But specifically, with respect to the misleading interpretations of data that you were just listing, it isn’t obvious how they are avoided by you. The data may fit these hypotheses swimmingly. Anyhow, this is not the place to discuss this further. In signing off, I just want to record my objection to (mis)portraying statistical tests and other error statistical methods as flawed because of some blatant, age-old misuses or misleading language, like “demonstrate” (flaws that are at least detectable and self-correctable by these same methods, whereas they might remain hidden by other methods now in use). [Those examples should not even be regarded as seeking evidence but at best colorful and often pseudoscientific interpretations.] When the Higgs particle physicists found their 2 and 3 standard deviation effects were disappearing with new data—just to mention a recent example from my blog—they did not say the flaw was with the p-values! They tightened up their analyses and made them more demanding. They didn’t report posterior distributions for the properties of the Higgs, but they were able to make inferences about their values, and identify gaps for further analysis.
For my own take on significance tests see here.
Decisions based on statistical significance testing certainly make life easier. But significance testing doesn’t give us the knowledge we want. It only gives an answer to a question we as researchers never ask — what is the probability of getting the result we have got, assuming that there is no difference between two sets of data (e. g. control group – experimental group, sample – population). On answering the question we really are interested in — how probable and reliable is our hypothesis — it remains silent.