- Always, but always, plot your data.
- Remember that data quality is at least as important as data
- Always ask yourself, “Do these results make economic/common sense”?
- Check whether your “statistically significant” results are also
- Be sure that you know exactly what assumptions are used/needed to obtain
the results relating to the properties of any estimator or test that you
- Just because someone else has used a particular approach to analyse a
problem that looks like yours, that doesn’t mean they were right!
- “Test, test, test”! (David Hendry). But don’t forget that “pre-testing”
raises some important issues of its own.
- Don’t assume that the computer code that someone gives to you is
relevant for your application, or that it even produces correct results.
- Keep in mind that published results will represent only a fraction of the
results that the author obtained, but is not publishing.
- Don’t forget that “peer-reviewed” does NOT mean “correct results”, or
even “best practices were followed”.
Some variants of ‘data mining’ can be classified as the greatest of the basement sins, but other variants of ‘data mining’ can be viewed as important ingredients in data analysis. Unfortunately, these two variants usually are not mutually exclusive and so frequently conflict in the sense that to gain the benefits of the latter, one runs the risk of incurring the costs of the former.
Hoover and Perez (2000, p. 196) offer a general definition of data mining as referring to “a broad class of activities that have in common a search over different ways to process or package data statistically or econometrically with the purpose of making the final presentation meet certain design criteria.” Two markedly different views of data mining lie within the scope of this general definition. One view of ‘data mining’ is that it refers to experimenting with (or ‘fishing through’) the data to produce a specification … The problem with this, and why it is viewed as a sin, is that such a procedure is almost guaranteed to produce a specification tailored to the peculiarities of that particular data set, and consequently will be misleading in terms of what it says about the underlying process generating the data. Furthermore, traditional testing procedures used to ‘sanctify’ the specification are no longer legitimate, because these data, since they have been used to generate the specification, cannot be judged impartial if used to test that specification …
An alternative view of ‘data mining’ is that it refers to experimenting with (or ‘fishing through’) the data to discover empirical regularities that can inform economic theory … Hand et al (2000) describe data mining as the process of seeking interesting or valuable information in large data sets. Its greatest virtue is that it can uncover empirical regularities that point to errors/omissions in theoretical specifications …
In summary, this second type of ‘data mining’ identifies regularities in or characteristics of the data that should be accounted for and understood in the context of the underlying theory. This may suggest the need to rethink the theory behind one’s model, resulting in a new specification founded on a more broad-based understanding. This is to be distinguished from a new specification created by mechanically remolding the old specification to fit the data; this would risk incurring the costs described earlier when discussing the first variant of ‘data mining.’
The issue here is how should the model specification be chosen? As usual, Leamer (1996, p. 189) has an amusing view: “As you wander through the thicket of models, you may come to question the meaning of the Econometric Scripture that presumes the model is given to you at birth by a wise and beneficent Holy Spirit.”
In practice, model specifications come from both theory and data, and given the absence of Leamer’s Holy Spirit, properly so.
Brad DeLong wonders why Cliff Asness is clinging to a theoretical model that has clearly been rejected by the data …
There’s a version of this in econometrics, i.e. you know the model is correct, you are just having trouble finding evidence for it. It goes as follows. You are testing a theory you came up with, but the data are uncooperative and say you are wrong. But instead of accepting that, you tell yourself “My theory is right, I just haven’t found the right econometric specification yet. I need to add variables, remove variables, take a log, add an interaction, square a term, do a different correction for misspecification, try a different sample period, etc., etc., etc.” Then, after finally digging out that one specification of the econometric model that confirms your hypothesis, you declare victory, write it up, and send it off (somehow never mentioning the intense specification mining that produced the result).
Too much econometric work proceeds along these lines. Not quite this blatantly, but that is, in effect, what happens in too many cases. I think it is often best to think of econometric results as the best case the researcher could make for a particular theory rather than a true test of the model.
Mark touches the spot — and for the sake of balancing the overly rosy picture of econometric achievements given in the usual econometrics textbooks today, it may also be interesting to see how Trygve Haavelmo, with the completion (in 1958) of the twenty-fifth volume of Econometrica, assessed the the role of econometrics in the advancement of economics. Although mainly positive of the “repair work” and “clearing-up work” done, Haavelmo also found some grounds for despair:
We have found certain general principles which would seem to make good sense. Essentially, these principles are based on the reasonable idea that, if an economic model is in fact “correct” or “true,” we can say something a priori about the way in which the data emerging from it must behave. We can say something, a priori, about whether it is theoretically possible to estimate the parameters involved. And we can decide, a priori, what the proper estimation procedure should be … But the concrete results of these efforts have often been a seemingly lower degree of accuracy of the would-be economic laws (i.e., larger residuals), or coefficients that seem a priori less reasonable than those obtained by using cruder or clearly inconsistent methods.
There is the possibility that the more stringent methods we have been striving to develop have actually opened our eyes to recognize a plain fact: viz., that the “laws” of economics are not very accurate in the sense of a close fit, and that we have been living in a dream-world of large but somewhat superficial or spurious correlations.
And as the quote below shows, Frisch also shared some of Haavelmo’s — and Keynes’s — doubts on the applicability of econometrics:
I have personally always been skeptical of the possibility of making macroeconomic predictions about the development that will follow on the basis of given initial conditions … I have believed that the analytical work will give higher yields – now and in the near future – if they become applied in macroeconomic decision models where the line of thought is the following: “If this or that policy is made, and these conditions are met in the period under consideration, probably a tendency to go in this or that direction is created”.
Peter Dorman is one of those rare economists that it is always a pleasure to read. Here his critical eye is focused on economists’ infatuation with homogeneity and averages:
You may feel a gnawing discomfort with the way economists use statistical techniques. Ostensibly they focus on the difference between people, countries or whatever the units of observation happen to be, but they nevertheless seem to treat the population of cases as interchangeable—as homogenous on some fundamental level. As if people were replicants.
You are right, and this brief talk is about why and how you’re right, and what this implies for the questions people bring to statistical analysis and the methods they use.
Our point of departure will be a simple multiple regression model of the form
y = β0 + β1 x1 + β2 x2 + …. + ε
where y is an outcome variable, x1 is an explanatory variable of interest, the other x’s are control variables, the β’s are coefficients on these variables (or a constant term, in the case of β0), and ε is a vector of residuals. We could apply the same analysis to more complex functional forms, and we would see the same things, so let’s stay simple.
What question does this model answer? It tells us the average effect that variations in x1 have on the outcome y, controlling for the effects of other explanatory variables. Repeat: it’s the average effect of x1 on y.
This model is applied to a sample of observations. What is assumed to be the same for these observations? (1) The outcome variable y is meaningful for all of them. (2) The list of potential explanatory factors, the x’s, is the same for all. (3) The effects these factors have on the outcome, the β’s, are the same for all. (4) The proper functional form that best explains the outcome is the same for all. In these four respects all units of observation are regarded as essentially the same.
Now what is permitted to differ across these observations? Simply the values of the x’s and therefore the values of y and ε. That’s it.
Thus measures of the difference between individual people or other objects of study are purchased at the cost of immense assumptions of sameness. It is these assumptions that both reflect and justify the search for average effects …
In the end, statistical analysis is about imposing a common structure on observations in order to understand differentiation. Any structure requires assuming some kinds of sameness, but some approaches make much more sweeping assumptions than others. An unfortunate symbiosis has arisen in economics between statistical methods that excessively rule out diversity and statistical questions that center on average (non-diverse) effects. This is damaging in many contexts, including hypothesis testing, program evaluation, forecasting—you name it …
The first step toward recovery is admitting you have a problem. Every statistical analyst should come clean about what assumptions of homogeneity are being made, in light of their plausibility and the opportunities that exist for relaxing them.
Firmly stuck in an empiricist tradition, econometrics is only concerned with the measurable aspects of reality. But there is always the possibility that there are other variables – of vital importance and although perhaps unobservable and non-additive, not necessarily epistemologically inaccessible – that were not considered for the model.
Real world social systems are not governed by stable causal mechanisms or capacities. If economic regularities obtain they — as a rule — do it only because we engineered them for that purpose. Outside man-made “nomological machines” they are rare, or even non-existant. Unfortunately that also makes them rather useless.
Remember that a model is not the truth. It is a lie to help you get your point across. And in the case of modeling economic risk, your model is a lie about others, who are probably lying themselves. And what’s worse than a simple lie? A complicated lie.
Sam L. Savage The Flaw of Averages
In Gretl it’s extremely simple to do this kind of bootstrapping. Run the regression and you get an output-window with the regression results. Click on Analysis at the top of the window and then on Bootstrap and select the options Confidence interval and Resample residuals. After having selected the coefficient for which you want to you get bootstrapped estimates, you just click OK and a window will appear showing the 95% confidence interval for the coefficient. It’s as simple as that!
In econometrics one often gets the feeling that many of its practitioners think of it as a kind of automatic inferential machine: input data and out comes casual knowledge. This is like pulling a rabbit from a hat. Great — but first you have to put the rabbit in the hat. And this is where assumptions come in to the picture.
The assumption of imaginary “superpopulations” is one of the many dubious assumptions used in modern econometrics, and as Clint Ballinger has highlighted, this is a particularly questionable rabbit pulling assumption:
Inferential statistics are based on taking a random sample from a larger population … and attempting to draw conclusions about a) the larger population from that data and b) the probability that the relations between measured variables are consistent or are artifacts of the sampling procedure.
However, in political science, economics, development studies and related fields the data often represents as complete an amount of data as can be measured from the real world (an ‘apparent population’). It is not the result of a random sampling from a larger population. Nevertheless, social scientists treat such data as the result of random sampling.
Because there is no source of further cases a fiction is propagated—the data is treated as if it were from a larger population, a ‘superpopulation’ where repeated realizations of the data are imagined. Imagine there could be more worlds with more cases and the problem is fixed …
What ‘draw’ from this imaginary superpopulation does the real-world set of cases we have in hand represent? This is simply an unanswerable question. The current set of cases could be representative of the superpopulation, and it could be an extremely unrepresentative sample, a one in a million chance selection from it …
The problem is not one of statistics that need to be fixed. Rather, it is a problem of the misapplication of inferential statistics to non-inferential situations.
As social scientists – and economists – we have to confront the all-important question of how to handle uncertainty and randomness. Should we define randomness with probability? If we do, we have to accept that to speak of randomness we also have to presuppose the existence of nomological probability machines, since probabilities cannot be spoken of – and actually, to be strict, do not at all exist – without specifying such system-contexts. Accepting Haavelmo’s domain of probability theory and sample space of infinite populations – just as Fisher’s “hypothetical infinite population, of which the actual data are regarded as constituting a random sample”, von Mises’s “collective” or Gibbs’s ”ensemble” – also implies that judgments are made on the basis of observations that are actually never made!
Infinitely repeated trials or samplings never take place in the real world. So that cannot be a sound inductive basis for a science with aspirations of explaining real-world socio-economic processes, structures or events. It’s not tenable.
As David Salsburg once noted – in his lovely The Lady Tasting Tea - on probability theory:
[W]e assume there is an abstract space of elementary things called ‘events’ … If a measure on the abstract space of events fulfills certain axioms, then it is a probability. To use probability in real life, we have to identify this space of events and do so with sufficient specificity to allow us to actually calculate probability measurements on that space … Unless we can identify [this] abstract space, the probability statements that emerge from statistical analyses will have many different and sometimes contrary meanings.
Just as e. g. John Maynard Keynes and Nicholas Georgescu-Roegen, Salsburg is very critical of the way social scientists – including economists and econometricians – uncritically and without arguments have come to simply assume that one can apply probability distributions from statistical theory on their own area of research:
Probability is a measure of sets in an abstract space of events. All the mathematical properties of probability can be derived from this definition. When we wish to apply probability to real life, we need to identify that abstract space of events for the particular problem at hand … It is not well established when statistical methods are used for observational studies … If we cannot identify the space of events that generate the probabilities being calculated, then one model is no more valid than another … As statistical models are used more and more for observational studies to assist in social decisions by government and advocacy groups, this fundamental failure to be able to derive probabilities without ambiguity will cast doubt on the usefulness of these methods.
This importantly also means that if you cannot show that data satisfies all the conditions of the probabilistic nomological machine – including e. g. the distribution of the deviations corresponding to a normal curve – then the statistical inferences used, lack sound foundations.
In his great book Statistical Models and Causal Inference: A Dialogue with the Social Sciences David Freedman also touched on these fundamental problems, arising when you try to apply statistical models outside overly simple nomological machines like coin tossing and roulette wheels (emphasis added):
Lurking behind the typical regression model will be found a host of such assumptions; without them, legitimate inferences cannot be drawn from the model. There are statistical procedures for testing some of these assumptions. However, the tests often lack the power to detect substantial failures. Furthermore, model testing may become circular; breakdowns in assumptions are detected, and the model is redefined to accommodate. In short, hiding the problems can become a major goal of model building.
Using models to make predictions of the future, or the results of interventions, would be a valuable corrective. Testing the model on a variety of data sets – rather than fitting refinements over and over again to the same data set – might be a good second-best … Built into the equation is a model for non-discriminatory behavior: the coefficient d vanishes. If the company discriminates, that part of the model cannot be validated at all.
Regression models are widely used by social scientists to make causal inferences; such models are now almost a routine way of demonstrating counterfactuals. However, the “demonstrations” generally turn out to depend on a series of untested, even unarticulated, technical assumptions. Under the circumstances, reliance on model outputs may be quite unjustified. Making the ideas of validation somewhat more precise is a serious problem in the philosophy of science. That models should correspond to reality is, after all, a useful but not totally straightforward idea – with some history to it. Developing appropriate models is a serious problem in statistics; testing the connection to the phenomena is even more serious …
In our days, serious arguments have been made from data. Beautiful, delicate theorems have been proved, although the connection with data analysis often remains to be established. And an enormous amount of fiction has been produced, masquerading as rigorous science.
And as if this wasn’t enough, one could — as we’ve seen — also seriously wonder what kind of “populations” these statistical and econometric models ultimately are based on. Why should we as social scientists – and not as pure mathematicians working with formal-axiomatic systems without the urge to confront our models with real target systems – unquestioningly accept Haavelmo’s “infinite population”, Fisher’s “hypothetical infinite population”, von Mises’s “collective” or Gibbs’s ”ensemble”?
Of course one could treat our observational or experimental data as random samples from real populations. I have no problem with that. But probabilistic econometrics does not content itself with that kind of populations. Instead it creates imaginary populations of “parallel universes” and assume that our data are random samples from that kind of populations.
But this is actually nothing else but hand-waving! And it is inadequate for real science. As David Freedman writes in Statistical Models and Causal Inference (emphasis added):
With this approach, the investigator does not explicitly define a population that could in principle be studied, with unlimited resources of time and money. The investigator merely assumes that such a population exists in some ill-defined sense. And there is a further assumption, that the data set being analyzed can be treated as if it were based on a random sample from the assumed population. These are convenient fictions … Nevertheless, reliance on imaginary populations is widespread. Indeed regression models are commonly used to analyze convenience samples … The rhetoric of imaginary populations is seductive because it seems to free the investigator from the necessity of understanding how data were generated.
In social sciences — including economics — it’s always wise to ponder C. S. Peirce’s remark that universes are not as common as peanuts …
In 1997, Christopher, the eleven-week-old child of a young lawyer named Sally Clark, died in his sleep: an apparent case of Sudden Infant Death Sybdrome (SIDS) … One year later, Sally’s second child, Harry, also died, aged just eight weeks. Sally was arrested and accused of killing the children. She was convicted of murdering them, and in 1999 was given a life sentence …
In this case the mistaken evidence came from Sir Roy Meadow, a paediatrician. Despite not being an expert statistician or probabilist, he felt able to make a statement about probabilities … He asserted that the probability of two SIDS deaths in a family like Sally Clark’s was 1 in 73 million. A probability as small as this suggests we might apply Borel’s law: we shouldn’t expect to see an improbable event …
Unfortunately, however, Meadow’s 1 in 73 million probability is based on a crucial assumption: that the deaths are independent; that one such death in a family does not make it more or less likely that there will be another …
Now … that assumption does seem unjustified: data show that if one SIDS death has occurred, then a subsequent child is about ten times more likely to die of SIDS … To arrive at a valid conclusion, we would have to compare the probability that the two children had been murdered with the probability that they had both died from SIDS … There is a factor-of-ten differeence between Meadow’s estimate and the estimate based on recognizing that SIDS events in the same family are not independent, and that difference shifts the probability from favouring homicide to favouring SIDS deaths …
Following widespread criticism of the misuse and indeed misunderstanding of statistical evidence, Sally Clark’s conviction was overturned, and she was released in 2003.
Back in 1943, eminent French mathematician Émile Borel published a book titled Les probabilités et la vie, in which he introduced what has been called Borel’s law : “Events with a sufficiently small probability never occur.”
Borel’s law has also been called the infinite monkey theorem since Borel illustrated his thinking using the classic example with monkeys randomly hitting the keys of a typewriter and by chance producing the complete works of Shakespeare:
Such is the sort of event which, though its impossibility may not be rationally demonstrable, is, however, so unlikely that no sensible person will hesitate to declare it actually impossible. If someone affirms having observed such an event we would be sure that he is deceiving us or has himself been the victim of fraud.
Wikipedia gives the historical background and a proof of the theorem:
Variants of the theorem include multiple and even infinitely many typists, and the target text varies between an entire library and a single sentence. The history of these statements can be traced back to Aristotle’s On Generation and Corruption and Cicero’s De natura deorum (On the Nature of the Gods), through Blaise Pascal and Jonathan Swift, and finally to modern statements with their iconic typewriters. In the early 20th century, Émile Borel and Arthur Eddington used the theorem to illustrate the timescales implicit in the foundations of statistical mechanics.
There is a straightforward proof of this theorem. As an introduction, recall that if two events are statistically independent, then the probability of both happening equals the product of the probabilities of each one happening independently. For example, if the chance of rain in Moscow on a particular day in the future is 0.4 and the chance of an earthquake in San Francisco on that same day is 0.00003, then the chance of both happening on that day is 0.4 × 0.00003 = 0.000012, assuming that they are indeed independent.
Suppose the typewriter has 50 keys, and the word to be typed is banana. If the keys are pressed randomly and independently, it means that each key has an equal chance of being pressed. Then, the chance that the first letter typed is ‘b’ is 1/50, and the chance that the second letter typed is a is also 1/50, and so on. Therefore, the chance of the first six letters spelling banana is
- (1/50) × (1/50) × (1/50) × (1/50) × (1/50) × (1/50) = (1/50)6 = 1/15 625 000 000 ,
less than one in 15 billion, but not zero, hence a possible outcome.
From the above, the chance of not typing banana in a given block of 6 letters is 1 − (1/50)6. Because each block is typed independently, the chance Xn of not typing banana in any of the first n blocks of 6 letters is
As n grows, Xn gets smaller. For an n of a million, Xn is roughly 0.9999, but for an n of 10 billion Xn is roughly 0.53 and for an n of 100 billion it is roughly 0.0017. As n approaches infinity, the probabilityXn approaches zero; that is, by making n large enough, Xn can be made as small as is desired, and the chance of typing banana approaches 100%.
The same argument shows why at least one of infinitely many monkeys will produce a text as quickly as it would be produced by a perfectly accurate human typist copying it from the original. In this case Xn = (1 − (1/50)6)n where Xn represents the probability that none of the first n monkeys types banana correctly on their first try. When we consider 100 billion monkeys, the probability falls to 0.17%, and as the number of monkeys n increases, the value of Xn – the probability of the monkeys failing to reproduce the given text – approaches zero arbitrarily closely. The limit, for n going to infinity, is zero.
However, for physically meaningful numbers of monkeys typing for physically meaningful lengths of time the results are reversed. If there are as many monkeys as there are particles in the observable universe (1080), and each types 1,000 keystrokes per second for 100 times the life of the universe (1020 seconds), the probability of the monkeys replicating even a short book is nearly zero.
For more on Borel’s law and the fact that — still — incredibly unlikely things keep happening, see David Hands’s The Improbability Principle (Bantam Press, 2014).