Mathematical statistician David A. Freedman‘s Statistical Models and Causal Inference (Cambridge University Press, 2010) is a marvellous book. It ought to be mandatory reading for every serious social scientist – including economists and econometricians – who doesn’t want to succumb to ad hoc assumptions and unsupported statistical conclusions!
How do we calibrate the uncertainty introduced by data collection? Nowadays, this question has become quite salient, and it is routinely answered using wellknown methods of statistical inference, with standard errors, t -tests, and P-values … These conventional answers, however, turn out to depend critically on certain rather restrictive assumptions, for instance, random sampling …
Thus, investigators who use conventional statistical technique turn out to be making, explicitly or implicitly, quite restrictive behavioral assumptions about their data collection process … More typically, perhaps, the data in hand are simply the data most readily available …
The moment that conventional statistical inferences are made from convenience samples, substantive assumptions are made about how the social world operates … When applied to convenience samples, the random sampling assumption is not a mere technicality or a minor revision on the periphery; the assumption becomes an integral part of the theory …
In particular, regression and its elaborations … are now standard tools of the trade. Although rarely discussed, statistical assumptions have major impacts on analytic results obtained by such methods.
Consider the usual textbook exposition of least squares regression. We have n observational units, indexed by i = 1, . . . , n. There is a response variable yi , conceptualized as μi + i , where μi is the theoretical mean of yi while the disturbances or errors i represent the impact of random variation (sometimes of omitted variables). The errors are assumed to be drawn independently from a common (gaussian) distribution with mean 0 and finite variance. Generally, the error distribution is not empirically identifiable outside the model; so it cannot be studied directly—even in principle—without the model. The error distribution is an imaginary population and the errors i are treated as if they were a random sample from this imaginary population—a research strategy whose frailty was discussed earlier.
Usually, explanatory variables are introduced and μi is hypothesized to be a linear combination of such variables. The assumptions about the μi and i are seldom justified or even made explicit—although minor correlations in the i can create major bias in estimated standard errors for coefficients …
Why do μi and i behave as assumed? To answer this question, investigators would have to consider, much more closely than is commonly done, the connection between social processes and statistical assumptions …
We have tried to demonstrate that statistical inference with convenience samples is a risky business. While there are better and worse ways to proceed with the data at hand, real progress depends on deeper understanding of the data-generation mechanism. In practice, statistical issues and substantive issues overlap. No amount of statistical maneuvering will get very far without some understanding of how the data were produced.
More generally, we are highly suspicious of efforts to develop empirical generalizations from any single dataset. Rather than ask what would happen in principle if the study were repeated, it makes sense to actually repeat the study. Indeed, it is probably impossible to predict the changes attendant on replication without doing replications. Similarly, it may be impossible to predict changes resulting from interventions without actually intervening.
To understand real world “non-routine” decisions and unforeseeable changes in behaviour, ergodic probability distributions are of no avail. In a world full of genuine uncertainty — where real historical time rules the roost — the probabilities that ruled the past are not necessarily those that will rule the future.
When we cannot accept that the observations, along the time-series available to us, are independent … we have, in strict logic, no more than one observation, all of the separate items having to be taken together. For the analysis of that the probability calculus is useless; it does not apply … I am bold enough to conclude, from these considerations that the usefulness of ‘statistical’ or ‘stochastic’ methods in economics is a good deal less than is now conventionally supposed … We should always ask ourselves, before we apply them, whether they are appropriate to the problem in hand. Very often they are not … The probability calculus is no excuse for forgetfulness.
John Hicks, Causality in Economics, 1979:121
To simply assume that economic processes are ergodic — and a fortiori in any relevant sense timeless — is not a sensible way for dealing with the kind of genuine uncertainty that permeates open systems such as economies.
Added 25 February: Commenting on this article, Paul Davidson writes:
After reading my article on the fallacy of rational expectations, Hicks wrote to me in a letter dated 12 February 1983 in which he said “I have just been reading your RE [rational expectations] paper … I do like it very much … You have now rationalized my suspicions and shown me that I missed a chance of labeling my own point of view as nonergodic. One needs a name like that to ram a point home.”
It is clearly the case that experienced modellers could easily come up with significantly different models based on the same set of data thus undermining claims to researcher-independent objectivity. This has been demonstrated empirically by Magnus and Morgan (1999) who conducted an experiment in which an apprentice had to try to replicate the analysis of a dataset that might have been carried out by three different experts (Leamer, Sims, and Hendry) following their published guidance. In all cases the results were different from each other, and different from that which would have been produced by the expert, thus demonstrating the importance of tacit knowledge in statistical analysis.
Magnus and Morgan conducted a further experiment which involved eight expert teams, from different universities, analysing the same sets of data each using their own particular methodology. The data concerned the demand for food in the US and in the Netherlands and was based on a classic study by Tobin (1950) augmented with more recent data. The teams were asked to estimate the income elasticity of food demand and to forecast per capita food consumption. In terms of elasticities, the lowest estimates were around 0.38 whilst the highest were around 0.74 – clearly vastly different especially when remembering that these were based on the same sets of data. The forecasts were perhaps even more extreme – from a base of around 4000 in 1989 the lowest forecast for the year 2000 was 4130 while the highest was nearly 18000!
Our results suggest that sexual frequency is highest in households with traditionally gendered divisions of labor … The coefficient for men’s share of core household labor is negative: households in which men do more female-typed (core) tasks report lower sexual frequency. The coefficient for men’s share of non-core household labor, on the other hand, is positive: households in which men do more male-typed (non-core) tasks report more sex. These effects are statistically significant and substantively large. Overall, these results suggest that sexuality is governed by enactments of femininity and masculinity through appropriately gendered performances of household labor that coincide with sexual scripts organizing heterosexual desire.
To illustrate the substantive size of these effects, Figure 1 shows predicted values for sexual frequency, varying the share of household labor performed by men while setting all other variables to their means. As the figure shows, shifting from a household in which women perform all of the core household tasks to one where women perform none of the core household tasks is associated with a decline in sexual frequency of nearly 1.6 times per month. Given a mean sexual frequency in this sample of slightly over five, this is a large difference. The figure represents two extreme values, but even households in which men do 40 percent of core household task hours report substantially lower sexual frequency than households in which women perform all core housework. The effect for men’s share of non-core housework is similar although somewhat smaller.
[h/t Andrew Gelman]