Against multiple regression analysis28 January, 2016 at 18:35 | Posted in Statistics & Econometrics | 2 Comments
Distinguished social psychologist Richard E. Nisbett has a somewhat atypical aversion to multiple regression analysis . In his Intelligence and How to Get It (Norton 2011) he wrote (p. 17):
Researchers often determine the individual’s contemporary IQ or IQ earlier in life, socioeconomic status of the family of origin, living circumstances when the individual was a child, number of siblings, whether the family had a library card, educational attainment of the individual, and other variables, and put all of them into a multiple-regression equation predicting adult socioeconomic status or income or social pathology or whatever. Researchers then report the magnitude of the contribution of each of the variables in the regression equation, net of all the others (that is, holding constant all the others). It always turns out that IQ, net of all the other variables, is important to outcomes. But … the independent variables pose a tangle of causality – with some causing others in goodness-knows-what ways and some being caused by unknown variables that have not even been measured. Higher socioeconomic status of parents is related to educational attainment of the child, but higher-socioeconomic-status parents have higher IQs, and this affects both the genes that the child has and the emphasis that the parents are likely to place on education and the quality of the parenting with respect to encouragement of intellectual skills and so on. So statements such as “IQ accounts for X percent of the variation in occupational attainment” are built on the shakiest of statistical foundations. What nature hath joined together, multiple regressions cannot put asunder.
And now he is back with a half an hour lecture — The Crusade Against Multiple Regression Analysis — posted on The Edge website a week ago (watch the lecture here).
Now, I think that what Nisbett says is right as far as it goes, although it would certainly have strengthened Nisbett’s argumentation if he had elaborated more on the methodological question around causality, or at least had given some mathematical-statistical-econometric references. Unfortunately, his alternative approach is not more convincing than regression analysis. As so many other contemporary social scientists today, Nisbett seems to think that randomization may solve the empirical problem. By randomizing we are getting different “populations” that are homogeneous in regards to all variables except the one we think is a genuine cause. In this way we are supposed to be able to not have to actually know what all these other factors are.
If you succeed in performing an ideal randomization with different treatment groups and control groups that is attainable. But it presupposes that you really have been able to establish – and not just assume – that the probability of all other causes but the putative have the same probability distribution in the treatment and control groups, and that the probability of assignment to treatment or control groups are independent of all other possible causal variables.
Unfortunately, real experiments and real randomizations seldom or never achieve this. So, yes, we may do without knowing all causes, but it takes ideal experiments and ideal randomizations to do that, not real ones.
As I have argued — e. g. here — that means that in practice we do have to have sufficient background knowledge to deduce causal knowledge. Without old knowledge, we can’t get new knowledge – and, no causes in, no causes out.
Nisbett is well worth reading and listening to, but on the issue of the shortcomings of multiple regression analysis, no one sums it up better than eminent mathematical statistician David Freedman in his Statistical Models and Causal Inference:
If the assumptions of a model are not derived from theory, and if predictions are not tested against reality, then deductions from the model must be quite shaky. However, without the model, the data cannot be used to answer the research question …
In my view, regression models are not a particularly good way of doing empirical work in the social sciences today, because the technique depends on knowledge that we do not have. Investigators who use the technique are not paying adequate attention to the connection – if any – between the models and the phenomena they are studying. Their conclusions may be valid for the computer code they have created, but the claims are hard to transfer from that microcosm to the larger world …
Regression models often seem to be used to compensate for problems in measurement, data collection, and study design. By the time the models are deployed, the scientific position is nearly hopeless. Reliance on models in such cases is Panglossian …
Given the limits to present knowledge, I doubt that models can be rescued by technical fixes. Arguments about the theoretical merit of regression or the asymptotic behavior of specification tests for picking one version of a model over another seem like the arguments about how to build desalination plants with cold fusion and the energy source. The concept may be admirable, the technical details may be fascinating, but thirsty people should look elsewhere …
Causal inference from observational data presents may difficulties, especially when underlying mechanisms are poorly understood. There is a natural desire to substitute intellectual capital for labor, and an equally natural preference for system and rigor over methods that seem more haphazard. These are possible explanations for the current popularity of statistical models.
Indeed, far-reaching claims have been made for the superiority of a quantitative template that depends on modeling – by those who manage to ignore the far-reaching assumptions behind the models. However, the assumptions often turn out to be unsupported by the data. If so, the rigor of advanced quantitative methods is a matter of appearance rather than substance.