## Shortcomings of regression analysis

16 April, 2018 at 08:45 | Posted in Statistics & Econometrics | Comments Off on Shortcomings of regression analysisDistinguished social psychologist Richard E. Nisbett has a somewhat atypical aversion to multiple regression analysis. In his *Intelligence and How to Get It* (Norton 2011) he writes:

Researchers often determine the individual’s contemporary IQ or IQ earlier in life, socioeconomic status of the family of origin, living circumstances when the individual was a child, number of siblings, whether the family had a library card, educational attainment of the individual, and other variables, and put all of them into a multiple-regression equation predicting adult socioeconomic status or income or social pathology or whatever. Researchers then report the magnitude of the contribution of each of the variables in the regression equation, net of all the others (that is, holding constant all the others). It always turns out that IQ, net of all the other variables, is important to outcomes. But … the independent variables pose a tangle of causality – with some causing others in goodness-knows-what ways and some being caused by unknown variables that have not even been measured. Higher socioeconomic status of parents is related to educational attainment of the child, but higher-socioeconomic-status parents have higher IQs, and this affects both the genes that the child has and the emphasis that the parents are likely to place on education and the quality of the parenting with respect to encouragement of intellectual skills and so on. So statements such as “IQ accounts for X percent of the variation in occupational attainment” are built on the shakiest of statistical foundations. What nature hath joined together, multiple regressions cannot put asunder.

Now, I think that what Nisbett says is right as far as it goes, although it would certainly have strengthened Nisbett’s argumentation if he had elaborated more on the methodological question around causality, or at least had given some mathematical-statistical-econometric references. Unfortunately, his alternative approach is not more convincing than regression analysis. As so many other contemporary social scientists today, Nisbett seems to think that randomization may solve the empirical problem. By randomizing we are getting different “populations” that are homogeneous in regards to all variables except the one we think is a genuine cause. In this way, we are supposed to be able to not have to actually know what all these other factors are.

If you succeed in performing an *ideal* randomization with different treatment groups and control groups that is attainable. *But* it presupposes that you really have been able to establish — and not just assume — that the probability of all other causes but the putative have the same probability distribution in the treatment and control groups, and that the probability of assignment to treatment or control groups is independent of all other possible causal variables.

Unfortunately, *real *experiments and *real* randomizations seldom or never achieve this. So, yes, we may do without knowing *all *causes, but it takes *ideal* experiments and *ideal* randomizations to do that, not *real *ones.

As I have argued — e. g. here — that means that in practice we do have to have sufficient background knowledge to deduce causal knowledge. Without old knowledge, we can’t get new knowledge — and, no causes in, no causes out.

On the issue of the shortcomings of multiple regression analysis, no one sums it up better than eminent mathematical statistician David Freedman:

If the assumptions of a model are not derived from theory, and if predictions are not tested against reality, then deductions from the model must be quite shaky. However, without the model, the data cannot be used to answer the research question …

In my view, regression models are not a particularly good way of doing empirical work in the social sciences today, because the technique depends on knowledge that we do not have. Investigators who use the technique are not paying adequate attention to the connection – if any – between the models and the phenomena they are studying. Their conclusions may be valid for the computer code they have created, but the claims are hard to transfer from that microcosm to the larger world …

Regression models often seem to be used to compensate for problems in measurement, data collection, and study design. By the time the models are deployed, the scientific position is nearly hopeless. Reliance on models in such cases is Panglossian …

Given the limits to present knowledge, I doubt that models can be rescued by technical fixes. Arguments about the theoretical merit of regression or the asymptotic behavior of specification tests for picking one version of a model over another seem like the arguments about how to build desalination plants with cold fusion and the energy source. The concept may be admirable, the technical details may be fascinating, but thirsty people should look elsewhere …

Causal inference from observational data presents may difficulties, especially when underlying mechanisms are poorly understood. There is a natural desire to substitute intellectual capital for labor, and an equally natural preference for system and rigor over methods that seem more haphazard. These are possible explanations for the current popularity of statistical models.

Indeed, far-reaching claims have been made for the superiority of a quantitative template that depends on modeling — by those who manage to ignore the far-reaching assumptions behind the models. However, the assumptions often turn out to be unsupported by the data. If so, the rigor of advanced quantitative methods is a matter of appearance rather than substance.

Create a free website or blog at WordPress.com.

Entries and comments feeds.