## Causal inference from observational data

1 Jun, 2021 at 14:23 | Posted in Economics | 1 Comment

Researchers often determine the individual’s contemporary IQ or IQ earlier in life, socioeconomic status of the family of origin, living circumstances when the individual was a child, number of siblings, whether the family had a library card, educational attainment of the individual, and other variables, and put all of them into a multiple-regression equation predicting adult socioeconomic status or income or social pathology or whatever. Researchers then report the magnitude of the contribution of each of the variables in the regression equation, net of all the others (that is, holding constant all the others). It always turns out that IQ, net of all the other variables, is important to outcomes. But … the independent variables pose a tangle of causality – with some causing others in goodness-knows-what ways and some being caused by unknown variables that have not even been measured. Higher socioeconomic status of parents is related to educational attainment of the child, but higher-socioeconomic-status parents have higher IQs, and this affects both the genes that the child has and the emphasis that the parents are likely to place on education and the quality of the parenting with respect to encouragement of intellectual skills and so on. So statements such as “IQ accounts for X percent of the variation in occupational attainment” are built on the shakiest of statistical foundations. What nature hath joined together, multiple regressions cannot put asunder.

Now, I think this is right as far as it goes, although it would certainly have strengthened Nisbett’s argumentation if he had elaborated more on the methodological question around causality, or at least had given some mathematical-statistical-econometric references. Unfortunately, his alternative approach is not more convincing than regression analysis. As so many other contemporary social scientists today, Nisbett seems to think that randomization solves empirical problems. By randomizing we are getting different ‘populations’ that are homogeneous in regards to all variables except the one we think is a genuine cause. In that way we are supposed being able not having to actually know what all these other factors are.

If you succeed in performing an *ideal* randomization with different treatment groups and control groups that is attainable. *But* it presupposes that you really have been able to establish — and not just assume — that the probability of all other causes but the putative have the same probability distribution in the treatment and control groups, and that the probability of assignment to treatment or control groups are independent of all other possible causal variables.

Unfortunately, *real *experiments and *real* randomizations seldom or never achieve this. So, yes, we may do without knowing *all *causes, but it takes *ideal* experiments and *ideal* randomizations to do that, not *real *ones. That means that in practice we do have to have sufficient background knowledge to deduce causal knowledge. Without old knowledge, we can’t get new knowledge — and, ‘no causes in, no causes out.’

On the issue of the shortcomings of multiple regression analysis, no one sums it up better than David Freedman:

Regression models often seem to be used to compensate for problems in measurement, data collection, and study design. By the time the models are deployed, the scientific position is nearly hopeless …

Causal inference from observational data presents many difficulties, especially when underlying mechanisms are poorly understood. There is a natural desire to substitute intellectual capital for labor, and an equally natural preference for system and rigor over methods that seem more haphazard. These are possible explanations for the current popularity of statistical models.

Indeed, far-reaching claims have been made for the superiority of a quantitative template that depends on modeling – by those who manage to ignore the far-reaching assumptions behind the models. However, the assumptions often turn out to be unsupported by the data. If so, the rigor of advanced quantitative methods is a matter of appearance rather than substance.

## 1 Comment »

RSS feed for comments on this post. TrackBack URI

### Leave a Reply

Blog at WordPress.com.

Entries and Comments feeds.

If what we are tasked to explain is a normalized score on an exam, an “IQ test”, we might expect variables that addressed variation in how well the subjects test: how motivated are the test-takers to do well on the test? It should not be set aside as an idle question or irrelevant to the ultimate objects of the inquiry.

.

I would not suggest that intelligence does not vary or is not important in many ways, but I would aver that far from being at the center of statistical studies of intelligence in society, statistical method becomes an excuse to ignore and obscure human intelligence as the multi-faceted phenomena it is, embedded not just in society but in personality and social circumstances and experience.

Comment by Bruce Wilder— 1 Jun, 2021 #