## Let’s empty the econometric garbage can!

8 Jan, 2015 at 17:06 | Posted in Statistics & Econometrics | 3 Comments

This is where statistical analysis enters. Validation comes in many different forms, of course, and much good theory testing is qualitative in character. Yet when applicable, statistical theory is our most powerful inductive tool, and in the end, successful theories have to survive quantitative evaluation if they are to be taken seriously. Moreover, statistical analysis is not confined to theory evaluation. Quantitative analysis also discovers empirical generalizations that theory must account for. Scientific invention emerges from data and experiment as often as data and experiment are used to confirm prior theory …

How is all this empirical creativity and validation to be achieved? Most empirical researchers … believe that they know the answer. First, they say, decide which explanations of a given phenomenon are to be tested. One or more such hypotheses are set out. Then “control variables” are chosen— factors which also affect the phenomenon under study, but not in a way relevant to the hypotheses under discussion. Then measures of all these explanatory factors are entered into a regression equation (linearly), and each variable is assigned a coefficient with a standard error. Hypotheses whose factors acquire a substantively and statistically significant coefficient are taken to be influential, and those that do not are treated as rejected. Extraneous influences are assumed to be removed by the “controls” …

In the great majority of applied work with all these methods, a particular statistical distribution is specified for the dependent variable, conditional on the independent variables. The explanatory factors are postulated to exert their influence through one or more parameters, usually just the mean of the statistical distribution for the dependent variable. The function that connects the independent variables to the mean is known as the “link function” …

In practice, researchers nearly always postulate a linear specification as the argument of the link function … Computer packages often make this easy: One just enters the variables into the specification, and linearity is automatically applied. In effect, we treat the independent variable list as a garbage can: Any variable with some claim to relevance can be tossed in. Then we carry out least squares or maximum likelihood estimation (MLE) or Bayesian estimation or generalized method of moments, perhaps with the latest robust standard errors. It all sounds very impressive. It is certainly easy: We just drop variables into our mindless linear functions, start up our computing routines, and let ’er rip …

Linear link functions are not self-justifying. Garbage-can lists of variables entered linearly into regression, probit, logit, and other statistical models have no explanatory power without further argument. In the absence of careful supporting argument, the results belong in the statistical rubbish bin …

In sum, we need to abandon mechanical rules and procedures. “Throw in every possible variable” won’t work; neither will “rigidly adhere to just three explanatory variables and don’t worry about anything else.” Instead, the research habits of the profession need greater emphasis on classic skills that generated so much of what we know in quantitative social science: plots, crosstabs, and just plain looking at data. Those methods are simple, but sophisticatedly simple. They often expose failures in the assumptions of the elaborate statistical tools we are using, and thus save us from inferential errors.

Christopher H. Achen

This paper is one of my absolute favourites. Why? I guess it’s because Achen reaffirms my firm conviction that since there is no absolutely certain knowledge at hand in social sciences — including economics — explicit argumentation and justification ought to play an extremely strong role if the purported knowledge claims are to be sustainably warranted. Or as Achen puts it, without careful supporting arguments, “just dropping variables into SPSS, STATA, S or R programs accomplishes nothing.”