Sherlock Holmes inference and econometric testing

23 Oct, 2014 at 15:10 | Posted in Statistics & Econometrics | Comments Off on Sherlock Holmes inference and econometric testing

Basil Rathbone as Sherlock HolmesSherlock Holmes stated that ‘It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.’ True this may be in the circumstance of crime investigation, the principle does not apply to testing. In a crime investigation one wants to know what actually happened: who did what, when and how. Testing is somewhat different.

With testing, not only what happened is interesting, but what could have happened, and what would have happened were the circumstances to repeat itself. The particular events under study are considered draws from a larger population. It is the distribution of this population one is primarily interested in, and not so much the particular realizations of that distribution. So not the particular sequence of head and tails in coin flipping is of interest, but whether that says something about a coin being biased or not. Not (only) whether inflation and unemployment went together in the sixties is interesting, but what that tells about the true trade-off between these two economic variables. In short, one wants to test.

The tested hypothesis has to come from somewhere and to base it, like Holmes, on data is valid procedure … The theory should however not be tested on the same data they were derived from. To use significance as a selection criterion in a regression equation constitutes a violation of this principle …

Consider for example time series econometrics … It may not be clear a priori which lags matter, while it is clear that some definitely do … The Box-Jenkins framework models the auto-correlation structure of a series as good as possible first, postponing inference to the next stage. In this next stage other variables or their lagged values may be related to the time series under study. While this justifies why time series uses data mining, it leaves unaddressed the issue of the true level of significance …

This is sometimes recommended in a general-to-specific approach where the most general model is estimated and insignificant variables are subsequently discarded. As superfluous variables increase the variance of estimators, omitting irrelevant variables this way may increase efficiency. Problematic is that variables were included in the first place because they were thought to be (potentially) relevant. If then for example twenty variables, believed to be potentially relevant a priori, are included, then one or more will bound to be insignificant (depending on the power, which cannot be trusted to be high). Omitting relevant variables, whether they are insignificant or not, generally biases all other estimates as well due to the well-known omitted variable bias. The data are thus used both to specify the model and test the model; this is the problem of estimation. Without further notice this double use of the data is bound to be misleading if not incorrect. The tautological nature of this procedure is apparent; as significance is the selection criterion it is not very surprising selected variables are significant.

D. A. Hollanders Five methodological fallacies in applied econometrics

Blog at WordPress.com.
Entries and comments feeds.