Data mining and the meaning of the Econometric Scripture

20 Oct, 2014 at 21:19 | Posted in Statistics & Econometrics | 1 Comment

Some variants of ‘data mining’ can be classified as the greatest of the basement sins, but other variants of ‘data mining’ can be viewed as important ingredients in data analysis. Unfortunately, these two variants usually are not mutually exclusive and so frequently conflict in the sense that to gain the benefits of the latter, one runs the risk of incurring the costs of the former.

mining-e1379773721738Hoover and Perez (2000, p. 196) offer a general definition of data mining as referring to “a broad class of activities that have in common a search over different ways to process or package data statistically or econometrically with the purpose of making the final presentation meet certain design criteria.” Two markedly different views of data mining lie within the scope of this general definition. One view of ‘data mining’ is that it refers to experimenting with (or ‘fishing through’) the data to produce a specification … The problem with this, and why it is viewed as a sin, is that such a procedure is almost guaranteed to produce a specification tailored to the peculiarities of that particular data set, and consequently will be misleading in terms of what it says about the underlying process generating the data. Furthermore, traditional testing procedures used to ‘sanctify’ the specification are no longer legitimate, because these data, since they have been used to generate the specification, cannot be judged impartial if used to test that specification …

An alternative view of ‘data mining’ is that it refers to experimenting with (or ‘fishing through’) the data to discover empirical regularities that can inform economic theory … Hand et al (2000) describe data mining as the process of seeking interesting or valuable information in large data sets. Its greatest virtue is that it can uncover empirical regularities that point to errors/omissions in theoretical specifications …

In summary, this second type of ‘data mining’ identifies regularities in or characteristics of the data that should be accounted for and understood in the context of the underlying theory. This may suggest the need to rethink the theory behind one’s model, resulting in a new specification founded on a more broad-based understanding. This is to be distinguished from a new specification created by mechanically remolding the old specification to fit the data; this would risk incurring the costs described earlier when discussing the first variant of ‘data mining.’

The issue here is how should the model specification be chosen? As usual, Leamer (1996, p. 189) has an amusing view: “As you wander through the thicket of models, you may come to question the meaning of the Econometric Scripture that presumes the model is given to you at birth by a wise and beneficent Holy Spirit.”

In practice, model specifications come from both theory and data, and given the absence of Leamer’s Holy Spirit, properly so.

Peter Kennedy

1 Comment

  1. I’ll bet you didn’t know that the late Canadian folk singer Stan Rogers had a very popular 1979 song about data mining called ‘White Collar Holler’ The equipment mentioned is a little dated, but the lyrics do put one in mind of certain econometricians at work.

Sorry, the comment form is closed at this time.

Blog at
Entries and comments feeds.