Big Data — poor science

17 Feb, 2017 at 15:19 | Posted in Statistics & Econometrics | 1 Comment

Almost everything we do these days leaves some kind of data trace in some computer system somewhere. When such data is aggregated into huge databases it is called “Big Data”. It is claimed social science will be transformed by the application of computer processing and Big Data. The argument is that social science has, historically, been “theory rich” and “data poor” and now we will be able to apply the methods of “real science” to “social science” producing new validated and predictive theories which we can use to improve the world.

charles-schulz-peanuts-think-bigWhat’s wrong with this? … Firstly what is this “data” we are talking about? In it’s broadest sense it is some representation usually in a symbolic form that is machine readable and processable. And how will this data be processed? Using some form of machine learning or statistical analysis. But what will we find? Regularities or patterns … What do such patterns mean? Well that will depend on who is interpreting them …

Looking for “patterns or regularities” presupposes a definition of what a pattern is and that presupposes a hypothesis or model, i.e. a theory. Hence big data does not “get us away from theory” but rather requires theory before any project can commence.

What is the problem here? The problem is that a certain kind of approach is being propagated within the “big data” movement that claims to not be a priori committed to any theory or view of the world. The idea is that data is real and theory is not real. That theory should be induced from the data in a “scientific” way.

I think this is wrong and dangerous. Why? Because it is not clear or honest while appearing to be so. Any statistical test or machine learning algorithm expresses a view of what a pattern or regularity is and any data has been collected for a reason based on what is considered appropriate to measure. One algorithm will find one kind of pattern and another will find something else. One data set will evidence some patterns and not others. Selecting an appropriate test depends on what you are looking for. So the question posed by the thought experiment remains “what are you looking for, what is your question, what is your hypothesis?”

Ideas matter. Theory matters. Big data is not a theory-neutral way of circumventing the hard questions. In fact it brings these questions into sharp focus and it’s time we discuss them openly.

David Hales

1 Comment

  1. I recently completed a doctoral program in agricultural science. Utilizing output from computation-intense programs and algorithms is, of course, deeply ingrained into the worldview in my profession. Our professors even stated that “with the computational power available today, having a correlation or well-fitted model takes the place of causation”. Apparently understanding mechanisms and cause/effect relationships in ANY form is viewed by (some) as unnecessary.

    Traditionally, agronomic science was subject to rigorous peer review and unflinching adherence to biometric methods. Today, computational power gives researchers a model that fits their data, they make conclusions in a relatively time-compressed manner, and go forward. It is small wonder that progress in understanding the relationship between genes and their physical expression takes place so slowly.

Sorry, the comment form is closed at this time.

Blog at
Entries and comments feeds.