Lies, Damned Lies and Big Data

By David Hales

Almost everything we do these days leaves some kind of data trace in some computer system somewhere. When such data is aggregated into huge databases it is called “Big Data”. It is claimed social science will be transformed by the application of computer processing and Big Data. The argument is that social science has, historically, been “theory rich” and “data poor” and now we will be able to apply the methods of “real science” to “social science” producing new validated and predictive theories which we can use to improve the world.

What’s wrong with this? On one level nothing. We know so little about the social world that anything is worth a try. Mining these huge databases will almost certainly lead to new ideas and insights. However, before we run headlong into this new world of big data, promoted as it is by corporations such as IBM and the large consultancies, perhaps we might benefit from a little critical reflection.

Firstly what is this “data” we are talking about? In it’s broadest sense it is some representation usually in a symbolic form that is machine readable and processable. And how will this data be processed? Using some form of machine learning or statistical analysis. But what will we find? Regularities or patterns (for a useful discussion of patterns within complex systems, see Greg Fisher’s previous post, Patterns Amid Complexity). What do such patterns mean? Well that will depend on who is interpreting them.

Given this level of generality, if someone tells you they are working on “big data” it tells you almost nothing. One way to approach the issue if confronted with a “big data” project is to ask the following question based on a thought experiment:

“Imagine you had a massive computer database that contained all possible measurements that could ever be made over the entire span of all space and time. You could query it with any question and it would deliver the result instantaneously. All big data is merely a subset of this ‘the biggest data that could ever exist’.  What would your project ask it?”

If no coherent answer can be produced to this question then any such project is at best directionless and at worst not conscious of its aims.

One answer might be “looking for patterns or regularities in the data”.

Looking for “patterns or regularities” presupposes a definition of what a pattern is and that presupposes a hypothesis or model, i.e. a theory. Hence big data does not “get us away from theory” but rather requires theory before any project can commence.

What is the problem here? The problem is that a certain kind of approach is being propagated within the “big data” movement that claims to not be a priori committed to any theory or view of the world. The idea is that data is real and theory is not real. That theory should be induced from the data in a “scientific” way[1].

I think this is wrong and dangerous. Why? Because it is not clear or honest while appearing to be so. Any statistical test or machine learning algorithm expresses a view of what a pattern or regularity is and any data has been collected for a reason based on what is considered appropriate to measure. One algorithm will find one kind of pattern and another will find something else. One data set will evidence some patterns and not others. Selecting an appropriate test depends on what you are looking for. So the question posed by the thought experiment remains “what are you looking for, what is your question, what is your hypothesis?”

It seems to me that one must at least try to answer this question if one is to pursue social science. Not just because it is good science but also because it has ethical and political implications.  The view one takes of social phenomena, either consciously or through algorithms and data, frames what is and is not conceivable for past and future social reality. If you doubt the importance of such ideas one should look that the history of the 20th century. Ideas matter. Theory matters. Big data is not a theory-neutral way of circumventing the hard questions. In fact it brings these questions into sharp focus and it’s time we discuss them openly.

Right now we are “data rich” and “theory poor”. We need new theory for the 21st century. That requires critical discussion, reflection, honestly and humility. It is not clear to me that such concerns are prominent within much of the “big data” movement.

Here is a more eloquent and playful take on these issues, by a colleague of mine, in the genre of that wonderful Orwell fable:





[1] Essentially then this is a Baconian view of social science.  As you might have realized, my sympathies are more in the Popperian mode of thinking.

2 Responses to “Lies, Damned Lies and Big Data”

  1. Hi Greg,

    I don’t have any ‘big data’ to support this assertion but I agree with your concerns!

    I make a similar point in a recent post, “Mystic Megaproject – Predicting the future with Big Data and Big Science (or not):

    Cheers, Chris

  2. carl allen says:

    Data is a rather strange resource to process.

    It can yield metal ore to be processed but also gems to be cut and polished.

    The ores form frames (patterns and regularities) and the gems are then placed into the frames.

    All rather elementary, I afraid, but so clear that it can be difficult to see.

Leave a Reply