Delivering the 2014 Significance Lecture yesterday at the Royal Statistical Society International Conference, Harford made a compelling case that important statistical lessons of the past still apply, even though big data advocates might try to wish them away.
Harford began by very clearly defining what he meant by big data, at least in the context of this talk. The data he was referring to was 'found data', the type that's created when our mobile phones ping mobile phone masts, when we update Facebook, search the web or tweet our frustrations about a particular story in the news.
There are opportunities in this data, Harford said. It might be used to improve our understanding of human behaviour in all kinds of ways. But, he warned, there are threats there too – and those threats are amplified when we forget to apply basic statistical concepts while handling the data.
Harford gave the example of Google Flu Trends, a fascinating experiment by Google.org to use search data to estimate the prevalence of flu in the US population at any given time. It was a simple idea that delivered extraordinarily accurate results when compared to official Centers for Disease Control and Prevention (CDC) data – and it did so in almost real-time, something the CDC couldn't do.
For a while, Google Flu Trends was the poster child for the power of big data. But then it started to go wrong. Its estimates began to overshoot those of the CDC by a factor of two. Various explanations were offered. Maybe Google's suggested search terms were pushing more people to search for flu-related topics than might have done otherwise? Perhaps media coverage of flu outbreaks was leading people to search online for information on flu even if they themselves weren't exhibiting symptoms?
The problem, generally speaking, was that Flu Trends was basing its estimates on the volume of searches for key words or phrases, but it did not – could not – know why people were searching those terms without asking them. There were other factors driving search behaviour that it had not accounted for.
Hammering home the point that size isn't everything when it comes to data, Harford recounted the story of The Literary Digest's famous and faulty prediction of an Alf Landon landslide in the 1936 US presidential election. The prediction was based on a survey of two million Digest readers, automobile owners and households with telephones. But the magazine had failed to realise that the sorts of households that could afford phones, cars and magazine subscriptions during the depths of The Great Depression were probably not representative of the overall population. Instead, it was George Gallup's opinion poll, with a considerably smaller but more balanced sample, that correctly called the election for Franklin Roosevelt.
The magazine folded soon after. 'Was this correlation, or causation?' Harford wondered.
Hidden biases in data are a problem. Even the largest of datasets have bits of information missing. Quoting Microsoft researcher Kate Crawford, Harford said one might think they have all the data, but there will always be people missing from any dataset.
To illustrate this, Harford pointed to the City of Boston's Street Bump smartphone app – a clever idea to tackle the problem of potholes. Bostonians were encouraged to download the app and set it running when out in their cars so that when their vehicles hit a pothole, the bump would be recorded by the phone's accelerometer and location data sent to the city's public works department. What happened, of course, was that most of the potholes that were identified and fixed were those in young, affluent areas – areas where people owned smartphones and could download the app.
City officials might have thought they had found a way to record every pothole, but that wasn't the case. As Harford concluded: 'Some might think we are now able to measure everything; that we can turn everything into numbers. But we need to be wise enough to know that is always an illusion.'