When British Prime Minister Theresa May called a snap election for 8 June 2017, it seemed like a smart move politically. Her Conservative Party was riding high in the opinion polls, with a YouGov poll in the Times giving them 44%, a lead of 21 points over their nearest rivals, the Labour Party. Were an election to be held the next day (as surveys often suppose) May looked to be on course for a convincing win.
But then came the obvious question: “Can we actually trust the polls?” The media seemed sceptical. Though they had not shied away from reporting poll results in the months since the 2015 general election, they were clearly still sore about the errors made last time, when survey results mostly indicated the country was heading for a hung parliament.
So, can we trust the polls this time around? It’s not possible to say until we have election results to compare them to. But what we can do is consider the work that’s been done to try to fix whatever went wrong in 2015.
There’s a lot to cover, so I’ve broken the story up by key dates and periods:
- The election – 7 May 2015
- The reaction – 8-10 May
- The suspects
- Early speculation – 11 May-18 June
- The Sturgis inquiry meets – 19 June
- The investigation focuses – 20 June-31 December
- Unrepresentative samples indicted – 1 January-30 March 2016
- The Sturgis inquiry report – 31 March
- A heated debate – 1 April-22 June
- EU referendum and reaction – 23 June-19 July
- US presidential election and reaction – 20 July-31 December
- The calm before the storm – 8 December 2016-18 April 2017
- Have the polls been fixed?
The night before the 2015 General Election, the atmosphere was tense but calm. The polling consensus was that there would be a hung parliament, with the coalition government that had been in power since 2010 continuing – albeit with the Conservative’s junior coalition partners, the Liberal Democrats, wounded but still viable. There were some dissenting views, however. An analyst, Matt Singh, had noticed an odd tendency by pollsters to underestimate the Conservatives, and Damian Lyons Lowe of the company Survation was puzzling over an anomalous poll which appeared to show the Conservatives heading for victory. But mostly people were confident that a hung parliament was the most likely outcome.
Skip forward 24 hours and the nation had voted, with the more politically engaged parts of the electorate settling down to watch David Dimbleby host the BBC election special, with Andrew Neil interrogating Michael Gove, Paddy Ashdown, Alistair Campbell and other senior figures. As Big Ben struck 10pm, exit poll results were released – and they showed something quite unexpected: the Conservatives were the largest party with a forecast of 316 Commons seats; the Liberal Democrats had dropped by 57 seats to just 10. As the night wore on, the Conservative victory gradually exceeded that predicted by the exit poll, and by the middle of the next day, the party had achieved a majority. The government to command the confidence of the new House would be very different to that predicted by the polls.
The polls had failed.
The public reaction to the failure of the polls was derisory – but the failure shouldn’t have come as a surprise. Similar misses were made in 1970 and 1992. The response from the industry was swift: within a day, the British Polling Council (BPC) and the Market Research Society (MRS) announced that a committee of inquiry was to be formed and recommendations made. The committee consisted mostly of academics, with a scattering of pollsters, and was chaired by Professor Patrick Sturgis of the National Centre for Research Methods (NCRM) at the University of Southampton. The committee has since come to be referred to as the “Sturgis inquiry”.
When pollsters or academics get together to talk about polling, certain well-worn issues will surface. Some of these are well-known to the public; others are more obscure. During the investigation, these issues were chewed over until prime suspects were identified and, eventually, one selected as the most probable cause. In no particular order, the list of suspects is given below.
||Polling samples are weighted according to certain assumptions. Those assumptions become less valid as time passes and the weighting may be thought to be biased weighting.
||Resetting the weights and assumptions after an event with a known vote distribution. Usually immediately after an election. May also be referred to as benchmarking.
||The propensity of many voters to make their mind up, or change their mind close to the election or even on the day itself. It was accepted as the cause of the 1970 polling failure, then accepted with reservations as a cause of the 1992 failure. Would it achieve traction for 2015, or would it be seen as implausible third time around?
||Confusingly, this phrase is used to describe two different but similar problems: people refusing to answer pollsters, and people lying to pollsters, and the problems it causes if it varies by party. The former is sometimes called missing not at random (MNAR) and the latter voter intention misreporting. The phrase differential response may also be used. The phrase social desirability bias or social satisficing may be used to describe the motivation. Despite the name, it is not limited to a specific party. Differential don’t knows may be accorded their own category.
||This phrase is used to describe people who say they will vote but in fact do not cast a vote. Better known as differential turnout by party or differential turnout. Again, despite the name, it is not limited to a specific party.
||Effects that vary dependent on the method used to obtain samples, e.g. telephone polls or online polls.
|Non-random and/or non-representative sample
||In the UK, samples taken via telephone polls or online opt-in panels are usually neither random nor representative, due to causes such as quota sampling or self-selection. The similar terms unrepresentative sample or biased sample may be used.
||Samples too small to produce sufficient power for the results.
|Margin of error
||The margin of error is the band within which one may expect the actual number to be, for a certain probability. An associated but different term is confidence interval. The AAPOR restricts the use of the term “margin of error” due to a lack of theoretical justification for non-random samples.
||Those registered to vote. Registration has recently changed to use individual voter registration, raising the possibility of differential registration.
||The phrase is more formalised in the US: in the UK it simply refers to people who are likely to vote. A similar term is propensity to vote. If it differs by party, see Lazy Labour above.
||Some jurisdictions allow votes to be cast outside its borders, usually by post.
|Question format||How the sample questions are asked and their order may alter the result.
*The phrase is invented for this article
The first days of the post-election investigations were wide ranging, with speculation over several possible causes. One of the first pollsters off the mark was ComRes, who on 11 May proposed benchmarking between elections to prevent assumption decay, and stronger methods of identifying registered and likely voters. The BBC considered the obvious suspects: late swing, Shy Tories, too-small samples and Lazy Labour, but it thought they were all unconvincing. YouGov similarly discarded late swing and mode effects (online versus phone polls) and said they would investigate turnout, differential response rates and unrepresentative samples. There was brief interest in regulating pollsters via a Private Member’s Bill, but objections were raised and the bill later stalled.
By June, ComRes had considered the matter further, and proposed that turnout in the 2015 election had varied by socioeconomic group and that this differential turnout could be modelled using a voter turnout model based on demographics. With admirable efficiency, they called this model the ComRes Voter Turnout Model.
At this point, Jon Mellon of the British Election Study (BES) joined the fray, publishing his initial thoughts on 11 June. The possible causes he was investigating were late swing, differential turnout/registration, biased samples or weighting, differential “don’t knows” and Shy Tories, although he would only draw conclusions once the BES post-election waves had been studied.
The inquiry team had their first formal meeting on 19 June, which heard presentations from most of the main polling companies, namely ComRes, ICM, Ipsos Mori, Opinium, Populus, Survation and YouGov. Most companies (though not Survation) thought there was little evidence of late swing being a cause. Shy Tories was mentioned but did not gain much traction. Discussion revolved around altering factors or weights, such as different age groups, socioeconomic groups, geographical spread, previous voting, and others. Most pollsters were considering turnout problems – although some defined that as people saying they would vote but didn’t (differential turnout), and others defined it as not sampling enough non-voters (unrepresentative samples). Martin Boon of ICM was concerned about the resurgence of Labour overestimates and the difficulty of obtaining a representative sample, particularly given that response rates had dropped so low he needed 20,000 phone numbers to get 1,000 responses.
Jon Mellon from BES came in again, this time with Chris Prosser. They started to narrow down the list of suspects, discarding late swing, “don’t knows” and Shy Tories, instead thinking differential turnout and unrepresentative samples were the avenues to explore. Conversely, ICM was still working on their Shy Tories adjustment method by fine-tuning it to cope with those who refused to say both how they intended to vote and how they voted last time – but the company acknowledged that this was just the precursor to other changes.
Proceedings were briefly interrupted by an article by Dan Hodges noting the existence of “herding” (the tendency of polls to clump together as the election date approaches) and accusing the pollsters of collusion. This was met by a rapid rebuttal from Matt Singh who disagreed that herding happens and gave his reasons. Ironically, Singh’s reasoning was later overridden by the inquiry who thought that herding did happen, albeit innocently.
By November, Mellon and Prosser had firmed up their conclusion: they still discarded late swing, “don’t knows” and Shy Tories, but they now downplayed differential turnout and named unrepresentative samples as the prime suspect. Having too many politically-engaged people and an age imbalance in the sample, combined with weighting to match the general population instead of the voting population, resulted in the observed inaccuracy in the pre-election polls.
By December, YouGov was agreeing with the BES. It deprecated late swing and misrepresentation and differential turnout. Instead it settled on the problem as being too many politically-engaged people on its panel and not enough older people, contributing to an age imbalance that resulted in too many young people and not enough over-70s in its samples. It proposed curing this by attracting more politically-disengaged people and introducing sampling targets for the over-70s. It also agreed with BES’s observation that weighting to the population, rather than to those who would vote, acted as an error multiplier. It also noted ICM’s statement that phone polls continued to collect too many Labour voters in the raw sample. As the year came to an end, Anthony Wells of YouGov wrote that “despite the many difficulties there are in getting a representative sample of the British public, I still think those difficulties are surmountable, and that ultimately, it’s still worth trying to find out and quantify what the public think”.
2016 started with a notable intervention. John Curtice published “The benefits of random sampling: Lessons from the 2015 UK General Election”, which targeted unrepresentative samples as the prime cause of the polling failure and suggested random sampling as the cure.
On 19 January, the Sturgis inquiry released its preliminary findings. It rejected many things that had been suggested as the cause: postal votes, falling voter registration, overseas voters, question wording/ordering, differential turnout reporting, mode effect, late swing and voter intention misreporting. In common with Curtice and the BES, it concluded that unrepresentative samples were to blame. It pointed out that herding existed, albeit innocently. But in deference to the pollsters, the inquiry acknowledged that random sampling was impracticable (given the cost implications) and would not be recommended.
As for the pollsters, Ipsos Mori agreed that it had sampled too many engaged voters (particularly younger people who tend to support Labour) and hadn’t interviewed enough non-voters, but it still thought it had too many respondents who said they were going to vote Labour but didn’t actually cast a vote.
By this point the pollsters had made interim changes of various magnitudes, with ComRes, BMG and YouGov having made the biggest changes, others only minor. All had further changes in mind but were waiting on the Sturgis inquiry to report and were settling down to wait for publication.
Meanwhile, a strange phenomenon was surfacing: online polls and telephone polls were showing a marked difference for the upcoming EU referendum, and people were beginning to notice…
On 31 March, the Sturgis inquiry released its report. They discarded postal votes, overseas voters, voter nonregistration, Shy Tories (whether via question wording or deliberate misreporting), mode effects (online vs. phone) and differential turnout. They indicted unrepresentative samples as the prime cause of the failure, specifically within age band, caused by too many young people and people under the age of 70, and not enough aged 75 and older. It gently contradicted Matt Singh’s assertion that there was no herding, although they didn’t think it caused the problem.
It made 12 recommendations split into three groups (see box).
For the pollsters, it recommended they refine their techniques and weights to correct their unrepresentative samples. They also made some recommendations about things that didn’t cause the problem but were nevertheless thought to be improvable, such as how they handled “don’t knows” and postal votes and estimated turnout.
For the Economic and Social Research Council (ESRC), it recommended the funding of a pre- and post-election random probability survey as part of the British Election Study in the next general election campaign.
For the BPC, it recommended improving transparency by changing the rules to require pollsters to report explicitly their methods, statistical tests and confidence intervals, and to require poll registration and provision of microdata. The recommendations regarding confidence intervals were a nod to a debate on whether margin of error should be used for non-random samples: the theoreticians state that the theoretical justification is weak and misleading; the pragmatists state that it is a good rule of thumb and alternatives, such as credible intervals, are themselves problematic.
The BPC agreed to put into practice the inquiry’s recommendations, some immediately, some by early 2017, some before 2020. The pollster ComRes agreed with the report, pointing out that it already included questions when appropriate about whether respondents had voted by post and was trying to deal with unrepresentative samples. But, just like Ipsos Mori in January, it still considered differential turnout to be relevant and had implemented its new ComRes Voter Turnout Model accordingly.
While all this work was going on, the referendum polls intruded. After some years of discussion, the UK Government had decided to hold a referendum on the UK’s membership of the European Union, to take place on 22 June 2016. The debate in the weeks before the referendum generated more heat than light, and contributing to the heat was the discussion about the difference between telephone polls and online polls. The telephone polls gave Remain a small but definite lead; the online polls were neck and neck. The difference was too noticeable to be elided.
The problem became farcical when ICM released two polls on 16 May covering the same period, but one was a telephone poll with an 8% Remain lead and the other an online poll with a 4% Leave lead. Plainly there was something wrong, and a few days later there was a Twitter spat between Peter Kellner and Stephan Shakespeare, respectively the past president and current CEO of YouGov, with Kellner saying the phone polls were right and Shakespeare saying not. Time was spent over the next week or so discussing the problem with no real resolution: some said one or the other was right; others said the truth was somewhere between the two. But nobody noticed that both phone and online polls were overestimating Remain.
The referendum was getting ever closer. Opinium switched to weighting targets based on a rolling average, including the 2015 BES face-to-face survey and other elements to remove the interviewer effect. Ipsos Mori started discarding those who were not registered to vote in the referendum, weighting to educational attainment, and using different questions to estimate turnout. Pollsters churned out polls until the day of the referendum. Would they get it wrong again?
The result of the referendum came in at 51.89% Leave, 48.11% Remain. Some pollsters had predicted this in their final polls – TNS, Opinium, ICM – but many had not. The UK newspapers’ verdict was implacable: the polls had failed again.
But the truth was more nuanced: the race was close, and although the telephone polls were wrong, the online polls reflected reality and were far closer, although Populus’s online 10-point lead for Remain did muddy the water. This success for online polls was noted. Another point – that some pollsters were not coping too well with undecided voters – was also made. ORB had reallocated undecideds three-to-one to Remain; ComRes and Populus allocated them according to answers on the impact of EU departure. All three had larger Remain leads. This point was later picked up by YouGov.
The pollsters made their responses. ComRes pointed out that regional and demographic effects were difficult to model, while Populus argued that handling undecideds and differential turnout was difficult in a referendum – going so far as to say that a demographically-based propensity-to-vote model for referendums was useless. YouGov analysed the results and noted that although online polls were better than telephone polls, there was significant variation within modes. They also found that harder-to-reach people were different to easy-to-reach ones (though perhaps not enough to justify chasing the more elusive respondents), that weighting by education and attitude might help but is not a panacea, that more sophisticated turnout modelling is not necessarily helpful and may indeed make things worse, and that it is difficult to handle “don’t knows”.
Attention next moved to the US presidential election, to be held on 8 November. In a prescient post, Anthony Wells of YouGov made some observations on whether the UK problems could be read across to the US. He noted that US polls needed to make sure their samples weren’t skewed by a lack of old and uneducated voters, and a surfeit of the young, educated and politically engaged.
In the event, the pollsters got the vote approximately correct and the American pollsters prided themselves on their performance with respect to the British. But they exhibited the same behaviour as the British pollsters: arguing in public and under-representing the uneducated in polls. And the US modellers and pundits had the same problems: the pundits cherry-picking polls and the modellers reaching problematic conclusions – namely that Hillary Clinton would win the presidency rather than Donald Trump.
On 8 December, the NCRM and BPC/MRS held a seminar at the Royal Statistical Society headquarters entitled “Opinion Polling in the EU Referendum: Challenges and Lessons”. Presenters included, John Curtice (BPC), Ben Page (Ipsos Mori), Adam Drummond (Opinium), and Patrick Sturgis (NCRM), Stephen Fisher (Oxford) and Will Jennings (Soton) – the latter three being part of the Sturgis inquiry panel.
The seminar mirrored the public debate: the pollsters fell into one camp, the academics another.
The pollsters (Page, Drummond) focussed on the logistics: which mode, quota, weights, turnout adjustment to use. Page for Ipsos Mori favoured changing newspaper weights, turnout filter and adding an education quota. Drummond for Opinium weighted attitudinally and although it hadn’t worked for the referendum, he said they would keep it for the general election. Both agreed that turnout was important and difficult to predict in a referendum.
The academics (Curtice, Sturgis, Fisher, Jennings) focussed on the theory: new factors, the liberal/authoritarian axis displacing the left/right axis, undecideds, unrepresentative samples, confidence intervals. Sturgis said that the phone vs. online difference may be the product of social desirability bias or sample composition, and that quota sampling depends on assumptions that might be wrong. Jennings spoke of using mean absolute error (MAE) or the winner to judge accuracy. Fisher said it was important but difficult to split undecideds correctly, and estimating turnout by reported likelihood to vote (LTV) was better than by demography for the referendum. All agreed that correcting unrepresentative samples was crucial.
As the year ended, things began to quiet. The AAPOR announced the Kennedy inquiry, the US equivalent of the Sturgis inquiry, to report in May 2017. The next UK General Election was scheduled for 2020, so people turned their attention to the 2017 local elections, the 2017 French presidential and the 2017 German federal elections. There was plenty of time to prepare.
Then, on 18 April 2017, British Prime Minister Theresa May announced a snap general election for 8 June 2017.
The 2015 polling failure was caused by unrepresentative samples containing too many young, politically-engaged people and not enough over-70 or 75s. The proposed solutions mirror an internal debate in the polling community, between the theoreticians (mostly academics) who want the process to be theoretically valid, and the pragmatists (mostly pollsters) who know the resources to hand and what the market can bear.
In the matter of unrepresentative samples, whilst it is accepted that it was a cause of the failure, opinions differ on what to do about it. Some pollsters are trying to fix the problem directly by changing their quotas or weighting mechanisms to get the correct proportions of non-graduates, older people and the politically unengaged – but whether these will work in action is not known. Other pollsters are trying to cure the problem indirectly by dealing with differential turnout.
In the matter of differential turnout and undecideds, pollsters have added corrective measures and/or new models based on demography, but they did not cope well in the referendum and in some cases made the results worse.
In the matter of mode effects, online polls have beaten out phone polls, despite the latter being marginally better in 2015. Following the referendum, it was thought that online polls find it easier to get a representative sample than phone polls, and longer fieldwork times don’t necessarily help close the gap. ComRes and ICM have moved to online polling, leaving Ipsos Mori as the sole phone pollster.
In the matter of random versus non-random sampling, the pragmatists have won. Every time there is a polling failure, someone suggests random sampling instead of quota sampling, and every time it’s rejected for the same reasons: it’s too expensive and difficult. That, and the failure of three random-sample polls for the EU referendum, makes the retention of non-random sampling inevitable.
In the matter of margin of error, the pragmatists have also won. Pollsters (and even some academics) continue to preface explanations of the margin of error with the statistical theory, despite the edicts of the AAPOR against it and the BPC pointing out the problems with it. Kantar TNS coped with the letter – if not the spirit – of recommendation 11 of the Sturgis inquiry by saying that: “On the assumption that – conditional on the weights – the sample is practically equivalent to a random sample, the survey estimates have a margin of error of up to ±3 percentage points.” (Emphasis added.)
The reader is entitled to ask if the problems with the polls are now fixed. The somewhat glib answer is no: polls are never fixed, they are only recalibrated. Political opinion polling in the UK is based on the premise that if you take a non-random and/or non-representative sample and weight it according to past assumptions, it will behave like a random representative sample, yielding results describable by statistical theory. But those assumptions decay with time, and if sufficient time has passed or public behaviour changes rapidly enough, then the poll will fail.
Given the enormous poll lead enjoyed by Prime Minister May’s Conservative Party, it would not be surprising if they won the 2017 election, and the public will accept that as a win for the pollsters. But it is a time of rapid change, and the underlying stresses have only been relieved, not cured. If past experience is any guide, one day there will be another polling failure. The only question is when.
BPC members should:
1 include questions during the short campaign to determine whether respondents have already voted by post. Where respondents have already voted by post they should not be asked the likelihood to vote question.
2 review existing methods for determining turnout probabilities. Too much reliance is currently placed on self-report questions which require respondents to rate how likely they are to vote, with no strong rationale for allocating a turnout probability to the answer choices.
3 review current allocation methods for respondents who say they don’t know, or refuse to disclose which party they intend to vote for. Existing procedures are ad hoc and lack a coherent theoretical rationale. Model-based imputation procedures merit consideration as an alternative to current approaches.
4 take measures to obtain more representative samples within the weighting cells they employ.
5 investigate new quota and weighting variables which are correlated with propensity to be observed in the poll sample and vote intention.
The Economic and Social Research Council (ESRC) should:
6 fund a pre- as well as a post-election random probability survey as part of the British Election Study in the 2020 election campaign.
BPC rules should be changed to require members to:
7 state explicitly which variables were used to weight the data, including the population totals weighted to and the source of the population totals.
8 clearly indicate where changes have been made to the statistical adjustment procedures applied to the raw data since the previous published poll. This should include any changes to sample weighting, turnout weighting, and the treatment of Don’t Knows and Refusals.
9 commit, as a condition of membership, to releasing anonymised poll micro-data at the request of the BPC management committee to the Disclosure Sub Committee and any external agents that it appoints.
10 pre-register vote intention polls with the BPC prior to the commencement of fieldwork. This should include basic information about the survey design such as mode of interview, intended sample size, quota and weighting targets, and intended fieldwork dates.
11 provide confidence (or credible) intervals for each separately listed party in their headline share of the vote.
12 provide statistical significance tests for changes in vote shares for all listed parties compared to their last published poll.
About the author
Timothy Martyn Hill is a statistician who used to work for the Office for National Statistics and now works in the private sector.
A fully referenced version of this article is also available.