A typical article on data science hails new data sources, new tools, and new visualisations, and thereby supports the case for the value of data science.
But this article takes a different angle: it talks about potential pitfalls that can face data scientists. It is based on our work as the Office for Statistics Regulation (OSR), the UK’s regulator for official statistics. We see lots of great work done by statisticians in government. But we also see some of the challenges they face – and data scientists are also likely to encounter the same challenges.
The problems arise from the fact that neither statisticians nor data scientists do their work in isolation. The work usually takes places within organisations – businesses, government bodies, think tanks, academic institutions – and as a result, the statisticians and/or data scientists are not the only players who get to influence how data science is presented and used.
What are the pitfalls we see in our work as regulator?
Pseudo data science
The first type of pitfall is pseudo data science.
Pseudo data science is a term we use to describe attempts to pass off crude work as being more data science-y than it really is. That reflects a sense in public life that data science is new, innovative, somehow the Future. In this context, people who are not data scientists can be tempted to dress themselves up in the clothes of data science to enhance their credibility. This dressing up is usually well-intentioned – communications professionals who want to illuminate and explain complex issues in an engaging way.
The trouble is, it can sometimes backfire. In our work at OSR, we have over the last year seen several examples where organisations have sought to publish visualisations that look like they are the product of in-depth data analysis – when in fact they have been drawn by communications staff using graphic design packages. Examples include inflation, nurses pay, and comparisons of UK economic performance with other countries. To be fair, whenever we have pointed out issues like this, organisations have responded well, putting in place new procedures to ensure that analysts sign off on this kind of visualisations. Nevertheless, we suspect that the temptations to indulge in pseudo data science will remain strong – and we may need to intervene on similar cases in future.
Unintelligent transparency
The second pitfall is a failure of intelligent transparency.
There is a raw form of transparency – quoting a single number (a naked number we call it); or dumping data out into the public domain with no explanation. This is not intelligent transparency. The latter involves being clear where data come from, what their source is, and making underlying data available so that others can understand and verify the statements that are being made. Raw transparency and naked numbers treat an audience with little respect; intelligent transparency helps the audience understand and appreciate what sits behind high level claims.
Data science outputs can sometimes seem to communications teams easy to cherry pick for the most attractive number. Again, like pseudo data science, this reflects largely good intentions – to communicate complex things through ideas. But it becomes easy for a single, unsupported number to be used and reused until it loses most of its meaning. We call this weaponization of data, and it is the antithesis of intelligent transparency. And there is a lot of it about – for example the way in which the former Prime Minister of the UK talked repeatedly about employment; or claims about Scotland’s capacity for renewable energy. These examples indicate the pathology of weaponization that can impact data science outputs. They also act as a reminder that data scientists can counter weaponization of their own outputs by delivering engaging and insightful communication.
Context collapse
The third type of pitfall surrounds context collapse.
This idea comes from the work of the philosopher Lucy McDonald (who in turn has built on the ideas of danah boyd). What is context collapse? Imagine a swimming pool – with neat divisions of the pool into different lanes. All is clearly labelled – fast, medium, slow – for lane swimmers, who are in turn separated from the splash area for families and the deep end for divers. Removing the lanes, and thus taking away any signposting, increases the likelihood for things to go wrong. The fast swimmers doing front crawl clash with the slower breaststroke swimmers; both are constantly having to avoid the families with young children; and all need to watch for the periodic big splashes created by the divers. This is the online communication environment, in which formerly private and casual statements can go viral; in which a brief statement in a media environment can be picked up on and circulated many times; and in which some bad actors (the divers) may wish to disrupt deliberately the debate by breaking all the rules.
How can this affect data science? It happens when individual bits of data are taken from their context, and used in service of a different, and bigger, argument. A good example is data on Covid vaccinations. Here, UK organisations like the Office for National Statistics and the UK Health Security Agency published comprehensive data in good faith about vaccinations and their impact. Some of the underlying data, however, was taken out of the broader context and used in isolation to support criticisms of vaccines – criticisms that the wider evidence base did not support.
The challenge then became how the organisations should respond. At an organisational level, they did not wish to withdraw the data – because that would reduce transparency. Instead they sought to both caveat their data more clearly; and directly rebut the more egregious misuses of the data. In a sense, then, what began as an individual analytical output became part of a broader organisational judgement on positioning in the face of misinformation.
It is fair to say that, against this third pitfall, there is not yet a clear consensus on how to address it. Practice is emerging all the time and we at OSR continue to support producers of data as they grapple with it.
There are other potential pitfalls to using data science. But what unites these three – pseudo data science; unintelligent transparency; and context collapse – is that they relate to situations where data science rubs up against broader organisational dynamics, around communications, presentation and organisational strategy.
And the meta-message is this: for data scientists to thrive in organisations, they need to be good at more than data science. They need to be skilled at working alongside and influencing colleagues from other functions. Only through this form of data leadership can the pitfalls be dealt with effectively.
This article is based on a presentation at the Data Science for Health Equity group in May 2023.
- About the author
- Ed Humpherson is head of the Office for Statistics Regulation, which provides independent regulation of all official statistics in the UK. The aim of OSR is to enhance public confidence in the trustworthiness, quality and value of statistics produced by government.
- Copyright and licence
- © 2023 Ed Humpherson
This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.
- How to cite
- Humpherson, Ed. 2023. “‘Pseudo data science’ and other pitfalls: lessons from the UK’s stats regulator on how not to be misleading.” Real World Data Science, September 18, 2023. URL