Why We Should All Be Data Quality Detectives

At the 2025 Royal Statistical Society conference in Edinburgh, a lively group of statisticians and data scientists gathered to tackle a quietly critical issue: data quality. Our workshop, titled “Why we should all be data quality detectives”, drew around 40 participants into a dynamic conversation about why data quality is often overlooked and what we can do to change that.

The Case for Data Quality

If you search for “data quality disasters” on any search engine, you will find many results. Similarly, literature on data quality measures offers abundant advice. But within the scientific research community, data quality is often ignored. For example, how often have you encountered the term “data quality” in the guidelines when submitting or reviewing an academic paper? We would venture to say, hardly ever (or never).

This is puzzling, because high-quality data (i.e., data that is fit for purpose) is essential; without it, results become almost meaningless. Data serves as the foundation of our work. So why isn’t its quality given the prominence it deserves? Why aren’t we, as statisticians and data scientists, advocating data quality more vocally?

Recent publications may shed some light: it appears that “Everyone wants to do the model work, not the data work” [1]¹, and that statisticians may feel uneasy with elements that are not easily quantifiable [2]². Or perhaps we are all guilty of “premature enumeration” (as Tim Harford puts it), rushing into data analysis without having a good look at the data first. Whatever the case, data quality work or “data cleaning/wrangling” is not seen as fun.

For us, as self-confessed “data quality detectives”, the reverse is true, and we began the workshop by reframing data quality not as a tedious chore, but as an empowering and even enjoyable part of the analytical process. We spend hours looking at the data, enjoying the delayed gratification of finally getting to (trustable) results.

In Rosemary’s case, her attitude was shaped by key experiences early in her statistical career. Her doctoral research focused on developing methods to automatically classify magnetic resonance spectra of human leg adipose tissue based on diet—specifically distinguishing between vegans and omnivores. The study recruited 33 vegans, while the control group included 34 omnivores and 8 vegetarians, primarily staff from the MRI unit at Hammersmith Hospital. With limited experience at the time, she began experimenting with various techniques, starting with k-means cluster analysis. Although she hoped the clusters would reflect dietary groups, the analysis instead produced two distinct clusters—one containing just two spectra and the other containing the rest. After consulting colleagues, she learned that the two outlier spectra had been acquired using a different protocol and were mistakenly included in the dataset. While she may have identified the error later, catching it early saved her several weeks of work — and won her some kudos with colleagues.

Catching it early saved her several weeks of work.

Detective Work at the Tables

During the workshop, we split into six groups to investigate two questions: Why does data quality get overlooked? What strategies can raise its profile?

The discussions were rich and revealing. Many pointed to organisational gaps — no clear strategy, limited training, and confusion over who is responsible for data quality. Others highlighted cultural issues: time pressures, lack of curiosity, and a tendency to assume someone else has already checked the data.

Simple Excel errors are also common. We heard an example case of a study comparing a new, advanced imaging machine with an older model. The results were presented in a spreadsheet, which included several measurements. As expected, the correlation matrix showed strong correlations between most columns—except for the first, which was the main measure of interest. It quickly became apparent that the sort function had been applied to that column, scrambling the values and rendering them effectively random. Unfortunately, the researcher had not kept a backup of the original data, meaning the entire experiment was compromised. During the COVID-19 pandemic, a similar technical mistake involving Excel led to thousands of positive cases being omitted from the UK’s official daily figures. These are the kinds of simple issues that could have been caught with a basic data check.

Other examples were given of data quality issues arising when datasets were used for a specific research focus, and the quality checks applied were tailored too narrowly to that focus. Additional problems only became apparent when the same data was later used for a different research purpose. Conclusion: you can’t be complacent about the quality of the data you’re using.

Strategies for Change

The second question sparked even more ideas. Suggestions ranged from embedding data quality education early (even at school level) to implementing cultural changes that lead to great transparency. Participants called for:

Training and upskilling across roles
Transparent reporting of errors and limitations
Positive feedback loops for data collectors
Rewarding quality work and error detection
Modernising systems and improving interoperability
Using AI and automation to support quality checks
Publications including recommendations for more transparent reporting of “initial data analysis” in their guidelines.

One standout idea: organisations could promote a “data amnesty” culture where errors can be acknowledged without blame. This is something Roger experienced during his time as Chief Statistician for the Scottish Government. There, he occasionally encountered serious data quality issues that required official statistics to be revised or delayed. Being transparent with users about these issues was a key principle of the Code of Practice for Official Statistics. A conscious effort was made — through training and through taking a certain approach to handling such situations — to foster a culture of openness and accountability. Staff were supported to create and implement plans to address the problems, learn from them, and communicate clearly with users. This transparency was essential to maintaining trust in both our processes and the statistics we produced.

A Call to Action

We walked away from the workshop with a clear conclusion: data quality needs a culture shift. It’s not enough to care — we need to prioritise and celebrate the work of those who keep our data trustworthy, while educating other stakeholders about what it involves.

Shaping the next steps will require keeping this conversation going within the data community and Real World Data Science can play an integral role in that. As a direct result of this piece, we have updated our submission guidelines to include recommendations for transparent data reporting and we would like to publish more stories of data disasters – or disasters averted through careful attention to data quality.

As one attendee put it, “We need to challenge the data and learn best practice from the get-go.” It’s time to embrace our inner data detectives; the integrity of our insights depends on it.

Please share your own data disaster stories in the comments, or in the Real World Data Science inbox.

Explore more data science ideas

About the authors: Rosemary Tate is a Chartered Biostatistician and Computer Scientist with over 30 years of experience in medical research and statistical consulting. She has a BSC in mathematics and a DPhil in Computer Science and AI, and an MSc in Medical Statistics. She has been scientific manager of a large EU-funded project and held lectureships at the Institutes of Child Health and Psychiatry. An independent statistical consultant since 2016, she now spends most of her time as a “Data Quality Agent Provocateur”.; Roger Halliday CEO at Research Data Scotland, providing leadership to improving public wellbeing through transforming how data is used in research, innovation and insight. Roger was Scotland’s Chief Statistician from 2011 to 2022. During that time he was also Scottish Government Chief Data Officer (2017-20), and jointly led Scottish Government Covid Analytical Team during the pandemic. Before that, he worked in the Department of Health in England as a policy analyst managing evidence for decision making across NHS issues. He became an honorary Professor at the University of Glasgow in 2019.

How to cite: Tate, Rosemary A. and Halliday, Roger. 2025. “Why We Should All Be Data Quality Detectives” Real World Data Science, October 30, 20245. URL

References

Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. Everyone wants to do the model work, not the data work: Data cascades in High-Stakes AI. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021.↩︎
Thomas Redman and Roger Hoerl. Data quality and statistics: Perfect together? Quality Engineering, 35(1):152–159, 2023.↩︎