Real World Data Science

New open access journal - RSS: Data Science and Artificial Intelligence

Thu, 01 Aug 2024 00:00:00 GMT

The Royal Statistical Society (RSS) is launching a new fully open access journal, RSS: Data Science and Artificial Intelligence. Created in recognition of the growing importance of data science and artificial intelligence in science and society, the new journal’s remit spans the breadth of data science; you can submit articles covering disciplines including statistics, machine learning, deep learning, econometrics, bioinformatics, engineering and computational social science.

As well as three primary paper types - method papers, applications papers and behind-the-scenes papers - RSS: Data Science and Artificial Intelligence will publish editorials, op-eds, interviews, and reviews/perspectives in line with its goal to become a primary destination for data scientists.

Published by Oxford University Press, this new journal is the first addition to the RSS family of world class statistics journals since 1952.

Learn more about why RSS: Data Science and Artificial Intelligence is the ideal platform for showcasing your research.

Meet the journal’s editors-in-chief and editorial board

Sach Mukherjee is Director of Research in Machine Learning for Biomedicine at the Medical Research Council (MRC) Biostatistics Unit, University of Cambridge, and Head of Statistics and Machine Learning at the German Center for Neurodegenerative Diseases.

Silvia Chiappa is a Research Scientist at Google DeepMind London, where she leads the Causal Intelligence team, and Honorary Professor at the Computer Science Department of University College London.

Neil Lawrenece is the inaugural DeepMind Professor of Machine Learning at the University of Cambridge. He has been working on machine learning models for over 20 years. He recently returned to academia after three years as Director of Machine Learning at Amazon.

View the full editorial board here: Editorial Board | RSS Data Science | Oxford Academic (oup.com)

Discover more The Pulse

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution (CC BY 4.0) International licence.

Democratizing Data: Using natural language processing and machine learning to capture dataset usage

Brian Tarran — Mon, 11 Mar 2024 00:00:00 GMT

Figuring out how much money is spent annually on collecting and publishing datasets is a challenge. According to the World Bank, it is “painfully hard to obtain” information just on government spending on data, never mind all the other bodies and organisations who invest in the creation of data assets. But there’s an even more challenging figure to pin down: How much value does all this data provide? By and large, there is very little data – or systematic collection of data – on dataset usage. There’s no easy way to find all the users of a particular dataset and to see how the data has been used, or what research topics it may have contributed to.

Writing in the Harvard Data Science Review in April 2022, Julia Lane and others explained that “the current approach to finding what data sets are used to answer scientific questions … is largely manual and ad hoc.” They went on to argue: “Better information on the use of data is likely to have at least two results: (i) government agencies might use the information to describe the return on investment in data sets and (ii) scientists might use the information to reduce the time taken to search and discover other empirical research.”

This hope for a better understanding of how datasets are used and the value they provide underpins the creation of “Democratizing Data: A Search and Discovery Platform for Public Data Assets.”

As described on the platform’s homepage, Democratizing Data “describes how datasets identified by federal agencies have been used in scientific research. It uses machine learning algorithms to search over 90 million documents and find how datasets are cited, in what publications, and what topics they are used to study.”

But this is just the start of what Lane thinks the project could eventually achieve, as she explains in this interview.

How did the Democratizing Data platform come about?
It emerged from three things, really, and I can trace the start of it back to 2016, when I was asked to build a secure environment that could host confidential microdata in order to inform the work of the Commission on Evidence-Based Policymaking.¹ They asked me to lead on this because I had built the NORC remote access Data Enclave at the University of Chicago 10 years before that.

This was the first step towards Democratizing Data.

The second step was figuring out how to create value from the data in the secure environment, because we knew that if we couldn’t create value, government agencies weren’t going to put their data into it.

But how do you figure out what agencies are going to want to do with the data in a secure environment? It’s difficult to get them to tell you what they want, so I thought, well, why don’t we build training classes, put people in these training classes and have them work on problems with their own data? That way, we’re going to know what problems they have so that we can address them.

So, step 1 was build secure environment. Step 2 was build capacity and identify the questions that are of interest to agencies so that they would put data into the secure environment. That led to the training classes that Frauke Kreuter, Rayid Ghani, and I put together, which are described here.

But then what happened was, people kept coming to me in the classes and asking, “Who else has worked with these data, and who can I go to and ask questions?”

I could give them a list of people, but that list would be biased by my age and race and sex and the people I know. What about those people who are doing really interesting stuff with these data that I happen not to know about?

So, that’s how Democratizing Data got started. I thought, really the best way to give a full answer to those sorts of questions is to figure out what datasets are being used in research publications.

Now, how was I going do that? I could read all the publications and manually write notes about who the authors were, and what the topics were, and what datasets they used. But that’s not realistic. So, I thought, well, maybe we could combine natural language processing techniques with machine learning so that you could “read” all these publications and find out how datasets are cited.

Julia Lane, professor at the NYU Wagner Graduate School of Public Service and visiting fellow at RTI International. Image supplied, used with permission.

We’re searching for ways to measure how datasets are used, how they’re valued, and so on. Federal statistical data and Elsevier’s Scopus have been a great starting point for us, but the broader vision is to incorporate other datasets and other publication databases.

Would this have been a problem that needed solving if there were common, established citation standards and practices for datasets?
There are great citation practices for datasets, and we’ve had them for 15 years. Back when I was a National Science Foundation (NSF) program officer, everyone was saying, well, if we just get the plumbing right, people will come and use it. Well, they don’t. Even when there are DOIs available and they’re relatively easy to cite, people don’t. The plumbing’s there, we built it, but they didn’t come.

So, I think there has to be a demand-side piece, and we talk about that in one of the papers in an upcoming special issue of the Harvard Data Science Review: How do you create an incentive structure so that people do provide information about how they’ve used data? My thinking was that, suppose we can find out who’s using what data. Then the incentive structure to an academic is “your name in lights”: you are the world’s living expert in orange carrots with green stripes, or whatever. So, we would read the publications, find the datasets, and then you could have a leaderboard of the people who have done the most work in a particular field, and then people would have an incentive both to cite datasets and to let you know when you miss things. And that was when I thought, well, let’s put all this information up in a dashboard.

So, is the grand vision for this to create a platform that, essentially, any data owner – anyone who publishes datasets – could plug into, connect to, and understand how other people are using their datasets?
That’s right. The grand vision is basically to set up a search and discovery portal. Originally, my thinking was that would be super helpful for people just starting out in a field; for a new graduate student or a postdoc to say, “I want to figure out what work has been done on recidivism of welfare recipients relative to access to jobs and neighborhood characteristics,” for example, and for them to see what datasets are available and how they have been used.

But from the data producer side it’d also be useful to know: Who’s using the datasets? Where are the gaps? Where are we maybe not reaching as many people as we thought we were, and how can we change that?

So, while the original idea was to build a platform for researchers, plans changed when the Evidence Act passed, and agencies were required to produce usage statistics for their datasets.²

We started a pilot with the US Department of Agriculture’s (USDA) Economic Research Service (ERS). They have been a huge supporter and helped us work through a lot of the issues. Then, when we began showing around the ERS wireframes and ideas, the NSF National Center for Science and Engineering Statistics joined in, and so did USDA’s National Agricultural Statistics Service and the National Center for Education Statistics and the National Oceanic and Atmospheric Administration.

So, we have these agencies involved and they’ve really been the drivers, the intellectual partners, pushing the design and the structure forward.

I’ve had a chance to play around with some of the public dashboards you’ve released on Tableau, and I really like the way you can explore dataset usage from different start points and end up with a list of publications that use those datasets. My question is, though, how have you connected all this up – datasets and publications?
Our start point was scientific publications because these are pretty well curated. We ended up working with Elsevier because Scopus [Elsevier’s abstract and citation database] is a well-curated corpus and they’ve got the associated publication metadata well curated.

So, we have the Scopus corpus, and we then ran a Kaggle competition to develop machine learning models to identify candidate snippets of text from scientific papers that seem like they might be referring to a dataset.

Human researchers would then validate those snippets as either referring to a dataset or not, and once they’ve validated the publication-to-dataset dyad, we then pull in all the metadata associated with the publication: authors, institutions, key topics, publication year, countries, etc. – all this information gets piped over to the dashboards.

You published a Harvard Data Science Review article about this competition a couple of years ago, and from that I understand that you can actually get quite far with a simple string-matching method for finding datasets, but you would still miss a lot of citations using this approach because of the variability in the way people refer to datasets.
That’s right. There were three different models that were developed, and each one picks up different aspects of how authors mention data in publications, and all three have been extremely useful. We learned a lot about the variety of ways in which researchers cite the data that they use.

It turns out that more people do cite datasets in references than we had originally thought, but usually they don’t cite a DOI, they cite the URL or they cite the exact name of the dataset, so string search of references and URLs pulls out quite a lot of information that the DOIs, per se, don’t.

What are the next steps for scaling up the Democratizing Data work?
I think it is more of a sociotechnical issue than a technical one. We have the plumbing, but really what we need to do is to figure out the incentives for researchers. We need to build a community around the data, which is what’s happened with code and the sharing of code on platforms like GitHub.

Obviously, our initial focus has been on federal statistical data, but there’s also a lot of interest in how administrative data or streaming data are being used.

The advantage of starting with statistical data is that they have names. As we learn more about citation patterns, though, it may be that we don’t need precise names. What may happen is that the community starts converging on common terminologies for datasets. That happens in a lot of fields.

At the moment, it feels a little bit like the Wild West. We’re searching for ways to measure how datasets are used, how they’re valued, and so on. Federal statistical data and Elsevier’s Scopus have been a great starting point for us, but the broader vision is to incorporate other datasets and other publication databases like arXiv and Semantic Scholar. But all those other datasets that are out there, they need to be curated and documented in some way and that’s a huge task, so the solution has got to be community curation and sharing, right?

If we don’t build a community around the data, we’re just going to have really bad information, really bad analysis, and really bad statistics on the value that our datasets – all these data assets – provide. My colleague Nancy Potok gave a talk a couple of days ago in which she said that our future depends on this – and it really does.

Find more Interviews

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photo of Julia Lane is not included in this licence.

How to cite: Tarran, Brian. 2023. “Democratizing Data: Using natural language processing and machine learning to capture dataset usage.” Real World Data Science, March 11, 2024. URL

Footnotes

The Commission on Evidence-Based Policymaking was “charged with examining all aspects of how to increase the availability and use of government data to build evidence and inform program design, while protecting privacy and confidentiality of those data.”↩︎
Specifically, the act requires federal agencies to identify and implement methods “for collecting and analyzing digital information on data asset usage by users within and outside of the agency.”↩︎

Editor’s note: Not saying goodbye, just saying…

Brian Tarran — Wed, 06 Mar 2024 00:00:00 GMT

It’s not easy to leave a brilliant group of people you’ve worked with for almost a decade, but in a month’s time I’ll be moving on from the Royal Statistical Society (RSS).

When I joined RSS in June 2014 I was looking for new challenges. I wanted to find out more about the ways statistics and data are used to understand and solve problems and inform decisions in science, business and industry, public policy, health… I could go on! Working for the RSS certainly delivered on that front: as editor of Significance for eight years and of Real World Data Science more recently, I have had many opportunities to learn.

Pretty much every day of my working life for the past nine years, eight months or so involved speaking with expert statisticians and data scientists or reading about their work. When there were things I didn’t understand, they were always happy to explain. When I shared my ideas for how to make their articles clearer or more readable, they took the time to listen. Together, we worked to create accessible, engaging stories about statistics and data. There have been hundreds of these collaborations over the years – too many to namecheck individually – but I have enjoyed them all, and I’ve learned something from each of them.

Before I head off to pursue a new set of challenges and learning opportunities, I want to say a big thank you to all the RSS staff and members, past and present, that I’ve been lucky to call my colleagues. Thank you also to the staff and members of the American Statistical Association who have been valued partners on Significance over the years and now RWDS too. It’s been a privilege to work with you all.

The chance to launch RWDS has been a particular highlight of my time at RSS, and I am grateful to have had the support and input of The Alan Turing Institute and many of its wonderful staff and researchers on this project. I’m excited to see the site continue to grow and develop into a valuable resource for the data science community, and I look forward to reading an upcoming series of articles that will explore the statistical and data science perspectives on AI – stay tuned for more on this soon.

Statistics and data will continue to be a big part of my life, so this isn’t “goodbye.” Instead, I’ll just say, let’s keep in touch – and thank you for reading!

Back to Editors’ blog

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Pete Pedroza on Unsplash.

How to cite: Tarran, Brian. 2024. “Editor’s note: Not saying goodbye, just saying…” Real World Data Science, March 6, 2024. URL

£10m for UK regulators to ‘jumpstart’ AI capabilities, as government commits to white paper approach

Brian Tarran — Thu, 08 Feb 2024 00:00:00 GMT

The UK government this week announced a £10 million investment to “jumpstart regulators’ AI capabilities” as part of its commitment to a “pro-innovation approach to AI regulation.” But will this be sufficient to answer criticisms that it has so far been “too slow” to give regulators the tools they need to police the growing usage of AI?

It was March last year when a Department for Science, Innovation and Technology (DSIT) white paper first set out the government’s principles- and context-based approach to regulating artificial intelligence. This proposed to focus regulatory attention on “the context in which AI is deployed” rather than target specific technologies. Under this model, existing regulators, including the Information Commissioner’s Office, Ofcom, and the Competition and Markets Authority, would be responsible for ensuring that technologies deployed within their domains adhered to established rules – e.g., data protection regulation – and a common set of principles:

Safety, security and robustness.
Appropriate transparency and explainability.
Fairness.
Accountability and governance.
Contestability and redress.

The approach was broadly well received, as was clear from a debate at techUK’s Digital Ethics Summit last December. However, concerns were expressed about whether regulators would be funded sufficiently to meet the expectations set out in the March white paper. Also, the Royal Statistical Society, in its response to the white paper, worried that “splitting responsibilities for regulating the use of AI between existing regulators does not meet the scale of the challenge,” and that “central leadership is required to give a clear, coherent and easily communicable framework that can be applied to all sectors.”

While the DSIT white paper proposed that a range of “central functions” be created to support regulators, evidence presented to a House of Lords inquiry last November suggested that regulators “did not appear to know what was happening” with these mooted teams and were “keen to see progress” on this front.

In reporting the outcomes of its inquiry last week, the House of Lords Communications and Digital Committee concluded that government was being “too slow” to give regulators the tools required to meet the objectives set out in the white paper, and that “speedier resourcing of government‑led central support teams is needed.”

“Relying on existing regulators to ensure good outcomes from AI will only work if they are properly resourced and empowered,” the committee said.

The £10 million funding for regulators announced this week is therefore likely to be welcomed. Money is earmarked to “help regulators develop cutting-edge research and practical tools to monitor and address risks and opportunities in their sectors, from telecoms and healthcare to finance and education,” according to a DSIT press release. Speaking on February 6 at a hearing of the Lords Communications and Digital Committee, Michelle Donelan, Secretary of State for Science, Innovation and Technology, said that the government would “stay on top” of what regulators need to be able to fulfil their responsibilities for regulating the use of AI in their sectors.

Consultation response

News of the funding for regulators came as part of a long-awaited response by the government to the consultation on its AI regulation white paper. The response essentially confirmed that the government was proceeding with its principles- and context-based approach to regulating AI, having received “strong support from stakeholders across society.”

This approach is right for today, the government said, “as it allows us to keep pace with rapid and uncertain advances in AI.” However, it acknowledged that “the challenges posed by AI technologies will ultimately require legislative action in every country once understanding of risk has matured.”

“Highly capable general-purpose AI systems” would, for example, present a particular challenge to the government’s current approach. It explained: “Even though some regulators can enforce existing laws against the developers of the most capable general-purpose systems within their current remits, the wide range of potential uses means that general-purpose systems do not currently fit neatly within the remit of any one regulator, potentially leaving risks without effective mitigations.”

As a next step in delivering on the white paper approach, the government is asking key regulators to publish an update on their strategic approach to AI by the end of April. This was welcomed by Royal Statistical Society (RSS) president Andrew Garrett, who said:

“Urgency is certainly warranted, and the directive for key regulators to disclose their approach in the coming months is a positive development. Ensuring consistency and coherence not only among key regulators but also those who follow is crucial.”

Garrett also reiterated the need for government to engage with statisticians and data scientists, particularly through its new AI Safety Institute (AISI). In the white paper consultation response, AISI is billed as being “fundamental to informing the UK’s regulatory framework”: it will “advance the world’s knowledge of AI safety by carefully examining, evaluating, and testing new frontier AI systems” and will also “research new techniques for understanding and mitigating AI risk.” Garrett said:

“As always, fostering diversity of representation within government and regulatory bodies remains paramount; it cannot solely rely on input from major tech companies. It is especially important that the AI Safety Institute engages with a diverse array of voices, including statisticians and data scientists who play a pivotal role in both the development of AI systems and novel evaluation methodologies.”

Risks and opportunities

Calls for a “diversity of representation within government and regulatory bodies” certainly chime with a warning bell sounded by the Lords Communications and Digital Committee last week, in the February 2 release of its inquiry report into large language models and generative AI. “Regulatory capture” by big commercial interests was highlighted as a danger to be avoided, amid concern that “the AI safety debate is being dominated by views narrowly focused on catastrophic risk, often coming from those who developed such models in the first place” and that “this distracts from more immediate issues like copyright infringement, bias and reliability.”¹

The committee called for enhanced governance and transparency measures in DSIT and AISI to guard against regulatory capture, and for a rebalancing away from a “narrow focus on high-stakes AI safety” toward a “more positive vision for the opportunities [of AI] and a more deliberate focus on near-term risks” including cyber security and disinformation.

It also wants to see greater action by the government in support of copyright. “Some tech firms are using copyrighted material without permission, reaping vast financial rewards,” reads the report. “The legalities of this are complex but the principles remain clear. The point of copyright is to reward creators for their efforts, prevent others from using works without permission, and incentivise innovation. The current legal framework is failing to ensure these outcomes occur and the Government has a duty to act. It cannot sit on its hands for the next decade and hope the courts will provide an answer.”

Again, here’s RSS president Andrew Garrett’s take on the Lords committee report:

Back to Editors’ blog

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Yaopey Yong on Unsplash.

How to cite: Tarran, Brian. 2024. “£10m for UK regulators to ‘jumpstart’ AI capabilities, as government commits to white paper approach.” Real World Data Science, February 8, 2024. URL

Footnotes

See, for example, “No, AI probably won’t kill us all – and there’s more to this fear campaign than meets the eye.”↩︎

UK government sets out 10 principles for use of generative AI

Brian Tarran — Mon, 22 Jan 2024 00:00:00 GMT

The UK government has published a framework for the use of generative AI, setting out 10 principles for departments and staff to think about if using, or planning to use, this technology.

It covers the need to understand what generative AI is and its limitations, the lawful, ethical and secure use of the technology, and a requirement for “meaningful human control.”

The focus is on large language models (LLMs) as, according to the framework, these have “the greatest level of immediate application in government.”

It lists a number of promising use cases for LLMs, including the synthesise of complex data, software development, and summaries of text and audio. However, the document cautions against using generative AI for fully automated decision-making or in contexts where data is limited or explainability of decision-making is required. For example, it warns that:

“although LLMs can give the appearance of reasoning, they are simply predicting the next most plausible word in their output, and may produce inaccurate or poorly-reasoned conclusions.”

And on the issue of explainability, it says that:

“generative AI is based on neural networks, which are so-called ‘black boxes’. This makes it difficult or impossible to explain the inner workings of the model which has potential implications if in the future you are challenged to justify decisioning or guidance based on the model.”

The framework goes on to discuss some of the practicalities of building generative AI solutions. It talks specifically about the value a multi-disciplinary team can bring to such projects, and emphasises the role of data scientists:

“data scientists … understand the relevant data, how to use it effectively, and how to build/train and test models.”

It also speaks to the need to “understand how to monitor and mitigate generative AI drift, bias and hallucinations” and to have “a robust testing and monitoring process in place to catch these problems.”

What do you make of the Generative AI Framework for His Majesty’s Government? What does it get right, and what needs more work?

And in case you missed it…

New York State issued a policy on the Acceptable Use of Artificial Intelligence Technologies earlier this month. Similar to the UK government framework, it references the need for human oversight of AI models and rules out use of “automated final decision systems.” There is also discussion of fairness, equity and explainability, and AI risk assessment and management.

Back to Editors’ blog

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Massimiliano Morosinotto on Unsplash.

How to cite: Tarran, Brian. 2024. “UK government sets out 10 principles for use of generative AI.” Real World Data Science, January 22, 2024. URL

When will the cherry trees bloom? Get ready to make and share your predictions!

Brian Tarran — Thu, 18 Jan 2024 00:00:00 GMT

The 2024 International Cherry Blossom Prediction Competition will open for entries on February 1, and Real World Data Science is once again proud to be a sponsor.

Contestants are invited to submit predictions for the date cherry trees will bloom in 2024 at five different locations – Kyoto, Japan; Liestal-Weideli, Switzerland; Vancouver, Canada; and Washington, DC and New York City, USA.

The competition organisers will provide all the publicly available data they can find for the bloom dates of cherry trees in these locations, and contestants will then be challenged to use this data “in combination with any other publicly available data (e.g., climate data) to provide reproducible predictions of the peak bloom date.”

“For this competition, we seek accurate, interpretable predictions that offer strong narratives about the factors that determine when cherry trees bloom and the broader consequences for local and global ecosystems,” say the organisers. “Your task is to predict the peak bloom date for 2024 and to estimate a prediction interval, a lower and upper endpoint of dates during which peak bloom is most probable.”

So that organisers can reproduce the predictions, entrants must submit all data and code in a Quarto document.

There’s cash and prizes on offer for the best entries, including having your work featured on Real World Data Science. Head on over to the competition website for full details and rules.

And, if you are looking for some inspiration, check out this tutorial on the law of the flowering plants, written by Jonathan Auerbach, a co-organiser of the prediction competition.

Good luck to all entrants!

Back to Editors’ blog

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photo by AJ on Unsplash.

How to cite: Tarran, Brian. 2024. “When will the cherry trees bloom? Get ready to make and share your predictions!” Real World Data Science, January 18, 2024. URL

‘We absolutely have to transform and modernise our operation’ – US Census Bureau director Robert Santos

Brian Tarran (with Anna Britten) — Mon, 15 Jan 2024 00:00:00 GMT

A month ago now, Real World Data Science published an interview with UK national statistician Professor Sir Ian Diamond. In the process of preparing the text of that interview for publication, I found myself reflecting on a conversation I’d been part of earlier in the year with Robert Santos, director of the US Census Bureau.

I met Santos in Toronto, Canada, in August – a few hours before his President’s Invited Address at the 2023 Joint Statistical Meetings. The meeting was arranged as a joint interview with Anna Britten, editor of our sister publication Significance magazine, and Santos was joined by Sallie Ann Keller, the Census Bureau’s chief scientist and associate director for research and methodology, and Michael Hawes, senior advisor for data access and privacy.

The interview with Santos, Keller, and Hawes was published in the October issue of Significance, so you may have already read it. But, following on from our Sir Ian Diamond interview, I thought it worth highlighting some of what Santos et al. had to say, particularly where key themes, challenges, and opportunities seem to resonate across both the US Census Bureau and the UK Office for National Statistics.

I’ve also gone back to the original interview recording to pick out some previously unpublished comments.

On the scale of the challenge Santos inherited on becoming director of the US Census Bureau in January 2022

Robert Santos: Certainly it was formidable – although I’m comforted in knowing, after a year and a half, that the career staff [of the Census Bureau] were well positioned to accept this challenge anyway, and were working on it. But the challenge was real. We had the pandemic. We had to, basically, not redesign but scramble and adapt to a really threatening situation where the entirety of the 2020 census was conducted before there was a vaccine, and when people didn’t know the nature of the beast. A huge chunk of this operation was conducted when society was shut down. And not only did the Census Bureau need to rethink, nimbly and quickly, how to do its operation, but so did all of the different community partners – which was really enlightening because we realised that, at the end of the day, we could not have completed this job alone. And now our position is that we cannot complete our mission without the external community. They’re the extra folks we need in order to understand better what the needs are, and therefore improve our methods and data and the relevance of what we’re doing.

So, we see our role now as having a continuous engagement with the entire country at all levels – be it elected officials, universities and professors and the research community, or data users like policy users and policy researchers, or local community organisations that are doing neighbourhood stuff. And so we’re actively working between censuses to engage them and show them the value of the data that we’ve collected – not just decennial [census data], but also our flagship American Community Survey and our Current Population Study and all the 130 other business, economic as well as household types of studies that we’re doing.

On the need to transform Census Bureau operations

Robert Santos: We absolutely have to transform and modernise our operation, from what was historically this transactional survey type of data collection – where we go to somebody that’s randomly sampled and we say, “Please give me your information” – and realise the value of taking that information, blending it with existing data, administrative data, even third-party private sector data, into a huge data pool and linking it together, and that will create new data products that will serve the public in ways that we never imagined before. And we already have some great examples of that. So, that transformation process is an incredible priority that we have to do, regardless of what our funding situation is. If we don’t do that, we’re not going to be able to serve the public in the way that we need to.

On laying the groundwork for the 2030 census and an increased use of administrative data

Robert Santos: There are a couple of things going on. One is that we’re obliged, because of our values of scientific integrity, objectivity, transparency, and independence, to let folks know what we’re doing in terms of our use of administrative records, and we’ve done that and we will continue doing that. The big lift was really in preparing for the last decennial [census], where we took the use of administrative records to new heights in terms of their utility – not only to help us for some enumeration of households, but, more importantly, to help us predict which households were occupied or not, or to predict which households would benefit from the use of administrative record enumeration versus which ones wouldn’t, or how many times should we knock on the door before we do something else. And now, with that knowledge, we’re looking back at 2020 and saying, what worked? What didn’t? How can we exploit it? And we’re kind of moving the dial to say, “What can we take more advantage of for 2030?”, with full recognition that there were some subpopulations, there’s some segments of society, that we really need to focus and hone in on to make sure we get a good count.

Robert Santos, director, US Census Bureau

We absolutely have to transform and modernise our operation, and realise the value of taking [survey] information, blending it with existing data, administrative data, even third-party private sector data, into a huge data pool and linking it together, and that will create new data products that will serve the public in ways that we never imagined before.

On addressing public concerns about data collection and data privacy

Michael Hawes: Even though the decennial census is mandatory under law, we rely on voluntary participation. We’re relying on people being willing to respond to their census. In the lead up to each census, we do an extensive survey of what are the attitudes or motivators that will encourage people to respond or to not respond. And one of the recurring themes in that is concerns about privacy, concerns about how their data can be used. So, in order to help encourage people who have those concerns – and this is a sizable percentage of the population – we do need to have strong messaging about how their data are protected, how they can only be used for statistical purposes, and so on. But that has to be in very easy-to-consume sound bites, because a lot of people don’t have a background in statistical disclosure control or even in the legal conceptions about what privacy is. So, that is a real challenge for us. How do we convey the fact that we are taking this very seriously, and that their data are protected, in a way that people can kind of internalise and respond to?

On making sure statistics serve the public good, and the role of the Census Bureau in supporting data literacy

Sallie Ann Keller: In the US over the last decade, there’s been a really large movement around data for the public good, data science for the public good, and it’s really focused at trying to engage researchers and scholars – and we’re talking about high school students, community college students, undergraduates, graduate students – trying to engage them with civic engagement around data and data insights. That’s happening all over the country – really trying to democratise data and bring it in service of the public good. And I think that’s very exciting.

Michael Hawes: We have a whole programme called Statistics in Schools which is about taking census data and making it valuable to teachers in the classroom, and allowing students at various levels – from elementary school through high school – to be able to engage with the data and use it to inform their own learning, and to learn about their own communities. That is especially profound in the years around the actual census, because that also serves as a catalyst for getting households to respond. If the kids are using the census data within the classroom, then they go home and say, “Hey, have you filled out your census form?”

Robert Santos: It’s really important to start young, but then there’s also folks who want to use the data who are adults. So, we have something called the Census Academy, where you can go on to YouTube and get tutorials that show you visually somebody trying to use census data. And the second thing we do is, we really have a strong commitment for creating easier platforms for folks to access and utilise various types of data produced by the Census Bureau. We’re creating these data visualisation tools that bring together the demographic data that we collect, the economic data that we collect, and visualise it down to the census tract level so that local communities can pull that up. And then finally, in terms of the public good, there’s also work that we’re doing with the Federal Emergency Management Agency and the National Oceanic and Atmospheric Administration on our community resilience estimates to create the same type of data visualisations that can show where the potential worrisome geographic spots are.

On the opportunities for bringing together Census Bureau data and large language models

Sallie Ann Keller: We’re not going to be in the business of building generative AI models. But what we want is the statistics that we put out, the data that we put out, to be picked up by these large language models – to be kind of an input into generative AI. So, we are focused on that in terms of really looking at the structure of how we’re disseminating statistics, and how we’re disseminating things like data tables. How harvestable are they for AI? What are the guardrails we should put around that? We’re looking at and considering issues on data integrity, because when questions are posed, we would like our official statistics to be answering those questions, not our statistics translated through three other parties. Data integrity is really a huge issue, because we don’t want false data and infiltration happening that gets branded as our statistics. I don’t know where we’ll take it all, but I think we’d also like to be incredibly creative here. So, let’s suppose you ask a question and some statistic comes back. Well, why not have that be an experience, so that not just a statistic comes back but maybe a question or two comes back, to try to assess the context that you’re really asking about, so that we can not only have our data coming to you, but we can have the right data coming to you?

Michael Hawes: Even with some of our more traditional statistical data products, informing users of the limitations and the the uncertainty baked into a lot of those estimates has historically been a challenge – even for some more sophisticated users. The number of people who ignore margins of error on data tables, even in our data products, is not insubstantial. And so, when we get into an AI-driven data dissemination kind of framework, how can we use the flexibility of those platforms to not just provide the answers to the questions people are asking, but also to educate and inform about what the limitations of those answers are?

Find more Interviews

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photo of Robert Santos is excluded from this licence; it is a US Government work.

How to cite: Tarran, Brian. 2023. “‘We absolutely have to transform and modernise our operation’ – US Census Bureau director Robert Santos.” Real World Data Science, January 15, 2024. URL

Creating a web publication with Quarto: the Real World Data Science origin story

Brian Tarran — Wed, 03 Jan 2024 00:00:00 GMT

When I attended posit::conf(2023) in Chicago last year, I gave a talk about creating Real World Data Science using Quarto, the open source publishing system developed by Posit. That talk is now online, along with all the other conference talks and keynotes.

My talk, “From Journalist to Coder: Creating a Web Publication with Quarto,” is embedded below. You can also find a selection of talks on our posit::conf highlights blog. The full conference playlist is on YouTube.

Back to Editors’ blog

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2024. “Creating a web publication with Quarto: the Real World Data Science origin story.” Real World Data Science, January 03, 2024. URL

‘I was pretty clear in my mind that we were into a no-going-back situation’

Brian Tarran — Fri, 15 Dec 2023 00:00:00 GMT

For many people, six months into a new job is about the time you start to feel fully on top of things. You’ve figured out how the organisation works and your place in it. You’ve met all of your colleagues and got to know your way around the office. The job makes sense, everything’s under control. But then, a pandemic hits! What do you do? What is going through your head?

That’s a question we put to Professor Sir Ian Diamond, UK national statistician, who was only six months into the job when Covid-19 upended everything. He said: “My overall sense at all times was one of, ‘What needs to be done? What role can we play in helping to do it? And how do we make sure that we are doing things at pace?’”

There was no “flapping,” he said, but there was a real risk of exhaustion. “I could see pretty quickly that this was going to be a marathon, not a sprint, and while we had a lot to do,” he explained, “the last thing on earth we needed was for people to start burning out.”

Almost four years have passed since that time, but the effects of the pandemic continue to be felt – not least within the Office for National Statistics (ONS), the organisation Sir Ian leads. The Covid experience helped shape his thinking about how the ONS would operate post-pandemic, as he explains in this interview.

When we spoke with Sir Ian, he was six months into a second term as national statistician. By the end of that term, in 2028, what kind of organisation will the ONS be? Read on to find out.

What was your experience of the Covid pandemic, being only six months into the role of national statistician at the time?
It was all-consuming and required an enormous amount of focus. At the beginning of the pandemic, huge amounts of data were flying in every different direction. I felt we were in a data deluge, and we needed to move to [delivering] insight, and really working hard early on to change the agenda towards a situation where we were asking questions – really serious and sensible questions – and working out if we had the data, or how we answered those questions.

On the whole, ONS and the Government Statistical Service were praised for the way they responded to Covid. Did the pandemic experience inform your thinking about how the statistical system should operate once we moved out of that crisis situation?
Yes, in a number of areas. One was that we should not be completely dependent on data collected traditionally. For example, as we went into the pandemic, our ways of calculating inflation were pretty much dependent on people with clipboards going into supermarkets and shops and writing down the prices of things. We already had a project which was starting to think long term of using scanner data. But actually, being able to pivot very quickly to using web scraping to get data was incredibly important. We were also able to use web scraping early on to understand the availability of various goods in what we might call “adaptive purchasing,” or some would call “stockpiling.” And so, identifying that there were new ways of doing things and new data sources, I was pretty clear in my mind that we were into a no-going-back situation.

The second thing we demonstrated was that we could set things up very agilely and very quickly. And one final thing that I thought we absolutely have to continue with all the time is improved communication. You may recall that there were press conferences every day [during the early part of the pandemic], and I think during the start of those press conferences, the graphs and the slides were not always as brilliant as I would want them to be. We embedded a team into the Government Communication Service to work on the slides, and I thought that team did a great job. Improving the communication of statistics was incredibly important, because one of the things to come out of this dreadful pandemic was that people across the country became more data literate, and more demanding of data, and more able to interpret data. That was a good thing which I wanted to make sure we continued.

In terms of embedding the lessons or the learnings of the pandemic into the ONS going forward, how much of it is culture change? How much is about rethinking the systems and the processes?
Was it culture? Was it improved processes? Was it better methods? All of the above. As an organisation, our main role in life is to measure the economy and society, and if you take that as your starting point, and then you ask the question, “In your lifetime, has the economy ever stood still? Has society ever stood still?” In my lifetime, I would argue, no. Therefore, we have to be an organisation which is constantly changing in order to reflect what is going on in the economy and in society. We have to change how we do things, and to ask questions about whether there are better ways of doing things, and that, I think, has been a really important reflection for us over the period both during and since the pandemic. We’ve learned a lot about the use of, for example, reproducible analytic pipelines to really improve the quality of our data at large, to improve the quality of our processes, and to enable us to do things more efficiently and effectively. We’ve learned a lot about new data sources, and we’ve really built on the opportunities and the skills so that you can now link data to be able to address questions that I could only have dreamt of 20 years ago.

And so, I do think we have changed the culture, changed our techniques, and changed our data. But does that mean that we’ve metaphorically thrown away the baby with the bathwater? Absolutely not. What we have now are appropriate methodologies to answer appropriate questions. Do we still use qualitative data? A hundred percent, when it is necessary to do so. Do we still use surveys? Yes, we’ve got some of the best surveys in the world. But equally, we also use digital data, administrative data, and we use very modern techniques of analysing those data. And we use data science in its broadest sense as often as we can.

So, I do think it’s been a major change in what we do, and that will continue. But underlying it all is a total commitment to quality, a total commitment to making sure that we have the best data to answer the question that we are trying to answer, and that all the time we are using the best approach to answer the questions.

Professor Sir Ian Diamond, UK national statistician. Image supplied, used with permission.

We have changed the culture, our techniques, and our data. What we have now are appropriate methodologies to answer appropriate questions. Do we still use qualitative data? A hundred percent, when it is necessary to do so. Do we still use surveys? Yes, we’ve got some of the best surveys in the world. But equally, we also use digital data, administrative data, and we use very modern techniques of analysing those data.

As well as changing the culture and the processes within the ONS itself, has some of your work also been about trying to bring the user community with you? When I first started reporting on official statistics 20 years ago, there was a sense that users of the data valued consistency in methodology, because that meant they could go back and look at the time series. But now, with this emphasis on innovation and looking at different ways of producing insight from different sources of data, has there been a tension, if you like, between these two cultures?
That whole question of “we’ve always done it this way” against “we can now do it better” is a super important one. And the length of time series is also important. When we change the ways of doing things, we need to take our user community with us, and we do. At the same time, we also need a very strong narrative (a) about why what we are doing gives us better data, and (b) about what the changes in the time series mean. So, just this year, with regard to prices and inflation, we’ve been able to bring in much better data than we had previously on rail ticket prices by using electronic data. It’s really super exciting. But we didn’t just bring them in and say, “Hey, we’ve got this new way of doing train prices and we’re planning to do the same again next year with used car prices using electronic data!” What we do is we dual run, and we work with our prices advisory committee to ensure that we understand what the implications of this change are, and we understand how to communicate them. But, you’ve got to be measuring the economy in the very, very best way that you can. We should not shy away from improving what we do.

To what extent can the changes in thinking, the changes of approach, be credited to the experimentation and innovation work that is coming out of the ONS Data Science Campus?
The Data Science Campus has been absolutely brilliant. But at the same time, innovation does not only take place in the Data Science Campus. What we’ve built is a culture of innovation right across the organisation. Is that culture of innovation driven by the Data Science Campus? Not so much driven, but certainly helped, and certainly in partnership, and the fact that it is there encourages that culture of innovation.

I do think it is important to recognise, as I say to my colleagues many times, that we are not a blue-sky research institute, we are a national statistics institute, and our job is to produce economic and social statistics. Therefore, we need to be in the business of not just research but research and development – and thinking through how the research on new data that we do will enable improved economic measurement is, for us, incredibly exciting.

I’ll give you an example. We’ve [recently] signed a contract to get telephony data – a few years historically, and then regular data going forward. Now, this is entirely anonymised, but it will enable us to understand much more about, for example, commuting. It means that we will now need to do research on how to use those mobility data, and we’ll be really pushing that forward very quickly. But, at the same time, it’s not just about what can we do that’s interesting in this area; we need to have a very clear vision of what success looks like and the measurements, the economic measurements, that we are going to improve.

How do you see that innovation mindset rolling out across government as a whole?
It’s worth saying two things. Firstly, my job is not only national statistician, I also have an extra couple of hats: one is head of the Government Analysis Function, and one is head of the Government Digital Service. I take those roles very seriously because I do think we need to propagate good practice and innovation right across government departments. It’s no use if it just sits in ONS.

We try really, really hard to have innovation meetings and innovation months, and I try to speak at as many as I’m invited to. And I think it is incredibly important that we really see ourselves – right across the Government Statistical Service, right across the Government Analysis Function – as seeking to propagate good practice.

You mentioned there about the Government Statistical Service and the Government Analysis Function. Is there scope one day for a Data Science Service within government?
The answer is yes. Under both the leadership of Laura Gilbert, who is head of 10 Downing Street’s data science, called 10DS, and Osama Rahman, who heads the Data Science Campus, we recently held a town hall for data scientists right across government to discuss, fundamentally, the question of what data scientists want from the analysis function, but equally [what they want] from a community of data scientists. Part of that is a question as to whether there should be, in government, a data science profession. I stress we haven’t come to the conclusion for that yet, but I would have to say it was a very successful town hall – many, many people attended, we had a really good discussion, and Laura and Osama will be taking forward that discussion over the next couple of months.

You gave a keynote address at the RSS Conference in September, and one of the things you mentioned that I was particularly excited about was “Stat Chat.” I understand this is in the very early stages, and I have a very rudimentary understanding that it might well be a large language model trained on the ONS website and the data resources available, as a way to query the website. Can you tell me a bit more about the project?
We’re in private beta, and so there’s a whole set of agendas there. But we started it as a much better way to enable people to interrogate a pretty complex website with an enormous amount of data on, and to not only get to the datasets but to get to the existing metadata that are attached to them. What we wanted to do was to use open-source models, where the underlying data and the research behind them are made publicly available, so there’s nothing secret about this at the moment, and it’s very early stage. But we see the potential, as we move forward, as being able to really make it much easier for people to interrogate the data that we own and the data that we have published. If we can actually use large language models to enable people to be able to ask questions and then to get an authoritative answer from publicly available data, that seems to me to be a good place to be.

I imagine, though, that when you’re dealing with something like national statistics, you need to be very alive to the danger of the hallucination problem in large language models; that if you’re querying something, it doesn’t throw up an invented statistic?
I couldn’t agree more. And that’s why we’re in private beta, working very hard to make sure that (a) it is working properly, (b) that it’s got the right security around it, and (c) that it is actually really useful.

The hallucination issue in LLMs leads us onto the topic of trust in information, and obviously the ONS is very keen to ensure that there is trust in official statistics and in the data that is produced. This is a wider problem than anything the ONS can hope to address by itself, but what are the kinds of conversations you’re having internally about distrust in official sources of information?
This is something I say to my colleagues a lot: We should not expect people to trust us. We have to demonstrate to people that we are trustworthy. That’s incredibly important. A lot of it is about transparency. A lot of it is about absolute openness, showing your working, explaining where your data came from, and explaining your motivation for doing something. People say to me, “You might write a really strong methodological piece, but not many people read it.” Yes, but it’s there. And I’m a huge believer in research integrity, and in open data, and enabling data to be available for secondary analysis. And I think the more you are transparent, the more you work with people, the better.

A critical part, also, of demonstrating that you are trustworthy is engaging with the public. And that’s not telling the public; it’s engaging with the public. We put a lot of time into working with the public to say, “Well, what if we did this? What if we did that?” and getting their input. I don’t have any kind of switch to make people feel that we are trustworthy. It’s a continuous process of transparency and openness, where people feel that they have everything they need [to know] about what we do and about our data.

We also are absolutely passionate about explaining uncertainty. Don’t tell me the answer is 62% – the answer has some uncertainty about it, and we need to really think about how we display that uncertainty. And I have to say, I think some of the techniques now to display uncertainty are just so beautiful – unbelievably beautiful – and we need to do that, not in a gimmicky way, but in a way that really explains the uncertainty in any data that we present.

Final question: by the end of your second term as national statistician, where do you think ONS will be as an organisation?
I hope it will be an innovative, agile organisation which is using evermore diverse types of information, but doing so in a transparent and open and rigorous way to improve economic and social statistics which can impact positively on the lives of our fellow citizens.

Find more Interviews

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photo of Professor Sir Ian Diamond is not included in this licence.

How to cite: Tarran, Brian. 2023. “‘I was pretty clear in my mind that we were into a no-going-back situation.’” Real World Data Science, December 15, 2023. URL

A Christmas card in R for the Real World Data Science community

Brian Tarran — Tue, 12 Dec 2023 00:00:00 GMT

A few weeks back, I managed to catch Nicola Rennie’s presentation to the Oxford R User Group on how to create Christmas cards in R. It was a fun session, and thanks to Nicola’s clear and concise explanations, I felt emboldened to attempt my own design, using her code as a base.

If you missed the Meetup session, Nicola has kindly written a tutorial for Real World Data Science that walks through all the necessary steps to create a snowman against a snowy night’s sky. You’ll want to read that tutorial first before returning to this blog.

My design uses the same basic setting as Nicola’s but updates the scene to reflect the Real World Data Science (RWDS) brand colours, and I replace the snowman with a Christmas tree adorned with coloured baubles.

Snowy sky

We begin by loading in the following packages, adding a couple extra to the ones Nicola uses:

library(ggplot2)
library(ggforce)
library(sf)
library(png)
library(patchwork)

Then we add the sky, now recoloured in RWDS purple using fill and color:

s1 <- ggplot() +
  theme_void() +
  theme(
    plot.background = element_rect(fill = "#939bc9", color = "#939bc9")
  )
s1

We use the same code as Nicola to create the snowflakes, but we do this step first before adding snow on the ground, as we’re using the RWDS site background colour, hex code #f0eeb, to represent our settled snow:

# add snowflakes
set.seed(20231225)
n <- 100
snowflakes <- data.frame(
  x = runif(n),
  y = runif(n)
)
s2 <- s1 +
  geom_point(
    data = snowflakes,
    mapping = aes(
      x = x,
      y = y
    ),
    colour = "white",
    pch = 8
  )
s2

# snow on ground
s3 <- s2 +
  annotate(
    geom = "rect",
    xmin = 0, xmax = 1,
    ymin = 0, ymax = 0.2,
    fill = "#f0eeeb", colour = "#f0eeeb"
  ) +
  xlim(0, 1) +
  ylim(0, 1) +
  coord_fixed(expand = FALSE)
s3

Oh, Christmas tree

To build her snowman, Nicola created a series of circles that were stacked and overlaid. A simple Christmas tree, though, requires a series of triangles. So, taking Nicola’s snowman’s nose (also a triangle) as our starting point, we coded three sets of coordinates – tree_pts1, tree_pts2, and tree_pts3 – for three triangles of decreasing size that would sit on top of one another.

# coordinates for tree base
tree_pts1 <- matrix(
  c(
    0.2, 0.3,
    0.5, 0.6,
    0.8, 0.3,
    0.2, 0.3
  ),
  ncol = 2,
  byrow = TRUE
)

# coordinates for tree middle
tree_pts2 <- matrix(
  c(
    0.3, 0.5,
    0.5, 0.7,
    0.7, 0.5,
    0.3, 0.5
  ),
  ncol = 2,
  byrow = TRUE
)

# coordinates for tree top
tree_pts3 <- matrix(
  c(
    0.4, 0.65,
    0.5, 0.75,
    0.6, 0.65,
    0.4, 0.65
  ),
  ncol = 2,
  byrow = TRUE
)

# put tree together
tree <- st_multipolygon(list(list(tree_pts1),
                             list(tree_pts2),
                             list(tree_pts3)))
s4 <- s3 +
  geom_sf(
    data = tree,
    fill = "chartreuse4",
    colour = "chartreuse4"
  ) +
  coord_sf(expand = FALSE)
s4

A tree also requires a trunk, so we borrowed one of the rectangles from Nicola’s snowman’s hat for this purpose:

s5 <- s4+
  annotate(
    geom = "rect",
    xmin = 0.45,
    xmax = 0.55,
    ymin = 0.2,
    ymax = 0.3,
    fill = "brown"
  )
s5

And, of course, no Christmas tree is complete without decorations. The “rocks” that formed the buttons and eyes on Nicola’s snowman were updated to become gold and red baubles for our tree:

# add gold baubles
s6 <- s5 +
  geom_point(colour = "gold",
             data = data.frame(
               x = c(0.3, 0.4, 0.5, 0.6, 0.57, 0.62, 0.45, 0.5),
               y = c(0.325, 0.4, 0.45, 0.35, 0.57, 0.52, 0.6, 0.7),
               size = runif(8, 2, 4.5)
             ),
             mapping = aes(x = x, y = y, size = size)
  ) +
  scale_size_identity()
s6

# add red baubles
s7 <- s6 +
  geom_point(colour = "red3",
             data = data.frame(
               x = c(0.7, 0.6, 0.5, 0.525, 0.43, 0.38, 0.55, 0.5),
               y = c(0.375, 0.4, 0.55, 0.65, 0.43, 0.48, 0.5, 0.375),
               size = runif(8, 2, 4.5)
             ),
             mapping = aes(x = x, y = y, size = size)
  ) +
  scale_size_identity()
s7

Season’s greetings

The final step was to add text to the top of the image, wishing you all a Merry Christmas, and our logo to the bottom, so you know who the card is from:

# add text
s8 <- s7 +
  annotate(
    geom = "text",
    x = 0.5,
    y = 0.875,
    label = "Merry Christmas",
    colour = "red3",
    fontface = "bold",
    size = 18
  )
s8

# add logo 
path <- "images/rwds-logo-150px.png"
img <- readPNG(path, native = TRUE) 
s9 <- s8 +                   
  inset_element(p = img, 
                left = 0.3265, 
                bottom = 0.0, 
                right = 0.6735, 
                top = 0.2
  ) 
s9

I hope you like the Christmas card! From all of us at Real World Data Science, thank you for your support throughout 2023. Merry Christmas, happy holidays, and best wishes for 2024!

Back to Editors’ blog

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “A Christmas card in R for the Real World Data Science community.” Real World Data Science, December 12, 2023. URL

AI and digital ethics in 2023: a ‘remarkable, eventful year’

Brian Tarran — Fri, 08 Dec 2023 00:00:00 GMT

What a difference a year makes! That was the general tone of the conversation coming out of techUK’s Digital Ethics Summit this week. At last year’s event, ChatGPT was but a few days old. An exciting, enticing prospect, sure – but not yet the phenomenon it would soon become. My notes from last year include only two mentions of the AI chatbot: Andrew Strait of the Ada Lovelace Institute expressing concern about the way ChatGPT had been released straight to the public, and Jack Stilgoe of UCL warning of the threat such technology poses to the social contract – public data trains it, while private firms profit.

A lot has happened since last December, as many of the speakers at Wednesday’s summit pointed out. UNESCO’s Gabriela Ramos commented on how the UK’s AI Safety Summit, US President Joe Biden’s executive order on AI, and other international initiatives had brought about “a change in the conversation” on AI risk, safety, and assurance. Simon Staffell of Microsoft spoke of “a huge amount of progress” being made, building from principles into voluntary actions that companies and countries can take.

Luciano Floridi of Yale University described 2023 as a “remarkable, eventful year which we didn’t quite expect,” with various international efforts helping to build consensus on what needs to be done, and what needs to be regulated, to ensure the benefits of AI can be realised while harms are minimised. Camille Ford of the Centre for European Policy Studies noted that while attempts at global governance of AI make for a “crowded space” – with more than 200 documents in circulation – there are at least principles in common across the various initiatives, focusing on aspects such as transparency, reliability and trustworthiness, safety, privacy, and accountability and liability.

However, in some respects, we’ve not come as far as we could or should have over the past 12 months. Ford, for instance, called for more conversation on AI safety, and a frank discussion about on whose terms AI safety is defined. Not only are there the risks and harms of AI outputs to consider, but also environmental harms, exploitative labour practices, and more besides. Echoing the Royal Statistical Society’s recent AI debate, Ford said we need to focus on the risks we face now, rather than being consumed by discussions about the existential and catastrophic risks of AI – which, for many, are still firmly in the realm of science fiction.

There also remains “a big mismatch” between the AI knowledge and skills that reside within tech companies and that of other communities, said Zeynep Engin of Data for Policy. And many speakers were clear that the global south needs a more prominent voice in the AI debate.

Regulatory approaches

The UK government’s AI Safety Summit has been criticised for focusing too much on the hypothetical existential risks of AI. But, on regulation at least, there was broad agreement that the UK’s principles- and sector-based approach, outlined in a March 2023 white paper, is the right one. That’s not to say it’s perfect: discussions were had about whether regulatory bodies would be adequately funded to regulate the use of AI in their sectors, while Hetan Shah of the British Academy wondered “where was the golden thread” linking the AI white paper to the AI Safety Summit and its various pronouncements, including plans for an AI Safety Institute. (On the Safety Institute in particular, Lord Tim Clement-Jones was sceptical of yet another body being drafted in to debate these issues – a point made by panellists at the RSS’s recent AI debate.)

Delegates also got to hear from the UK’s Information Commissioner directly. John Edwards delivered a keynote address in which he acknowledged the huge excitement surrounding the benefits AI promises to bring, while cautioning that deployment and use of AI must be done in accordance with existing rules on data protection and privacy. The technology may be new, he said, but the same old data rules apply: “Our legislation is founded on technology-neutral principles of general application. They are capable of adapting to numerous new technologies, as they have over the last 30 years and will continue to do.”

He warned that noncompliance with data protection rules and regulations “will not be profitable,” and that persistent misuse of AI and personal data for competitive advantage would be punished. Edwards concluded by saying that AI is built on the data of human individuals and should therefore be used to improve their lives, and not put them or their personal data at risk.

Elections in an era of generative AI

One major looming risk is the use of generative AI to create mis- and disinformation during election campaigns. Hans-Petter Dalen of IBM suggested that next year is perhaps the biggest year for elections in the history of mankind, with votes due in the UK, US, and India, to name but a few. Generative AI represents not a new threat, he said, but an “amplified” one – a point further developed by Henry Parker of Logically.ai. Parker spoke of the risk of large-scale breakdown in trust due to mis- or disinformation campaigns. Thanks to AI tools, he said, we are now seeing the “democratisation of disinformation.” What once might have cost millions of dollars and required a team of hundreds of people can now be done much more cheaply and with fewer human resources. As the Royal Society’s Areeq Chowdhury said, the challenge of disinformation has only become harder.

Asked how to counter this, Dalen said that if he were a politician, “I would certainly get my own blockchain and all my content would have been digitally watermarked from source – that’s what the blockchain does.” But digital watermarking is only part of the answer, added Parker. Identifying mis- and disinformation is both a question of provenance and of dissemination. Logically.ai is using AI as a tool to analyse behaviours around the circulation of mis- and disinformation, Parker said – positioning AI as but one solution to a problem it has helped exacerbate.

Back to Editors’ blog

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Kajetan Sumila on Unsplash.

How to cite: Tarran, Brian. 2023. “AI and digital ethics in 2023: a ‘remarkable, eventful year.’” Real World Data Science, December 8, 2023. URL

Evaluating artificial intelligence: How data science and statistics can make sense of AI models

Brian Tarran — Wed, 06 Dec 2023 00:00:00 GMT

A little over a month ago, governments, technology firms, multilateral organisations, and academic and civil society groups came together at Bletchley Park – home of Britain’s World War II code breakers – to discuss the safety and risks of artificial intelligence.

One output from that event was a declaration, signed by countries in attendance, of their resolve to “work together in an inclusive manner to ensure human-centric, trustworthy and responsible AI that is safe, and supports the good of all.”

We also heard from UK prime minister Rishi Sunak of plans for an AI Safety Institute, to be based in the UK, which will “carefully test new types of frontier AI before and after they are released to address the potentially harmful capabilities of AI models, including exploring all the risks, from social harms like bias and misinformation, to the most unlikely but extreme risk, such as humanity losing control of AI completely.”

But at a panel debate at the Royal Statistical Society (RSS) the day before the Bletchley Park gathering, data scientists, statisticians, and machine learning experts questioned whether such an institute would be sufficient to meet the challenges posed by AI; whether data inputs – compared to AI model outputs – are getting the attention they deserve; and whether the summit was overly focused on AI doomerism and neglecting more immediate risks and harms. There were also calls for AI developers to be more driven to solve real-world problems, rather than just pursuing AI for AI’s sake.

The RSS event was chaired by Andrew Garrett, the Society’s president, and formed part of the national AI Fringe programme of activities. The panel featured:

Mihaela van der Schaar, John Humphrey Plummer professor of machine learning, artificial intelligence and medicine at the University of Cambridge and a fellow at The Alan Turing Institute.
Detlef Nauck, head of AI and data science research at BT, and a member of the Real World Data Science editorial board.
Mark Levene, principal scientist in the Department of Data Science at the National Physical Laboratory.
Martin Goodson, chief executive of Evolution AI, and former chair of the RSS Data Science and AI Section.

What follows are some edited highlights and key takeaways from the discussion.

AI safety, and AI risks

Andrew Garrett: For those who were listening to the commentary last week, the PM [prime minister] made a very interesting speech. Rishi Sunak announced the creation of the world’s first AI Safety Institute in the UK, to examine, evaluate and test new types of AI. He also stated that he pushed hard to agree the first ever international statement about the risks of AI because, in his view, there wasn’t a shared understanding of the risks that we face. He used the example of the IPCC, the Intergovernmental Panel on Climate Change, to establish a truly global panel to publish a “state of AI science” report. And he also announced an investment in raw computing power, so around a billion pounds in a supercomputer, and £2.5 billion in quantum computers, making them available for researchers and businesses as well as government.

The RSS provided two responses this year to prominent [AI policy] reviews. The first was in June on the AI white paper, and the second was on the House of Lords Select Committee inquiry into large language models back in September. How do they relate to what the PM said? There’s some good news here, and maybe not quite so good news.

First, the RSS had requested investments in AI evaluation and a risk-based approach. And you could argue, by stating that there will be a safety institute, that that certainly ticks one of the boxes. We also recommended investment in open source, in computing power, and in data access. In terms of computing power, that was certainly in the [PM’s] speech. We spoke about strengthening leadership, and in particular including practitioners in the [AI safety] debate. A lot of academics and maybe a lot of the big tech companies have been involved in the debate, but we want to get practitioners – those close to the coalface – involved in the debate. I’m not sure we’ve seen too much of that. We recommended that strategic direction was provided, because it’s such a fast-moving area, and the fact that the Bletchley Park Summit is happening tomorrow, I think, is good for that. And we also recommended that data science capability was built amongst the regulators. I don’t think there was any mention of that.

That’s the context [for the RSS event today]. What I’m going to do now is ask each of the panellists to give an introductory statement around the AI summit, focusing on the safety aspects. What do they see as the biggest risk? And how would they mitigate or manage this risk?

Detlef Nauck: I work at BT and run the AI and data science research programme. We’ve been looking at the safety, reliability, and responsibility of AI for quite a number of years already. Five years ago, we put up a responsible AI framework in the company, and this is now very much tied into our data governance and risk management frameworks.

Looking at the AI summit, they’re focusing on what they call “frontier models,” and they’re missing a trick here because I don’t think we need to worry about all-powerful AI; we need to worry about inadequate AI that is being used in the wrong context. For me, AI is programming with data, and that means I need to know what sort of data has been used to build the model, and I need AI vendors to be upfront about it and to tell me: What is the data that they have used to build it, how have they built it, or if they’ve tested for bias? And there are no protocols around this. So, therefore, I’m very much in favour of AI evaluation. But I don’t want to wait for an institute for AI evaluation. I want the academic research that needs to be done around this, which hasn’t been done. I want everybody who builds AI systems to take this responsibility and document properly what they’re doing.

I hear more and more a lot of companies talking about AI general intelligence, and how AI is going to take over the world, and I’m tremendously concerned about this. There is an opportunity to build AI that is human empowering, that keeps us strong, able, capable, intelligent, and can support us in all our human capabilities.

Mihaela van der Schaar: I am an AI researcher building AI and machine learning technology. Before talking about the risks, I also would like to say that I see tremendous potential for good. Many of these machine learning AI models can transform for the better areas that I find extremely important – healthcare and education. That being said, there are substantial risks, and we need to be very careful about that. First, if not designed well, AI can be both unsafe as well as biased, and that could lead to tremendous impact, especially in medicine and education. I completely agree with all the points that the Royal Statistical Society has made not only about open source but also about data access. This AI technology cannot be built unless you have access to high quality data, and what I see a lot happening, especially in industry, is people have data sources that they’ll keep private, build second-rate or third-rate technology on them, and then turn that into commercialised products that are sold to us for a lot of money. If data is made widely available, the best as well as the safest AI can be produced, rather than monopolised.

Another area of risk that I’m especially worried about is human marginalisation. I hear more and more a lot of companies talking about AI general intelligence, and how AI is going to take over the world, and I’m tremendously concerned as an AI researcher about this. There is an opportunity to build AI that is human empowering, that keeps us strong, able, capable, intelligent, and can support us in all our human capabilities.

Martin Goodson: The AI Safety Summit is starting tomorrow. But, unfortunately, I think the government are focusing on the wrong risks. There are lots of risks to do with AI, and if you look at the scoping document for the summit, it says that what they’re interested in is misuse risk and the risk of loss of control. Misuse risk is that bad actors will gain access to information that they shouldn’t have and build chemical weapons and things like that. And the loss of control risk is that we will have this super intelligence which is going to take over and we should see, as is actually mentioned, the risk of the extinction of the human race, which I think is a bit overblown.

Both of these risks – the misuse risk and the loss of control risk – are potential risks. But we don’t really know how likely they are. We don’t even know whether they’re possible. But there are lots of risks that we do know are possible, like loss of jobs, and reductions in salary, particularly of white-collar jobs – that seems inevitable. There’s another risk, which is really important, which is the risk of monopolistic control by the small number of very powerful AI companies. These are the risks which are not just likely but are actually happening now – people are losing their jobs right now because of AI – and in terms of monopolistic control, OpenAI is the only company that has anything like a large language model as powerful as GPT-4. Even the mighty Google can’t really compete. This is a huge risk, I think, because we have no control over pricing: they could raise the prices if they wanted to; they could constrain access; they could only give access to certain people that they want to give access to. We don’t have any control over these systems.

Mark Levene: I work in NPL as a principal scientist in the data science department. I’m also emeritus professor in Birkbeck, University of London. I have a long-standing expertise in machine learning and focus in NPL on trustworthy AI and uncertainty quantification. I believe that measurement is a key component in locking-in AI safety. Trustworthy AI and safe AI both have similar goals but different emphases. We strive to demonstrate the trustworthiness of an AI system so that we can have confidence in the technology making what we perceive as responsible decisions. Safe AI puts the emphasis on the prevention of harmful consequences. The risk [of AI] is significant, and it could potentially be catastrophic if we think of nuclear power plants, or weapons, and so on. I think one of the problems here is, who is actually going to take responsibility? This is a big issue, and not necessarily an issue for the scientist to decide. Also, who is accountable? For instance, the developers of large language models: are they the ones that are accountable? Or is it the people who deploy the large language models and are fine-tuning them for their use cases?

The other thing I want to emphasise is the socio-technical characteristics [of the AI problem]. We need to get an interdisciplinary team of people to actually try and tackle these issues.

Do we need an AI Safety Institute?

Andrew Garrett: Do we need to have an AI Safety Institute, as Rishi Sunak has said? And if we don’t need one, why not?

Detlef Nauck: I’m more in favour of encouraging academic research in the field and funding the kind of research projects that can look into how to build AI safely, [and] how to evaluate what it does. One of the key features of this technology is it has not come out of academic research; it has been built by large tech companies. And so, I think we have to do a bit of catch up in scientific research and in understanding how are we building these models, what can they do, and how do we control them?

Mihaela van der Schaar: This technology has a life of its own now, and we are using it for all sorts of things that maybe initially was not even intended. So, shall we create an AI [safety] institute? We can, but we need to realise first that testing AI and showing that it’s safe in all sorts of ways is complicated. I would dare say that doing that well is a big research challenge by itself. I don’t think just one institute will solve it. And I feel the industry needs to bear some of the responsibility. I was very impressed by Professor [Geoffrey] Hinton, who came to Cambridge and said, “I think that some of these companies should invest as much money in making safe AI as developing AI.” I resonated quite a lot with that.

Also, let’s not forget, many academic researchers have two hats nowadays: they are professors, and they are working for big tech [companies] for a lot of money. So, if we take this academic, we put them in this AI tech safety institute, we have potential for corruption. I’m not saying that this will happen. But one needs to be very aware, and there needs to be a very big separation between who develops [AI technology] and who tests it. And finally, we need to realise that we may require an enormous amount of computation to be able to validate and test correctly, and very few academic or governmental organisations may have [that].

I think it’s an insult to the UK’s scientific legacy that we’re reduced to testing software that has been made by US companies. We have huge talents in this country. Why aren’t we using that talent to actually build something instead of testing something that someone else has made?

Martin Goodson: Can I disagree with this idea of an evaluation institute? I think it’s a really, really bad idea, for two reasons. The first is an argument about fairness. If you look at drug regulation, who pays for clinical trials? It’s not the government. It’s the pharmaceutical companies. They spend billions on clinical trials. So, why do we want to do this testing for free for the big tech companies? We’re just doing product development for them. It’s insane! They should be paying to show that their products are safe.

The other reason is, I think it’s an insult to the UK’s scientific legacy that we’re reduced to testing software that has been made by US companies. I think it’s pathetic. We were one of the main leaders of the Human Genome Project, and we really pushed it – the Wellcome Trust and scientists in the UK pushed the Human Genome Project because we didn’t want companies to have monopolistic control over the human genome. People were idealistic, there was a moral purpose. But now, we’re so reduced that all we can do is test some APIs that have been produced by Silicon Valley companies. We have huge talents in this country. Why aren’t we using that talent to actually build something instead of testing something that someone else has made?

Mark Levene: Personally, I don’t see any problem in having an AI institute for safety or any other AI institutes. I think what’s important in terms of taxpayers’ money is that whatever institute or forum is invested in, it’s inclusive. One thing that the government should do is, we should have a panel of experts, and this panel should be interdisciplinary. And what this panel can do is it can advise government of the state of play in AI, and advise the regulators. And this panel doesn’t have to be static, it doesn’t have to be the same people all the time.

Andrew Garrett: To evaluate something, whichever way you chose to do it, you need to have an inventory of those systems. So, with the current proposal, how would this AI Safety Institute have an inventory of what anyone was doing? How would it even work in practice?

Martin Goodson: Unless we voluntarily go to them and say, “Can you test out our stuff?” then they wouldn’t. That’s the third reason why it’s a terrible idea. You’d need a licencing regime, like for drugs. You’d need to licence AI systems. But teenagers in their bedrooms are creating AI systems, so that’s impossible.

Let’s do reality-centric AI!

Andrew Garrett: What are your thoughts about Rishi Sunak wanting the UK to be an AI powerhouse?

Martin Goodson: It’s not going to be a powerhouse. This stuff about us being world leading in AI, it’s just a fiction. It’s a fairy tale. There are no real supercomputers in the UK. There are moves to build something, like you mentioned in your introduction, Andrew. But what are they going do with it? If they’re just going to build a supercomputer and carry on doing the same kinds of stuff that they’ve been doing for years, they’re not going to get anywhere. There needs to be a big project with an aim. You can build as many computers as you want. But if you haven’t got a plan for what to do with them, what’s the point?

Mihaela van der Schaar: I really would agree with that. What about solving some real problem: trying to solve cancer; trying to solve our crisis in healthcare, where we don’t have enough infrastructure and doctors to take care of us? What about solving the climate change problem, or even traffic control, or preventing the next financial crisis? I wrote a little bit about that, and I call it “let’s do reality-centric AI.” Let’s have some goal that’s human empowering, take a problem that we have – energy, climate, cancer, Alzheimer’s, better education for children, and more diverse education for children – and let us solve these big challenges, and in the process we will build AI that’s hopefully more human empowering, rather than just saying, “Oh, we are going to solve everything if we have general AI.” Right now, I hear too much about AI for the sake of AI. I’m not sure, despite all the technology we build, that we have advanced in solving some real-world problems that are important for humanity – and imminently important.

Martin Goodson: So, healthcare– I tried to make an appointment with my GP last week, and they couldn’t get me an appointment for four weeks. In the US you have this United States Medical Licencing Examination, and in order to practice medicine you need to pass all three components, you need to pass them by about 60%. They are really hard tests. GPT-4 for gets over 80% in all three of those. So, it’s perfectly plausible, I think, that an AI could do at least some of the role of the GP. But, you’re right, there is no mission to do that, there is no ambition to do that.

Mihaela van der Schaar: Forget about replacing the doctors with ChatGPT, which I’m less sure is such a good idea. But, building AI to do the planning of healthcare, to say, “[Patient A], based on what we have found out about you, you’re not as high risk, maybe you can come in four weeks. But [patient B], you need to come tomorrow, because something is worrisome.”

Martin Goodson: We can get into the details, but I think we are agreeing that a big mission to solve real problems would be a step forward, rather than worrying about these risks of superintelligences taking over everything, which is what the government is doing right now.

Managing misinformation

Andrew Garrett: We have some important elections coming up in 2024 and 2025. We haven’t talked much about misinformation, and then disinformation. So, I’m interested to get your views here. How much is that a problem?

Detlef Nauck: There’s a problem in figuring out when it happens, and that’s something we need to get our heads around. One thing that we’re looking at is, how do we make communication safe from bad actors? How do you know that you’re talking to the person you see on the camera and it’s not a deep fake? Detection mechanisms don’t really work, and they can be circumvented. So, it seems like what we need is new standards for communication systems, like watermarks and encryption built into devices. A camera should be able to say, “I’ve produced this picture, and I have watermarked it and it’s encrypted to a certain level,” and if you don’t see that, you can’t trust that what you see comes from a genuine camera, and it’s not artificially created. It’s more difficult around text and language – you can’t really watermark text.

Mark Levene: Misinformation is not just a derivative of AI. It’s a derivative of social networks and lots of other things.

Mihaela van der Schaar: I would agree that this is not only a problem with AI. We need to emphasise the role of education, and lifelong education. This is key to being able to comprehend, to judge for ourselves, to be trained to judge for ourselves. And maybe we need to teach different methods – from young kids to adults that are already working – to really exercise our own judgement. And that brings me to this AI for human empowerment. Can we build AI that is training us to become smarter, to become more able, more capable, more thoughtful, in addition to providing sources of information that are reliable and trustworthy?

Andrew Garrett: So, empower people to be able to evaluate AI themselves?

Mihaela van der Schaar: Yes, but not only AI – all information that is given to us.

Martin Goodson: On misinformation, I think this is really an important topic, because large language models are extremely persuasive. I asked ChatGPT a puzzle question, and it calculated all of this stuff and gave me paragraphs of explanations, and the answer was [wrong]. But it was so convincing I was almost convinced that it was right. The problem is, these things have been trained on the internet and the internet is full of marketing – it’s trillions of words of extremely persuasive writing. So, these things are really persuasive, and when you put that into a political debate or an election campaign, that’s when it becomes really, really dangerous. And that is extremely worrying and needs to be regulated.

At the moment, if you type something into ChatGPT and you ask for references, half of them will be made up. We know that, and also OpenAI knows that. But it could be that, if there’s regulation that things are traceable, you should be able to ask, ‘How did this information come about? Where did it come from?’

Mark Levene: You need ways to detect it. Even that is a big challenge. I don’t know if it’s impossible, because, if there’s regulation, for example, there should be traceability of data. So, at the moment, if you type something into ChatGPT and you ask for references, half of them will be made up. We know that, and also OpenAI knows that. But it could be that, if there’s regulation that things are traceable, you should be able to ask, “How did this information come about? Where did it come from?” But I agree that if you just look at an image or some text, and you don’t know where it came from, it’s easy to believe. Humans are easily fooled, because we’re just the product of what we know and what we’re used to, and if we see something that we recognise, we don’t question it.

Audience Q&A

How can we help organisations to deploy AI in a responsible way?

Detlef Nauck: Help for the industry to deploy AI reliably and responsibly is something that’s missing, and for that, trust in AI is one of the things that needs to be built up. And you can only build up trust in AI if you know what these things are doing and they’re properly documented and tested. So that’s the kind of infrastructure, if you like, that’s missing. It’s not all big foundation models. It’s about, how do you actually use this stuff in practice? And 90% of that will be small, purpose-built AI models. That’s an area where the government can help. How do you empower smaller companies that don’t have the background of how AI works and how it can be used, how can they be supported in knowing what they can buy and what they can use and how they can use it?

Mark Levene: One example from healthcare which comes to mind: when you do a test, let’s say, a blood test, you don’t just get one number, you should get an interval, because there’s uncertainty. What current [AI] models do is they give you one answer, right? In fact, there’s a lot of uncertainty in the answer. One thing that can build trust is to make transparent the uncertainty that the AI outputs.

How can data scientists and statisticians help us understand how to use AI properly?

Martin Goodson: One big thing, I think, is in culture. In machine learning – academic research and in industry – there isn’t a very scientific culture. There isn’t really an emphasis on observation and experimentation. We hire loads of people coming out of an MSc or a PhD in machine learning, and they don’t know anything, really, about doing an experiment or selection bias or how data can trip you up. All they think about is, you get a benchmark set of data and you measure the accuracy of your algorithm on that. And so there isn’t this culture of scientific experimentation and observation, which is what statistics is all about, really.

Mihaela van der Schaar: I agree with you, this is where we are now. But we are trying to change it. As a matter of fact, at the next big AI conference, NeurIPS, we plan to do a tutorial to teach people exactly this and bring some of these problems to the forefront, because trying really to understand errors in data, biases, confounders, misrepresentation – this is the biggest problem AI has today. We shouldn’t just build yet another, let’s say, classifier. We should spend time to improve the ability of these machine learning models to deal with all sorts of data.

Do we honestly believe yet another institute, and yet more regulation, is the answer to what we’re grappling with here?

Detlef Nauck: I think we all agree, another institute is not going to cut it. One of the main problems is regulators are not trained on AI, so it’s the wrong people looking into it. This is where some serious upskilling is required.

Are we wrong to downplay the existential or catastrophic risks of AI?

Martin Goodson: If I was an AI, a superintelligent AI, the easiest path for me to cause the extinction of the human race would be to spread misinformation about climate change, right? So, let’s focus on misinformation, because that’s an immediate danger to our way of life. Why are we focusing on science fiction? Let’s focus on reality.

AI tech has advanced, but evaluation metrics haven’t moved forward. Why?

Mihaela van der Schaar: First, the AI community that I’m part of innovates at a very fast pace, and they don’t reward metrics. I am a big fan of metrics, and I can tell you, I can publish much faster a method in these top conferences then I can publish a metric. Number two, we often have in AI very stupid benchmarks, where we test everything on one dataset, and these datasets may be very wrong. On a more positive note, this is an enormous opportunity for machine learners and statisticians to work together and advance this very important field of metrics, of test sets, of data generating processes.

Martin Goodson: The big problem with metrics right now is contamination, because most of the academic metrics and benchmark sets that we’re talking about, they’re published on the internet, and these systems are trained on the internet. I’ve already said that I don’t think this [evaluation] institute should exist. But if it did exist, there’s one thing that they could do, which is important, and that would be to create benchmark datasets that they do not publish. But obviously, you may decide, also, that the traditional idea of having a training set and a test set just doesn’t make any sense anymore. And there are loads of issues with data contamination, and data leakage between the training sets and the test sets.

Closing thoughts: What would you say to the AI Safety Summit?

Andrew Garrett: If you were at the AI Safety Summit and you could make one point very succinctly, what would it be?

Martin Goodson: You’re focusing on the wrong things.

Mark Levene: What’s important is to have an interdisciplinary team that will advise the government, rather than to build these institutes, and that this team should be independent and a team which will change over time, and it needs to be inclusive.

Mihaela van der Schaar: AI safety is complex, and we need to realise that people need to have the right expertise to be able to really understand the risks. And there is risk, as I mentioned before, of potential collusion, where people are both building the AI and saying it’s safe, and we need to separate these two worlds.

Detlef Nauck: Focus on the data, not the models. That’s what’s important to build AI.

Discover more The Pulse

Copyright and licence: © 2023 Royal Statistical Society

Images by Wes Cockx & Google DeepMind / Better Images of AI / AI large language models / Licenced by CC-BY 4.0.

How to cite: Tarran, Brian. 2023. “Evaluating artificial intelligence: How data science and statistics can make sense of AI models.” Real World Data Science, December 6, 2023. URL

‘I would like modellers to be less ambitious in developing monster models that are impossible to inspect’

Brian Tarran — Fri, 24 Nov 2023 00:00:00 GMT

It was during the first wave of the Covid-19 pandemic, when citizens in many countries around the world were confined to their homes, that Andrea Saltelli and colleagues were inspired to write “a manifesto for responsible modelling.” The television news was, Saltelli recalls, dominated by models of Covid infections, hospitalisations, and deaths. Politicians pointed to charts showing those model projections and spoke of “flattening the curve” – driving down case numbers – so as not to overwhelm healthcare systems.

While all this was going on, Saltelli came into contact “with a fantastic group of people,” he says, “all of whom were, in a sense, concerned by the sudden eruption of mathematical modelling into everyday life.”

“We were concerned that this [modelling] was not being done properly, that too much importance was given to those numbers, too much certainty was attached to them, and nobody seemed to realise that the selection of certain numbers rather than others would eventually and dramatically bias the message that was given.”

In June 2020, Saltelli – along with Monica Di Fiore, Deborah Mayo, Theodore Porter, Philip Stark and others – published in Nature their manifesto setting out “Five ways to ensure that models serve society.” The ideas proposed in that three-page comment piece have now been given a book-length treatment, so we sat down with Saltelli to discuss The Politics of Modelling: Numbers Between Science and Policy.

Can you tell our readers a little about yourself?
I am a chemist. I got my degree in chemistry, but for most of my life I have worked as a mathematical modeller and applied statistician. More recently, let’s say in the last 10 years or so, I have also moved into issues of epistemology – meaning, how do we decide that we know what we know, and how do we do that when the source of the knowledge is represented by a mathematical model?

I’d like to dig into the title of your new book. What should people understand about The Politics of Modelling?
It starts from a broader discussion of a state of exception enjoyed by mathematical modelling. One point we try to make in the book is that models are exceptional because they have an incredible palette of methodologies – even more than statistics. They are not a discipline, because everyone does modelling in their own craft in a different way. Modelling even escapes the gaze of sociologists most of the time because sociologists are more interested in algorithms and statistics. And, as a consequence of this state of exception, models enjoy many privileges, including a better defence of the pretence of neutrality, and they maintain, in a certain sense, a lapse of symmetry between developers and users. They also have a very strong grip on policy, whereby models can enjoy a high epistemic authority, and this epistemic authority seems to be proportional to the dimension of the model or the base of data on which the model has been calibrated. All of this creates a situation which leads to a problem – a problem for society, on the one hand, because models are used to suggest policies which are not optimal, and on the other hand, trust is consumed, trust is lost, and this may have been happening as a result of the Covid-19 epidemic and the way mathematical modelling was used in the context of the epidemic.

The book emerged out of the “manifesto for responsible modelling” that you published in Nature a few years back. Could you describe that manifesto?
The manifesto was something which came out of the pandemic, in fact, because we were all locked up at home and we could spend some time reflecting and writing. We tried to produce a set of recommendations for both society and the modellers: for society to be a bit more circumspect in accepting results from mathematical modelling, and for modellers to be more cautious in formulating their predictions. But, beyond the issue of apparent precision of mathematical models, there was also the issue that models are built on a series of assumptions, each of which may have a great bearing on the result. And not only that but also, at the point where you formulate a mathematical model, you assume that you have already decided what is the problem, what is the direction of progress. So, there are really many normative assumptions which are embedded into that. Then there is the issue that mathematical models are not done by everyone; they are done by specific groups of people who belong to, normally, a certain identified class, some kind of elite – not a financial elite, but an elite in terms of competencies and knowledge. And this also creates bias, because – to put it brutally – if you can work at home with your laptop, the epidemic doesn’t affect you so much. But if you work in a plant and the plant is closed, and you are not paid, this destroys your life, or the life of your family. This asymmetry – or inequality, let’s say, or implicit bias – in those who are producing the analysis, this was, for many of us, an issue which needed to be brought to the attention of the public.

Andrea Saltelli, co-editor of ‘The Politics of Modelling.’

The manifesto was something which came out of the pandemic… We tried to produce a set of recommendations for both society and the modellers: for society to be a bit more circumspect in accepting results from mathematical modelling, and for modellers to be more cautious in formulating their predictions.

Surely the urgency of the Covid situation prevented people from taking a step back and thinking more deeply about how models are constructed. Is it not forgivable in a situation like that? Or, is your argument that we should be doing this at all times, regardless of the urgency, regardless of the time pressures?
I am tempted to say both yes and no. Yes, because surely the situation was urgent, and many things which were done in a way which one would consider suboptimal were later justified on the grounds of urgency. We noted incredible differences in the measures adopted in several countries, so for us it was obvious that even though everyone was claiming to “follow the science,” they seemed to be following different sciences, or perhaps they were following “the science” which was more instrumental or more convenient to justify what was simply politically expedient.

Beyond that, I would say: it’s always urgent, no? We are very often in these kinds of situations. One might say that even the regulation of artificial intelligence today is urgent. Regulation of pesticides is urgent. Not to mention geopolitics… Everything seems to be urgent, and this seems to be a constant in our relationship with technology in particular: we don’t want to kill innovation, but if we wait to see what a new piece of technology does before we regulate it, then maybe it’s too late to change it. This is exploited by many people, not least [Mark] Zuckerberg [CEO of Facebook owner Meta]. He says, “Move fast and break things,” but once things are broken, they’re broken.

And talking about things being broken, what we discuss in the book is also this issue of broken trust. People are losing faith in expertise – not in all countries in the same way; there are national differences that are important. But, in general, if you measure trust in science – which is still very high – it’s taken quite a dent during the pandemic, and we argue that this was in part due to abuse of mathematical models.

The book, which you’ve edited with Monica Di Fiore, breaks down the manifesto for responsible modelling into extended essays from different contributors, looking at different aspects of modelling – the framing of models, the assumptions, the consequences. For these essays, you draw on experts from different fields: sociology, philosophy, statistics, civil engineering, geography, law, environmental sciences, and others. Why was it important to get such diverse perspectives on these various aspects of modelling?
There is a major divide between social science – humanities – on the one hand and natural sciences on the other hand, with lots of suspicion between the two fields and sometimes open hostility. Mathematical modelling is particularly impenetrable, as we argue, to the gaze coming from a social scientist – at least, more impenetrable than statistics or algorithms, which have been very much studied in recent years. And so, it was important to allow the two fields, the two big communities, to communicate and to speak to one another in a critical way.

You write in your introduction to the book that the field of statistics has spent more time thinking more deeply about questions of data ethics, model assumptions, and so on. Can you give an example?
There is a book by a group of French statisticians, Statactivisme, which is rich with examples of how a statistician could make a difference by simply producing better numbers. They don’t say, “Throw away the model, throw away the numbers,” but simply be careful of what numbers you use. And I think models and modellers need something like this, some kind of systematic debate – a societal debate – with other disciplines on what they’re doing.

One of the quotes that jumped out at me from the book was, “Models are underexplained but overinterpreted.” How do we reset that balance?
This is more easily said than done. The remedies to this are, maybe we should spend some time thinking about reproducibility, even in mathematical modelling. This is not done. We talk about the reproducibility of data but very few people talk about the reproducibility of a mathematical model. Another thing which I think would be useful is to think more about how to interpret models and less about how to make them bigger. And then, of course, there is the practice of “assumption hunting.” If you use a model, go and hunt for the assumptions contributing to its construction.

To the modellers we say, engage yourself in something that might be called “modelling of the model process,” which means, try to imagine what would happen if you took a different branch in the construction of the model. In this we make an analogy to the “garden of the forking paths,” something that statisticians discuss, because they understand that when they build a statistical construction, they can take one way or another way, and when they measure the impact of taking a different path – as, for instance, when they give the same data to different teams – they find an amazing diversity of results that are totally unexpected. We are learning now that not only in statistics and mathematical modelling but in the laboratory, too – conducting physical experiments, not numerical ones – you can have a diverging set of outcomes depending on who is doing the analysis.

All this should call for a science that is more humble – one that accepts this kind of possibility and works actively to make these issues evident but also solves them in order to produce knowledge that is useful.

Earlier, you spoke about modellers coming from a specific group or class of people – an elite. We interviewed, earlier this year, Erica Thompson, author of the book Escape from Model Land, and one of the points Erica discussed was how to bring more diversity of thought and of voices into the modelling process. How do we engage broader communities in the construction of models – maybe not building the model itself, but thinking about what is important, what needs to be measured, what are we looking to understand?
This could be achieved if models were used in a context of what we, the authors, call an “extended peer community.” In other words, this is the idea that when you are discussing an issue, you should talk to the people directly affected by the issue because they have some knowledge about it. For this to take place, the model must be one instrument, which the community can get together to discuss, and so the model must not be too complex.

Now that your book is out, what do you hope will be its impact?
Looking from the point of view of the modellers, I would like them to be more humble and less ambitious in developing monster models that are impossible to inspect and explore. From society I would like to see a more circumspect attitude, as I said before. Society has been trained to be sceptical of statistical information, but we should also be circumspect about the output of mathematical modelling. Ask more questions; ask, for instance, for the uncertainty range for a given number, or whether the number tells the entire story.

Find more Interviews

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “‘I would like modellers to be less ambitious in developing monster models that are impossible to inspect.’” Real World Data Science, November 24, 2023. URL

How data science and statistics can shape the UK’s AI strategy

Brian Tarran — Mon, 30 Oct 2023 00:00:00 GMT

About the panelists

Andrew Garrett (chair) is president of the Royal Statistical Society. He is executive vice president of scientific operations at the clinical research organisation ICON plc, where he is responsible for the strategic direction and operational delivery of a range of clinical trial services. Having worked extensively in the area of rare diseases, he has held various biostatistics managerial positions in the pharmaceutical industry, including vice president of biostatistics, medical writing and regulatory affairs at Quintiles (now IQVIA).

Peter Wells is a technologist, who accidentally started a second career in public policy. He has both worked on AI policy and helped design AI-enabled services. After 20 years in the telecoms industry, he found himself spending 2014 developing digital government policy for the Labour Party. Since then he has worked with multiple governments and organisations including the Open Data Institute, Projects by IF, Google, Meta and the Government Digital Service.

Maxine Setiawan is a data scientist specialising in AI and data risk and trusted AI in EY UK&I. She works to help clients from various industries assess and manage risks from analytics and AI systems, and implement AI governance to ensure AI systems are implemented with fair, accountable, and trustworthy principles. She combines her socio-technical background with an MSc in Social Data Science from the University of Oxford, and her experience working in data science within consulting firms.

Sophie Carr is chair of the Real World Data Science editorial board and is the founder and owner of Bays Consulting, a data science company. Having trained as an aeronautical engineer, Sophie completed her PhD in Bayesian analysis part time whilst she worked and, following redundancy, founded her own company. She is the VP for education and statistical literacy at the RSS and sits on the executive committees of the Academy for Mathematical Sciences and the International Centre for Mathematical Sciences. She is also currently the world’s most interesting mathematician.

Chris Nemeth is a professor of statistics at Lancaster University. His primary research area is in probabilistic machine learning and computational statistics. He holds an EPSRC-funded Turing AI fellowship on Probabilistic Algorithms for Scalable and Computable Approaches to Learning (PASCAL), and through his fellowship he works closely with partners including Shell, Tesco, Elsevier, Microsoft Research and The Alan Turing Institute. He is chair of the Royal Statistical Society Section on Computational Statistics and Machine Learning.

Karen Tingay is a principal statistical methodologist at the Office for National Statistics where she specialises in natural language processing and in managing complex survey imputation. She established and heads up the Text Data Subcommunity, a large network of public sector analysts to build capability and best practice guidance in managing and analysing unstructured text data, on behalf of the Government Data Science Community. She sits on several cross-government and international working groups on responsible use of generative AI.

Back to Editors’ blog

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “How data science and statistics can shape the UK’s AI strategy.” Real World Data Science, October 30, 2023. URL

‘Statistics and data science are at the heart of the AI movement – we want to be a strong voice in the debate’

Brian Tarran — Wed, 25 Oct 2023 00:00:00 GMT

Next week, the Royal Statistical Society (RSS) is hosting a panel debate on “Evaluating artificial intelligence: How data science and statistics can make sense of AI models.” The event forms part of the AI Fringe programme of activities and is timed to precede the UK government’s AI Safety Summit at Bletchley Park.

RSS president Andrew Garrett is chairing this free, in-person event on 31 October, and he’ll be joined by five panellists to discuss big questions around AI model development, evaluation, risk and benefits:

Mihaela van der Schaar, John Humphrey Plummer professor of machine learning, artificial intelligence and medicine at the University of Cambridge and a fellow at The Alan Turing Institute
Detlef Nauck, head of AI and data science research, BT
Stephanie Hare, researcher, broadcaster, and author of Technology is Not Neutral
Mark Levene, principal scientist, department of data science, National Physical Laboratory
Martin Goodson, chief executive, Evolution AI, and former chair of the RSS Data Science and AI Section

We sat down with Andy for a quick-fire Q&A to hear more of what’s in store for next week’s event.

The RSS event takes place one day before the UK government’s AI Safety Summit. Why is it so important for statisticians and data scientists to be involved in the debate over AI safety?
Statistics and data science are at the heart of the AI movement – it’s really a question of taking data and information, and using statistical algorithms to create outputs. That’s at the core of what we do as statisticians and data scientists. Although it’s called AI, it uses mathematical and statistical methods.

The AI Safety Summit focuses on risks posed by certain types of AI systems. Where do you see the biggest risk?
Risk depends upon the purpose and the impact of the AI. It’s very different whether something is being used to inform or recommend or persuade or decide. If it’s a decision-making system, say, there is a bigger risk associated with it if the decision to be made will have an important impact on your life – so, that might be a medical decision or a decision on whether you’re to receive benefits or housing, or how you’re treated in the judicial system. It’s important to understand what the AI is being used for and how much control you have over it, and also how much oversight there is. Will the decision be made solely by an algorithm, or is there human oversight?

There is a particular concern moving forward around misinformation and disinformation. That is a genuine concern, particularly with big elections coming up in the UK and beyond. People are sharing things that they don’t realise are disinformation, so it is really important to understand where the information is coming from, its provenance; that’s incredibly important. We’ve seen with hallucinogenic AI that sometimes there are references given for outputs that don’t actually exist. So we need to constantly scrutinise where information sources are coming from and, if sources are quoted, do they really exist?

The RSS policy response to the UK government’s AI regulation whitepaper urges investment in a Centre for AI Evaluation Methodology. What informed this recommendation?
The white paper uses the word “proportionate” many, many times. When you talk about safety, inevitably you move into the area of risk, and if you want safety measures to be proportionate, and you want those measures to be based upon risk, then – effectively – you have to understand how you evaluate risk, how you evaluate the probability of something happening, and what the impact might be if it does happen. That naturally lends itself to needing to evaluate both the potential harm but also the potential benefit of using AI. I think the summit is focused more around harm, and the concerns around potential harms, rather than the trade off between harm and benefit. But that was certainly the reason we got into talking about the importance of evaluation, and evaluation is used in other industries where there is high risk. Drug development is an example, as is healthcare, where treatments and new methods are evaluated so that people are informed about both the potential harms and the potential benefits.

Andrew Garrett, president, Royal Statistical Society

We have a really important role to play in focusing attention not only on AI outputs but on inputs. That is an area that hasn’t received enough attention but it is one that statisticians understand incredibly well.

A lot of the focus of the AI debate is on outputs. Should we be talking more about inputs – the data the models are trained on, where the data is coming from, and issues of quality and bias, etc.?
There should be more discussion of this, yes. Statisticians and data scientists, but statisticians in particular, are trained very much around the data generating process, so we naturally think about how data is gathered, the potential biases in a dataset, and the representation that you need for a study. We understand that the data going in is as important as the outputs produced. And I think it’s becoming more and more necessary to understand where exactly the data is coming from, whether it’s diverse, whether it’s representative of the populations you want to study, and so on. This is an incredibly important area, and provenance of data – like provenance of information – will become ever more important. So, with a large language model, for example, what information is it trained on? Did the developers have permission to use that information? How representative is that information? These sorts of questions need to be addressed, because your outputs will only ever be as good as your inputs.

AI systems are having wide-reaching effects across society. What impact are AI tools having on the work of statisticians and data scientists, and how would you evaluate the impacts so far: good, bad, or neutral?
You have to be a cynic and an evangelist at the same time. There is some very good work being done but also some very naive work. AI is not magic. It requires the same thoroughness and lifecycle management as anything else. Certainly in terms of pattern recognition, image recognition, it’s been very useful. On MRI images, for example, can you reduce the amount of time humans need to spend looking at the data because you have an AI tool helping with the assessment? Of course, the challenge is then, when you have an AI assessment, what do you compare it to? You could compare it to what an expert would assess, but is that a suitable reference point for saying something is a good system, knowing that humans themselves are not perfect? AI systems are able to handle large datasets, large images, very quickly. And so the speed of being able to do that has a potential advantage, although it depends on the level of human oversight. Where we’re seeing these tools being advantageous, I think, is where you have some human oversight but some of the heavy lifting is being done by the AI systems.

What do you hope will emerge from the panel debate at the RSS next week? Reaching consensus on such a big, broad topic is unlikely, perhaps, but what are the kinds of things that you’re hoping to learn and take away from the discussion?
We’ve got some very good practitioners on the panel, and I’m hoping that we’ll generate some really good discussion from the panel and some really good questions from the audience. When it comes to the AI conversation generally, there’s a danger that it has focused too much so far on either the academic view of things or the large tech company perspective, so we’re probably missing out a whole tranche of people who are working at the coalface on these things, working in smaller companies. So, I’d like to understand a little bit more about what is happening in that part of industry. I know there’s a big focus on things like building out capability in the UK, and that’s not simply a question of having people with expertise – it’s about having access to things like the right sort of computing environments. So, I think there’s going to be some interesting discussion around what’s holding back industry. Overall, though, what I’d like to see come out of this meeting is a more proportionate response, from people who are working on this on a day-to-day basis. Statisticians are good at that – at coming up with a measured response, an informed response. Do we have the same concerns about the existential threat of AI that have been discussed by some of the larger companies, for example?

Aside from coming along and contributing to this panel discussion, how else can statisticians and data scientists engage with the AI debate and help shape a collective response to this major issue?
I’d certainly encourage them to join the RSS and be a part of our work on this. We want to be a strong voice in the debate on AI because it is underpinned by statistical and mathematical techniques, as I mentioned at the start. We have a really important role to play in focusing attention not only on AI outputs but on inputs. That is an area that hasn’t received enough attention but it is one that statisticians understand incredibly well – and it’s one that brings into discussion issues such as ethics, consent, copyright, etc., and that’s very much where we should be engaging as well.

Register now for “Evaluating artificial intelligence: How data science and statistics can make sense of AI models,” a free, in-person debate at the RSS offices in London, 4 pm – 6 pm, Tuesday, October 31.

Find more Interviews

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “‘Statistics and data science are at the heart of the AI movement – we want to be a strong voice in the debate.’” Real World Data Science, October 25, 2023. URL

An AI for humanity

Martin Goodson — Fri, 20 Oct 2023 00:00:00 GMT

This is the text of a talk Martin Goodson gave to the European Commission in Brussels on October 10, 2023. It is republished with permission from the Royal Statistical Society Data Science and AI Section Newsletter Substack. The views expressed are the author’s own and do not necessarily represent those of the RSS.

For years academics have published studies about the limits of automation by AI, suggesting that jobs requiring creativity were the least susceptible to automation. That turned. out. well.

Actually, that’s not completely true: some said that jobs that need a long period of education, like teaching and healthcare, were going to be the hardest of all to automate. Oh. dear.

Let’s face it, all predictions about the limits of AI have been hopelessly wrong. Maybe we need to accept that there aren’t going to be any limits. How is this going to affect our society?

Studies came out from Stanford and MIT this year, looking at the potential of AI assistants to improve the productivity of office workers. Both came to the same conclusion – that the workers with the lowest ability and least experience were the ones who gained the most in productivity.

In other words, AI has made human knowledge and experience less valuable.

Researchers at Microsoft and Open AI wrote something important on this phenomenon that I’d like to quote in full:

Large swaths of modern society are predicated on a “grand bargain” in which professional classes invest years or even decades in technical education and training and are [afforded] the exclusive right to practice in their field, social prestige, and above-average compensation.

Technical disruption of this social contract can have implications not only for the medical field but for numerous other knowledge-intensive professions including law, banking, engineering, accounting, and others.

Let’s talk about the fairness of this. Because the AI models didn’t invent medicine, accountancy or engineering. They didn’t learn anything directly from the world – human experts taught AI models how to do these things. And they [the human experts] did it without giving their permission, or even knowing that it was happening.

The large tech companies have sucked up all of human knowledge and culture and now provide access to it for the price of an API call. This is a huge transfer of power and value from humanity to the tech companies.

Biologists in the 1990s found themselves in a very similar position. Celera Genomics was trying to achieve commercial control over the human genome. To stop this happening, the publicly funded Human Genome Project (HGP) resolved to sequence the human genome and release the data for free on a daily basis, before Celera could patent any of it.

The HGP was criticised because of ethical concerns (including concerns about eugenics), and because it was thought to be a huge waste of money. The media attacked it, claiming that a publicly funded initiative could not possibly compete with the commercial sector. Fortunately for humanity, a group of scientists with a vision worked together to make it a success.

And it was a huge success: in purely economic terms it produced nearly $1 trillion in economic impacts for investment of about $4 billion. Apart from the economics, the Human Genome Project accelerated development of the genomic technologies that underlie things like mRNA vaccine technology.

The parallels to our current situation with AI are striking. With OpenAI, just like Celera, we have a commercial enterprise that launched with an open approach to data sharing but eventually changed to a more closed model.

We have commentators suggesting that a publicly funded project to create an open-source AI would be ethically dubious, a waste of money and beyond the competency of the public sector. Where the analogy breaks down is that unlike in the 1990s, we do not have any strong voices arguing on the other side, for openness and the creation of shared AI models for all humanity.

Public funding is needed for an “AI for humanity” project, modelled on the Human Genome Project. How else can we ensure the benefits of AI are spread widely across the global population and not concentrated in the hands of one or two all-powerful technology companies?

We’ll never know what the world would have looked like if we’d let Celera gain control over the human genome. Do we want to know a world where we let technology companies gain total control over artificial intelligence?

FAQ

How about all the ethical considerations around AI – shouldn’t we consider this before releasing any open-source models?

Of course. Obviously, there are ethical implications that need to be considered carefully, just as there were for the genome project. At the start of that project, the ethical, legal, and social issues (or ELSI) program was set up. The National Institutes of Health (NIH) devoted about 5% of their total Human Genome Project budgets to the ELSI program and it is now the largest bioethics program in the world. All important ethical issues were considered carefully and resolved without drama.

Aren’t there enough community efforts to build open-source AI models already?

There are good projects producing open-source large language models, like Llama 2 from Meta and Falcon from the TII in the United Arab Emirates. These are not quite as powerful as [Open AI’s] GPT-4 but they prove the concept that open-source models can approach the capabilities of the front-running commercial models; even when produced by a single well-funded lab (and a state-funded lab in the case of the TII). A coordinated international publicly funded project will be needed to surpass commercial models in performance.

In any case, do we want to be dependent on the whims of the famously civic-minded Mark Zuckerberg [CEO of Meta] for access to open-source AI models? We shouldn’t forget that the original Llama model was released with a restrictive licence that was eventually changed to something more open after a community outcry. We are lucky they made this decision. But the future of our societies needs to rely on more than luck.

How about the UK Government AI Safety Summit and AI Safety Institute – won’t they be doing similar work?

Absolutely not! The limit of the UK Government’s ambition seems to be to set the UK up as a sort of evaluation and testing station for AI models made in Silicon Valley. This is as far from the spirit of the Human Genome Project as it’s possible to be.

Sir John Sulston, the leader of the HGP in the UK, was a Nobel Prize-winning scientific hero who wanted to stop Celera Genomics from gaining monopolistic control over the human genome at all costs. The current UK ambition would be like reducing the Human Genome Project to merely testing Celera Genomics’ data for errors.

How will an international ‘AI for humanity’ project avoid the devaluation of human knowledge and experience, and consequent job losses?

It may not be possible to avoid this. But governments will at least be able to mitigate societal disruption if they can redistribute some of the wealth gained via AI (e.g., via universal basic income). They will not be able to do this if all of the wealth accrues to only one or two technology companies based in Silicon Valley.

How about existential risk?

‘Existential risk’ is a science fiction smokescreen generated by large tech companies to distract from the real issues. I cannot think of a better response than the words of Prof Sandra Wachter at the University of Oxford: “Let’s focus on people’s jobs being replaced. These things are being completely sidelined by the Terminator scenario.”

Martin Goodson will be speaking live at the Royal Statistical Society on October 31, 2023, as part of a panel discussion on “Evaluating artificial intelligence: How data science and statistics can make sense of AI models.” Register now for this free in-person debate. The event forms part of the AI Fringe programme of activities, which runs alongside the UK Government’s AI Safety Summit (1–2 November).

Discover more The Pulse

About the author: Martin Goodson is the former chair of the RSS Data Science and AI Section (2019–2022). He is the organiser of the London Machine Learning Meetup, the largest network of AI practitioners in Europe, with over 11,000 members. He is also the CEO of AI startup, Evolution AI.

Thumbnail image by Etienne Girardet on Unsplash.

‘Feelings about sharing data can be context and time dependent – you can’t just do one survey or focus group’

Brian Tarran — Mon, 16 Oct 2023 00:00:00 GMT

This summer, the UK Office for Statistics Regulation (OSR) published its report on Data Sharing and Linkage for the Public Good. In the report, the OSR notes that the value of sharing and linking data has become widely recognised within government, though there remain areas of challenge and uncertainties about “the public’s attitude to, and confidence in, data sharing.”

The report also warns that “unless significant changes are implemented… progress that has been made could be lost and the potential for data sharing and linkage to deliver public good will not be achieved.”

To find out more about the report and its recommendations for change, we sat down with Helen Miller-Bakewell, OSR’s head of development and impact. Listen to the full interview below or on YouTube.

Transcript

This transcript has been produced using speech-to-text transcription software. It has been only lightly edited to correct mistranscriptions and remove some repetitions.

Brian Tarran
Hello, I’m Brian Tarran, editor of realworlddatascience.net. And welcome to another Real World Data Science interview. Today I’m speaking with Helen Miller-Bakewell of the Office for Statistics Regulation. And we’re talking about the Office’s July 2023 report, Data Sharing and Linkage for the Public Good. In this report, the OSR reviews progress that has been made towards sharing and linking data for the public good. It says that the value of sharing and linking data has become widely recognised within government, though there remain areas of challenge and uncertainties about, quote, the public’s attitude to and confidence in data sharing. The report warns that quote, unless significant changes are implemented, the OSR is concerned that progress that has been made could be lost and the potential for data sharing and linkage to deliver public good will not be achieved. In my interview with Helen, we talk about some of the key highlights and findings of the report some of its main recommendations, some examples of data sharing and linkage that are going on now within government. So let’s hand over now to Helen, who will begin by introducing herself, her role within OSR and some of the background to the report.

Helen Miller-Bakewell
So I’m Helen, Helen Miller-Bakewell, I have an official title of head of development and impact within OSR. OSR as a whole, our kind of formal job is to regulate statistics produced by government, we are the regulatory arm of the UK Statistics Authority. And our aim is to work towards statistics that serves the public good, and a government that produces and uses statistics analysis in a way that means the public can feel confident in them, and the analysis that that’s done and how they’re used. Within OSR, I used to be a regulator of statistics, I was out there looking at official statistics on crime and security, holding them up against our code of practice for statistics, which sets the standards that we would like to see government statistics meet. In my role now I actually oversee a few of our cross organisation functions, that all are designed to try and improve the way that OSR as a whole works, and to do work that can support the statistical system as a whole. And the most relevant one for today is the data and methods function. And a key piece of work that that function has been working on for the last year is looking at how data sharing and linkage is done across government, and that supports one of OSR’s wider interests and ambitions. One of our ambitions for the current five years is to make greater data available in a secure way for research and evaluation. That’s what this report that we’re going to be talking about has contributed to.

Brian Tarran
Before we get stuck into the report, maybe it’s worth setting out what does OSR mean when it’s talking about data sharing and linkage? What’s the driver for this being a kind of a priority, something that OSR wants to encourage and to see happen? What are the public benefits that you hope to accrue from it?

Helen Miller-Bakewell
The general premise that’s underpinned the report, and the reason that we have been a champion and advocate for date sharing and linkage for a while, is that we think data can be more powerful when it’s linked, and when it’s made accessible, in a secure way, to a wider audience for analysis. And it can offer more insights and better fulfil its potential to serve the public goods. And again, I keep saying serve the public goods. Within OSR, we very much think that statistics analysis, it shouldn’t just be for government, for decision makers in government, it should be available to the public, to all stakeholders, really, who wants to use it to make decisions hold government to account.

Brian Tarran
So for people reading this or listening to this, who haven’t yet had a chance to dig into the report – and it’s very interesting report – what are the key messages that you want to share with people? What is the, I guess, what’s the assessment of the current state of data sharing and linkage?

Helen Miller-Bakewell
I think some of the key messages do echo what other reports have said in this space, which is that there really has been some excellent progress in terms of data sharing and data linkage. We last reported on data sharing and linkage back in 2018 and 2019, when the DA was kind of coming into force and things were starting to move slowly.

Brian Tarran
Sorry to interrupt you. But what is the DEA did you say?

Helen Miller-Bakewell
It’s the Digital Economy Act. So there was some amendments to the DEA, it created new, new powers to enable greater sharing of data for research and statistics, and it put the Office for National Statistics at the centre of those powers. It kind of gave ONS greater powers to ask other departments across the government to share data with them for research and statistical purposes. And the UK Statistics Authority as well have powers to accredit researcher and accredit processing environments so that people can have access to more data as well across governments.

Brian Tarran
So your last report was in 2018, 2019 time. There’s been good progress, you say, since then. I wanted to ask you about some good examples of data linkage that you’ve seen in that time. Maybe the obvious one is Covid, I’m guessing. The pandemic, that offered a lot of opportunities for linking different datasets together. But are there– is that it? Or are there others that you want to highlight?

Helen Miller-Bakewell
I think the pandemic definitely provided a really strong impetus to share data, to link data. And I think, you know, we saw things done which hadn’t been possible previously, it broke down barriers, and was a kind of an excellent enabler, which is fantastic, because it was a crisis situation. And actually, yes, one of the examples that I would highlight, and I know others in OSR would highlight that was doing Covid, was Office for National Statistics estimates of Covid-19 mortality rates among different ethnic groups, and that drew on census data, it drew on death registration data at your on hospital episode statistics, it drew on data from lots of different places to create some really, really important analysis. It’s exciting to be able to say now that there are good examples of data sharing and linkage across different topic areas and different organisations. And I think if you spoke to regulators in OSR working in different domains, they will each have like their favourite examples of, of data sharing and linkage. Having worked in crime and security regulation myself, the one that comes straight to mind is Data First, which is data linkage project led by Ministry of Justice, working with ADR UK – another one, I don’t have to say full often, Administrative Data Research UK – and that’s done a fantastic job of kind of opening up access again, in a secure way, a real focus on security, to a wealth of data from across MOJ systems, sometimes horrible, clunky legacy systems – I hope none won’t be offended if I say that – but making it valuable because people can, you know, to a greater extent now link it and link it to data from other departments as well.

I did ask a couple of colleagues if they wanted to throw me any other examples of data sharing and linkage that they particularly think highly of, and another one that came up was the Registration and Population Interaction Database, RAPID database, which has been created by DWP [Department for Work and Pensions]. And that provides that brings together data, information from DWP, HMRC [His Majesty’s Revenue & Customs] and local authorities to try and give a view of citizens’ interactions across the breadth of DWP services. What the report does say is, although we have these good examples, there are still barriers and challenges to doing data sharing and linkage. And that, that can, can be true across the whole process, right from getting support for the idea, through the practical steps of finding out what data is available, where, who owns it, how you can get to it, and then actually doing the linkage bit technically at the end. So we’ve been in a situation where things definitely have improved. But it’s, in many cases, it is not easy or efficient yet to share or link data. Our report talks about different barriers we heard about during the course of interviews with stakeholders, that we’ve encountered through our regulatory work as well. And we make 16 – to be precise – recommendations for how we could, or how government could, could start to chip away at those to, to improve the situation going forward.

Brian Tarran
Yeah, and I would like to talk about some of those recommendations, I guess the more technical side of the recommendations, a bit, a bit later. One thing I wanted to ask you now about was about a word that sort of jumped out at me in, in the report was this idea of needing there to be a social licence for data sharing. And in the report, social licence has, has been defined as the, like, the level of acceptance or approval in local communities for data linkage projects. Now, I guess for something like the pandemic, right, you can argue that there is a kind of an implicit social licence, it was an emergency situation, there was, you know, people at risk. So that sort of use case was kind of justified, but I was curious about how social licences for these things can best be established and, and maintained, because that’s about kind of interaction with the public, right? You know, you could put a load of government statisticians and data scientists into a room and say, what could you do with this, all this data and how you could, could you link it all together? And they’d get very excited about it? But actually, then, how do you then take that to the public and convince them that it’s a good idea, or are there other ways of making sure this social licence has been obtained?

Helen Miller-Bakewell
So, social licence and public engagement were one of the topics that was good, most consistently mentioned across the interview– interviews we held. And yeah, there just seemed to be a consensus which we would support that those working on data sharing and linkage should be prioritising public engagement around their work, both to kind of gauge the amount of social licence that there might be, or any sticking points, and potentially start to build social licence as well. And I think, you know, it’s really great to see people thinking like that, and I would actually say as well, it seems to be a recurring theme. And it’s really nice to see that having prominence. Another finding from the interviews was, you know, the flip side of this, yes, people think it’s important, but often people can not quite know how to approach public engagement, building social licence. And there are actually a few examples we highlight in the report where we, where we think public engagement has been done, done well, hopefully, to kind of inspire people. I would say, you know, I have a few kind of overarching thoughts on what can be important. And I think it, it comes so much to trust and trustworthiness. And I think the way, some ways that you can support trust with the public are transparency, saying what you’re going to do and why, and how, and actually the outcomes as well, when possible, let’s sit closing the circle. Thinking about this interview, I remembered OSR ran a public dialogue last year with ADR UK. And we were trying to talk to members of the public about what they understand by the public good of statistics and data. And one of the messages that came up there was people want to feel the impact their data and feel that outcome. And they don’t always feel they get that knowledge back. Like, what, what was, so what was the outcome of you having my data and doing these things that you kind of said you would do? So yeah, transparency, and then kind of linked to that, I guess, like continuous engagement, and considered engagement. And the public is not a homogenous group, there will be different groups that are important to engage for different data sharing, different data linkage projects, and you kind of need to consider who are the people you need to really engage with for your specific initiatives. And then it’s, I’m afraid you can’t just do one survey or one focus group, there needs to be some kind of mechanism for getting more continuous engagement to keep an eye on actually, how are people feeling now, because we know that social licence and people’s feelings about sharing data, it can be context dependent, it can be time dependent. And actually the first couple of recommendations in the report, so right there at one and two, number one is about the value of trackers like CDEI’s [Centre for Data Ethics and Innovation] Public Attitudes Survey, which, you know, are run on a semi-regular basis to try and track how the public are feeling at a high level about questions around data. And then our second recommendation is about having an organisation that can do more to produce guidance and support people doing research to do public engagement well.

Brian Tarran
Does the mission statement of statistics for the public good, does that kind of help guide the approach, right? So any data linkage, data sharing project, you need to understand, you need to think about okay, what’s the public good that we’re trying to achieve here? And then that becomes your, almost the point, the focal point of the discussion with the public about why we want to do this and why we think there’ll be a benefit.

Helen Miller-Bakewell
Absolutely. This focus on the public good is what we always come back to in, in OSR. And again, it is something that can sometimes slightly differentiate us from other organisations in this space where there may be a very internal government focus. Yeah, absolutely. What’s the outcome that’s seeking to be achieved? And will we achieve it? Did we achieve it?

Brian Tarran
One of the parts of the report that I was quite interested in was the four future scenarios. The task you set yourself was to look five years from, from now at where we might be, and I guess give a range of, like, scenarios in which data linkage is great, and everyone’s doing it and it’s fully supported, down to it’s, you know, it’s happening on a piecemeal basis or not at all. So what I wanted to understand was, how those scena– whether any of those scenarios are more likely than others to emerge, and whether the kind of likelihood of those scenarios emerging are dependent on certain of your recommendations.

Helen Miller-Bakewell
The scenarios just allowed us to explore in a theoretical way, like, yeah, where, where could we end up? We hope the 16 recommendations we’ve made, taken together, if they could all be fully delivered, could lead us towards that ultimate scenario of data sharing and linkage for the public good. And you can put quite neatly different recommendations against different bits of that scenario to help get, get us there. I think if, if I reflect on the scenarios like right now, the one that feels like most familiar to me is data sharing and linkage in silos. We’ve kind of spoken a little bit earlier already about how there are some areas of government and some topic areas and some organisations that are doing some really brilliant work. And I could see quite a realistic scenario where that kind of becomes more entrenched over, over the next five years. But, you know, maybe that’s actually just realistic, right? Maybe it’s unrealistic to expect that every organisation [in] government with different, different sizes, different funding, different priorities, could, could end up in exactly the same place on date sharing and linkage, all at the kind of, the top level. But I do, I do think if we can chip away at the recommendations we’ve made, then every organisation could improve on their starting point and move towards that, that scenario.

Brian Tarran
Okay, so let’s talk about some of the recommendations of how we get there. And there were a couple of areas I particularly wanted to focus on. One was talking about career frameworks, and having those kind of reflect, and I guess, reward the skills of those who are working on data and data linkage projects. So I was kind of wondering, you know, are there, are these skills that are kind of currently either underserved or under recognised within the existing career frameworks? And if so, how do we change that? Or was that is that beyond the scope of your report to recommend that?

Helen Miller-Bakewell
I think the situation across government is perhaps relatively complicated here. And I think what we, what we’d really like to see, essentially, is a situation where people in roles, in data roles, basically feel valued, and they can see a clear career pathway for them within government. And I think what we have at the moment is a variety of career frameworks that support people working in in roles with data and a decentralised pay model. Which means pay scales across, across roles can, can – and frameworks – can vary. You can see intuitively, how that has the potential to kind of create confusion for individuals – like well, which career framework should I look at – and, and then on the pay side, the potential to create kind of skills sink, where people want to go and work particular areas simply because they can be paid more, and in some cases it’s a considerable amount more, actually. And the reason I say it’s complicated is because, I mean, to some extent, this, what we have, is appropriate. People who work on data and sharing and linkage projects can come from lots of different analytical backgrounds – like, you could have a statistician or a social researcher or a data scientist or data architect, they could all be working on a, on a data sharing project, shall we say. And actually, it kind of makes sense that people in those kind of different roles might have different career frameworks and paths. And similarly, as I say, pay is decentralised, I don’t think OSR has much power to change too much there. But, you can see there are arguments for departments being able to have their own say on what skills they need to kind of pay more for in different circumstances at different times to, to bolster things. However, what we have said in the report, what we call for, and what we will try and speak to people who own frameworks to try and facilitate, is just a bit more awareness between people who own different frameworks about what else is out there, and a bit more consistency, therefore, about how different data skills perhaps are reflected and where they’re reflected in different frameworks. So we’re not asking for one single framework, I don’t think that would be practical, or particularly serve people working in data very well. But yeah, like more awareness, better joined up working, more consistent use of frameworks in, in job adverts, for example, just to help people see more clarity about their careers and where they can take them.

Brian Tarran
I was at an event recently where, you know, people working in, in data science or data, data kind of roles in government, were talking about how, like, the sort of career pathways and that there’s a feeling that technical skills, or growing technical skills aren’t always as well rewarded as greater sort of managerial responsibilities, so that if you become a, someone who’s excellent at being able to solve the, the knotty problems of data linkage and sharing, right, you might not be as well compensated for that as you might be if you were, say, running a team of 20, 30 people and sort of not actually applying those technical skills on a day-to-day basis. So it’s, I guess that was where my question was coming from was, is it that sort of thing that we kind of need to, need to address in some way? But, again, that might be beyond the scope of, you know, what you were looking at it on this particular matter?

Helen Miller-Bakewell
Well, I think it’s something for us to think more about, if I’m honest. We have committed ourselves to a follow on report for this. And we want to kind of take a look at the recommendations next year and see, you know, how far everything’s got. So, you know, any, any additional bits that we haven’t covered in this first report are good for us to think about. I think the situation you’ve just outlined sounds very familiar, to be honest, and not just across data science, across a whole load of roles where, yeah, actually your technical skills, they take you really well to middle management, but then there often comes a point where, if you want to go higher and have greater remuneration, you might have to move more to those softer skills, those managerial skills. That’s something we can think a bit more about, actually. And when we do have conversations with those who we kind of pointed to for the frameworks recommendation, maybe have a look into it a bit more.

Brian Tarran
This is, I guess, some somewhat related to the previous point. But the other aspect of the recommendations that jumped out to me were the discussion around quality metadata and documentation, standardisation and things like that as being priorities for effective data linkage, these are the things that need to be in place. But again, when you chat to researchers, not just in government but all over the place, these less glamorous aspects of data management are kind of underappreciated. And often teams are, the way that, the way teams work, the way projects work is you finish one, you move on to the next, you don’t really want to think too much about the one that you just worked on, because you’ve got a new priority or a new round of funding or whatever it is. So how do we convince senior leaders that, that there are sufficient resources for this sort of work that needs to be done? It might not sort of deliver necessarily immediate value and benefits, but it’s about kind of accruing the kind of infrastructure, I guess, to, to make sure that data sharing and linkage, you can achieve your, you know, your most optimistic vision of data sharing and linkage being widespread in government.

Helen Miller-Bakewell
Yeah, I think it’s a really important question. And yeah, having worked as a statistician as well, before I came into this world of regulation, yep, I recognise what you just described. And, you know, I think it’s always going to be a challenge in these kind of fast-paced multiple priority environments, where often people are resource stretched in terms of people, time, money, all those things together. Though, I think there’s a couple of tacks. One, I think, is maybe improving the data literacy of senior leaders, and trying to give them a greater understanding of, of data, how it’s used, and around these issues of kind of standardisation, and why they’re why they’re so important. And, actually, a couple of the recommendations earlier on in the report, in the people section, are around improving the– or strengthening the statistical literacy and the data literacy of senior leaders and recommending they go on, for example, the Data Science Campus in ONS run a master class for senior leaders across the service. So I think there was a kind of a bit of a, an education thing. And yeah, I guess, you know, part of that is setting out, like, what are the benefits? And what are the risks of not doing this? I think that, you know, here comes a role for people like OSR in setting expectations, especially in the world of official statistics, it is completely within our power to set the expectation for what government statistics should be doing with regards to metadata, or kind of following, following best practice and things like that. Our code of practice for statistics does do that to some extent already. And us as well, there’s a role for us in demonstrating the benefits and saying why we’re asking people to do these things and, and what’s good when it, when it goes well. I was thinking about this, and it drew to mind the EAST Framework. I don’t know if you’ve come across that. It’s a framework that was introduced to me by the Behavioural Insights Team for bringing about change and what you– what interventions need to be if they’re going to be successful, and it’s Easy, Attractive, Social and Timely. And I think when we and when other organisations who are kind of working on metadata standardisation, like the Central Digital Data Office, like Department for Science, Innovation and Technology, there’s, you know, there’s a few players here. We need to be keeping these, the EAST in mind as we design to try and help people kind of come on board with things more easily. And yeah, recommendation 16 in the report, the final one, is about standardisation and about, there are lots of players in this space and can we, can we bring them together a bit to be even more effective? So that’s, that’s definitely something we’ll be looking at in the coming months.

Brian Tarran
And you said that there’ll be a follow up report soon. When is that? When are you targeting?

Helen Miller-Bakewell
We’re planning next summer. So last time when we did our first report on Joined-up Data in 2018. And then we did a follow on one year on in 2019. We found that was quite an effective way to, for us and others, to kind of build and maintain momentum. And yeah, again, you know, feels a bit unfair almost to just put out a load of recommendations, and then then leave, leave the world to it. We’d like to see if we can help facilitate and then tell people how we’ve been getting on.

Brian Tarran
And I’m guessing it’s not, you’re not looking for all recommend– 16 recommendations to be implemented by next year. But it’s, are we making steps towards some of them? Are we are we heading in the right direction?

Helen Miller-Bakewell
You know, let’s practice what we preach. A bit of transparency. Yeah, are we heading in the right direction? And if we’re not, you know, is there a plan? I’d love to be optimistic. That optimistic. I wouldn’t expect that we can just put ticks against all 16 recommendations next year. But hopefully, yeah, we can, we could do some progress bars.

Brian Tarran
Excellent. Well, we should probably schedule a follow up interview for a year’s time then. But Helen, thank you very much for your time today.

Helen Miller-Bakewell
Oh, you’re very welcome. And if anyone, any of your listeners interested in, in the report, please do get in touch with OSR. We’d be very happy to talk about the report that we’ve just written or about, you know what we’re doing. Following on from that? Yeah, thank you.

Brian Tarran
So we’ll definitely put a link to the report in the show notes. So once again, Helen, thank you very much for joining us.

Helen Miller-Bakewell
Oh, you’re welcome. Thank you.

Find more Interviews

This interview is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “‘Feelings about sharing data can be context and time dependent – you can’t just do one survey or focus group.’” Real World Data Science, October 16, 2023. URL

Join Real World Data Science at three events this October!

Brian Tarran — Tue, 10 Oct 2023 00:00:00 GMT

Summer 2023 for us was a blur of excellent data science and statistics events. There was the Joint Statistical Meetings in Toronto, the Royal Statistical Society Conference in Harrogate, and posit::conf(2023) in Chicago. But if that wasn’t enough, autumn promises more good stuff, and more opportunities to meet with Real World Data Science in person and online.

An introduction to Real World Data Science

Date: Monday, October 16, 2023 Time: 12 pm – 1 pm Location: Online

Next week is Members’ Week at the Royal Statistical Society (RSS), and the RSS calendar is full of events targeted at members – prospective members, new members, and established members. Kicking things off on Monday lunchtime is an online event to introduce Real World Data Science. We’ll discuss the aims of this project, our guiding ethos and content plans, and we’ll explain the various ways in which people can contribute to the site.

Chances are, if you’re reading this blog, you won’t need much of an introduction to Real World Data Science. But do please help spread the word to potential new readers, and encourage them to register for this free event.

NHS-R Community Annual Conference

Date: Tuesday, October 17, 2023 Time: 9:30 am – 10:00 am Location: Edgbaston Stadium, Birmingham (in person) and online

It’s a real honour for us to be invited to give a keynote talk at this annual gathering of the NHS-R Community, a group dedicated to promoting the use of R in the National Health Service. Our talk is titled, “Forging community links: NHS-R, the Royal Statistical Society and Real World Data Science,” and we’ll explain how the Real World Data Science project came about, how we embraced open-source tools and the idea of collaborative content development, and why there’s so much to be gained from sharing data science case studies across domains.

Evaluating artificial intelligence: How data science and statistics can make sense of AI models

Date: Tuesday, October 31, 2023 Time: 4 pm – 6 pm Location: RSS, London (in person only)

Real World Data Science has partnered with colleagues and volunteers across the RSS to organise another AI panel debate, following up on the AI discussion at the RSS Annual Conference.

This event forms part of the AI Fringe programme of events, which coincides with the UK government’s AI Safety Summit on 1–2 November.

Our free event focuses on big questions around AI model evaluation, which will also be a key topic of discussion at the summit. One of the government’s stated objectives is for the summit to identify “areas for potential collaboration on AI safety research, including evaluating model capabilities and the development of new standards to support governance,” and so we’ll be asking:

What should AI evaluation look like?
How will it work in practice?
What metrics are most important?
Who gets to decide all of this?

Register via the RSS website to attend this free in-person event, chaired by RSS president Andy Garrett. Panellists will be announced soon, so stay tuned!

Back to Editors’ blog

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “Join Real World Data Science at three events this October!” Real World Data Science, October 10, 2023. URL

American Statistical Association joins Real World Data Science as partner

Brian Tarran — Mon, 02 Oct 2023 00:00:00 GMT

The first version of Real World Data Science was launched almost one year ago by the Royal Statistical Society (RSS). As we approach our first birthday, we’re delighted to announce that the American Statistical Association (ASA) has become a partner in this project.

ASA shares our goal of developing Real World Data Science as a free and beneficial resource for the entire data science community – one that informs, inspires and strengthens the community by bringing together students, practitioners, leaders, and educators to share knowledge about real-world applications of data science.

The data science profession is geographically and academically diverse. We believe that Real World Data Science can best achieve its goal of being a trusted, go-to resource for all data scientists if a range of partner organisations work together to develop the site and its content, so we’re thrilled that ASA is taking the first step with us towards fulfilling this vision.

Ron Wasserstein, executive director of the American Statistical Association, shared: “We are delighted to be partnering with RSS on Real World Data Science. This is important for our community and serves to further strengthen our valuable relationship with RSS.”

Sarah Cumbers, chief executive of the Royal Statistical Society, commented: “We’re thrilled to have ASA on board as a partner for Real World Data Science. We have big plans for the project and this partnership will help us achieve these by allowing us to reach more of the data science community and strengthen our content offering.”

As part of this new partnership with ASA, we will shortly welcome two ASA members to our editorial board. So, on behalf of the entire – and soon-to-be expanded – editorial board, we’d like to say a huge thanks to ASA for their support and endorsement of Real World Data Science.

ASA members, groups, and sections interested in contributing to the site are encouraged to review our call for contributions and to contact us via email or our social media channels to discuss content ideas.

Back to Editors’ blog

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “American Statistical Association joins Real World Data Science as partner.” Real World Data Science, October 2, 2023. URL

Live from Chicago: Real World Data Science at posit::conf(2023)

Brian Tarran — Wed, 27 Sep 2023 00:00:00 GMT

Videos from posit::conf(2023) are now available on YouTube, including our talk about how we built Real World Data Science using Quarto. We’ve embedded a selection of videos in this blog post, but be sure to check out the full playlist.

Tuesday, September 19

From data confusion to data intelligence

An inspiring start to posit::conf(2023) this morning, with keynote talks from Elaine McVey, senior director of analytics at Chief, and David Meza, head of analytics for human capital at NASA, sharing stories and insights on how to build strong data science foundations in organisations.

McVey spoke about the frequent mismatch between high levels of hope for what data science can achieve within organisations, and low levels of understanding about how to set up data science teams for success. The best chance for success, she said, is if data scientists take the lead in helping organisations learn how to make best use of data science expertise.

From there, McVey went on to present a set of “guerilla data science tactics” that data scientists can use to get around any obstacles they may encounter, as illustrated in the slide below:

Elaine McVey’s “guerilla data science tactics” for building successful data science teams.

Data scientists should start by scanning for opportunities to help the organisation, before building a small-scale version of what it is they propose to do. Once buy-in is achieved, and data is made available, it’s time to run with the project. Once complete, you need to “nail the landing,” McVey said, and make sure to communicate results broadly – not just to primary stakeholders, but across the organisation. Then comes time to “up the ante”: if your first project has built some organisational goodwill, leverage that and look for something higher risk, with higher potential reward for the organisation.

Throughout this process, McVey said, data scientists should be building foundations for future projects – creating data pipelines, R packages, etc., that can be reused later. This was a point picked up and developed upon by Meza, who walked through in detail the steps required to establish “data foundations” within organisations, drawing on his own past experiences. Typically, he said, organisations seem to collect data just to store it – but always data should be collected, stored, and managed with analysis in mind.

A hacker’s guide to open source LLMs

Fast.ai’s Jeremy Howard lifted the hood on large language models (LLMs) in the second of two keynotes this morning.

Beginning with an accessible overview of what LLMs are, how they work, and how they are trained, Howard then addressed some of the criticisms made of LLMs – that they “can’t reason” or give correct answers.

As Howard explained, a model like OpenAI’s GPT-4 is not trained at any point to give correct answers to prompts – only to predict the most likely next word, or word token, in a sequence.

The pre-training step, for example, does not involve only feeding the model with “correct answers,” instead relying on a corpus of text from the internet – some (or, maybe, much) of which may consist of factual inaccuracies, errors, falsehoods, etc. And in the fine-tuning stage, when human feedback is used to either reward or penalise model outputs, Howard said there is a preference for confident-sounding responses – and so, again, this doesn’t necessarily reward the model for giving correct answers.

Howard made the case that users have to help language models to give good answers, and that custom instructions can be used to change the way models respond. He then walked delegates through a series of demos using open-source LLMs, to show how outputs can be refined and improved.

“My view is that if you are going to be good at language modelling in any way,” said Howard, “you have to be good at using language models.”

Documenting Things: Openly for Future Us

Julia Stewart Lowndes, founding director of Openscapes, gave a compelling talk advocating for the importance of documentation for data science projects.

Documenting things, Lowndes said, should be done for the benefit of “Future Us”: not only ourselves but our teams and our communities who may be contributing to or revisiting the project in the next hours, days, weeks, months and years.

Documenting things does not have to be painful, Lowndes said. In fact, it’s supposed to be helpful. It does, however, take time and intention. And it means slowing down briefly to write things down now, in order that work speeds up in the longer term.

Lowndes then shared some pointers to help people get started with documentation:

Have a place to write things down – Google Docs, GitHub, wherever – ideally a place where people can work collaboratively.
1. Develop the habit of writing things down as you go.
2. Write in a modular way – small bits of text are less daunting and easier to maintain collaboratively.
Have an audience in mind – you are writing this for someone, so make it engaging for them.
1. write in an inclusive tone.
2. Narrate code in small chunks, and in a way that you’d say out loud if teaching.
3. Share, and share early – you want to be able to iterate on your documentation and receive feedback. Also, sharing openly does not always mean publicly – manage permissions as necessary.
Design for readability and accessibility.
1. Use section headers – particularly important for screen readers, but this also helps generally to describe the flow of a document. Plus, you can link readers directly to specific parts of a document.
2. Use text formatting.
3. Use alt-text for images, describing the take-home message of the image.

Teaching Data Science in Adverse Circumstances: Posit Cloud and Quarto to the Rescue

Professor Aleksander Dietrichson of the Universidad de San Martin brought a valuable perspective to posit::conf(2023) on the challenges of teaching data science in the face of technology and language barriers.

At the public, state-funded university in Argentina where Dietrichson works, more than half of students do not have access to laptops or computers at home, and those who do have access – whether at home or at school – may not have access to the latest kit. But “Posit Cloud solves the resource issue,” Dietrichson said. The free-to-use, online browser-based version of Posit’s tools runs on anything; Dietrichson said he’s tested it successfully on both decade-old computers and cellphones – though he doesn’t recommend using it on a cellphone!

On language barriers, he pointed out that learning to code in R and Python can be challenging when English isn’t your first language – if you don’t have semantic access to function names, for example, there will be a steeper learning curve for students.

Dietrichson also has to deal with the problem of “arithmaphobia” among some of the liberal arts students he teaches. This has necessitated a reshuffling of the typical statistics curriculum, he said, in order to make it easier for students to access. But the work is worth it, Dietrichson explained: many of his students want to work in careers like journalism, and he believes that “journalists should be statistically literate.”

Dynamic Interactions: Empowering Educators and Researchers with Interactive Quarto Documents Using webR

Some of my favourite sessions at posit::conf(2023) were about Quarto. Understandable, really, when you consider that we used it to build this very site! Albert Rapp has described Quarto as a web dev gateway drug, and I’d agree with him:

Quarto is a powerful tool for creating beautiful and interactive documents. I think of it as a gateway drug to web development: While it offers a user-friendly interface for creating documents and blogs, it also allows users to delve into the world of HTML & CSS without even realizing it.

I spoke a bit about my own journey into web dev in one of the Quarto sessions at posit::conf, but what I loved most about these sessions was learning about all the cool new things I’ve yet to discover and try out. For example, James Balamuta’s talk and demonstration of building interactive code cells into Quarto webpages was an eye-opener!

Since returning from Chicago I’ve tested out this functionality and added Balamuta’s example here. First run the code that’s already in the code block but also edit it to try out your own examples.

Visit the quarto-webr website for details on how to make full use of this capability. Once you’re up to speed, why not contribute a webR-enabled article for Real World Data Science?

Wednesday, September 20

R Not Only In Production

Kara Woo, senior data science engineer at InsightRX, began her Wednesday morning keynote with a rousing description of posit::conf(2023) being like a “great community garden” where things are being cultivated and shared for the benefit of all. This is an important feeling, Woo said, because it doesn’t always feel like that in our day jobs. Data scientists can feel siloed, not able to share ideas with like-minded people, and facing resistance from people who say “R can’t do that, R isn’t a real programming language” – a comment that elicited a groan of weary familiarity from sections of the crowd.

But as Woo went on to explain, “it is possible to build quality software in R” and “it is possible to have an organisation where the strengths of R and the people who use it influence the organisation as a whole.”

Woo was speaking from her experience at InsightRX, a precision medicine company, which makes software for clinicians to inform individualised dosing decisions for patients. Through a tool called Nova, clinicians feed in data about a patient’s unique characteristics, which is then passed to R for analysis, which then returns dosage recommendations to Nova.

In InsightRX, R has also been used to solve problems that are not strictly data science problems. Woo gave the example of working with a colleague to write an R package to identify data labels that have been changed and rollout translations for those labels in multiple languages for software users in different parts of the world.

“Our mindset of R being a first-class language empowers us to solve problems,” said Woo.

It’s Abstractions All the Way Down…

The second of the morning keynotes on day two of posit::conf(2023) was by JD Long, vice president of risk management at RenaissanceRe.

During Long’s insightful – and frequently very funny – talk, this slide appeared:

JD Long’s assertion #1.

Do you agree with Long’s assertion? If you don’t, what is the single biggest business value that’s been derived from the data science movement? Share your thoughts in the comments below.

It’s All About Perspective: Making a Case for Generative Art

Hobbies are important, right? They are a way to relax, to unwind. But also a great opportunity to learn things that might come in handy professionally. At least, that is the experience of Meghan Santiago Harris, a data scientist in the Prostate Cancer Clinical Trials Consortium at Memorial Sloan Kettering.

Harris shared with delegates her journey into generative art, and how skills acquired using ggplot2 for “fun stuff” had a positive impact on her work.

She first defined generative art as artwork created through a program in any language or interface, so long as the program itself executes the generation of the art. To make generative art, Harris said, you just need data and the ability to “think outside the grid” of your favourite graphics software or package. Harris’s tool of choice is ggplot2, but any will do: “If a tool lets you plot data, it will let you make art,” she said.

A slide from Meghan Santiago Harris’s talk, with an example of how to create an image of the sun setting on a city using lines of R code.

Harris’s passion for generative art bloomed during a recent period of maternity leave. She was coding for fun but also deepening her understanding and expertise in areas like code iteration, development and communication. And, in August, Harris published an R package called artpack, which is now available on CRAN and designed “to help generative artists of all levels create generative art in R.”

Generative art was a motivation to learn and do more, Harris said, and doing something she loved helped make programming and data science more digestible.

How the R for Data Science (R4DS) Online Learning Community Made Me a Better Student

Following straight after Meghan Santiago Harris was Lydia Gibson, a data scientist from Intel, with an inspiring talk about her route into data science. Gibson began by explaining how, when younger, “I wanted to be a fashion designer.” For her high school prom, Gibson even designed her own dress, which her grandmother made for her.

In 2011, Gibson earned a BS in economics and worked in retail customer service and state and local government for a time before deciding to return to school to do a Masters in statistics in 2021. She had “no experience of programming” when she made this decision, but soon learned that R is “a necessary evil if you have to go back to school to do statistics.”

Gibson told delegates that discovering data visualisation was what made her care about R. She could “feed [her] need for creativity” while also learning about things that were required for her course.

And it was the R for Data Science (R4DS) Online Learning Community that helped take her learning to the next level. Gibson described R4DS as “an amazing, welcoming learning environment where beginners and advanced folks alike can come together to learn not only R but data science as a whole.”

“Being surrounded by folks more advanced than you is a gift, not a curse,” she said, and she urged delegates to find what they are passionate about and explore its depths.

GitHub Copilot integration with RStudio, it’s finally here!

Tom Mock, product manager for Posit Workbench and RStudio, had a full house for his talk about the upcoming integration of GitHub’s code Copilot product into RStudio. Copilot, Mock said, is an AI pair programmer that offers autocomplete-style suggestions for code – and this integration is one of the most popular requested features among RStudio users on GitHub.

To make use of the integration, you’ll need a Copilot subscription from GitHub. But more than that, Mock said, users will need to experiment to learn how to get the most out of the “generative [AI] loop.”

See Mock’s slide deck below for more details.

Back to Editors’ blog

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “Live from Chicago: Real World Data Science at posit::conf(2023).” Real World Data Science, September 19, 2023, updated September 27, 2023. URL