Evaluating artificial intelligence: How data science and statistics can make sense of AI models

How can we help organisations to deploy AI in a responsible way? Why have evaluation metrics trailed behind advances in AI technology? And, are data inputs receiving the attention they deserve in debates over AI development? Read on for highlights of the Royal Statistical Society’s recent AI Fringe event.

Large language models

Brian Tarran


December 6, 2023

A little over a month ago, governments, technology firms, multilateral organisations, and academic and civil society groups came together at Bletchley Park – home of Britain’s World War II code breakers – to discuss the safety and risks of artificial intelligence.

One output from that event was a declaration, signed by countries in attendance, of their resolve to “work together in an inclusive manner to ensure human-centric, trustworthy and responsible AI that is safe, and supports the good of all.”

We also heard from UK prime minister Rishi Sunak of plans for an AI Safety Institute, to be based in the UK, which will “carefully test new types of frontier AI before and after they are released to address the potentially harmful capabilities of AI models, including exploring all the risks, from social harms like bias and misinformation, to the most unlikely but extreme risk, such as humanity losing control of AI completely.”

But at a panel debate at the Royal Statistical Society (RSS) the day before the Bletchley Park gathering, data scientists, statisticians, and machine learning experts questioned whether such an institute would be sufficient to meet the challenges posed by AI; whether data inputs – compared to AI model outputs – are getting the attention they deserve; and whether the summit was overly focused on AI doomerism and neglecting more immediate risks and harms. There were also calls for AI developers to be more driven to solve real-world problems, rather than just pursuing AI for AI’s sake.

The RSS event was chaired by Andrew Garrett, the Society’s president, and formed part of the national AI Fringe programme of activities. The panel featured:

What follows are some edited highlights and key takeaways from the discussion.

AI safety, and AI risks

Andrew Garrett: For those who were listening to the commentary last week, the PM [prime minister] made a very interesting speech. Rishi Sunak announced the creation of the world’s first AI Safety Institute in the UK, to examine, evaluate and test new types of AI. He also stated that he pushed hard to agree the first ever international statement about the risks of AI because, in his view, there wasn’t a shared understanding of the risks that we face. He used the example of the IPCC, the Intergovernmental Panel on Climate Change, to establish a truly global panel to publish a “state of AI science” report. And he also announced an investment in raw computing power, so around a billion pounds in a supercomputer, and £2.5 billion in quantum computers, making them available for researchers and businesses as well as government.

The RSS provided two responses this year to prominent [AI policy] reviews. The first was in June on the AI white paper, and the second was on the House of Lords Select Committee inquiry into large language models back in September. How do they relate to what the PM said? There’s some good news here, and maybe not quite so good news.

First, the RSS had requested investments in AI evaluation and a risk-based approach. And you could argue, by stating that there will be a safety institute, that that certainly ticks one of the boxes. We also recommended investment in open source, in computing power, and in data access. In terms of computing power, that was certainly in the [PM’s] speech. We spoke about strengthening leadership, and in particular including practitioners in the [AI safety] debate. A lot of academics and maybe a lot of the big tech companies have been involved in the debate, but we want to get practitioners – those close to the coalface – involved in the debate. I’m not sure we’ve seen too much of that. We recommended that strategic direction was provided, because it’s such a fast-moving area, and the fact that the Bletchley Park Summit is happening tomorrow, I think, is good for that. And we also recommended that data science capability was built amongst the regulators. I don’t think there was any mention of that.

That’s the context [for the RSS event today]. What I’m going to do now is ask each of the panellists to give an introductory statement around the AI summit, focusing on the safety aspects. What do they see as the biggest risk? And how would they mitigate or manage this risk?

Detlef Nauck: I work at BT and run the AI and data science research programme. We’ve been looking at the safety, reliability, and responsibility of AI for quite a number of years already. Five years ago, we put up a responsible AI framework in the company, and this is now very much tied into our data governance and risk management frameworks.

Looking at the AI summit, they’re focusing on what they call “frontier models,” and they’re missing a trick here because I don’t think we need to worry about all-powerful AI; we need to worry about inadequate AI that is being used in the wrong context. For me, AI is programming with data, and that means I need to know what sort of data has been used to build the model, and I need AI vendors to be upfront about it and to tell me: What is the data that they have used to build it, how have they built it, or if they’ve tested for bias? And there are no protocols around this. So, therefore, I’m very much in favour of AI evaluation. But I don’t want to wait for an institute for AI evaluation. I want the academic research that needs to be done around this, which hasn’t been done. I want everybody who builds AI systems to take this responsibility and document properly what they’re doing.

I hear more and more a lot of companies talking about AI general intelligence, and how AI is going to take over the world, and I’m tremendously concerned about this. There is an opportunity to build AI that is human empowering, that keeps us strong, able, capable, intelligent, and can support us in all our human capabilities.

Mihaela van der Schaar: I am an AI researcher building AI and machine learning technology. Before talking about the risks, I also would like to say that I see tremendous potential for good. Many of these machine learning AI models can transform for the better areas that I find extremely important – healthcare and education. That being said, there are substantial risks, and we need to be very careful about that. First, if not designed well, AI can be both unsafe as well as biased, and that could lead to tremendous impact, especially in medicine and education. I completely agree with all the points that the Royal Statistical Society has made not only about open source but also about data access. This AI technology cannot be built unless you have access to high quality data, and what I see a lot happening, especially in industry, is people have data sources that they’ll keep private, build second-rate or third-rate technology on them, and then turn that into commercialised products that are sold to us for a lot of money. If data is made widely available, the best as well as the safest AI can be produced, rather than monopolised.

Another area of risk that I’m especially worried about is human marginalisation. I hear more and more a lot of companies talking about AI general intelligence, and how AI is going to take over the world, and I’m tremendously concerned as an AI researcher about this. There is an opportunity to build AI that is human empowering, that keeps us strong, able, capable, intelligent, and can support us in all our human capabilities.

Martin Goodson: The AI Safety Summit is starting tomorrow. But, unfortunately, I think the government are focusing on the wrong risks. There are lots of risks to do with AI, and if you look at the scoping document for the summit, it says that what they’re interested in is misuse risk and the risk of loss of control. Misuse risk is that bad actors will gain access to information that they shouldn’t have and build chemical weapons and things like that. And the loss of control risk is that we will have this super intelligence which is going to take over and we should see, as is actually mentioned, the risk of the extinction of the human race, which I think is a bit overblown.

Both of these risks – the misuse risk and the loss of control risk – are potential risks. But we don’t really know how likely they are. We don’t even know whether they’re possible. But there are lots of risks that we do know are possible, like loss of jobs, and reductions in salary, particularly of white-collar jobs – that seems inevitable. There’s another risk, which is really important, which is the risk of monopolistic control by the small number of very powerful AI companies. These are the risks which are not just likely but are actually happening now – people are losing their jobs right now because of AI – and in terms of monopolistic control, OpenAI is the only company that has anything like a large language model as powerful as GPT-4. Even the mighty Google can’t really compete. This is a huge risk, I think, because we have no control over pricing: they could raise the prices if they wanted to; they could constrain access; they could only give access to certain people that they want to give access to. We don’t have any control over these systems.

Mark Levene: I work in NPL as a principal scientist in the data science department. I’m also emeritus professor in Birkbeck, University of London. I have a long-standing expertise in machine learning and focus in NPL on trustworthy AI and uncertainty quantification. I believe that measurement is a key component in locking-in AI safety. Trustworthy AI and safe AI both have similar goals but different emphases. We strive to demonstrate the trustworthiness of an AI system so that we can have confidence in the technology making what we perceive as responsible decisions. Safe AI puts the emphasis on the prevention of harmful consequences. The risk [of AI] is significant, and it could potentially be catastrophic if we think of nuclear power plants, or weapons, and so on. I think one of the problems here is, who is actually going to take responsibility? This is a big issue, and not necessarily an issue for the scientist to decide. Also, who is accountable? For instance, the developers of large language models: are they the ones that are accountable? Or is it the people who deploy the large language models and are fine-tuning them for their use cases?

The other thing I want to emphasise is the socio-technical characteristics [of the AI problem]. We need to get an interdisciplinary team of people to actually try and tackle these issues.

Do we need an AI Safety Institute?

Andrew Garrett: Do we need to have an AI Safety Institute, as Rishi Sunak has said? And if we don’t need one, why not?

Detlef Nauck: I’m more in favour of encouraging academic research in the field and funding the kind of research projects that can look into how to build AI safely, [and] how to evaluate what it does. One of the key features of this technology is it has not come out of academic research; it has been built by large tech companies. And so, I think we have to do a bit of catch up in scientific research and in understanding how are we building these models, what can they do, and how do we control them?

Mihaela van der Schaar: This technology has a life of its own now, and we are using it for all sorts of things that maybe initially was not even intended. So, shall we create an AI [safety] institute? We can, but we need to realise first that testing AI and showing that it’s safe in all sorts of ways is complicated. I would dare say that doing that well is a big research challenge by itself. I don’t think just one institute will solve it. And I feel the industry needs to bear some of the responsibility. I was very impressed by Professor [Geoffrey] Hinton, who came to Cambridge and said, “I think that some of these companies should invest as much money in making safe AI as developing AI.” I resonated quite a lot with that.

Also, let’s not forget, many academic researchers have two hats nowadays: they are professors, and they are working for big tech [companies] for a lot of money. So, if we take this academic, we put them in this AI tech safety institute, we have potential for corruption. I’m not saying that this will happen. But one needs to be very aware, and there needs to be a very big separation between who develops [AI technology] and who tests it. And finally, we need to realise that we may require an enormous amount of computation to be able to validate and test correctly, and very few academic or governmental organisations may have [that].

I think it’s an insult to the UK’s scientific legacy that we’re reduced to testing software that has been made by US companies. We have huge talents in this country. Why aren’t we using that talent to actually build something instead of testing something that someone else has made?

Martin Goodson: Can I disagree with this idea of an evaluation institute? I think it’s a really, really bad idea, for two reasons. The first is an argument about fairness. If you look at drug regulation, who pays for clinical trials? It’s not the government. It’s the pharmaceutical companies. They spend billions on clinical trials. So, why do we want to do this testing for free for the big tech companies? We’re just doing product development for them. It’s insane! They should be paying to show that their products are safe.

The other reason is, I think it’s an insult to the UK’s scientific legacy that we’re reduced to testing software that has been made by US companies. I think it’s pathetic. We were one of the main leaders of the Human Genome Project, and we really pushed it – the Wellcome Trust and scientists in the UK pushed the Human Genome Project because we didn’t want companies to have monopolistic control over the human genome. People were idealistic, there was a moral purpose. But now, we’re so reduced that all we can do is test some APIs that have been produced by Silicon Valley companies. We have huge talents in this country. Why aren’t we using that talent to actually build something instead of testing something that someone else has made?

Mark Levene: Personally, I don’t see any problem in having an AI institute for safety or any other AI institutes. I think what’s important in terms of taxpayers’ money is that whatever institute or forum is invested in, it’s inclusive. One thing that the government should do is, we should have a panel of experts, and this panel should be interdisciplinary. And what this panel can do is it can advise government of the state of play in AI, and advise the regulators. And this panel doesn’t have to be static, it doesn’t have to be the same people all the time.

Andrew Garrett: To evaluate something, whichever way you chose to do it, you need to have an inventory of those systems. So, with the current proposal, how would this AI Safety Institute have an inventory of what anyone was doing? How would it even work in practice?

Martin Goodson: Unless we voluntarily go to them and say, “Can you test out our stuff?” then they wouldn’t. That’s the third reason why it’s a terrible idea. You’d need a licencing regime, like for drugs. You’d need to licence AI systems. But teenagers in their bedrooms are creating AI systems, so that’s impossible.

Let’s do reality-centric AI!

Andrew Garrett: What are your thoughts about Rishi Sunak wanting the UK to be an AI powerhouse?

Martin Goodson: It’s not going to be a powerhouse. This stuff about us being world leading in AI, it’s just a fiction. It’s a fairy tale. There are no real supercomputers in the UK. There are moves to build something, like you mentioned in your introduction, Andrew. But what are they going do with it? If they’re just going to build a supercomputer and carry on doing the same kinds of stuff that they’ve been doing for years, they’re not going to get anywhere. There needs to be a big project with an aim. You can build as many computers as you want. But if you haven’t got a plan for what to do with them, what’s the point?

Mihaela van der Schaar: I really would agree with that. What about solving some real problem: trying to solve cancer; trying to solve our crisis in healthcare, where we don’t have enough infrastructure and doctors to take care of us? What about solving the climate change problem, or even traffic control, or preventing the next financial crisis? I wrote a little bit about that, and I call it “let’s do reality-centric AI.” Let’s have some goal that’s human empowering, take a problem that we have – energy, climate, cancer, Alzheimer’s, better education for children, and more diverse education for children – and let us solve these big challenges, and in the process we will build AI that’s hopefully more human empowering, rather than just saying, “Oh, we are going to solve everything if we have general AI.” Right now, I hear too much about AI for the sake of AI. I’m not sure, despite all the technology we build, that we have advanced in solving some real-world problems that are important for humanity – and imminently important.

Martin Goodson: So, healthcare– I tried to make an appointment with my GP last week, and they couldn’t get me an appointment for four weeks. In the US you have this United States Medical Licencing Examination, and in order to practice medicine you need to pass all three components, you need to pass them by about 60%. They are really hard tests. GPT-4 for gets over 80% in all three of those. So, it’s perfectly plausible, I think, that an AI could do at least some of the role of the GP. But, you’re right, there is no mission to do that, there is no ambition to do that.

Mihaela van der Schaar: Forget about replacing the doctors with ChatGPT, which I’m less sure is such a good idea. But, building AI to do the planning of healthcare, to say, “[Patient A], based on what we have found out about you, you’re not as high risk, maybe you can come in four weeks. But [patient B], you need to come tomorrow, because something is worrisome.”

Martin Goodson: We can get into the details, but I think we are agreeing that a big mission to solve real problems would be a step forward, rather than worrying about these risks of superintelligences taking over everything, which is what the government is doing right now.

Managing misinformation

Andrew Garrett: We have some important elections coming up in 2024 and 2025. We haven’t talked much about misinformation, and then disinformation. So, I’m interested to get your views here. How much is that a problem?

Detlef Nauck: There’s a problem in figuring out when it happens, and that’s something we need to get our heads around. One thing that we’re looking at is, how do we make communication safe from bad actors? How do you know that you’re talking to the person you see on the camera and it’s not a deep fake? Detection mechanisms don’t really work, and they can be circumvented. So, it seems like what we need is new standards for communication systems, like watermarks and encryption built into devices. A camera should be able to say, “I’ve produced this picture, and I have watermarked it and it’s encrypted to a certain level,” and if you don’t see that, you can’t trust that what you see comes from a genuine camera, and it’s not artificially created. It’s more difficult around text and language – you can’t really watermark text.

Mark Levene: Misinformation is not just a derivative of AI. It’s a derivative of social networks and lots of other things.

Mihaela van der Schaar: I would agree that this is not only a problem with AI. We need to emphasise the role of education, and lifelong education. This is key to being able to comprehend, to judge for ourselves, to be trained to judge for ourselves. And maybe we need to teach different methods – from young kids to adults that are already working – to really exercise our own judgement. And that brings me to this AI for human empowerment. Can we build AI that is training us to become smarter, to become more able, more capable, more thoughtful, in addition to providing sources of information that are reliable and trustworthy?

Andrew Garrett: So, empower people to be able to evaluate AI themselves?

Mihaela van der Schaar: Yes, but not only AI – all information that is given to us.

Martin Goodson: On misinformation, I think this is really an important topic, because large language models are extremely persuasive. I asked ChatGPT a puzzle question, and it calculated all of this stuff and gave me paragraphs of explanations, and the answer was [wrong]. But it was so convincing I was almost convinced that it was right. The problem is, these things have been trained on the internet and the internet is full of marketing – it’s trillions of words of extremely persuasive writing. So, these things are really persuasive, and when you put that into a political debate or an election campaign, that’s when it becomes really, really dangerous. And that is extremely worrying and needs to be regulated.

At the moment, if you type something into ChatGPT and you ask for references, half of them will be made up. We know that, and also OpenAI knows that. But it could be that, if there’s regulation that things are traceable, you should be able to ask, ‘How did this information come about? Where did it come from?’

Mark Levene: You need ways to detect it. Even that is a big challenge. I don’t know if it’s impossible, because, if there’s regulation, for example, there should be traceability of data. So, at the moment, if you type something into ChatGPT and you ask for references, half of them will be made up. We know that, and also OpenAI knows that. But it could be that, if there’s regulation that things are traceable, you should be able to ask, “How did this information come about? Where did it come from?” But I agree that if you just look at an image or some text, and you don’t know where it came from, it’s easy to believe. Humans are easily fooled, because we’re just the product of what we know and what we’re used to, and if we see something that we recognise, we don’t question it.

Audience Q&A

How can we help organisations to deploy AI in a responsible way?

Detlef Nauck: Help for the industry to deploy AI reliably and responsibly is something that’s missing, and for that, trust in AI is one of the things that needs to be built up. And you can only build up trust in AI if you know what these things are doing and they’re properly documented and tested. So that’s the kind of infrastructure, if you like, that’s missing. It’s not all big foundation models. It’s about, how do you actually use this stuff in practice? And 90% of that will be small, purpose-built AI models. That’s an area where the government can help. How do you empower smaller companies that don’t have the background of how AI works and how it can be used, how can they be supported in knowing what they can buy and what they can use and how they can use it?

Mark Levene: One example from healthcare which comes to mind: when you do a test, let’s say, a blood test, you don’t just get one number, you should get an interval, because there’s uncertainty. What current [AI] models do is they give you one answer, right? In fact, there’s a lot of uncertainty in the answer. One thing that can build trust is to make transparent the uncertainty that the AI outputs.

How can data scientists and statisticians help us understand how to use AI properly?

Martin Goodson: One big thing, I think, is in culture. In machine learning – academic research and in industry – there isn’t a very scientific culture. There isn’t really an emphasis on observation and experimentation. We hire loads of people coming out of an MSc or a PhD in machine learning, and they don’t know anything, really, about doing an experiment or selection bias or how data can trip you up. All they think about is, you get a benchmark set of data and you measure the accuracy of your algorithm on that. And so there isn’t this culture of scientific experimentation and observation, which is what statistics is all about, really.

Mihaela van der Schaar: I agree with you, this is where we are now. But we are trying to change it. As a matter of fact, at the next big AI conference, NeurIPS, we plan to do a tutorial to teach people exactly this and bring some of these problems to the forefront, because trying really to understand errors in data, biases, confounders, misrepresentation – this is the biggest problem AI has today. We shouldn’t just build yet another, let’s say, classifier. We should spend time to improve the ability of these machine learning models to deal with all sorts of data.

Do we honestly believe yet another institute, and yet more regulation, is the answer to what we’re grappling with here?

Detlef Nauck: I think we all agree, another institute is not going to cut it. One of the main problems is regulators are not trained on AI, so it’s the wrong people looking into it. This is where some serious upskilling is required.

Are we wrong to downplay the existential or catastrophic risks of AI?

Martin Goodson: If I was an AI, a superintelligent AI, the easiest path for me to cause the extinction of the human race would be to spread misinformation about climate change, right? So, let’s focus on misinformation, because that’s an immediate danger to our way of life. Why are we focusing on science fiction? Let’s focus on reality.

AI tech has advanced, but evaluation metrics haven’t moved forward. Why?

Mihaela van der Schaar: First, the AI community that I’m part of innovates at a very fast pace, and they don’t reward metrics. I am a big fan of metrics, and I can tell you, I can publish much faster a method in these top conferences then I can publish a metric. Number two, we often have in AI very stupid benchmarks, where we test everything on one dataset, and these datasets may be very wrong. On a more positive note, this is an enormous opportunity for machine learners and statisticians to work together and advance this very important field of metrics, of test sets, of data generating processes.

Martin Goodson: The big problem with metrics right now is contamination, because most of the academic metrics and benchmark sets that we’re talking about, they’re published on the internet, and these systems are trained on the internet. I’ve already said that I don’t think this [evaluation] institute should exist. But if it did exist, there’s one thing that they could do, which is important, and that would be to create benchmark datasets that they do not publish. But obviously, you may decide, also, that the traditional idea of having a training set and a test set just doesn’t make any sense anymore. And there are loads of issues with data contamination, and data leakage between the training sets and the test sets.

Closing thoughts: What would you say to the AI Safety Summit?

Andrew Garrett: If you were at the AI Safety Summit and you could make one point very succinctly, what would it be?

Martin Goodson: You’re focusing on the wrong things.

Mark Levene: What’s important is to have an interdisciplinary team that will advise the government, rather than to build these institutes, and that this team should be independent and a team which will change over time, and it needs to be inclusive.

Mihaela van der Schaar: AI safety is complex, and we need to realise that people need to have the right expertise to be able to really understand the risks. And there is risk, as I mentioned before, of potential collusion, where people are both building the AI and saying it’s safe, and we need to separate these two worlds.

Detlef Nauck: Focus on the data, not the models. That’s what’s important to build AI.

Discover more Viewpoints

Copyright and licence
© 2023 Royal Statistical Society

Images by Wes Cockx & Google DeepMind / Better Images of AI / AI large language models / Licenced by CC-BY 4.0.

How to cite
Tarran, Brian. 2023. “Evaluating artificial intelligence: How data science and statistics can make sense of AI models.” Real World Data Science, December 6, 2023. URL