Large language models: Do we need to understand the maths, or simply recognise the limitations?

Do we need to understand the inner workings of large language models before we use them? Or, is it enough to simply teach people to recognise that model outputs can’t always be relied upon, and that some use cases are better than others? Members of the Royal Statistical Society’s Data Science and AI Section discuss these and other questions in part two of our Q&A.

Large language models
AI
Communication
Regulation
Author

Brian Tarran

Published

May 18, 2023

Part 1 of our conversation with the Royal Statistical Society’s Data Science and AI (DS&AI) Section ended on a discussion around the need to verify that large language models (LLMs), when embedded in workflows and operational processes, are working as intended. But there was also acknowledgement that this could be difficult to achieve, not least of all because, as Giles Pavey said, “nobody knows exactly how these things work – not even the people who build them.” And then, of course, there is the speed at which developments are taking place: Trevor Duguid Farrant made the point that an expert may not even have a chance to finish reviewing the latest version of an LLM before a new iteration is rolled out.

These issues – of verification, explainability and interpretability – are of particular interest to data scientists like Anjali Mazumder, whose work focuses on the impact AI technologies could have, and are having, on society and individuals.

In part 2 of our Q&A about ChatGPT and other LLM-powered advances, and what all of this might mean for data science, Mazumder kicks off the conversation by setting out her perspective.

Our full list of interviewees, in order of appearance, are:


Anjali Mazumder: I work in research, but I also sit in the crux of government, industry, and civil society, looking at how they’re using these technologies. For me, it’s about knowing what the opportunities are but also understanding the limitations, the risks and the harms, and how we balance those and put in place oversight mechanisms that act as safeguards to ensure that these technologies don’t cause harm. We’re taking a very socio-technical approach that requires an interdisciplinary team to understand these issues and what should be done. Part of this is about not only the outcomes and the impact but also the upstream side of it – recognising that these models have been built on the work of people who have done the labelling, and that this also has implications – to say nothing of the associated environmental issues or energy issues!

Detlef Nauck: I think the regulators really have to look at this. It has come completely out of left field for them. All the regulators that we are monitoring, they regulate the space as it was three years ago – they are mainly concerned about predictive models and bias. But if you look at, say, what Microsoft wants to do – putting GPT into Office 365 and into Bing – that will completely change how people interact with and consume information. I think the large tech companies really have a responsibility here, when they make this public, to make sure that people understand what this technology actually is, and how it can be used and has to be used.

Also, they need to open up about how these things have been built. There are a lot of stories around how OpenAI used cheap labour in order to do the labelling and reinforcement learning for ChatGPT, and these things have to become public knowledge; they need to become part of some kind of Kitemark for these models: “Ethically built, properly checked, hallucinate only a little bit. Whatever you do, don’t take it for granted. Check it!” That’s the kind of disclaimer they need to put on these models.

If you look at what Microsoft wants to do – putting GPT into Office 365 and into Bing – that will completely change how people interact with and consume information. Large tech companies really have a responsibility here, when they make this public, to make sure that people understand what this technology actually is.

Regulatory principles always seem to stress that AI systems should be understandable, and we should be able to explain how we get particular outputs. But a lot of our conversation has focused on how we don’t really know how these models work. So, is that, in itself, a problem, and is it something that the data science community can help with – to dig into how these things work and try and get that across – to help meet these principles of explainability and interpretability?

DN: That’s a very specialist job, I would say – specialist research into how these mathematical structures work. It’s not something I could do, and I’ve not seen any significant work there. One thing that we are interested in is whether we can do knowledge embedding, so that you can “teach” concepts that these models can then use to communicate, and that would lead to smaller systems where you have some understanding of what’s inside. But all of this kind of work, I think, is very much just beginning.

Martin Goodson: Do we actually need this? There’s sort of a big assumption there that you need to understand how LLMs work in order to build in the kinds of things that we care about as a society. But we don’t understand how humans think. Of course, we can ask a human, “Why did you make that decision?” You can’t understand the cause of that decision – that’s a complex neuroscience question. But you can ask what the reason is for making a decision, and you can ask an LLM what its reasoning is as well. I think a lot of these questions about explainability are stuck in the past, when you’re trying to explain how a linear model works. It’s really not the same thing when you’ve got an LLM where you can just say, “Why did you make that decision?”

Louisa Nolan: I was going to say something very similar. Most people don’t know how most things work…

DN: My point was, these things are largely still like the Improbability Drive in the Hitchhiker’s Guide to the Galaxy. You press a button, and you don’t really know what comes out, and that’s the problem we need to get our heads around.

LN: But people don’t know what percentages are, and yet we still use them for decision making. I don’t think people need to understand the maths behind LLMs, and I think it would be a hopeless job to expect everybody to do that. What we do need to understand is what LLMs can and can’t do. What’s the body of work that they are drawing from? What isn’t in there? What are the things that you need to check? So, for some things, it’ll be brilliant: if you’ve written something and you want it rewritten for a nine-year-old; if you want to summarise a paper; if you want to write code, as long as you already know how to code – these could be real labour-saving tasks. If you’re using ChatGPT to write a thesis about something that you haven’t looked at, that’s where the danger is. It’s this kind of simple understanding that people need to get in their heads – and the maths, except for the people who care about it, is beside the point, and probably detrimental, because it means that people won’t engage with it.

DN: I agree, but I wasn’t talking about the general public. I meant, the people who build these things – they should know what they’re doing.

There’s a big assumption that you need to understand how LLMs work in order to build in the kinds of things that we care about as a society. But we don’t understand how humans think. You can ask a human what their reason is for making a decision, and you can ask an LLM what its reasoning is as well.

We talked there about communication. There was a webinar recently by the Royal Statistical Society’s Glasgow Local Group, and the presenter, Hannah Rose Kirk, showed how you can take tabular data and statistical results and ask ChatGPT to produce a nice paragraph or two that explains the numbers. Is this the sort of thing that any of you have experimented with? Have you had any successes at using ChatGPT to translate data into readable English that decision makers can act on?

Piers Stobbs: I have an interesting use case. We had a basic survey: 200-odd responses, multiple languages, and we just said, “Please summarise the results of this survey contained in this CSV file.” And it came up with five or six relevant bullet points. What was amazing was that we could then interrogate it. For example, “Please compare the results that were in English versus in French, and describe the differences.” Again, it did it, but then you have the issue of, was it all correct? Well, the bulk of it was. Now I am intrigued by whether you can ask it to do correlations and some actual statistical things on a dataset, and does it get that right? I don’t know. We’ve not really got to that. But, to go back to one of the earlier discussion points around productivity, that initial survey work could have easily taken a week of someone’s time if we didn’t run it through ChatGPT.

Trevor Duguid Farrant: Piers, in this case you’re interested in checking and seeing if it’s right. If you’d asked a group of people to do that survey for you and get the results, you’d have just accepted whatever they gave you back. You wouldn’t have questioned it.

PS: That’s true. And the results were plausible, certainly.

AM: I think one of the challenges is that the results could seem like they’re plausible, right – whether that’s a statistical output or a text output. This was not a proper experiment, but I asked ChatGPT about colleagues and friends who are quite prominent researchers, asking, “Who is so-and-so?”, and it produced biographies that were quite plausibly them, but it wasn’t them. It might have listed the correct PhD, say, but the date was off by a year, or the date was correct but it was from the wrong institution. So, depending on what the issue is, these seemingly plausible results could have more serious implications.

LN: So, just to join those two things together: for me, the question is not, “Do we understand how ChatGPT works?” As Martin says, we don’t understand how humans work, and surely we’re trying to develop something that enhances human thinking in some way. The more important question for me is, “How do we know that what is produced is useful?”

For me, the question is not, ‘Do we understand how ChatGPT works?’ We don’t understand how humans work, and surely we’re trying to develop something that enhances human thinking in some way. The more important question is, ‘How do we know that what is produced is useful?’

Giles, you mentioned previously that you’re doing some work at Unilever around how to minimise hallucination. I don’t know how much you can say on what direction that’s going in, and how successful you expect that to be, but that’s obviously going to be a really important part of refining these models to be more widely usable.

Giles Pavey: I’m by no means an expert, but there’s quite a lot you can do with both the architecture of it and also the pre-prompts that you put in – more or less saying, “Quote what the source is, and if you’re not sure, then tell me you’re not sure.” I think what’s interesting is the question of whether we’ll have to rely on OpenAI or Microsoft to do that work, and it will be just another thing that we have to trust them for. Or, will it be something that people within an organisation can put in themselves?

MG: I think it’s absolutely critical that open-source models are developed that can compete with these tech companies, otherwise there’s going to be a huge transfer of power to these companies.

GP: Arguably, the single biggest issue is, who elected Sam Altman (no-one) and are we as society happy with him having so much power over our future?

To close us out, I’d like to return to a question Trevor posed earlier, which is: How might organisations like the Royal Statistical Society help companies to embrace LLMs and start using them, so that everyone can benefit from the technology?

Adam Davison: My instinct is that there’s some great parallel here with the stuff that the Data Science and AI Section have been doing in general, where we’ve said, “OK, there’s lots of good advice out there on how to do things in data science, but how do you make it practical? How do you make it real? How do you apply those ethical principles? How do you make sure you have people with the right technical understanding in charge of projects to get value?” If, five years ago, the hype around data science was leading organisations to hire 100 data scientists in the hope that something innovative would happen, well then, we don’t want those same organisations now thinking that they need to hire 100 prompt engineers and keep their fingers crossed for something special. Our focus has been on “industrial strength data science”, so I think we can extend that to show what “industrial strength LLM usage” looks like in practice.

Want to hear more from the RSS Data Science and AI Section? Sign up for its newsletter at rssdsaisection.substack.com.

Copyright and licence
© 2023 Royal Statistical Society

This interview is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Images are not covered by this licence. Thumbnail image by Google DeepMind on Unsplash.

How to cite
Tarran, Brian. 2023. “Large language models: Do we need to understand the maths, or simply recognise the limitations?” Real World Data Science, May 18, 2023. URL