LLMs in the news: hype, tripe, and everything in between

We’re back discussing large language models after two weeks of ‘breakthrough’ announcements, excitable headlines, and some all-too-familiar ethical concerns.

Machine learning
Large language models
AI
Author

Brian Tarran

Published

December 9, 2022

Two weeks ago, I posted a Q&A with our editorial board member Detlef Nauck about large language models (LLMs), their drawbacks and risks. And since then we’ve had several big new announcements in this space. First came news from Meta (the company that owns Facebook) about Galactica, an LLM trained on scientific papers. This was followed by another Meta announcement, about Cicero, an AI agent that is apparently very good at playing the game Diplomacy. And then came perhaps the biggest launch of them all: ChatGPT from OpenAI, an LLM-based chatbot that millions of people have already started talking to.

Following these stories and the surrounding commentaries has been something of a rollercoaster ride. ChatGPT, for example, has been greeted in some quarters as if artificial general intelligence has finally arrived, while others point out that – impressive though it is – the technology is as prone to spout nonsense as all LLMs before it (including Galactica, the demo of which was quickly taken offline for this reason). Cicero, meanwhile, has impressed with its ability to play a game that is (a) very difficult and (b) relies on dialogue, cooperation, and negotiation between players. It blends a language model with planning and reinforcement learning algorithms, meaning that it is trained not only on the rules of the game and how to win, but how to communicate with other players to achieve victory.

To help me make sense of all these new developments, and the myriad implications, I reached out to Harvey Lewis, another of our editorial board members and a partner in EY’s tax and law practice.

Q&A

Harvey, when I spoke with Detlef, he mentioned that one of the reasons we’re seeing investment in LLMs is because there’s this belief that they are somehow the route to artificial general intelligence (AGI). And there were headlines in some places that would perhaps convince a casual reader that AGI had been achieved following the release of ChatGPT. For example, the Telegraph described it as a “scarily intelligent robot who can do your job better than you”. What do you make of all that?

Harvey Lewis (HL): My personal view is that you won’t get to artificial general intelligence using just one technique like deep learning, because of the problematic nature of these models and the limitations of the data used in their training. I’m convinced that more general intelligence will come from a combination of different systems.

The challenge with many LLMs, as we’ve seen repeatedly, is that they’ve no real understanding of language or concepts represented within language. They’re good at finding patterns in semantics and grammatical rules that people use in writing, and then they use those patterns to create new expressions given a prompt. But they’ve no idea whether these outputs are factual, inaccurate or completely fabricated. As a consequence, they can produce outputs that are unreliable, but which sound authoritative, because they’re just repeating a style that we expect to see.

Over the past couple of weeks, Twitter has been full of people either showing off astoundingly impressive outputs from LLMs, or examples of truly worrying nonsense. One example of the latter is when ChatGPT was asked to “describe how crushed porcelain added to breast milk can support the infant digestive system”. This made me think of a recent headline from VentureBeat, which asked whether AI is moving too fast for ethics to catch up. Do you think that it is?

HL: I find that to be an interesting philosophical question, because ethics does move very slowly, for good reason. When you think of issues of bias and discrimination and prejudice, or misinformation and other problems that we might have with AI systems, it shouldn’t be a surprise that these can occur. We’re aware of them. We’re aware of the ethical issues. So, why do they always seem to catch us by surprise? Is it because we have teams of people who simply aren’t aware of ethical issues or don’t have any appreciation of them? This points – for me, at least – in the direction of needing to have philosophers, ethicists, theologians, lawyers working in the technical teams that are developing and working on these systems, rather than having them on the periphery and talking about these issues but not directly involved themselves. I think it’s hugely important to ensure that you’ve got trust, responsibility, and ethics embedded in technical teams, because that’s the only way it seems that you can avoid some of these “surprises”.

When situations like these occur, I’m always reminded of Dr Ian Malcolm’s line from Jurassic Park: “…your scientists were so preoccupied with whether or not they could that they didn’t stop to think if they should.” The mindset seems to be, let’s push the boundaries and see what we can do, rather than stop and think deeply about what the implications might be.

HL: There’s a balance to be struck between these things, though, right? Firstly, show consideration for some of the issues at the outset, and secondly, have checks and balances and safeguards built into the process by which you design, develop and implement these systems. That’s the only way to create the proper safeguards around the systems. I don’t think that there’s any lack of appreciation of what needs to be done; people have been talking about this now for quite a long time. But it’s about making sure organisationally that it is done, and that you’ve got an operating model which bakes these things into it; that the kinds of principles and governance that you want to have around these systems are written, publicised, and properly policed. There should be no fear of making advances in that kind of a context.

Also, I think open sourcing these models provides a way forward. A lot of large language models are open for use and for research, but aren’t themselves open sourced, so it’s very difficult to get underneath the surface and figure out exactly how they work. But with open source, you have opportunities for researchers, whether they’re technical or in the field of ethics, to go and investigate and find out exactly what’s going on. I think that would be a fantastic step forward. It doesn’t take you all the way, of course, because a large amount of the data that these systems use is never open sourced, so while you might get an understanding of the mechanics, you have no idea of what exactly went into them in the first place. But open sourcing is a very good way of getting some external scrutiny. It’s about being transparent, which is a core principle of AI ethics and responsible AI.

An image created by the Stable Diffusion 2.1 Demo. The model was asked to produce an image with the prompt, 'Text from an old book cut up like a jigsaw puzzle with pieces missing'.

An image created by the Stable Diffusion 2.1 Demo. The model was asked to produce an image with the prompt, “Text from an old book cut up like a jigsaw puzzle with pieces missing”.

Thinking about LLMs and their questionable outputs, should there not be ways for users to help the models produce better, more accurate outputs?

HL: There are, but there are also problems here too. I’ve been having an interesting dialogue with ChatGPT this morning, asking it about quantum computing.1 For each response to a prompt, users are encouraged to provide feedback on whether or not an output is good. But you’re only provided with the usual thumbs-up/thumbs-down ratings; there’s nothing nuanced about it. So, for example, I asked ChatGPT to provide me with good analogies that help to explain quantum computing in simple terms. The first analogy was a combination lock, which is not a good analogy. The chatbot suggested that quantum computing is like a combination lock in which you can test all of the combinations at the same time, but I don’t know any combination locks where you can do this – being able to check only one combination at a time is the principal security mechanism of a combination lock! I asked it again for another analogy, and it suggested a coin toss where, when the coin is spinning in the air, you can see both heads and tails simultaneously but it isn’t until you catch the coin and then show its face that one of the states of the coin is resolved. That is a good analogy – it’s one I’ve also used myself. Now, the challenge I can see with a lot of these feedback approaches is that unless I know enough about quantum computing to understand that a combination lock is not a good analogy whereas a coin toss is, how am I to provide that kind of feedback? They’re relying to an extent on the user being able to make a distinction between what is correct and what is potentially incorrect or flawed, and I think that’s not a good way of approaching the problem.

Final question for you, Harvey. There’s a lot of excitement around GPT-4, which is apparently coming soon. The rumour mill says it will bring a leap forward in performance. But what do you expect we’ll see – particularly with regards to the issues we’ve talked about today?

HL: I’ve likened some of the large language models and their approach of “bigger is better” to the story of the Tower of Babel – trying to reach heaven by building a bigger and bigger tower, basically. That is not going to achieve the objective of artificial general intelligence, no matter how sophisticated an LLM might appear. That said, language is a fascinating area. I’m not a linguist, but I spend a lot of my time on natural language processing using AI systems. Language responds very well to AI because it is pattern-based. We speak using patterns, we write using patterns, and these can be inferred by machines from many examples. The addition of more parameters in the models allows them to understand patterns that extend further into the text, and I suspect that outputs from these kinds of models are going to be largely indistinguishable from the sorts of things that you or I might write.

But, I also think that increasing the number of parameters runs a real risk – and we’re starting to see this in other generative models – where prompts become so specific that the models aren’t actually picking up on patterns, they are picking up on specific instances of training data and text they’ve seen before. So, buried amongst these fantastically written articles on all kinds of subjects are going to be more examples of plagiarism, which is a problem; more examples of spelling mistakes and other kinds of issues, because these are also patterns which are going to possibly be observed.

This introduces potentially a whole new breed of problems that the community has to deal with – as long as they don’t get fixated upon the height of the tower and the quality of some of the examples that are shown, and realise that there are some genuine underlying difficulties and challenges that need to be solved.

Have you got news for us?

Is there a recent data science story you’d like our team to discuss? Do you have your own thoughts to share on a hot topic, or a burning question to put to the community? If so, either comment below or contact us.

Back to Editors’ blog

Copyright and licence
© 2022 Royal Statistical Society

This work is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence, except where otherwise noted.

How to cite
Tarran, Brian. 2022. “LLMs in the news: hype, tripe, and everything in between.” Real World Data Science, December, 9 2022. URL

Footnotes

  1. I also had a conversation with ChatGPT. Read the transcript.↩︎