Why large language models should come with a content warning

The outputs of LLMs seem impressive, but users need to be wary of possible bias, plagiarism and model ‘hallucinations’.

Machine learning
Large language models

Brian Tarran


November 23, 2022

Anyone who has ever been set a writing task will probably have wished at some point that somebody else could write it for them. As a journalist of 20-plus years, the thought has certainly crossed my mind more than a few times. Which probably explains why a recent headline in Nature caught my attention: “Could AI help you to write your next paper?”

The article, by Matthew Hutson, looks at how researchers are using artificial intelligence (AI) tools built on large language models (LLMs) as “assistants”. Starting with a prompt, such as “Write a headline for a blog post about large language models being used by academic researchers as research assistants”, an LLM will produce a text output. For example, using the same prompt with OpenAI’s GPT-3, I got:

“New research assistants: large language models”

Asked to “Write a headline for a blog post that critiques academic researchers’ use of large language models as research assistants”, GPT-3 produced:

“Too much reliance on large language models can hinder research”

And when I asked “Why can too much reliance on large language models hinder research?”, GPT-3 wrote:

“If academic researchers rely too heavily on large language models as research assistants, they may become reliant on the models and may not be able to conduct research without them. This can hinder research because the researchers may not be able to think for themselves and may not be able to come up with new ideas.”

A fair point, I suppose. But I sense there’s more to this story, and rather than continue quizzing GPT-3, I sat down with Detlef Nauck, a member of the Real World Data Science Editorial Board and head of AI and data science research for BT’s Applied Research Division, to ask a few more questions.


Thanks for joining me today, Detlef. To start, could you give a brief overview of these large language models, what they are, and how they work?

Detlef Nauck (DN): Essentially, LLMs match sequences to sequences. Language is treated as a sequence of patterns, and this is based on word context similarity. The way these things work is that they either reuse or create a word vector space, where a word is mapped to something like a 300-dimensional vector based on the context it’s normally found in. In these vector spaces, words like “king” and “queen”, for example, would be very similar to each other, because they appear in similar contexts in the written texts that are used to train these models. Based on this, LLMs can produce coherent sequences of words.

But the drawback of this approach is that these models have bias, because they are trained with biased language. If you talk about “women”, for example, and you look at which job roles are similar to “women” in a vector space, you find the stereotypically “female” professions but not technical professions, and that is a problem. Let’s say you take the word vector for “man” and the word vector for “king”, and you subtract “man” and then add this to “woman”, then you end up with “queen”. But if you do the same with “man”, “computer scientist”, and “woman”, then you end up maybe at “nurse” or “human resources manager” or something. These models embed the typical bias in society that is expressed through language.

The other issue is that LLMs are massive. GPT-3 has something like 75 billion parameters, and it cost millions to train it from scratch. It’s not energy efficient at all. It’s not sustainable. It’s not something that normal companies can afford. You might need something like a couple of hundred GPUs [graphics processing units] running for a month or so to train an LLM, and this is going to cost millions in cloud environments if you don’t own the hardware yourself. Large tech companies do own the hardware, so for them it’s not a problem. But the carbon that you burn by doing this, you could probably fly around the globe once. So it’s not a sustainable approach to building models.

Also, LLMs are quite expensive to use. If you wanted to use one of these large language models in a contact centre, for example, then you would have to run maybe a few hundred of them in parallel because you get that many requests from customers. But to provide this capacity, the amount of memory needed would be massive, so it is probably still cheaper to use humans – with the added benefit that humans actually understand questions and know what they are talking about.

A black keyboard at the bottom of the picture has an open book on it, with red words in labels floating on top, with a letter A balanced on top of them. The perspective makes the composition form a kind of triangle from the keyboard to the capital A. The AI filter makes it look like a messy, with a kind of cartoon style.

Letter Word Text Taxonomy by Teresa Berndtsson / Better Images of AI / CC-BY 4.0

Researchers are obviously quite interested in LLMs, though, and they are asking scientific questions of these models to see what kinds of answers they get.

DN: Yes, they are. But you don’t really know what is going to come out of an LLM when you prompt it. And you may need to craft the input to get something out that is useful. Also, LLMs sometimes make up stuff – what the Nature article refers to as “hallucinations”.

These tools have copyright issues, too. For example, they can generate computer code because code has been part of their training input, but various people have looked into it and found that some models generate code verbatim from what others have posted to GitHub. So, it’s not guaranteed that what you get out is actually new text. It might be just regurgitated text. A student might find themselves in a pickle where they think that they have created a text that seems new, but actually it has plagiarism in some of the passages.

There’s an article in Technology Review that gives some examples of how these systems might fail. People believe these things know what they’re talking about, but they don’t. For them, it’s just pattern recognition. They don’t have actual knowledge representation; they don’t have any concepts embedded.

To summarise, then: LLMs are expensive. They sometimes produce nonsense outputs. And there’s a risk that you’ll be accused of plagiarism if you use the text that’s produced. So, what should our response be to stories like this recent Nature article? How should we calibrate our excitement for LLMs?

DN: You have to treat them as a tool, and you have to make sure that you check what they produce. Some people believe if you just make LLMs big enough, we’ll be able to achieve artificial general intelligence. But I don’t believe that, and other people like Geoffrey Hinton and Yann LeCun, they say there’s no way that you get artificial general intelligence through these models, that it’s not going to happen. I’m of the same opinion. These models will be forever limited by the pattern recognition approach that they use.

But, still, is this a technology that you have an eye on in your professional capacity? Are you thinking about how these might be useful somewhere down the line?

DN: Absolutely, but we are mainly interested in smaller, more energy efficient, more computationally efficient models that are built on curated language, that can actually hold a conversation, and where you can represent concepts and topics and context explicitly. At the moment, LLMs can only pick up on context by accident – if it is sufficiently expressed in the language that they process – but they might lose track of it if things go on for too long. Essentially, they have a short-term memory: if you prompt them with some text, and they generate text, this stays in their short term memory. But if you prompt them with a long, convoluted sentence, they might not have the capacity to remember what was said at the beginning of the sentence, and so then they lose track of the context. And this is because they don’t explicitly represent context and concepts.

The other thing is, if you use these systems for dialogues, then you have to script the dialogue. They don’t sustain a dialogue by themselves. You create a dialogue tree, and what they do is they parse the text that comes from the user and then generate a response to it. And the response is then guided by the dialogue tree. But this is quite brittle; it can break. If you run out of dialogue tree, you need to pass the conversation over to a person. Systems like Siri and Alexa are like that, right? They break very quickly. So, you want these systems to be able to sustain conversations based on the correct context.

Have you got news for us?

Is there a recent data science story you’d like our team to discuss? Do you have your own thoughts to share on a hot topic, or a burning question to put to the community? If so, either comment below or contact us.

← Back to Editors’ blog

This work is licensed under CC BY-SA 4.0