What do defence cases in litigation, statistical analyses, book summaries and a description of Young’s double slit experiment in the manner of poet Robert Burns have in common? They are all tasks that people have rightly or wrongly attempted to delegate to large language models.
The playground of generative AI algorithms based on large language models extends well beyond the space of generating text-based language but includes creating images, videos and even music from prompts. The capabilities of these algorithms, and the ubiquity of tasks that large language models like OpenAI’s Generative Pre-trained Transformer (GPT) models can have a go at is striking. These large language models and the chatbots and so on based on them have also been catapulted into the centre of mainstream public attention with huge success – who has not heard of ChatGPT? The net result has been something akin to a feeding frenzy as individuals and businesses alike strive to be among the first to benefit from them.
Like others, many data scientists closely familiar with these kinds of algorithms share some enthusiasm for their potential utility, but many also advocate an element of caution. There are some obvious caveats, including accuracy and cost – not just financially but also in terms of the huge energy costs to run these algorithms, a real world consequence that is affecting the planet in the present day but is often eclipsed by fears that AI might take over the world some time in the future. Another concern is security. Should you be sharing the information you are working from with a third party anyway? However, while a lot of attention has focused on what these algorithms can do, fewer have been asking what they actually do – what we know about the initial programming, the training data, the final algorithm and the range of possible outputs, all of which provide useful pointers as to whether a particular algorithm is appropriate for the task in hand, and how best to benefit from it.
Demystifying machine learning
Definitions of artificial intelligence vary, often circling around the theme of a system reaching an “intelligent” decision or output based on multiple inputs, although how “intelligent” might be defined can be hairier still. Nonetheless, there is currently largely a consensus that some kind of machine learning is a route to achieving it. Through machine learning “you are letting the computer adjust the importance of its inputs, and their relationships, to determine an appropriate output” as Napier chief data scientist and chair of the Royal Statistical Society DS & AI Section Janet Bastiman describes it. The term “machine learning” was first proposed by IBM scientist Arthur Samuel in 1952, and it has largely been achieved by two approaches. One is “random forests”, based on constructing multiple decision trees. The other is the neural networks first devised by American psychologist Frank Rosenblatt and simulated at IBM in 1957. Here, a set of artificial neurons – components closer to a capacitor than a biological neuron – is connected to another layer of neurons, which is connected to another layer of neurons, and so on (Figure 1). Crucially the connections are strengthened or not through “learning” based on training data that allows the network to recognise patterns and extract meaningful features.
“Machine learning is just like linear regression with tonnes of bells and whistles,” says Daniela Witten, professor of mathematical statistics at the University of Washington in the US, referring to a statistical method for fitting a line to a set of data points that dates back over a hundred years. There are many other traditional approaches to statistical learning that may be nonlinear and so on, but as an example of the “bells and whistles” Witten describes, whereas a traditional regression model might have 5 inputs or variables, the machine learning version might have 15 million, and instead of assuming it is linear the fit is allowed to be more flexible and so on. However, the fundamental statistical ideas underlying both sets of models are the same. For this reason, although some may beg to differ, she feels doing machine learning before you understand statistics is like trying to jump rope before you can walk. “It’s not that you can’t do it but why would you?” she adds.
Broadly speaking, machine learning can be classified two ways. One is “supervised”, which means that the training data is somehow labelled, for instance with a known output collected from real world records. The alternative is “unsupervised” where the algorithm is set the task of finding a way to learn relationships between input data itself. There are also neural network approaches that fall somewhere between the two, such as reinforcement learning where an algorithm may generate outputs for a task at random such that its performance is initially poor but improves with feedback to reinforce generation of outputs that are closer to those desired. An approach that enjoyed great popularity for a time used another machine learning algorithm to provide this feedback, which would initially also be a poor judge but improve as pitted against the algorithm learning to do the task – generative adversarial networks (GANs). GANs are still used a lot, but usually in a pipeline and may be pre-trained so they are not starting from scratch like they used to.
As the number of layers increased from just a few, the term “deep learning” was adopted along with alternative structures. The first models operated with every neuron in each layer connected to every neuron in the previous layer. “This is very wasteful because not every part of your input relates to each other that much,” says Petar Veličković, staff research scientist at Google DeepMind and affiliated lecturer at the University of Cambridge. He cites images as among the first scenarios where people began to implement a tweak to the approach in what is called a convolutional neural network. Based on the assumption that the pixels for each object in an image sit adjacent to each other rather than at opposing corners of the image, the neurons in a convolutional neural network connect only with the neurons in the next layer that are nearby in the image space. In this way the convolutional neural network assumes a kind of structure in the input data – that the image is contiguous, so the pixels for edges and so on are in contact.
“Transformers are also a neural network but they encode a different kind of structure,” Veličković tells Real World Data Science, as he describes the data architecture at the heart of the large language models creating such a buzz at present. Language has structure too – the letters make up words, which then make up sentences and so on. So it makes sense to program some of that structure into the algorithm rather than leaving it to work it all out. “You would need a lot more training data than there is on the internet to train a system without such structure by itself,” adds Veličković. Transformers structure the training data into tokens, and a key component first reported in 2017 is the way each token then connects with or “attends” to all other similar tokens . Here whether they are “similar” is determined by their dot products, a multiplication technique for the kind of vector format of input numbers used for these tokens (Figure 2). Exploiting this “dot product attention” significantly improves the efficiency of the training process.
Taking the world by storm
The transformer architecture proved very powerful as has been seen in the surge to prominence of various AI systems based on generative pre-trained transformer algorithms, such as ChatGPT, BERT and PaLM, although this likely has at least as much to do with the marketing of the recent releases as it does with the algorithm itself. “It was a small evolution rather than a revolution,” says Bastiman in reference to recent GPT releases, explaining that there was an increase in parameter size and the amount of data used for training that gave rise to something that could provide broader answers and was ready for mass market. Nonetheless she adds, “There had been GPT2, GPT1 and all the other previous ones had been released quietly and had all been quite good.”
The marketing spin has not stopped with the product releases as terms like “hallucination” have entered the lexicon to describe instances when the output is wrong and potentially dangerous. (Figure 3) “The language we are using to describe these models is different to how we describe human intelligence to deliberately instil the sense this is better,” adds Bastiman. “So even if the model is incorrect these terms imply that it is still doing something amazing.”
The success of this marketing does have its advantages as Veličković points out, thrusting AI in the spotlight, inviting people to try the algorithms, which thanks to a growth in web-based user interfaces like ChatGPT can reach a much broader audience. This is not only encouraging developers they are doing something potentially useful but prompting important discussions around the potential bias and ethics issues, which many would argue ought to be considered before anything else. Nonetheless Veličković also doubts whether the current AI fanfare can be attributed to advances in the algorithms they use alone, pointing out that neural networks have been around since the 1950s, and the 1980s and 1990s saw the invention of most of the building blocks we need to scale such systems up: the backpropagation algorithm, convolutional and recurrent neural networks, long-short term memory networks, and early variants of linear self-attention and graph neural networks. “It’s just that we needed gamers,” he tells Real World Data Science, suggesting that hardware and engineering have been key to the recent successes of AI. “We needed people to drive the development of graphics cards which are really good hardware for training these things.”
Clearly advances in processing power and the hardware such as GPUs to manage it so that it is possible to compute these algorithms massively affects their potential impact. Although the field no longer relies on GPUs developed for gamers, GPUs are still widely used, as they offer such a good return on investment and are easier to get hold of than alternatives like tensor processing units. Certainly a significant development over the past decade or so is the increase in size of not just the data sets but the algorithms themselves. Implementing algorithms at such colossal scales that require data centres imposes incredibly challenging requirements on the hardware and the electrical and computational engineering involved to set them running and keep them from failure. “When you have a data centre, failure is a common thing,” says Veličković, listing multiple vulnerabilities that balloon at scale such as hardware failures, electrical failures, even apparently exotic events like solar flares can flip bits and scramble data leading to nonsense output. “People underestimate this but good engineering is now the bread and butter of how these systems work.”
Managing expectations
The explosion in scale has also created fundamental distinctions from how people work with machine learning algorithms versus statistical methods. Witten highlights “the ability to gauge uncertainty” by quantifying parameters such as confidence intervals and error bars as a key contribution of statistics. “Often with these machine learning models things get very complicated and we do not yet have a way to quantify that uncertainty.”
This quantifying of finite parameters contrasts with the kind of output achieved with the generative AI applications that have grabbed media focus recently. For instance, asking a large language model to describe Young’s double slit experiment in the style of Robert Burns may sound quite a specific prompt, and it may seem impressive if the algorithm returns something akin to what was asked, but the number of possible responses that could be deemed “correct” are infinite. A lot of applications of generative AI – many with more real world impact than describing iconic experiments in archaic scotch rhyme – similarly have a vast set of reasonable outputs.
“We shouldn’t be surprised if ChatGPT does well with a question that has a million reasonable answers,” says Witten, contrasting these scenarios with questions that she suggests might have more real-world importance, like whether a patient with breast cancer will respond to a particular treatment. “Actually ChatGPT often gets into trouble if there is a problem with just one answer, and that answer is not part of the training set.”
For predictive AI there is often only one useful answer – the outcome that will come to pass. This has implications if machine learning is used for predictions, particularly if it is in real world settings that affect real people. “If you are deploying an AI model for some healthcare application like what breast cancer treatment you are going to respond best to, we really better make sure that the model works, and that we understand the uncertainty of those predictions,” says Witten. She feels that over the past few years, the machine learning community has increasingly recognized the importance of bringing statistical thinking to bear within the context of complex machine learning/AI algorithms: in particular, interpretability and uncertainty quantification have become major areas of interest in machine learning. Witten suggests that statistics is making progress here citing as an example conformal inference, “which allows recalibration of the predictions of a machine learning model in order to quantify uncertainty appropriately.”
Explain yourself
Understanding the uncertainties of output is one thing, but many of these algorithms have now reached the kind of scale that totally obfuscates what they are doing with the input data to reach their outputs. There may be specialists who understand how they are programmed but there are just too many variables to track so that even for them, the final process the algorithm lands on for generating its output from the various inputs is a black box with no neat mathematical description, unlike statistical techniques like regression.
“You can draw a picture with circles and arrows, and arrows cross in a certain way, but you don’t have a clear idea of how one feature that you started with maps to the output,” says Witten. “On an actual quantitative level of scientific understanding we don’t have that.” If decisions are being made for and about people based on AI, people will also sometimes want to know how that conclusion was drawn. “When we want to make decisions there’s a level of deferred trust,” says Bastiman citing a work by the Alan Turing Institute that began in the late 2010s. “We as humans want explanations from machines in the same scenarios that we want them from humans but that’s not going to be the same for all people.” For example, a person who has had a bad experience in the past will need more convincing than one who has not. Janet suggests that a very normal cognitive bias can be generalised as most people wanting more explanation if the model output is not in their favour. “Similarly, a person accepted for a job where AI is used, may not require any explanation, while another candidate the AI rejected may challenge the decision and want to know why.”
Hybrid implementations including a human in the loop may help to a degree. However, to get a handle on the workings of the algorithm itself, Bastiman points out that it is possible to introduce layers in the algorithm that will help extract how the output is reached even for unsupervised neural networks. “That’s where a lot of effort goes from data scientists and machine learning engineers to make sure the model has that level of transparency and makes sense,” she adds, emphasising the need to ensure a model has these features before it is released and put in use. The process is far from straightforward as the explanation needs to be at the right level and with the right terminology for a range of audiences, be they data scientists, quality assurance professionals, decision makers, end users or impacted individuals. “People say you can’t explain things when what they really mean is that it’s difficult.”
Veličković suggests a lot could be gained in terms of being able to analyse AI algorithms by marrying them with elements of classical algorithms, which are “nicely implemented and interpretable.” Classical algorithms are also impervious to changes in the input data such as an increase by a factor of 10, which can completely throw an algorithm based on machine learning. “The problem is they are trained to be really useful and give you an answer at all times so they won’t even give you a confidence estimate, they will just try to answer even if the answer is completely broken,” he adds. A lot of his research has focused on “out-of-distribution generalisation” – the way classical algorithms work with any input data – to see how these features might be sewn into AI to extract the best of both worlds. “There’s a lot of research to be done still but our findings so far indicate that if you want out-of-distribution generalisation you need to look at what makes your problem special and put some of those properties inside your neural network model.”
Even what we know about the way the algorithm reaches a decision has caused concern when it comes to critical real-world applications – for example, to determine the likelihood that someone convicted of a crime will reoffend. (More on this to come in the special issue article on ethics). With many commercial algorithms the details of the training data are unknown or essentially constitute the whole internet, which as Witten points out is “a pretty bad place a lot of the time.” While ChatGPT may seem an unlikely choice for anything like gauging risk for recidivism or cancer treatments, concerns remain over biases propagating in AI generated content we might consume through marketing campaigns and other activities. “Even just thinking about deploying AI/machine learning models in critical real-world settings without the associated statistical understanding is just very deeply problematic,” says Witten, emphasising the importance of not just statisticians but also ethicists for tackling these challenges.
The fact is many of us are already interacting with multiple machine learning/AI models on a daily basis through recommendations, search engines and predictive text. “If we are going to deploy these [machine learning algorithms] at scale in a way that will affect human lives, then we first need to understand the implications for humans of these models,” says Witten. “This includes both statistical and ethical considerations.”
Coming up: Forthcoming articles in this special issue will look at machine learning and human-level intelligence, issues around data, techniques for evaluation, gauging workforce impact, governance, best practice and living with AI
Also in the AI series
Generative AI models and the quest for human-level artificial intelligence datasets for optimised AI performance
- About the authors
- Anna Demming is a freelance science writer and editor based in Bristol, UK. She has a PhD from King’s College London in physics, specifically nanophotonics and how light interacts with the very small, and has been an editor for Nature Publishing Group (now Springer Nature), IOP Publishing and New Scientist. Other publications she contributes to include The Observer, New Scientist, Scientific American, Physics World and Chemistry World.
- Copyright and licence
- © 2024 Anna Demming
This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail image courtesy of Serenechan3 reproduced under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence
- How to cite
- Demming Anna. 2024. “What is AI? Shedding light on the method and madness in these algorithms .” Real World Data Science, April 22, 2024. URL