For many people, it starts with a question. Something simple, something they already know the answer to. A test, in other words, to see what these AI-powered chatbots are all about. But spend any amount of time with ChatGPT and other such tools and you’ll quickly start to wonder what else they might do, and how useful they might be in your day-to-day working life.
Data scientists certainly have been thinking along these lines, and to find out more about current use cases, proofs of concepts and potential applications, Real World Data Science got together with members of the Royal Statistical Society’s Data Science and AI Section (DS&AI) for a group discussion.
Our interviewees, in order of appearance, are:
- Piers Stobbs, VP science at Deliveroo, and DS&AI committee member.
- Detlef Nauck, head of AI and data science research at BT, and editorial board member, Real World Data Science.
- Adam Davison, head of data science at the Advertising Standards Authority, and DS&AI committee member.
- Trevor Duguid Farrant, senior principal statistician at Mondelez International, and DS&AI committee member.
- Giles Pavey, global director for data science at Unilever, and DS&AI vice-chair.
- Martin Goodson, CEO and chief scientist at Evolution AI, and DS&AI committee member.
The first part of our discussion focuses on how large language models are becoming part of the data science toolkit, and what this new development means for data science teams and skillsets. Stay tuned for part two, which we’ll be publishing soon!
(UPDATE: Part two is now published: “Large language models: Do we need to understand the maths, or simply recognise the limitations?”)
As data scientists, how has ChatGPT – and other tools built on large language models (LLMs) – changed your working lives?
Piers Stobbs: Up to about a year ago, although I was really impressed with the developments in deep learning and the improvements in computer vision and natural language models, it felt in line with general improvements in machine learning. And then, probably about six months ago, with things like DALL·E and ChatGPT, it felt like something changed – properly ground-breaking capabilities. And I still can’t quite get my head around the fact that you can basically have a model that tries to predict the next token, and it comes up with outputs that really feel quite sensible and human-like – if prone to hallucination.
The way I think about it is, this feels like a brand-new capability that we’ve just not really had before. It’s almost like an interface with unstructured information. Historically, you sort of have to turn text into something, and then turn something back into text, if you want to have this interface with humans. Now, we’ve got this really quite elegant way of plugging the gaps, which feels full of opportunities.
I’m having great fun playing around with the code co-pilots – GitHub’s Copilot is amazing and, productivity wise, is helping me a lot. I am now a much faster coder because there’s all those Stack Exchange lookups that I don’t have to do anymore. Again, from a personal productivity perspective, I’m using [ChatGPT] for initial drafts of documents and other things. And then I use it almost for validating things. For instance, I had a random discussion the other night with ChatGPT about logistic algorithms. It’s not going to solve problems for you, but I asked it to give some pointers of things I could be thinking about – some of which I had, some of which I hadn’t. So, it’s almost like a brainstorming helper, somehow.
But probably the thing I’m most excited about is the knowledge sharing side of it – plugging it into, or on top of, private information, and surfacing all that knowledge that is locked away in documents and intranet pages.
This feels like a brand-new capability – an interface with unstructured information. Historically, you have to turn text into something, and then turn something back into text, if you want to have this interface with humans. Now, we’ve got this elegant way of plugging the gaps.
Detlef Nauck: We’re looking into running proof of concepts to see whether LLMs do bring value. Software engineering is the most obvious one, and easiest to set up and run. And then we want to look at making use of internal documents – so, either summarization or creation of internal documents in appropriate language. The latter use cases are trickier to evaluate. We want to know whether the outputs produced are any good. With software engineering you can track GitHub statistics, for example. But if you give ChatGPT to somebody to write marketing material, or to get information out of a document, how do you know that the results are good? We need to get our head around metrics for evaluation.
Adam Davison: I’ve been using it for basically anything where I don’t remember the API very well or it’s a bit confusing. Pandas is the key, right? We all use pandas, but you don’t really remember how to do some complicated thing with apply()
, say, so you just ask GPT-4 to give you the answer, and it saves you that hassle. Also, I read some insightful tweet that said these chat systems are really good for things where generating the solution is hard, but verifying it is easy. And I think that’s true for some of these things. You know, you get a short piece of Python code, you can basically look at that and you can tell if it’s right.
In data science, you’re a bit of a jack-of-all-trades. You need to do little bits of everything, but you’re not a specialist in anything. And so, I think for software development, it’s been really helpful. For example, right now, I’m doing a bit of frontend development in a project to visualise something. I’m never going to be a professional frontend developer, but GPT-4 can help deal with the oddities of JavaScript much more easily than it would be for me to trawl through Stack Overflow posts.
But the thing that we’re using it for, practically, is natural language processing (NLP) and classification. We have this particular problem at the Advertising Standards Authority (ASA) where we are running lots of different models that are completely unconnected to each other because every project is a different topic. So, one week we’re looking at, “Do these gambling ads appeal to young people?” and then the week after it’s, “Are these cryptocurrency ads being clear about the risks involved?” It’s very disparate, we don’t have a lot of time to iterate on models, and we don’t have huge amounts of training data. Ten years ago, when you were doing sentiment classification, you were on Mechanical Turk getting 10,000 examples, and even then it didn’t work very well after these really complicated models. Now, you’ve got a couple of hundred examples and with the embeddings [from LLMs] you can get to a pretty decent classifier quite quickly. We’re also starting to experiment with using OpenAI’s fine-tuning tools, and the performance that we’ve seen from that is very impressive, to the extent that it’s making us rethink whether we bother doing anything else in some of our classifiers.
Five years ago, if you had a sophisticated problem involving text or images, you’d need a big research team with a big budget to tackle it. But increasingly we find, like many other people, that you can take models off the shelf and repurpose them for quite diverse tasks.
Trevor Duguid Farrant: My organisation is not as far forward as the rest of you. I’ve introduced it to the leadership team, and the digital services team – what was IT – are looking to make a decision on whether we can use it or not. I think there’ll be so much pressure they’ll have to use it, but there’s still a feeling of discomfort with it, whereas I think it’s really good and have started using it. Everyone else on the call seems to have started using it. So, can organisations like the Royal Statistical Society help companies to embrace this and start using it, and then everyone can benefit from it?
Giles Pavey: I wish I could be with Piers and Adam – actually using it – but my life has been taken over as the guy who goes and explains it to the business. Unilever is a massive business, and we are concerned about privacy, confidentiality and trustworthiness. We’ve now built an initial GPT instance on Azure and fed it with some of our own documents, and a lot of my time has been working with legal to convince ourselves that that’s okay. Now we are really trying to work out just how we manage the amazing demand for proofs of concepts and use cases – and what we’re just about to uncover, I think, is the unknown but potentially massive expense of running it.
In pure proofs of concepts, departments that have large knowledge banks are using it: research and development, and marketing, for example. And one of the big technical things that we’re working on – and, because of the size that we are, we’re doing a lot of work with OpenAI and Microsoft on this – is how to stop the models from hallucinating.
Have your experiences with ChatGPT and other tools changed your thinking about the skillsets required of data scientists and data science teams?
AD: A little bit. As someone at a small organisation, I think it’s quite exciting because, five years ago, maybe you were in a world where if you had a sophisticated problem involving text or images, you’d need a big research team with a big budget to tackle it. But increasingly we find, like many other people, that you can take models off the shelf and repurpose them for quite diverse tasks. So, I think it’s becoming increasingly viable to have a small team of people who are implementers, who aren’t necessarily backed up by a big research organisation, doing increasingly sophisticated stuff.
I don’t think it does away with the sort of things that we always bang on about in the Data Science and AI Section, like the need for an understanding of statistics and how the underlying systems really work, because I think you still need to understand what you’re doing with LLMs, as with any other machine learning technique. But, if I had to guess, what we’re going to be seeing now is that for a lot of problems, you’re going to have more of a division – so, you’re either in one of a small number of very large labs doing research on very cutting-edge big models, or you can be an implementer who is taking things off the shelf and applying them. And maybe that space in between is going to get a little bit squeezed – that would be my guess, but obviously it’s very unpredictable.
We’ve built an initial GPT instance on Azure. Now we are really trying to work out just how we manage the amazing demand for proofs of concepts and use cases – and what we’re just about to uncover, I think, is the unknown but potentially massive expense of running it.
PS: That’s exactly my view. When I first started hiring data scientists, a long time ago, you basically had to write stuff from scratch, and you needed PhDs – people who really understood, at a deep level, how the maths all works. But I think there’s been a steady progression towards valuing software engineering skills, and I think, in some ways, this is another step along that path. If I think now about implementing a chatbot over your own knowledge base, it’s basically like plugging APIs together with some Python. Adam’s point is still hugely important, though, because I think we still need the background knowledge about what’s actually going on – OK, I’m creating embeddings here, and that’s allowing this search to work so I can surface the right docs – that whole process, which an average software engineer is maybe not going to know. But I think it’s definitely blurring the lines.
Martin Goodson: It’s just as important now to understand how to evaluate performance. The difference is, it used to be that you were trying to figure out whether it’s 80% accurate, or 85%. Now, it’s like 99.9%. But you still need to figure it out. You still need to understand what the failure modes are, what caused it; how is it actually working, and is it doing what you need it to do for the product? Is it actually satisfying our needs as users or as customers of the products.
DN: I think in the future, the skills we will need are people who can run and build these models. Giles made the point about how expensive it is to run these things. Right now, you have two options: subscribe to an API, and then you are limited in how you can modify these models; or build your own – take an open-source LLM and modify it. But then you need people who know how to build a high-performance computing environment and operate this efficiently. You need to know how to actually train the models, how to curate the data, how to set the model parameters. And I always think there’s too much alchemy still going on in this field, right? It’s not proper science. People build these things and then are surprised at what they can do; they didn’t know such things would be possible. A lot of these capabilities only emerge when you make the models really, really big and, essentially, you also have no control over them – you can’t stop them hallucinating. So, these are the kinds of issues we need to get under control if we want to get any value out of them.
Prompt engineering is another one – you really need to understand how these models work and how to prompt them. If you want to give them to, say, a marketer to generate copywriting, they may not have the right ideas of how to prompt the machine. So, I could see roles developing out of data science that understand how to influence these models and make them do what we want them to do.
MG: The other angle to this is junior engineers. Now, the bar for being a useful junior engineer is that you’re better than GitHub Copilot. Why do you need a really junior person if you can just use a large language model to be the junior developer?
DN: I’m not thinking about the data science person who needs to write some code for a project here, but if you have a large software team in an organisation that produces production code, they will become more efficient by using these tools. But still, with all this overhead of testing and putting it all together, there will be a lot of manual work that needs to be done. But the teams will get more efficient and junior people will get up to speed quicker. That’s probably another advantage.
Can the Royal Statistical Society help non-tech companies embrace large language models, extolling their virtues and dispelling the myths?
PS: I think Detlef’s point about understanding is an interesting one. It definitely feels like there’s been this sort of continuum from, you know, “OK, it’s a linear regression, we know what’s going on” to complex models to ensemble models where, again, you’re combining these things you can individually understand. Even with big ImageNet architectures, billions of parameters, at least conceptually you can understand how these work and build out tools where you can understand the layers. To me, what’s different now is you’ve got this reinforcement learning layer on top, or diffusion layer, or some other additional approach – this combination of really complicated things. I honestly don’t know where to start with trying to understand why a specific output is generated, and I think that is a proper concern. That’s definitely an area of research, because I think we need to understand this.
GP: I think there’s also a question in large companies of just who owns these things. Up until this point, everybody’s been happy that AI is the realm of data science. And, suddenly, generative AI looks like it might be the realm of the IT team – that it’s a service that you get off the shelf. It’s going to be interesting to see how that plays out. I really liked the point that Martin was making about being able to tell what the systems are actually doing, what they are supposed to do and how to check them, because if you don’t have a background in that area, you might just assume they work. Now, nobody knows exactly how these things work – not even the people who build them. But having a background in how you test things, for potential causes for things not working, is actually going to be incredibly powerful or useful.
TDF: Will experts like us actually be able to check it because of the speed that new versions are coming out and the developments that are happening? Is it going to take us six months to check that GPT-3.5 works? Well, too late, a month later GPT-4’s out! I just think that pace is going to keep accelerating.
Want to hear more from the RSS Data Science and AI Section? Sign up for its newsletter at rssdsaisection.substack.com.
- Copyright and licence
- © 2023 Royal Statistical Society
This interview is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photos are not covered by this licence. Portrait photos are supplied by interviewees and used with permission. ChatGPT homescreen photo by Levart_Photographer on Unsplash.
- How to cite
- Tarran, Brian. 2023. “How is ChatGPT changing data science?” Real World Data Science, May 11, 2023. URL