Real World Data Science

Data science and AI in the public sector: An interview with ONS’s Penny Holborn

Jonathan Gillard — Wed, 27 Mar 2024 00:00:00 GMT

About the author: Jonathan Gillard is a professor of statistics and data science at Cardiff University and a member of the editorial board of Real World Data Science.

Copyright and licence: © 2024 Royal Statistical Society

This interview is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail background by Marcin Skalij on Unsplash.

How to cite: Gillard, Jonathan. 2024. “Data science and AI in the public sector: An interview with ONS’s Penny Holborn.” Real World Data Science, March 27, 2024. URL

Data science and AI in financial services: An interview with Nationwide’s Matthew Jones

Jonathan Gillard — Thu, 21 Mar 2024 00:00:00 GMT

Back to Careers

About the author: Jonathan Gillard is a professor of statistics and data science at Cardiff University and a member of the editorial board of Real World Data Science.

Copyright and licence: © 2024 Royal Statistical Society

This interview is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail background by Devin Pickell on Unsplash.

How to cite: Gillard, Jonathan. 2024. “Data science and AI in financial services: An interview with Nationwide’s Matthew Jones.” Real World Data Science, March 21, 2024. URL

‘I fell in love with math, really, and fell into data science because of that’

Brian Tarran — Wed, 04 Oct 2023 00:00:00 GMT

A passion for maths and solving mathematical problems led Niclas Thomas to a PhD in machine learning with a focus on medical research. But then a conversation with a recruiter steered his career towards data science in the retail sphere. After stints at Tesco, Sainsbury’s, and Gousto, Thomas is now head of data science for Next, the clothing retailer.

In this interview with Real World Data Science, Thomas reflects on his career journey so far, from hands-on coding work to team leadership and management. He also argues for the importance of communication and storytelling as part of the data science skill set.

Transcript

This transcript has been produced using speech-to-text transcription software. It has been only lightly edited to correct mistranscriptions and remove some repetitions.

Brian Tarran
Niclas Thomas, thank you for joining us today. I hope you’re well.

Niclas Thomas
I am indeed thanks. Thank you for having me.

Brian Tarran
Today we’re meeting because we want to find out a little bit about your career in data science, how you got into it, what you’re doing now, where you see both your career and data science as a profession going next. So do you mind– can we start by giving us a brief introduction to who Niclas Thomas is?

Niclas Thomas
Yeah, of course. Yeah. So I’m currently working as head of data science at Next. My background is academia originally, a maths degree. I did my PhD in machine learning, and more in medical research, so more of an applied machine learning position where the idea was to try and predict, ultimately, predict disease from a given sample of data from blood – can you actually predict future disease? – which I think is a really interesting area; I love medical research. And then [I] switched over to more commercial role and worked in several retail data science roles: so, Tesco, Sainsbury’s, Gousto, and then now, as I said, currently head of data science at Next, where I run a team, and I imagine most listeners will be familiar with what Next do: a retail, a clothing brand on the whole, where the idea is, obviously to sell some great stuff, great products and put the right product in front of the right customer.

Brian Tarran
Can you tell us, what does your job involve? What are your sort of main tasks and responsibilities in that role?

Niclas Thomas
Yeah, so I suppose I’m lucky enough to have been head of data science at several different companies: Sainsbury’s, Gousto, and Next. So it’s always interesting to compare the role of the head of data science in each of those three. At the moment, I think there’s a core focus on, well, ultimately making sure the teams are efficient as possible. And that really means just making sure our tech stack – what tools, what programming languages, what software we use on a day to day basis – is set up for success and make sure the team have what they need to be able to do the job as efficiently as possible, whether that’s using Python or R, whether that’s how we develop code, and how we work with other people as well, being a big part of that, then. So how do we work with other software engineers? How do we work with web developers, then, to make sure that the work we do actually gets in the hands of the business and ultimately in the hands of the customer. So that’s one aspect: it’s just making sure the team is set up for success, both in terms of the ways they work and what tools they have to work as well, then. I guess the other side of that coin is what we actually work on. So understanding the value of potential work we could do, and helping the team understand what that value is, and, and ultimately giving direction of what things we want to work on next. Obviously, that’s not my decision in isolation, but understanding on the one hand, what other stakeholders want to do, what my superiors wants to do, as well. And trying to put that all into the mix to understand these are the next best projects to work on given a finite amount of people to work on these problems. And then ultimately, then, the last part, then, is ultimately helping the team deliver those projects, those products as well then, which usually means calling on my experience of having solved these problems myself, either directly when I was earlier in my career or indirectly through leading others then or, you know, being the head of a team and working with some other great people and to learn from their experiences as well.

Brian Tarran
What does data science mean to you, personally? I’m not asking you to define it for everybody. But for you, what is what is data science?

Niclas Thomas
Yeah, I wish I’d come up over the years with a great definition of this. But yeah, I mean, really, it just, I mean, at the very highest level, it just means using data to drive business value, I suppose, as I guess in my– which probably reflects the fact that it’s more of a business role that I have. But I think that in its broadest sense, I think that’s true: using data to drive insights and make decisions for the business. There are more, I guess, detailed definitions of that. So, for example, the way I’ve always differentiated between data analytics and data science is that if you want to make repeated decisions on a daily or weekly basis, then that’s when it becomes more about a data science question versus a data analytics question, because data analytics is generally about answering large one off ad hoc questions, rather than making the same decision over and over again and using methods appropriate for that. But, ultimately, that’s what data science means to me, I think: making repeated decisions using data and the scientific method to use data for good.

Brian Tarran
And so what do you think is your most important skill as a data scientist given that definition that you have of data science?

Niclas Thomas
In my role, I suppose communication ultimately becomes the most important thing. I’d say definitely earlier in my career, and I think if you’re the person actually delivering and implementing the algorithm, I think that the technical skill set obviously is really important then. But ultimately, I almost see my role as the head of data science as a hybrid– as a link between my team and the rest of the business, then. So it’s really about being able to, on the one hand, translate technical concepts into non technical descriptions of what we’re actually doing, making sure the rest of the business can understand and vice versa, then making sure I understand the business process and business terminology well enough to be able to translate that for the team, as and when needed, into a vision for a project, a product, then, and develop a strategy for that. So I think that the communication both in the strictest sense of being able to talk that through with, with my team, with other team members, with stakeholders, as well, but also more in the looser sense, then, of being able to define that strategy, being able to define what the roadmap for a particular project or a product might look like.

Brian Tarran
Can you talk us through your so your education and your training that led up to your kind of first data science job, your first data science role.

Niclas Thomas
I suppose the first time, the first time I– actually, I’d never heard about it, I think, when a recruiter approached me. This is probably going back into 2014, when I was maybe eight months into my postdoc after my PhD. I think– obviously it did exist before that, although I suppose the terminology wasn’t quite as widespread going back almost 10 years now where the term is a lot more rife. So my original background, I did a master’s in maths originally, four years. And then I remember being– the last year of that, then, I was applying for a few jobs, and I applied for one at the Met Office, where the focus obviously was predicting weather, forecasting. And I wasn’t successful in that job. But I did notice that the, on the job spec at the time, it was PhD preferred was one of the specs on that role. It was probably the first time I thought about taking on a PhD as more of a career move rather than as the natural progression to an academic career, more of a business career move if you like, then of actually how it can help you in more business settings. So that was at least when I decided to do my PhD and thought it’s certainly not going to be– and this was back in 2008, so at the time of the financial issues at the time when getting jobs was harder anyway, so it felt like a win-win of doing something that would be– I was clear I wanted to work in a data role of some sort. And that combined with the fact that I thought it would be a good career move and the financial climate at the time wasn’t brilliant. So I took on a PhD then. And then in terms of actually getting into, into my first data science position was, as I said, just after I finished my PhD, I had been working about six months, eight months as a postdoc, and then a recruiter just described a role that was available at Tesco at the time. And it sounded a lot of what I was doing in my current postdoc role at the time – making predictions based on data and exactly the same techniques – sounded really interesting. And it must have been the way the recruiter sold it at the time as well then, because it’s something I was really keen to take on and then made my move off the back of that then. So yeah, kind of moved into it a little bit, I guess, semi deliberately from taking a PhD on first, but always with the view of moving over to a business role at some point after that.

Brian Tarran
But it wasn’t like you started out your further education thinking, “I want to be a data scientist, what do I need to do to kind of get there? What are the subjects I need to focus on? What are the topics I need to research?

Niclas Thomas
Yeah. Oh, absolutely. Yeah, it certainly wasn’t by design at the very start of my journey. I fell in love with math, really, and just fell into data science because of that, really, I loved numbers and loved solving maths problems. So that’s why I did a degree in it first of all, then and certainly, you know, even midway through my degree, then I wasn’t really sure what I wanted to do. It was more, as you say, just by chance, then, that there were a few opportune moments that came around then, that opportunities came around at the right time to fall into that career.

Brian Tarran
Doing a PhD in machine learning as you did, that was quite a – in hindsight – a smart choice of PhD to pursue, I think, right?

Niclas Thomas
I think so. Yeah, I suppose it was– still even at that stage it wasn’t necessarily, again, the terminology ‘data science’ wasn’t really around. Certainly, when I started my PhD in 2009 2010. It wasn’t really terminology, at least it may have been in usage a little bit in terms of being on, you know, if you look for jobs on LinkedIn or Indeed, but it certainly wasn’t terminology that that I would have been particularly familiar with.

Brian Tarran
Your first job in data science was at Tesco. You mentioned that you were you were kind of recruited to that role there. How does it compare to your current role? So I guess, you know, what’s the difference between being a data scientist versus head of data science as you are now?

Niclas Thomas
Yeah, I think there are probably more similarities than differences, I would say. We were quite lucky in the setup in Tesco that the recruitment strategy seemed to be more focused around people who already had some experience in, generally, either already had business experience or a PhD. So we were fairly independent in solving our own– the project that we were working on and working on that. Not necessarily with the head of data science guiding us, you know, day by day, in terms of the actual nitty gritty and the technical detail, which is great, then. So it did mean that we had responsibility and ownership for our product quite early on. So yeah, I really enjoyed that. I suppose I was writing a lot more code in those days than I do now. I rarely, if ever write code at the moment. So I think that’s probably true for the last maybe three or four years, I think, only occasionally getting my hands dirty. And even when it is, it’s not really to build an algorithm, it’s more to inquire about what data we have to solve the algorithm then. So even when I do get my hands dirty, it’s more in the very early stages of the whole algorithm development lifecycle. So I think that’s probably the biggest difference is just the actual ownership of development there – probably expected, I would say, but it’s– I think that’s one of the beauties of being in your first job or two in data science. I think the– I think in most places I’ve seen, I think you’ll get ownership of, of the work, the stuff that you work on, on a day to day basis, quite early on. And you’ll be expected to contribute code and ideas for that as well, which I think most people would love. I certainly loved it at the time.

Brian Tarran
What was the most important thing you learned in your first year in that job?

Niclas Thomas
I think, again, it’s probably a lot around the ways of working, I would say – of the various ways you can [work], which I never really thought about it before. Working in academia, it was quite isolated, I suppose. You work on your own project, you work on your own work and don’t really– or at least, I found I didn’t really work with anyone else that much. Maybe that was the nature of my work as well, we’d obviously be dependent on people working in a lab to get data. But I think the day to day work, I was working quite in isolation, whereas the team aspect of working, I think, was a steep learning curve then – so agile methodology, and everything around that, which was very, very new to me. And the various ways you can do that. I’m generally not someone for overly putting processes in place in a team, only where necessary. But I think there’s some great learnings from that as well. It certainly started to shape how I think I would want to run a team if and when I got to that position.

Brian Tarran
So, Nick, what have been your career highlights so far?

Niclas Thomas
I think in terms of– there was one product we built in Sainsbury’s in particular. So in terms of, on a product level of replenishment. So how do you most efficiently get products from the back of the store onto the shelves of an individual store? And what’s the most optimal strategy to do that, which I love for a variety of reasons. A, it was one of the first full data science products that we had deployed and worked on as a team in Sainsbury’s. So there was that kind of milestone about it. I think it also stood out as a really nice move away from classic machine learning – i.e., making a prediction, a classification model – to something that was a bit more operations research based and more based on optimization. So using graph theory, making a graph network of a store. And using that to solve the problem of taking a route through the store, for example, a bit like a Google Maps for a store basically, was how we always pitched it to our stakeholders, and how can you choose the best route and again, moving more into a bit more of a vehicle routing problem, then: if you’ve got two different trolleys, how do you decide what items to put on trolley one versus trolley two? So there’s loads of interesting stuff on the technical side of things and it was, again, I felt it was probably one of the highlights – as well as the end product, it was also the one I worked on at the very start. So actually, the understanding whether it would be possible to do that, what kind of technical approach. So I think certainly from a product perspective, that’s probably stuck in my mind. Aside from that, on a more personal level, I guess, I did decide to write a book off the back of my PhD. Just mainly on my experiences from my PhD and postdoc. I mean, it’s not like a confessional. But more on the– just working with non data scientists and making it more accessible was really what I really focused on there. So having worked with clinicians, immunologists and others as part of the medical research that I did, I felt that data can be accessible if you pitch it in a way and make it easy to use. And so that was the purpose of what was largely an educational textbook.

Brian Tarran
Do you want to give a short plug for the book, what it’s called and where people can find it?

Niclas Thomas
Yeah, so it’s, Data Science for Immunologists is the name of the book. It’s available on Amazon. I’m one of the two co-authors on that then. And we do have a website, datascienceforimmunologists.com, as well then if you did want to visit and you can either buy the book, there’s a link on that website or just go straight to Amazon and it’s available there.

Brian Tarran
This next question, we’ve gone from highlights to lowlights. Have there been any mistakes or regrets that you’ve had along the way in your data science career so far?

Niclas Thomas
The main mistakes I think I’ve made before is not valuing, A, communication or soft skills, but B, the leadership and management as well then. And I think especially it’s something, when working at Gousto as well that was something that was a big focus of the team and something that I really took from my time there as well was the, I guess, the art of good management and good leadership, you know, what the difference is between the two. So I wouldn’t say there’s any one bang event that’s a mistake or regret, but it’s probably, as ever, it’s probably I would have put more emphasis on it sooner had I known that how important those skills would be.

Brian Tarran
Yeah, but I think that’s understandable to a certain extent. If you’re coming from, I guess, a role that’s very hands on, doing things yourself, getting into the messy details of a project, it can sometimes be hard to kind of take a step back and adopt more of a kind of leadership, management position, can’t it?

Niclas Thomas
Yeah, definitely. Yeah, definitely. I would agree with that. And I think it’s also, I’d probably say for a lot of people starting out, and certainly it was for me, that the technical– the technical aspect is probably why you get into a role in data science in the first place, that you just love solving problems, basically, whether that’s with code or with pen and paper. And so that’s, that’s what you want to do. And getting your mind focused elsewhere away from that is probably not viewed as the most fun thing to do, I probably wouldn’t have, when I was starting out in 2014, 2015, I probably wouldn’t have thought it was as fun or as interesting to do that as I do now, maybe. So I think that’s the other reason why it probably doesn’t get as much focus earlier on in my career anyway, at least, as it probably deserved.

Brian Tarran
How do you think your– how do you see your role, I guess, evolving over the rest of your career in data science?

Niclas Thomas
I suppose on a personal level, for me it’s, I’m always thinking of what, 10 years down the line, do I still want to be focused just on data science? Or do I want to be focused on a data role, more broadly? I suppose that’s always the main question to ask. And so by that I mean, looking at data engineering as well, data analytics, and being responsible for a wider group. I think the way the field is going anyway, I think a lot more companies seem to move to vertical management rather than horizontal. So by that, I mean having heads of data in different areas of the business. So rather than having a head of data and a head of analytics, you might have a head of data for certain aspects of the business and another head of data then that’s responsible for both in other areas of the business, then. So either way, I think that the broadening of responsibilities and not just being responsible for data science is probably one way I would see my career potentially moving. At the moment, I love just focusing just on the data science, I’m really happy doing that now. But I think that could be one way that my focus changes in the future.

Brian Tarran
What personal or professional advice would you give for anyone wanting to be a data scientist?

Niclas Thomas
Yeah, so first of all, the balance between the soft and hard skills. I think I’ve alluded to it before, but the– don’t put too much– I mean, still emphasise on the technical skills are really important, but don’t feel like it’s the be all end all. I think just understanding the softer side of how you communicate, how you tell a story, for example, and storytelling with data, I think is really important. So I’d say that’s probably one focus area. I think that the second would probably, and maybe it’s a harder one to act on, but being passionate, I think, because whenever I’m looking to recruit anyone new into my team, I think it’s as much about understanding what the potential of that person is as is what is their current performance or where their current capability is – how good they could be in the future is arguably more important. And I think a lot of that comes to ultimately someone’s– whether they have a fixed or growth mindset. So by that, I mean, ultimately, do they want to learn or not, and if they really want to learn, as a lot of data scientists do, but if they have a huge passion for or about data science, and wanting to learn about just how to get better – whether that’s a better coder, better at maths, anything around that – then if you have that attitude, I think then it’s, A, you can have a great impact on our team, but B, I think it’s a sign of someone who can be a great performer in the future.

Brian Tarran
So what do you think will be the main challenges facing data science as a field over the next few years?

Niclas Thomas
I think probably, certainly, currently maybe living up to the hype, I suppose. And matching I suppose the classic Gartner Hype Cycle of, it feels like we’re probably at the stage where there’s a lot of– the hype has been around for a few years of data science now and I think making sure we tackle the right problems, I suppose, is one of the – and by ‘we’ I mean, Next as a business or whatever business we’re working in at the time – I think it’s making sure we’re working on the right things. Because I think a lot of people will be keen to have data scientists as part of their work and the product they’re trying to build. What is the best place to spend our time, and what projects we should be working on most I think is– becomes important then because, as I say, there’s a huge demand for data scientists time, I think, in every company. And so choosing where we spend that time wisely, I think, becomes the key challenge and the important decisions for, especially for a head of data science like myself to make then, to make sure we’re best using the team’s capacity, then.

Discover more Career profiles

Copyright and licence: © 2023 Royal Statistical Society

This interview is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “‘I fell in love with math, really, and fell into data science because of that.’” Real World Data Science, October 4, 2023. URL

‘I was inspired by the power that numerical data have to tell stories and promote policy change’

Brian Tarran — Wed, 28 Jun 2023 00:00:00 GMT

This week, in celebration of Pride, Real World Data Science is collaborating with the JEDI Outreach Group of the American Statistical Association (ASA) and the ASA LGBTQ+ Advocacy Committee to highlight the achievements of statisticians and data scientists from across the LGBTQ+ spectrum.

Members of the committee nominated two individuals to be featured as part of our career profile series, and so we are pleased to bring you interviews with Claire Morton (below) and Albert Lee.

Read on to discover more about Claire’s data science career (so far).

Hi, Claire. Thank you for sharing your career story with Real World Data Science. Please tell us a little about yourself and your role in data science.
My name is Claire Morton, and I’m an undergraduate student studying mathematical and computational science and environmental justice at Stanford University. I’m particularly interested in using statistics and data science to work with community-based organizations and advance evidence-based environmental justice policy. I have conducted quantitative research on tools to classify disadvantaged communities, oil wells, climate resilience, housing justice, and the connections between soil and health.

What drew you to study statistics and data science?
I really enjoyed my math, statistics, and coding classes in school. I was also inspired by the power that numerical data have to tell stories and promote policy change.

What do you think is your most important skill as a data scientist?
Listening to others. Listening allows me to learn new statistics skills from my mentors and to learn about how best I can work with community partners on their priorities in my research.

How does your gender and/or sexual identity factor into your career?
I am a lesbian, and, at the start of college, I didn’t have any mentors who shared my identity. I’ve now found several through the ASA and queer communities at my university, and I’m continually inspired by the achievements of queer statisticians, mathematicians, and computer scientists. My research hasn’t explicitly connected statistics and queerness yet, but I’m interested in working on projects involving hard-to-reach populations, such as queer people, in the future.

Claire Morton

It’s important to be able to take initiative to learn skills, talk to people, and solve problems as they come up – but it’s also critical to not be afraid of asking for help when you need it.

How did you get into data science?
In high school, I worked in a cell biology lab. As part of that work, I learned to model cellular processes and analyze data from my experiments. I realized that those elements were my favorite parts of the science I was doing, so I decided to study math, computer science, and/or statistics in college. I had always been interested in environmental issues, and so I got involved in quantitative research about environmental justice. I realized that this type of research allowed me to connect the skills I have to my passions, so I’ve kept working in these areas ever since.

What, or who, first inspired you to pursue this career path?
My mom! She’s also a statistician, and she has encouraged and mentored me throughout my academic journey. I’m inspired by her success as a woman in statistics.

What hurdles or challenges have you faced in your studies?
My classes can be tough, which makes it hard to stay motivated sometimes. I also struggled to maintain a healthy work-life balance at the start of college. Finally, it has been tough to learn some of the ins and outs of the research process and publications – how best to engage with research mentors, what it looks like to write and submit a paper, and some of the nuances of working in academia. I think my next big challenge is deciding what to do after college, though I’ve been trying to reframe the question as an opportunity rather than a hurdle. I’m excited to continue doing research at the intersection of statistics and public policy in the future.

What was your first job in data science, and how does it compare to your current role?
My first job in data science was as a researcher, working at a non-profit called Physicians, Scientists, and Engineers for Healthy Energy (PSE). As part of this job, I worked with community-based organizations to code quantitative optimization models to locate climate resilience hubs in California that took community priorities into account. I’m currently a student, which involves less research but gives me the chance to focus on learning new skills for my next job. I hope to be able to work in a role like my job at PSE, combining statistics/data science, community engagement, and policy impact, in the future.

What was the most important thing you learned in your first year on the job?
The importance of being adaptable and self-directed. Research projects shift and change as you uncover new information, and it’s useful to be able to shift with them. It’s important to be able to take initiative to learn skills, talk to people, and solve problems as they come up – but it’s also critical to not be afraid of asking for help when you need it.

What have been your career highlights so far?
One was publishing research about the demographics of people living near oil wells in California, which informed policymakers about racial and socioeconomic differences in exposure to oil wells and is part of a long-standing effort from activists and researchers to protect the health of Californians. I’ve also gotten to work directly with organizers on several mapping projects, which was deeply fulfilling. Finally, I loved getting to present my research at the Joint Statistical Meetings last year, and I look forward to presenting my undergraduate thesis this year.

What three things are at the top of your reading/study list?
Some statistical areas I’m hoping to learn more about are spatial statistics and survey methods. Some books I’m excited to read are The Color of Law, Data Feminism, and Thicker Than Blood: How Racial Statistics Lie. 

What advice would you give for anyone wanting to study statistics and data science?
Find mentors that inspire you, support your career goals, and challenge you to learn and grow as a statistician.

What new ideas or developments in the field are you most excited about or intrigued by?
I’m really interested in combining quantitative research with community-based research, so I think that cross-disciplinary developments are exciting. I’m intrigued by AI tools, and I’m interested to see how these tools change what the day-to-day of being a statistician looks like and the skills that are most sought after in a statistician.

And what do you think will be the main challenges facing the profession over the next few years?
The main challenges will be related to statistical literacy, both for the people consuming and doing statistics. While statistical methods and data becoming more accessible is a positive development, it has meant that more analyses are done incorrectly and that more misleading results are publicized (and absorbed) as truth. It’s getting much easier to twist numbers to support whatever we want them to say, and I think this will continue to challenge both statisticians and non-statisticians in the future.

About the ASA Pride Scholarship

The ASA Pride Scholarship was established to raise awareness for and support the success of LGBTQ+ statisticians and data scientists and allies. The scholarship will celebrate their diverse backgrounds and showcase the invaluable skills and perspectives these individuals bring to the ASA, statistics, and data science.

Apply or nominate someone for the ASA Pride Scholarship.

Discover more Career profiles

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photo of Claire Morton is not covered by this licence.

How to cite: Tarran, Brian. 2023. “‘I was inspired by the power that numerical data have to tell stories and promote policy change.’” Real World Data Science, June 28, 2023. URL

‘Living my identity takes courage. It is the same courage necessary to start a new business’

Brian Tarran — Wed, 28 Jun 2023 00:00:00 GMT

Members of the committee nominated two individuals to be featured as part of our career profile series, and so we’re pleased to bring you interviews with Albert Lee (below) and Claire Morton.

Read on to discover more about Albert’s data science career (so far).

Hi, Albert. Thank you for sharing your career story with Real World Data Science. Please tell us a little about yourself and your role in data science.
My name is Albert Lee. I’m the founding partner at Summit Consulting, a quantitative and financial consulting firm in Washington, DC. Summit delivers data-driven solutions to help make government effective and society just. I started Summit in 2003, and we recently celebrated our 20th anniversary.

I received my PhD in economics from UCLA in 1999. My professional practice is focused on econometrics – an academic specialty that blends economic theory with statistical practices – and statistical sampling.

What does your job involve?
A large portion of my time is spent running Summit and making decisions about management, personnel, and business development. That said, I am still pretty active in technical topics. I am a testifying expert in econometrics and statistical sampling. Recently, I have been leading a team of data scientists who are reformulating the edit and imputation algorithms for the US Department of Agriculture’s National Agriculture Statistical Service, which collects survey data from US agriculture sectors.

What do you think is your most important skill as a data scientist?
Explaining technical concepts is a big part of my job, and it requires the ability to consume the technical literature and know the concepts well enough that I can explain them to a lay audience (such as lawyers, judges, and program staff).

How has your gender and/or sexual identity factored into your career?
My gender and identity have given me important perspective as a data scientist and an entrepreneur. Living my identity takes courage. It is the same courage necessary to start a new business. From a young age, my identity has conditioned me to be comfortable with differences.

My identity has also taught me to see similarity among differences. Empathy is essential in client services and especially in quantitative consulting, where some of my clients feel disempowered by the complex subject matter.

Albert Lee

The data science field is moving very fast. Every day brings a new algorithm, software program, and hardware innovation. Since data science is a multidisciplinary field, keeping up with it has been challenging.

How did you get into data science?
Although I studied mathematics and economics as an undergraduate student and economics as a graduate student, my academic training was very theoretical. I didn’t work with data and computers extensively until my first job outside of academia in the early 2000s. Little did I know that it was the advent of the “big data” revolution.

At Summit we serve mostly federal agencies, who are sitting on decades of administrative data – information they collected as part of their mission but not of research quality. These agencies want to use their administrative data to automate their routine tasks (like predicting which loans will default first) and evaluate program efficacy (determining whether a training program reached its goals). Extracting and analyzing administrative data has been a big part of my career.

When I founded Summit, data science was not a recognized discipline. But as the datasets get larger, decisions about hardware setup, software programs, estimation algorithms, and data virtualization have become increasingly intertwined and interdependent. This really was my first taste of data science as we know it today.

What, or who, first inspired you to become a data scientist?
There are too many people to mention by name. I owe a lot of my career to my first two managers at KPMG, Alan Salzberg and Rick Holt. They taught me how to code and reason quantitatively. And Rob Gould at UCLA has patiently converted a theorist to an empiricist. Once a convert, now a zealot.

What were the hurdles or challenges that you needed to overcome on your route into the profession?
I am an immigrant and a first-generation college graduate. My journey was full of unknowns. Figuring out my academic and professional career has taken a lot of exploration. In this regard, the same exploration that guided my identity also guided my academic and professional journey.

And what are the challenges that you face now that you are working in data science?
The data science field is moving very fast. Every day brings a new algorithm, software program, and hardware innovation. Since data science is a multidisciplinary field, keeping up with it has been challenging. As I progress along my professional journey, striking the right balance between management, hands-on practice, and learning has been difficult as well.

What was your first job in data science, and how does it compare to your current role?
As an entrepreneur, I was given a lot of professional freedom to actualize my career. To a large extent, I have the career that I envisioned. To me, data science lives in the intersection of methods, software, and hardware. I have spent a large part of my career in this intersection.

Of course there are many things that were not part of the original vision, such as running a 100-person organization. My approach has always been intention with openness. By this metric, my current role is not far off from my original vision.

What was the most important thing you learned in your first year on the job?
The ability and the love of learning constantly, regardless of the topic.

What have been your career highlights so far?
The biggest highlight was that on June 15, 2023, Summit celebrated its 20th anniversary! Reformulating the National Agricultural Statistics Service’s edit and imputation systems is also a big deal. And being a testifying expert in some of the most consequential legal cases in the United States was a highlight as well.

What three things are at the top of your current reading/study list?
In recent years, I have been binge-reading Stoic philosophy. I have read most books by Ryan Holiday. His most recent book was Ego Is the Enemy. In between the Stoics, you will find me reading Buddhist meditation literature, including Thich Nhat Hanh’s The Heart of the Buddha’s Teaching. David McCullough’s Truman is also by my bedside.

What advice would you give for anyone wanting to be a data scientist?
Be open and multidisciplinary. Many good ideas in statistics come from other fields, such as economics, medicine, sociology, and education. Computer science enables computational statistics. Having the openness to these topics is key.

What new ideas or developments in the field are you personally most excited about or intrigued by?
Machine learning has transformed statistics both as a consumer and a contributor. It consumes statistics in that it requires cutting-edge statistical techniques and algorithms for its estimation. Machine learning has important applications in many of the statistical sciences.

And what do you think will be the main challenges facing the profession over the next few years?
The proper use of statistics or statistical ethics is an important societal challenge. Machine learning is becoming increasingly sophisticated, and its applications are more broad and pervasive. Machine learning algorithms are making more and more decisions in society, including mortgage loan approvals, residential home prices, and which prisoners receive parole. These are important and weighty decisions. How do we know that these decisions are unbiased and fair?

About the ASA Pride Scholarship

Apply or nominate someone for the ASA Pride Scholarship.

Discover more Career profiles

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photo of Albert Lee is not covered by this licence.

How to cite: Tarran, Brian. 2023. “‘Living my identity takes courage. It is the same courage necessary to start a new business.’” Real World Data Science, June 28, 2023. URL

‘Once I started to see what was possible with data science, there was no going back’

Brian Tarran — Tue, 20 Jun 2023 00:00:00 GMT

Hi, Chanuki. Thank you for sharing your career story with Real World Data Science. Please tell us a little about yourself and your role in data science.
I am Chanuki Seresinhe, head of data science at Zoopla and Hometrack. My commercial career in data science began in 2018 at Channel 4. Since then, I have worked at a few different companies – from startups to scale-ups – before ending up here at Zoopla in 2022.

I am also the founder of beautifulplaces.ai, which is a continuation of my University of Warwick and Alan Turing Institute PhD work where I provided the first large-scale evidence that beautiful places are connected to our wellbeing.

What does your job involve?
My role at Zoopla involves managing data scientists both for Zoopla and Hometrack. At Zoopla, we use data science to create an engaging experience for users who want to buy, sell and rent properties. At Hometrack, the data scientists mainly work on an automated valuation model that provides property valuations to most of the major mortgage lenders in the UK.

As a leader in data science, my role primarily involves helping stakeholders across the business to best leverage data science to reach our business goals, as well as ensuring my data science team has all the support and mentoring they need to design, develop and maintain high performing data science algorithms.

What does “data science” mean to you?
That is a really good question! One thing I have noticed is that people who aren’t familiar with the field often confuse data science and data analytics. There are indeed many similarities – both require quite a bit of knowledge to be able to leverage the correct insights from both structured and unstructured data (data hidden within images, for example). However, data science is essential when you need to make inferences. For instance, you are not only analysing data to see what certain consumers may prefer, but you are also predicting what similar consumers might prefer. Thus, having a strong grounding in statistics is really important for anyone working in this field.

Chanuki Seresinhe

In data science, getting the model right is not enough, and working with people across the business to make sure the model can be integrated into the business processes is essential.

What do you think is your most important skill as a data scientist?
Aside from a good grounding in the technical aspects of data science (which is possible for really anyone to pick up from the many good courses that are available), the most important skill is how you can leverage data science to create products that actually create value for the business. I find this to be the most challenging journey that junior data scientists find themselves having to navigate. They are really excited about the technology, and get carried away with wanting to perfect their algorithms. But when you are building commercial products, what is really important is to constantly engage with stakeholders to make sure you are building something that actually has a tangible business benefit. Early release of a model for user testing is also essential, as models only really get better once you have real user input.

How did you get into data science?
It was somewhat by accident. I previously had a long career in digital and decided to take a career break to return to university and study economics. When I was working on my Master’s degree in behavioral economic science at the University of Warwick, I saw an ad for a PhD to “use online data to understand human behaviour” and I thought this was perfect, as it combined my prior knowledge with a new area I was increasingly becoming drawn to. I quickly taught myself how to program in Python and convinced my supervisors to take me on, and from there on, I came to love data science!

What, or who, first inspired you to become a data scientist?
It was more about realising what you could do with data science. In my PhD, I was quantifying the connection between beautiful places and our wellbeing. While this has long been an intuitive connection, we were not able to test this on a large scale due to lack of data. Being able to use data science to start predicting the beauty of outdoor images was fascinating as it opened up a whole new method for potential research combining beauty with various wellbeing ratings. Once I started to see what was possible with data science, there was no going back.

What were the hurdles or challenges that you needed to overcome on your route into the profession?
For me personally it was the challenge of moving sideways into a leadership role after my PhD and not having to start all the way from the bottom. I would have loved to continue in academia and expand my research even further, but starting from the bottom earning a tiny salary after I had taken quite a large career break to do my PhD wasn’t an option for me. So I decided to go back into the commercial world and look for a senior role from the get go and luckily Channel 4 agreed to give me my first commercial stab at data science.

What are the challenges that you face now, as a working data scientist?
Trying to keep up with everything that is constantly changing in the world of data science. I love the rapid change but it can also be quite time consuming to make sure you are on top of it and giving the right advice to people.

What was your first job in data science, and how does it compare to your current role?
My first job was working as a senior data scientist at Channel 4. As a senior data scientist, even though you have additional goals to help run the team and be a mentor to more junior data scientists, you still get a great deal of time to do coding and develop your own projects. When you move more into a management role, the time you have to develop data science models diminishes. People also expect you to give in-depth guidance when you haven’t actually had much time to deep dive into a project. So, I am often trying hard to make sure I am on top of what is going on even when I have limited time, and really focus on building a strong team that can support each other and collaborate often to create better data science products. Learning to delegate is key!

What was the most important thing you learned in your first year on the job?
How hard it is to actually get organisational buy-in to use data science at scale. It is really easy to get approval to build a proof of concept (POC). However, if you do not use the time when developing your POC to also make sure to get the right stakeholder on board, your project is dead before it even starts. So, in data science, getting the model right is not enough, and working with people across the business to make sure the model can be integrated into the business processes is essential.

What have been your career highlights so far?
It has been great being able to give talks about my research, and data science in general, all around the world. I have actually come to love public speaking, and I hope that as I continue to be recognised for my expertise, I can encourage and aid potential data scientists with their careers – especially minority women, as I think that diversity in the field is very important. This is a role that is fit for people from all kinds of backgrounds and I hope I am exemplifying this.

Have there been any mistakes or regrets along the way?
In smaller companies, it can easily happen that the founders don’t fully understand data science and often use data science as a buzzword to get investors on board. Whenever you take on a new job, and data science is just getting established, make sure the founders or leaders are actually fully onboard with integrating data science into the product and understand what this means. See if they know how tricky data science can be when first integrating into a product and are actually willing to overcome the challenges with you to eventually reap the huge benefits data science can bring.

How do you think your role will evolve over the rest of your career?
I see my role evolving into being more strategic and less about the data science day-to-day modelling. It is more about being able to advise companies on how to make use of data science as a strategy and helping them figure out where in the product or process to inject it to get the most out of it for the business as a whole.

If you were starting out in data science now, what would you put at the top of your reading or study list?
Practise how you would apply using data science for a real life problem. Seek a placement, as this will pay dividends in being able to speed up your learning.

If you don’t understand the statistics involved in data science, make sure to upskill in that area before starting your first role. A lot of junior data scientists focus on learning how to code or get carried away with the modelling without first learning the importance of preparing the data in the correct way so that your predictions can work well in a real life setting.

What personal or professional advice would you give for anyone wanting to be a data scientist now?
Try to find ways to stand out from the average data scientist. When we open up applications for data science jobs, I get hundreds of applications for each one. I am looking for people who can not only do data science but who also have other stand-out qualities that they can bring to the business. This can be something along the lines of effective stakeholder engagement to deep expertise in a certain domain or technology.

What new ideas or developments in the field of data science are you personally most excited about or intrigued by?
I am really interested to see where generative AI will take us – particularly about how it can help us improve the speed of our own performance. It feels like generative AI can be a technology that can help everyone – even the everyday person – as it can help speed up so many processes, from coding to writing to ideating. While the technology is still in its early days, it is progressing rapidly and I am very curious to see where this will lead in the next year!

What do you think will be the main challenges facing data science as a field in the next few years?
Generative AI’s latest breakthroughs have made AI capabilities accessible to the broader public, but it has also stoked fears around the use of AI. The headline-grabbing narratives around AI and existential threat is distracting from other conversations that are really important. There are some very real issues we do need to solve – from biases in AI to the impact generative AI can have on wages and workforce – but this needs to be approached in a constructive and thoughtful way.

We need to find a way to engage with the public in a more meaningful way – rather than scaremongering – to have public debates on issues that actually matter.

Discover more Career profiles

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photo of Chanuki Seresinhe is not covered by this licence.

How to cite: Tarran, Brian. 2023. “‘Once I started to see what was possible with data science, there was no going back.’” Real World Data Science, June 20, 2023. URL

‘For me, data science is about bridging the gap between business requirements and the data that businesses have’

Brian Tarran — Wed, 24 May 2023 00:00:00 GMT

Transcript

This transcript has been produced using speech-to-text transcription software. It has been only lightly edited to correct mistranscriptions and remove some repetitions.

What does your job at Expedia involve?

I would probably call myself more on the analyst side. So, while my day-to-day job doesn’t necessarily involve AI, ML and productionalising models, it’s more taking business goals or requirements and taking the data and essentially bridging the gap between the two. I am on the incrementality analytics team. So, what that means is I measure the short-term returns from our marketing efforts that we have. And I do that via geotesting. So, I’m essentially working in the geotesting part of the company if you like. And before that I worked in the customer data section. So, essentially looking at customer data and working with that.

How long have you been working in data science?

More in an analyst role, but probably about seven years now, I began in Stack Overflow just as a data analyst, and then worked at DAZN – which is like a Netflix for sports – as a data analyst, and then joined here as a senior analyst, and then moved into data science in the last couple of years. I would, I would credit Stack Overflow as the place where my career kind of was birthed, if you like. I started there as an account manager, so with hardly any technical background at all required, and then moved into a role that was essentially looking after, or reporting the metrics of advertising campaigns for companies that would advertise on Stack Overflow. So that required a little technical knowledge, not much – a few pivot tables and things like that. But then the longer I stayed there, the further my career developed, and they had, at the time – probably still do – some fantastically smart people that work there, as you can imagine. I was sponsored to do a General Assembly data analytics course, which was focused around Excel, dashboarding, and SQL and essentially fell in love with data analysis. It was the most technical subject matter that I had experienced to that point, and I found a real natural affinity to it, particularly SQL. And then [I] moved into more of a data analyst role within Stack Overflow, so – as you can probably imagine – an absolute sea of proprietary data that needed analysing, and started learning R, or rather being taught R within Stack Overflow, and loved it. I think I was there for three-and-a-half years, and then moved into a data analyst role at DAZN. At this point, I did a data science General Assembly bootcamp course, and fell in love with that. And then I decided that I really loved General Assembly as a concept; I actually started a second job teaching there, so the courses that I had previously taken I was now teaching, first as a teaching assistant, and then as a lead instructor, which was one of the most, yeah, one of the most amazing experiences I’ve had actually, I learned a lot from that. And then I got a job as a senior analyst within Expedia Group, which is where I am now, and then moved into a data science role, which is what I’m doing currently. So, I actually left school at 16, and had to go into a full-time employment. And the General Assembly education that I took a part in was my first of that type. So, when I realised that data science was absolutely something that I really wanted to dedicate the rest of my life to, I decided to take on a part-time data science bachelor’s degree, which I am now about a year away from finishing. Because I’m doing it part time it takes a bit longer. But yeah, so I will have my data science bachelor’s completed, hopefully, by 2024.

Who or what inspired you to work in data science?

There were two big inspirations into getting into data sciences. So, they were actually the data scientists that I worked with at Stack Overflow. They were the first two data scientists, I believe, that Stack Overflow had ever hired. I worked very closely with them as an analyst and one of them was, had previously worked – I don’t know if it was officially an astrophysicist – but had studied black holes, and I remember thinking that was just amazing. And the other was, was very famous within the field. And they spent a lot of time giving me one-on-one training on R and SQL and basic analysis, and I was so inspired by these two individuals that I, it was also a career path that I didn’t really know existed.

Jasmine Holdsworth

What was impressed upon me in that first year [in data science] was the importance of statistics and interpreting statistics in a way that’s honest.

What does data science mean to you?

For me, it is bridging the gap between the business requirements and the data that businesses have. So, you’ll have business goals, requirements that kind of come down the line and there’s a lot of data that’s being collected, and, essentially, you have to try and be the bridge between the two. So, not just doing very complicated analyses, with very sophisticated models – at least not in my role – it’s about being able to create analysis that’s interpretable, that you can present to non-technical stakeholders that they’re going to understand to a degree. So, I do know that in different roles in different companies, it will be slightly different. But yeah, for me, it’s about making data, yeah, interpretable, to the non-technical stakeholders to enable them to do their job better.

What is your most important skill as a data scientist?

I like to think that there is one responsibility around how statistics are interpreted. So, just making sure that when you’re giving someone a statistic, that they understand what it can be used for, what it can’t be used for, and that it doesn’t kind of get halfway around the company before, you know, without any danger of it being misinterpreted. And I do think that the other is just being a translator. So, as well as teaching with General Assembly, I teach people within my company, things like SQL, R, and some basic data analysis. And I feel like it’s taking what is quite a technical, complicated subject and almost translating it into, if you like, English, so that people can kind of get some sense of what something may mean, without necessarily having to have the degree to back it up.

What hurdles did you face in becoming a data scientist?

Towards the beginning of my career to say – 5, 6 years ago – it was quite hard to get interviews. It was never hard to get interviews with technical people within companies, because you can– a technical person can see whether or not you know what you’re talking about. But recruiters don’t, and if someone is a recruiter for a technical role, their proxy for whether or not you can do the role is what’s your level of education, which is completely understandable and that’s what education is for. But it did mean that sometimes I applied for roles that were well below my, my level, and if I did so through a recruiter, then I wouldn’t hear anything back. But if I spoke to a technical person within that company, then it would be fine.

How did you overcome those hurdles?

Actually, I guess the story of how I joined Expedia is quite relevant in that way. So, I presented some, just some fun analysis that I did at an R-Ladies meetup, and I was already talking to a recruiter within Expedia Group and I said to them, oh, well, I happen to be visiting your offices to present at this meetup, so maybe I can meet you there. And they actually sent the manager of the team that they wanted me to join. So, this manager attended the meetup, watched me present, and then they ended up hiring me, which is really great. But I do really think that that was a result of being able to see me on stage, talking about stuff that I had done, showing code that I had written, and it kind of bypassed a few steps. So yeah, I would definitely say meetups and connections are very helpful to overcome that.

The most important lesson from your first year in data science?

I think that what was impressed upon me in that first year – and what really drove me to do the bootcamp courses and then, ultimately, the degree – above everything, actually, was the importance of statistics and interpreting statistics in a way that’s honest. So, I feel like– I feel like with coding, that comes quite naturally to me, and writing SQL queries, R, that was all kind of fine. That didn’t require a lot. But I really, I had an amazing manager who taught me a lot about, essentially, if you’re going to go and speak to this company about the campaign that they’ve run on our website, then you need to impress that X doesn’t necessarily mean Y, it just gives evidence to, or alludes to, and essentially just making sure that how you’re communicating things is as accurate as possible.

Any mistakes or regrets in your career so far?

When I look back on my career, I think the things that have really stayed with me that I’ve really learned from, mistakes wise, are around small little mistakes around how you interpret data. Maybe it was a, like, years ago, summing the wrong cell in Excel but not checking two or three or four times before that goes out. I’m now quite– I over-check everything. I think that the most important part of our job, as well as being the translation, is being the correct translation. You need to be reliable. People need to know that if you put out analysis that they can trust you. So, I would say I regret every small, tiny little data error that I ever made, which I can’t even recall right now, but I know have kind of cumulated enough that it has made me a very fastidious checker, I suppose.

How do you see your role in data science evolving?

I definitely see myself being an individual contributor in a consumer-facing company, just because that is basically what I’ve been doing up to this point. I don’t really have any desire to get into people management. I very much love being stuck in a room with my laptop, solving problems. Above all else, it still makes me happy. However, I do also love knowledge sharing – I love teaching, whether it’s with General Assembly, or whether it’s within the company that I work now. And I would like to kind of balance those two goals moving forward. So, keeping my role within my company as like an individual contributor and actually being like the front face for the, for the analysis that’s happening rather than kind of managing it. But also making sure that I carve out time to upskill others, because data science as a field, I mean, as you all know, is growing so much and people are coming in from different backgrounds. And I’m lucky enough to be able to kind of speak to a few people like that and do some very casual mentorship. And it makes me happy to see, so I hope that as my career develops, I will see more people maybe with backgrounds a little bit more like mine, come through and bring some diversity to the sector.

New developments or ideas you are most excited about?

It would be remiss of me to not mention like ChatGPT or generative AI, etc. But honestly, I am more interested in – or vaguely interested, I should say – in wearable technology. So, I’ve read a few very, very interesting papers and articles that talk about the development of wearable technology, not just your kind of watches, but potentially clothing, etc., that can be used for people with specific health problems to really help pinpoint, like, inflection points in time when something might happen. For example, a heart attack is about to happen, or is imminent, or is happening. So I actually feel like at the moment, this is perhaps going slightly under the radar, compared with more, you know, sexy developments like chatbots and things. But I’m very interested to see in the next one to two decades how ubiquitous wearables will be, and how closely entwined that will become with healthcare. So that’s something that I’m keeping half an eye on.

Any words of advice for budding data scientists?

You will never stop learning at all because, frankly, the field is moving very, very quickly. So, even if you were to kind of consider yourself an absolute expert today, tomorrow that may not be the case. You will constantly be learning. And I have found that learning the same thing several times through different mediums and having things explained to you different ways is so valuable. Because you may think that you understand something from, say, your bootcamp, but then when you read about it as part of your degree – this is obviously personal to me – you read about it in a different way. And you think, Oh, I’ve never thought of it like that. And then you watch a YouTube video and someone visualises it and you think, okay, I understand this all a little bit deeper now. So, constantly revising what you do know and learning anything that’s new as it comes up, I think everyone at every stage of career can kind of, can do that.

Discover more Career profiles

Copyright and licence: © 2023 Royal Statistical Society

This interview is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Tarran, Brian. 2023. “‘For me, data science is about bridging the gap between business requirements and the data that businesses have.’” Real World Data Science, May 24, 2023. URL

Large language models: Do we need to understand the maths, or simply recognise the limitations?

Brian Tarran — Thu, 18 May 2023 00:00:00 GMT

Part 1 of our conversation with the Royal Statistical Society’s Data Science and AI (DS&AI) Section ended on a discussion around the need to verify that large language models (LLMs), when embedded in workflows and operational processes, are working as intended. But there was also acknowledgement that this could be difficult to achieve, not least of all because, as Giles Pavey said, “nobody knows exactly how these things work – not even the people who build them.” And then, of course, there is the speed at which developments are taking place: Trevor Duguid Farrant made the point that an expert may not even have a chance to finish reviewing the latest version of an LLM before a new iteration is rolled out.

These issues – of verification, explainability and interpretability – are of particular interest to data scientists like Anjali Mazumder, whose work focuses on the impact AI technologies could have, and are having, on society and individuals.

In part 2 of our Q&A about ChatGPT and other LLM-powered advances, and what all of this might mean for data science, Mazumder kicks off the conversation by setting out her perspective.

Our full list of interviewees, in order of appearance, are:

Anjali Mazumder, AI and Justice & Human Rights Theme Lead at the Alan Turing Institute, and DS&AI committee member.
Detlef Nauck, head of AI and data science research at BT, and editorial board member, Real World Data Science.
Martin Goodson, CEO and chief scientist at Evolution AI, and DS&AI committee member.
Louisa Nolan, head of public health data science, Public Health Wales, and DS&AI secretary.
Piers Stobbs, VP science at Deliveroo, and DS&AI committee member.
Trevor Duguid Farrant, senior principal statistician at Mondelez International, and DS&AI committee member.
Giles Pavey, global director for data science at Unilever, and DS&AI vice-chair.
Adam Davison, head of data science at the Advertising Standards Authority, and DS&AI committee member.

Anjali Mazumder: I work in research, but I also sit in the crux of government, industry, and civil society, looking at how they’re using these technologies. For me, it’s about knowing what the opportunities are but also understanding the limitations, the risks and the harms, and how we balance those and put in place oversight mechanisms that act as safeguards to ensure that these technologies don’t cause harm. We’re taking a very socio-technical approach that requires an interdisciplinary team to understand these issues and what should be done. Part of this is about not only the outcomes and the impact but also the upstream side of it – recognising that these models have been built on the work of people who have done the labelling, and that this also has implications – to say nothing of the associated environmental issues or energy issues!

Detlef Nauck: I think the regulators really have to look at this. It has come completely out of left field for them. All the regulators that we are monitoring, they regulate the space as it was three years ago – they are mainly concerned about predictive models and bias. But if you look at, say, what Microsoft wants to do – putting GPT into Office 365 and into Bing – that will completely change how people interact with and consume information. I think the large tech companies really have a responsibility here, when they make this public, to make sure that people understand what this technology actually is, and how it can be used and has to be used.

Also, they need to open up about how these things have been built. There are a lot of stories around how OpenAI used cheap labour in order to do the labelling and reinforcement learning for ChatGPT, and these things have to become public knowledge; they need to become part of some kind of Kitemark for these models: “Ethically built, properly checked, hallucinate only a little bit. Whatever you do, don’t take it for granted. Check it!” That’s the kind of disclaimer they need to put on these models.

If you look at what Microsoft wants to do – putting GPT into Office 365 and into Bing – that will completely change how people interact with and consume information. Large tech companies really have a responsibility here, when they make this public, to make sure that people understand what this technology actually is.

Regulatory principles always seem to stress that AI systems should be understandable, and we should be able to explain how we get particular outputs. But a lot of our conversation has focused on how we don’t really know how these models work. So, is that, in itself, a problem, and is it something that the data science community can help with – to dig into how these things work and try and get that across – to help meet these principles of explainability and interpretability?

DN: That’s a very specialist job, I would say – specialist research into how these mathematical structures work. It’s not something I could do, and I’ve not seen any significant work there. One thing that we are interested in is whether we can do knowledge embedding, so that you can “teach” concepts that these models can then use to communicate, and that would lead to smaller systems where you have some understanding of what’s inside. But all of this kind of work, I think, is very much just beginning.

Martin Goodson: Do we actually need this? There’s sort of a big assumption there that you need to understand how LLMs work in order to build in the kinds of things that we care about as a society. But we don’t understand how humans think. Of course, we can ask a human, “Why did you make that decision?” You can’t understand the cause of that decision – that’s a complex neuroscience question. But you can ask what the reason is for making a decision, and you can ask an LLM what its reasoning is as well. I think a lot of these questions about explainability are stuck in the past, when you’re trying to explain how a linear model works. It’s really not the same thing when you’ve got an LLM where you can just say, “Why did you make that decision?”

Louisa Nolan: I was going to say something very similar. Most people don’t know how most things work…

DN: My point was, these things are largely still like the Improbability Drive in the Hitchhiker’s Guide to the Galaxy. You press a button, and you don’t really know what comes out, and that’s the problem we need to get our heads around.

LN: But people don’t know what percentages are, and yet we still use them for decision making. I don’t think people need to understand the maths behind LLMs, and I think it would be a hopeless job to expect everybody to do that. What we do need to understand is what LLMs can and can’t do. What’s the body of work that they are drawing from? What isn’t in there? What are the things that you need to check? So, for some things, it’ll be brilliant: if you’ve written something and you want it rewritten for a nine-year-old; if you want to summarise a paper; if you want to write code, as long as you already know how to code – these could be real labour-saving tasks. If you’re using ChatGPT to write a thesis about something that you haven’t looked at, that’s where the danger is. It’s this kind of simple understanding that people need to get in their heads – and the maths, except for the people who care about it, is beside the point, and probably detrimental, because it means that people won’t engage with it.

DN: I agree, but I wasn’t talking about the general public. I meant, the people who build these things – they should know what they’re doing.

There’s a big assumption that you need to understand how LLMs work in order to build in the kinds of things that we care about as a society. But we don’t understand how humans think. You can ask a human what their reason is for making a decision, and you can ask an LLM what its reasoning is as well.

We talked there about communication. There was a webinar recently by the Royal Statistical Society’s Glasgow Local Group, and the presenter, Hannah Rose Kirk, showed how you can take tabular data and statistical results and ask ChatGPT to produce a nice paragraph or two that explains the numbers. Is this the sort of thing that any of you have experimented with? Have you had any successes at using ChatGPT to translate data into readable English that decision makers can act on?

Piers Stobbs: I have an interesting use case. We had a basic survey: 200-odd responses, multiple languages, and we just said, “Please summarise the results of this survey contained in this CSV file.” And it came up with five or six relevant bullet points. What was amazing was that we could then interrogate it. For example, “Please compare the results that were in English versus in French, and describe the differences.” Again, it did it, but then you have the issue of, was it all correct? Well, the bulk of it was. Now I am intrigued by whether you can ask it to do correlations and some actual statistical things on a dataset, and does it get that right? I don’t know. We’ve not really got to that. But, to go back to one of the earlier discussion points around productivity, that initial survey work could have easily taken a week of someone’s time if we didn’t run it through ChatGPT.

Trevor Duguid Farrant: Piers, in this case you’re interested in checking and seeing if it’s right. If you’d asked a group of people to do that survey for you and get the results, you’d have just accepted whatever they gave you back. You wouldn’t have questioned it.

PS: That’s true. And the results were plausible, certainly.

AM: I think one of the challenges is that the results could seem like they’re plausible, right – whether that’s a statistical output or a text output. This was not a proper experiment, but I asked ChatGPT about colleagues and friends who are quite prominent researchers, asking, “Who is so-and-so?”, and it produced biographies that were quite plausibly them, but it wasn’t them. It might have listed the correct PhD, say, but the date was off by a year, or the date was correct but it was from the wrong institution. So, depending on what the issue is, these seemingly plausible results could have more serious implications.

LN: So, just to join those two things together: for me, the question is not, “Do we understand how ChatGPT works?” As Martin says, we don’t understand how humans work, and surely we’re trying to develop something that enhances human thinking in some way. The more important question for me is, “How do we know that what is produced is useful?”

For me, the question is not, ‘Do we understand how ChatGPT works?’ We don’t understand how humans work, and surely we’re trying to develop something that enhances human thinking in some way. The more important question is, ‘How do we know that what is produced is useful?’

Giles, you mentioned previously that you’re doing some work at Unilever around how to minimise hallucination. I don’t know how much you can say on what direction that’s going in, and how successful you expect that to be, but that’s obviously going to be a really important part of refining these models to be more widely usable.

Giles Pavey: I’m by no means an expert, but there’s quite a lot you can do with both the architecture of it and also the pre-prompts that you put in – more or less saying, “Quote what the source is, and if you’re not sure, then tell me you’re not sure.” I think what’s interesting is the question of whether we’ll have to rely on OpenAI or Microsoft to do that work, and it will be just another thing that we have to trust them for. Or, will it be something that people within an organisation can put in themselves?

MG: I think it’s absolutely critical that open-source models are developed that can compete with these tech companies, otherwise there’s going to be a huge transfer of power to these companies.

GP: Arguably, the single biggest issue is, who elected Sam Altman (no-one) and are we as society happy with him having so much power over our future?

To close us out, I’d like to return to a question Trevor posed earlier, which is: How might organisations like the Royal Statistical Society help companies to embrace LLMs and start using them, so that everyone can benefit from the technology?

Adam Davison: My instinct is that there’s some great parallel here with the stuff that the Data Science and AI Section have been doing in general, where we’ve said, “OK, there’s lots of good advice out there on how to do things in data science, but how do you make it practical? How do you make it real? How do you apply those ethical principles? How do you make sure you have people with the right technical understanding in charge of projects to get value?” If, five years ago, the hype around data science was leading organisations to hire 100 data scientists in the hope that something innovative would happen, well then, we don’t want those same organisations now thinking that they need to hire 100 prompt engineers and keep their fingers crossed for something special. Our focus has been on “industrial strength data science”, so I think we can extend that to show what “industrial strength LLM usage” looks like in practice.

Want to hear more from the RSS Data Science and AI Section? Sign up for its newsletter at rssdsaisection.substack.com.

← Read part one

Back to Careers

Copyright and licence: © 2023 Royal Statistical Society

This interview is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Images are not covered by this licence. Thumbnail image by Google DeepMind on Unsplash.

How to cite: Tarran, Brian. 2023. “Large language models: Do we need to understand the maths, or simply recognise the limitations?” Real World Data Science, May 18, 2023. URL

How is ChatGPT changing data science?

Brian Tarran — Thu, 11 May 2023 00:00:00 GMT

For many people, it starts with a question. Something simple, something they already know the answer to. A test, in other words, to see what these AI-powered chatbots are all about. But spend any amount of time with ChatGPT and other such tools and you’ll quickly start to wonder what else they might do, and how useful they might be in your day-to-day working life.

Data scientists certainly have been thinking along these lines, and to find out more about current use cases, proofs of concepts and potential applications, Real World Data Science got together with members of the Royal Statistical Society’s Data Science and AI Section (DS&AI) for a group discussion.

Our interviewees, in order of appearance, are:

Piers Stobbs, VP science at Deliveroo, and DS&AI committee member.
Detlef Nauck, head of AI and data science research at BT, and editorial board member, Real World Data Science.
Adam Davison, head of data science at the Advertising Standards Authority, and DS&AI committee member.
Trevor Duguid Farrant, senior principal statistician at Mondelez International, and DS&AI committee member.
Giles Pavey, global director for data science at Unilever, and DS&AI vice-chair.
Martin Goodson, CEO and chief scientist at Evolution AI, and DS&AI committee member.

The first part of our discussion focuses on how large language models are becoming part of the data science toolkit, and what this new development means for data science teams and skillsets. Stay tuned for part two, which we’ll be publishing soon!

(UPDATE: Part two is now published: “Large language models: Do we need to understand the maths, or simply recognise the limitations?”)

As data scientists, how has ChatGPT – and other tools built on large language models (LLMs) – changed your working lives?

Piers Stobbs: Up to about a year ago, although I was really impressed with the developments in deep learning and the improvements in computer vision and natural language models, it felt in line with general improvements in machine learning. And then, probably about six months ago, with things like DALL·E and ChatGPT, it felt like something changed – properly ground-breaking capabilities. And I still can’t quite get my head around the fact that you can basically have a model that tries to predict the next token, and it comes up with outputs that really feel quite sensible and human-like – if prone to hallucination.

The way I think about it is, this feels like a brand-new capability that we’ve just not really had before. It’s almost like an interface with unstructured information. Historically, you sort of have to turn text into something, and then turn something back into text, if you want to have this interface with humans. Now, we’ve got this really quite elegant way of plugging the gaps, which feels full of opportunities.

I’m having great fun playing around with the code co-pilots – GitHub’s Copilot is amazing and, productivity wise, is helping me a lot. I am now a much faster coder because there’s all those Stack Exchange lookups that I don’t have to do anymore. Again, from a personal productivity perspective, I’m using [ChatGPT] for initial drafts of documents and other things. And then I use it almost for validating things. For instance, I had a random discussion the other night with ChatGPT about logistic algorithms. It’s not going to solve problems for you, but I asked it to give some pointers of things I could be thinking about – some of which I had, some of which I hadn’t. So, it’s almost like a brainstorming helper, somehow.

But probably the thing I’m most excited about is the knowledge sharing side of it – plugging it into, or on top of, private information, and surfacing all that knowledge that is locked away in documents and intranet pages.

Piers Stobbs

This feels like a brand-new capability – an interface with unstructured information. Historically, you have to turn text into something, and then turn something back into text, if you want to have this interface with humans. Now, we’ve got this elegant way of plugging the gaps.

Detlef Nauck: We’re looking into running proof of concepts to see whether LLMs do bring value. Software engineering is the most obvious one, and easiest to set up and run. And then we want to look at making use of internal documents – so, either summarization or creation of internal documents in appropriate language. The latter use cases are trickier to evaluate. We want to know whether the outputs produced are any good. With software engineering you can track GitHub statistics, for example. But if you give ChatGPT to somebody to write marketing material, or to get information out of a document, how do you know that the results are good? We need to get our head around metrics for evaluation.

Adam Davison: I’ve been using it for basically anything where I don’t remember the API very well or it’s a bit confusing. Pandas is the key, right? We all use pandas, but you don’t really remember how to do some complicated thing with apply(), say, so you just ask GPT-4 to give you the answer, and it saves you that hassle. Also, I read some insightful tweet that said these chat systems are really good for things where generating the solution is hard, but verifying it is easy. And I think that’s true for some of these things. You know, you get a short piece of Python code, you can basically look at that and you can tell if it’s right.

In data science, you’re a bit of a jack-of-all-trades. You need to do little bits of everything, but you’re not a specialist in anything. And so, I think for software development, it’s been really helpful. For example, right now, I’m doing a bit of frontend development in a project to visualise something. I’m never going to be a professional frontend developer, but GPT-4 can help deal with the oddities of JavaScript much more easily than it would be for me to trawl through Stack Overflow posts.

But the thing that we’re using it for, practically, is natural language processing (NLP) and classification. We have this particular problem at the Advertising Standards Authority (ASA) where we are running lots of different models that are completely unconnected to each other because every project is a different topic. So, one week we’re looking at, “Do these gambling ads appeal to young people?” and then the week after it’s, “Are these cryptocurrency ads being clear about the risks involved?” It’s very disparate, we don’t have a lot of time to iterate on models, and we don’t have huge amounts of training data. Ten years ago, when you were doing sentiment classification, you were on Mechanical Turk getting 10,000 examples, and even then it didn’t work very well after these really complicated models. Now, you’ve got a couple of hundred examples and with the embeddings [from LLMs] you can get to a pretty decent classifier quite quickly. We’re also starting to experiment with using OpenAI’s fine-tuning tools, and the performance that we’ve seen from that is very impressive, to the extent that it’s making us rethink whether we bother doing anything else in some of our classifiers.

Adam Davison

Five years ago, if you had a sophisticated problem involving text or images, you’d need a big research team with a big budget to tackle it. But increasingly we find, like many other people, that you can take models off the shelf and repurpose them for quite diverse tasks.

Trevor Duguid Farrant: My organisation is not as far forward as the rest of you. I’ve introduced it to the leadership team, and the digital services team – what was IT – are looking to make a decision on whether we can use it or not. I think there’ll be so much pressure they’ll have to use it, but there’s still a feeling of discomfort with it, whereas I think it’s really good and have started using it. Everyone else on the call seems to have started using it. So, can organisations like the Royal Statistical Society help companies to embrace this and start using it, and then everyone can benefit from it?

Giles Pavey: I wish I could be with Piers and Adam – actually using it – but my life has been taken over as the guy who goes and explains it to the business. Unilever is a massive business, and we are concerned about privacy, confidentiality and trustworthiness. We’ve now built an initial GPT instance on Azure and fed it with some of our own documents, and a lot of my time has been working with legal to convince ourselves that that’s okay. Now we are really trying to work out just how we manage the amazing demand for proofs of concepts and use cases – and what we’re just about to uncover, I think, is the unknown but potentially massive expense of running it.

In pure proofs of concepts, departments that have large knowledge banks are using it: research and development, and marketing, for example. And one of the big technical things that we’re working on – and, because of the size that we are, we’re doing a lot of work with OpenAI and Microsoft on this – is how to stop the models from hallucinating.

Have your experiences with ChatGPT and other tools changed your thinking about the skillsets required of data scientists and data science teams?

AD: A little bit. As someone at a small organisation, I think it’s quite exciting because, five years ago, maybe you were in a world where if you had a sophisticated problem involving text or images, you’d need a big research team with a big budget to tackle it. But increasingly we find, like many other people, that you can take models off the shelf and repurpose them for quite diverse tasks. So, I think it’s becoming increasingly viable to have a small team of people who are implementers, who aren’t necessarily backed up by a big research organisation, doing increasingly sophisticated stuff.

I don’t think it does away with the sort of things that we always bang on about in the Data Science and AI Section, like the need for an understanding of statistics and how the underlying systems really work, because I think you still need to understand what you’re doing with LLMs, as with any other machine learning technique. But, if I had to guess, what we’re going to be seeing now is that for a lot of problems, you’re going to have more of a division – so, you’re either in one of a small number of very large labs doing research on very cutting-edge big models, or you can be an implementer who is taking things off the shelf and applying them. And maybe that space in between is going to get a little bit squeezed – that would be my guess, but obviously it’s very unpredictable.

Giles Pavey

We’ve built an initial GPT instance on Azure. Now we are really trying to work out just how we manage the amazing demand for proofs of concepts and use cases – and what we’re just about to uncover, I think, is the unknown but potentially massive expense of running it.

PS: That’s exactly my view. When I first started hiring data scientists, a long time ago, you basically had to write stuff from scratch, and you needed PhDs – people who really understood, at a deep level, how the maths all works. But I think there’s been a steady progression towards valuing software engineering skills, and I think, in some ways, this is another step along that path. If I think now about implementing a chatbot over your own knowledge base, it’s basically like plugging APIs together with some Python. Adam’s point is still hugely important, though, because I think we still need the background knowledge about what’s actually going on – OK, I’m creating embeddings here, and that’s allowing this search to work so I can surface the right docs – that whole process, which an average software engineer is maybe not going to know. But I think it’s definitely blurring the lines.

Martin Goodson: It’s just as important now to understand how to evaluate performance. The difference is, it used to be that you were trying to figure out whether it’s 80% accurate, or 85%. Now, it’s like 99.9%. But you still need to figure it out. You still need to understand what the failure modes are, what caused it; how is it actually working, and is it doing what you need it to do for the product? Is it actually satisfying our needs as users or as customers of the products.

DN: I think in the future, the skills we will need are people who can run and build these models. Giles made the point about how expensive it is to run these things. Right now, you have two options: subscribe to an API, and then you are limited in how you can modify these models; or build your own – take an open-source LLM and modify it. But then you need people who know how to build a high-performance computing environment and operate this efficiently. You need to know how to actually train the models, how to curate the data, how to set the model parameters. And I always think there’s too much alchemy still going on in this field, right? It’s not proper science. People build these things and then are surprised at what they can do; they didn’t know such things would be possible. A lot of these capabilities only emerge when you make the models really, really big and, essentially, you also have no control over them – you can’t stop them hallucinating. So, these are the kinds of issues we need to get under control if we want to get any value out of them.

Prompt engineering is another one – you really need to understand how these models work and how to prompt them. If you want to give them to, say, a marketer to generate copywriting, they may not have the right ideas of how to prompt the machine. So, I could see roles developing out of data science that understand how to influence these models and make them do what we want them to do.

MG: The other angle to this is junior engineers. Now, the bar for being a useful junior engineer is that you’re better than GitHub Copilot. Why do you need a really junior person if you can just use a large language model to be the junior developer?

DN: I’m not thinking about the data science person who needs to write some code for a project here, but if you have a large software team in an organisation that produces production code, they will become more efficient by using these tools. But still, with all this overhead of testing and putting it all together, there will be a lot of manual work that needs to be done. But the teams will get more efficient and junior people will get up to speed quicker. That’s probably another advantage.

Trevor Duguid Farrant

Can the Royal Statistical Society help non-tech companies embrace large language models, extolling their virtues and dispelling the myths?

PS: I think Detlef’s point about understanding is an interesting one. It definitely feels like there’s been this sort of continuum from, you know, “OK, it’s a linear regression, we know what’s going on” to complex models to ensemble models where, again, you’re combining these things you can individually understand. Even with big ImageNet architectures, billions of parameters, at least conceptually you can understand how these work and build out tools where you can understand the layers. To me, what’s different now is you’ve got this reinforcement learning layer on top, or diffusion layer, or some other additional approach – this combination of really complicated things. I honestly don’t know where to start with trying to understand why a specific output is generated, and I think that is a proper concern. That’s definitely an area of research, because I think we need to understand this.

GP: I think there’s also a question in large companies of just who owns these things. Up until this point, everybody’s been happy that AI is the realm of data science. And, suddenly, generative AI looks like it might be the realm of the IT team – that it’s a service that you get off the shelf. It’s going to be interesting to see how that plays out. I really liked the point that Martin was making about being able to tell what the systems are actually doing, what they are supposed to do and how to check them, because if you don’t have a background in that area, you might just assume they work. Now, nobody knows exactly how these things work – not even the people who build them. But having a background in how you test things, for potential causes for things not working, is actually going to be incredibly powerful or useful.

TDF: Will experts like us actually be able to check it because of the speed that new versions are coming out and the developments that are happening? Is it going to take us six months to check that GPT-3.5 works? Well, too late, a month later GPT-4’s out! I just think that pace is going to keep accelerating.

Want to hear more from the RSS Data Science and AI Section? Sign up for its newsletter at rssdsaisection.substack.com.

Back to Careers

Read part two →

Copyright and licence: © 2023 Royal Statistical Society

This interview is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photos are not covered by this licence. Portrait photos are supplied by interviewees and used with permission. ChatGPT homescreen photo by Levart_Photographer on Unsplash.

How to cite: Tarran, Brian. 2023. “How is ChatGPT changing data science?” Real World Data Science, May 11, 2023. URL

‘I always thought someone like me couldn’t work in data, let alone data science’

Brian Tarran — Mon, 24 Apr 2023 00:00:00 GMT

Hi, Sami. Thank you for sharing your career story with Real World Data Science. Please tell us a little about yourself and your role in data science.
Hello! I’m Sami Rahman, a passionate head of data engineering and data platform at Penguin Random House, the book publisher that has enriched lives through literature. I started my career in data science five years ago and I’ve evolved into a data generalist with expertise in machine learning, data infrastructure, and data strategy.

What does your job involve?
My role is about harnessing the power of data to drive extraordinary outcomes. Leading a skilled team, we empower our company to leverage data and cutting-edge technologies for informed decisions and automation. I help shape our organisation’s capabilities in data science, analytics, machine learning, data management, and strategy.

What does “data science” mean to you?
Data science, to me, is a captivating fusion of modern data technologies and computational statistics that tackles business challenges, crafts intelligent automation, and generates insightful revelations.

What do you think is your most important skill as a data scientist?
Active listening is key. A data scientist must be surgical and precise in developing models, analysis, and tools that reinforce the company’s bottom line and operations. Data science exists to create value using data.

Photo supplied by Sami Rahman, used with permission.

As I’ve transitioned into management, maintaining my coding prowess is an ongoing challenge. I stay sharp by doing data science and infrastructure development for fun, leveraging tools like ChatGPT and AirOps where I’m rusty.

How did you get into data science?
I began with a psychology degree, which led to working as business psychologist where I discovered psychometric data analysis. After a master’s in countering organised crime and terrorism and a few short jobs in counter terrorism/intelligence, I decided that it wasn’t for me. I embraced my love for statistics and research, I dove into data science, learning Python through online platforms, and secured my first data scientist role at a WPP agency called Essence.

What, or who, first inspired you to become a data scientist?
I always thought someone like me couldn’t work in data, let alone data science. Dr Suzy Moat’s fascinating talk on machine learning’s application to human behaviour and psychology showed me that a psychologist could make a significant impact in this field, inspiring my aspiration to try to have a data science career.

What were the hurdles or challenges that you needed to overcome on your route into the profession?
Breaking into data science without a typical background in maths/computer science/physics was daunting. Building a Kaggle portfolio and coding models for fun prepared me for interviews. Another challenge was learning to harmonise my “data brain” and “business brain” to solve problems efficiently. Understanding how data solutions impact business problems will always propel you forward. 

And what are the challenges that you face now, as a working data scientist?
As I’ve transitioned into management, maintaining my coding prowess is an ongoing challenge. I stay sharp by doing data science and infrastructure development for fun, leveraging tools like ChatGPT and AirOps where I’m rusty. I’m currently building my own cloud data platform and running a lot of image neural networks on it.

What was your first job in data science, and how does it compare to your current role?
As an analytics executive at WPP agency Essence, I tackled data science, cloud engineering, and analytics problems for clients. They were a lot more singular and tactical in nature. Now, as head of data engineering and data platform at Penguin Random House, I focus on shaping data and technology strategy to align with the company’s broader vision.

What was the most important thing you learned in your first year on the job?
To always consider the bigger picture: how your work integrates with the organisation/client’s objectives, delivers value, and aligns with the aspirations of other stakeholders. Actionable insights and value is the most important thing.

What have been your career highlights so far?
Two shining moments include being the first of three of HSBC UK fraud data science leaders, where each of our departments tackled a different type of crime and protected our customers, and developing data strategies and capabilities for analytics, science, and business intelligence at Penguin Random House.

Have there been any mistakes or regrets along the way?
I regret not delving deeper into natural language processing (NLP) or spatial data science, which are now more accessible and growing fields within data science. I reckon the NLP methodologies would’ve been extremely useful seeing as I’m at a publishing company now!

How do you think your role will evolve over the rest of your career?
As data technologies become more accessible, I anticipate data roles will transform. I envision a future where data professionals focus on general AI, quantum machine learning, and multi-dimensional data analytics as traditional specialisms become democratised.

If you were starting out in data science now, what three things would you put at the top of your reading/study list?
I’d recommend Skin in the Game by Nassim Nicholas Taleb, Calling Bullshit: The Art of Scepticism in a Data-Driven World by Jevin West and Carl Bergstrom,  and Artificial Intelligence: How Machine Learning Will Shape the Next Decade by Matthew Burgess.

What personal or professional advice would you give for anyone wanting to be a data scientist now?
Success in data science hinges on understanding how it can transform organisations and engaging with business stakeholders. My advice: never stop listening to the business – the stakeholders are your biggest allies. I would also try to find your niche that sets you apart from everyone else. Mine when I first started in the field was my expertise on computational psychology and behavioural machine learning. 

What new ideas or developments in the field of data science are you personally most excited about or intrigued by?
Transfer learning excites me most, as numerous large technology companies now offer pre-trained models based on billions/trillions of parameters. This will revolutionise industries worldwide, as it will be easier to build more performant models even if a company has less data.

What do you think will be the main challenges facing data science as a field in the next few years?
The challenge lies in staying relevant amidst the democratisation of data science. Through large language models, low-code, and transfer learning, advanced data science methods will become easier for non-specialists to do and use. Innovation and keeping up with modern data technologies will be crucial.

Discover more Career profiles

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Photo of Sami Rahman is not covered by this licence.

How to cite: Tarran, Brian. 2023. “‘I always thought someone like me couldn’t work in data, let alone data science.’” Real World Data Science, April 24, 2023. URL

The politics of performance measurement

Noah Wright — Tue, 18 Apr 2023 00:00:00 GMT

At the beginning of 2016, the Criminal Justice Division (CJD) of the Texas Governor’s Office received news all government agencies dread: budgets were to be cut. CJD oversaw a grant program that funded specialty courts throughout the state, however it was now being told that the program’s budget of $10.6m would be reduced 20% to $8.5m by 2018.

How should these cuts be distributed among grant holders? CJD had no meaningful performance data on which to base its decisions, and I would know: I was hired by the agency just a few months before to analyze grant performance. Still, decisions needed to be made. We had to come up with a plan of action, and the clock was ticking…

This is a story of making opportunity out of crisis, of the interaction between not just theory of change and technical implementation, but the “political” process of negotiating these changes with stakeholders in a manner that led to better decisions. Through careful outreach and continuous communication, we developed a data collection and performance assessment process that enabled us to allocate budget cuts in a manner widely accepted.

The story ends on a bittersweet note. But, along the way, there are lessons to be learned about how to find common ground, manage expectations, forge productive working partnerships, and sustain a data science project longer term.

Step 1: Consider your options

Texas had over 150 specialty courts in 2016, providing a program of specialized services – usually drug treatment – to offenders as an alternative to incarceration. About half of the state’s specialty courts received CJD grant funds (and about half of grantees received 100% of their program budget from our grants). Funding cuts of the size we needed to make would not go over well with them. Any changes to the program would have to run a gauntlet of decision-makers including advisory boards, interest groups, and professional associations, most with contacts in the legislature.

Complicating this situation further, CJD didn’t even make the final funding decisions. We administered the grants, but the merit review process fell to the Specialty Courts Advisory Council, an appointed group of specialty courts staff and related experts who annually scored the grant applications we received. We needed to get them onboard.

The way our Executive Director saw it, we had three options to implement the cut in a way that could get us buy-in from stakeholders and the Advisory Council:

Cut across the board. The Advisory Council would employ the same scoring method as the previous year but reduce each grant amount by 20%.

This option would leave long-running grantees scrambling to make up for this shortfall by reducing services, laying off staff, or spending more of their limited local funds. Worse, it would punish all grantees equally – our most successful programs would be arbitrarily defunded.

Fewer grants. Grants were scored based on the quality of their application and all grants that passed a certain threshold got funded. The Advisory Council would employ the same scoring method as the previous year but instead of funding the top $10.6m worth of grants, they would fund the top $8.5m worth.

This seemed a less bad option than cutting across the board, but we would still run into the problem of arbitrarily defunding successful programs. Grants near the bottom of the Advisory Council’s cutoff that got funded the previous year would be denied renewal only because the goalposts had moved.

Targeted funding. The Advisory Council would incorporate performance data and statewide strategic plan alignment into their scoring method and make cuts accordingly.

At the time, the Advisory Council did not take performance into consideration when scoring grant applications. They agreed in theory that a grant requesting its tenth annual renewal should perhaps at some point be assessed on its outcomes, but they had never seen CJD commit to a rigorous performance assessment process before. We administered the grants, not them, so without our commitment to develop a performance assessment process, and their trust in that commitment, this would not be a viable option.

After due consideration, option 3 emerged as the favorite of our Executive Director. On the face of it, this seemed the most “objective” approach to take. We would let the data decide who gets funded and who doesn’t, rather than cutting arbitrarily. But that would be a fallacious argument. Data does not decide. It might inform our decisions, but it would be up to us to choose the structure of the performance measurement process: what aspects to focus on, what data to collect, what benchmarks to set – all of which would later help determine funding decisions. And in any funding decision, politics inevitably plays a role.

Politics is, in its broadest sense, the negotiated decision-making between groups with opposing interests. And in developing our performance measurement process we would encounter a variety of interests – from the Advisory Council down to the grantees themselves. Success would require us to acknowledge stakeholder perspectives and address or manage them appropriately. Planning decisions made in the early phases of a project as a result of political processes directly influence the type and scope of analysis a data scientist will eventually be able to perform, so it behooves the data scientist to participate in these processes!

Step 2: Engage stakeholders and define performance

Having settled on our preferred option, our Executive Director convened a strategy session with the Advisory Council to discuss how to proceed as part of a broader strategic plan. The session began by achieving consensus on high-level goals such as “fund strategically”, “focus on success”, “build capacity”, etc. The session also helped the Advisory Council and CJD alike to clarify our conception of how we ought to fit into the specialty courts field going forward. CJD would develop its performance assessment system to help the Advisory Council target funding, but that would come as part of a larger plan that included capacity building, training and technical assistance, helping courts obtain non-CJD sources of funding, and steering grantees toward established best practice.

We left the meeting with a very basic plan that looked good on paper. Our Executive Director set to work persuading our external stakeholders of the wisdom of this new strategic direction. Meanwhile, I had to build a performance assessment process that people could trust.

CJD had no formally designated standards to measure performance data against. However, drug courts have been around for decades and there existed a large body of research supporting the program model.¹ Offering supervised drug treatment instead of incarceration had been repeatedly shown to cost less money and lower recidivism rates. I performed a literature review and spoke with numerous subject matter experts to get started on defining program-specific performance metrics.

I was conscious that imposing metrics without any feedback or input from affected parties would all but guarantee bad-faith engagement, especially if these metrics are tied to funding. A problem inherent to any performance measurement is that once something gets measured as a performance outcome, it warps the very processes it is intended to measure. This phenomenon happens so frequently that the phrase “Campbell’s Law” was coined to describe it in 1979.² Think of standardized testing at schools: once the government ties test performance to school funding it creates powerful incentives for schools to improve test scores at any cost. Even in the absence of outright cheating, struggling schools feel massive pressure to adjust their curriculum, to the point where they teach test score optimization strategies more than math, language, history, and science.

I consistently heard from specialty court scholars and practitioners alike that arrest recidivism would be the ideal outcome measure. On paper, recidivism was a direct expression of long-term program success and could also be used as an outcome variable for classification modeling. And, in practice, a court could do little to affect recidivism by way of manipulation. Courts do not make arrests – police make arrests. Once a specialty court participant finished a program, the court itself no longer intervened in their lives. If a participant got arrested within 1-3 years of completion, the program had no say in the matter.

This, however, presented an implementation problem: one-year recidivism data would, by definition, take a year past the point of implementation to collect, i.e., not soon enough to inform our cuts. And while recidivism was the best measure of success, it could not be the only measure. Recidivism was, after all, a stochastic process not within the court’s control – a crime wave or other systemic factors could move recidivism up and make it look like a successful court had actually failed. We would have to use something else as well.

The National Association of Drug Court Professionals (NADCP) publishes a book of best practice standards, and our stakeholders identified a court’s adherence to these standards as another strong performance assessment standard. These criteria, unlike recidivism, were directly under the program’s control. Does your program have the recommended staff? Does your program drug test participants frequently enough to guarantee sobriety? Does your program meet with participants regularly enough? Do you offer a continuum of services instead of a “one-size-fits-all” approach?

In addition to being much easier to measure than recidivism, best practice adherence also resists Campbell’s Law by avoiding outcome measurement. In our school metaphor, this would be like measuring school performance based on student-to-teacher ratio, variety of course offerings, attendance rates, and teacher qualifications. Far from perfect, but measuring a variety of elements that predict success and taking them as a whole represents a vast improvement over a single, easily-gamed outcome measure.

But to operationalize these standards, we would have to have good data.

Step 3: Update processes and collect data

We inherited a longstanding process in which grantees had to fill out a form every six months asking them to report performance data. This is a screenshot of what that form looked like:

No additional definitions or instructions were provided, leaving grantees with many questions: Does the request for “annual data” mean as of fiscal year or calendar year? What counts as a person being “assessed for eligibility”? And so on. Grantees did not know the answers, and neither did we. And these were the more straightforward measures. The form went on for 10 pages, most of which asked grantees to report extensively on information they had already provided as part their application.

This disaster of an assessment process did have a silver lining. When we announced we were throwing out these forms entirely we faced almost no pushback from grantees.

We knew from the start that our new assessment process would need to collect individual-level participant data instead of aggregated measures. Even with clear definitions, 75 grants would mean 75 different aggregations at work. Asking the grantees to report their individual-level participant data in a consistent format and doing the aggregations ourselves meant a single aggregation at work.

But we needed to establish trust with grantees before making this request. Strictly speaking, we could mandate the reporting of this data. However, if that angered enough of our grantees, they or their contacts might take it up with our bosses at the Governor’s Office, and our bosses could cancel any plan we came up with if they thought it was not worth the fuss. So, from day one we communicated clearly to all grantees that we would maintain total transparency when it came to definitions and calculations. Before we used any calculated metric to assess performance we would send it to the grantees themselves to review for accuracy.

To avoid the vagueness and inscrutability that characterized the old reporting process, every piece of data we asked for in the new process had a clear written definition and specific reason for being asked. These reasons usually amounted to some combination of best practices, Advisory Council recommendations, and grantee suggestions.

Implementing the new process was far from easy, however. We faced numerous administrative and technical barriers. Texas courts at this time did not share a common case management system, so we couldn’t just get a data export from everybody. Meanwhile, the Governor’s Office banned all of its divisions from all usage of the cloud. This forced us to build a more labor-intensive reporting process, in which courts would obtain blank Excel templates with required data fields. Courts had either to fill out these templates by hand or export their case management data and reconfigure it to template specifications. Then, courts submitted their data for review and we sent back any bad formatting.

We collected preliminary data at the six-month mark and made another adjustment based on these results, which we would not count toward performance measurement. A majority of courts had some kind of data error in this first case. Specific definitions of data fields had to be written and rewritten using grantee feedback over the course of the year, leading to significant changes between the six-month reports and the year-end reports.

Importantly, we had developed reporting requirements iteratively with participation from grantees and the Advisory Council from the start. By mid-2017 we had so successfully achieved buy-in that only one grantee court’s judge refused to give us data (the court’s grant manager later sent it to us).

Step 4: Analyze and report findings

In the course of this process, we established the benchmarks in Table 1 based on best practices and justification for funding. Because this was our initial rollout, we set the specific values low to function more as minimum standards than targets.

Table 1: Specialty court best practices translated into quantitative measures.

Benchmark	Best practice	Rationale
1. Number of participants	10+	CJD decision: programs should be of sufficient size to justify a grant
2. Number of graduates	5+	CJD decision: programs should be of sufficient size to justify a grant
3. Graduation rate	20%-90%	CJD decision: 0% and 100% success rates are both red flags
4. Average amount of time graduates spent in program (in months)	12-24	NADCP best practice recommended program lengths of 1-2 years
5. Percent of graduates employed, seeking education, or supported through family, partner, SSI, etc.	100%	NADCP best practice recommended against releasing participants without financial support, which all but guarantees relapse or rearrest.
6. Percent of participants with “low-risk” assessment score	0%	NADCP best practice recommended moderate- or high-risk participants. Research had shown that low-risk participants get little benefit.
7. Average sessions per participant per month	1+	NADCP best practice recommended sessions be held at least monthly.

Grantee performance data for each benchmark would be generated from the individual level data that courts sent us. Crucially, we sent our aggregations back to grantees for confirmation prior to using them in any kind of evaluation, alongside the program-wide average and the best practice values for comparison (example in the table below). If something didn’t look right, they had the chance to let us know before we took their numbers as final.

Table 2: Specialty court best practices compared with program-wide averages and grantee reported values.

Benchmark	Best practice	Program-wide average	Grantee reported values
1. Number of participants	10+	89	96
2. Number of graduates	5+	25	27
3. Graduation rate	20%-90%	71%	56%
4. Average amount of time graduates spent in program (in months)	12-24	17	14
5. Percent of graduates employed, seeking education, or supported through family, partner, SSI, etc.	100%	95%	100%
6. Percent of participants with “low-risk” assessment score	0%	18%	2%
7. Average sessions per participant per month	1+	2	3.7

In the end, we found seven grants that we could unequivocally recommend be cut. Two of the seven had effectively never gotten off the ground, and served almost no participants the entire year. The other five served mostly low-risk participants, the type of people that research had shown do not benefit from specialty court programs. Some of these grantees were inevitably disappointed at the decision, but we had so actively worked within the field to develop and justify our processes that they understood why the decision had been made.

Factors for success

In the span of one year, CJD went from collecting a large volume of useless data to a specific, targeted collection of data informed by best practices. The new collection process had high grantee compliance and stakeholder buy-in.

The following factors proved essential to getting to a place where we had useful, reliable data upon which to base future data science efforts:

Discontent with status quo

The Advisory Council wanted CJD to play a more active support role in the field. Meanwhile, everyone disliked the existing performance assessment process. As a result, most of the challenges we faced along the way related to implementation rather than defending the status quo on its merits.
A catalyst for change

Despite existing discontent, it took a funding shortfall to kickstart the process of change. It would have been unlikely for us to be able to create this system a priori.
Continuous, high-quality communication

We could impose rules and requirements all day long, but without good faith engagement from the grantees we could never collect the quality of data we needed. Note that “continuous communication” does not mean “tell them everything you do at every point”. People become overwhelmed by torrents of information.
Humility and flexibility

Had we begun this process assuming we had all of the answers, we would have been dead in the water. Continuous outreach and willingness to take criticism and suggestions shaped the process as it progressed, ultimately producing a better end-product than we could have devised on our own.
An established program model

Drug courts have been around for decades, with a vast body of supporting research and a community of practitioners and scholars we could speak to. That meant we could focus on implementation and execution instead of determining if the model worked or not.
Strong leadership support

From the very beginning, we could not have accomplished what we did without the full support and advocacy of our Executive Director.

Coda: Why knowledge transfer is vital

I wish I could write a follow-up article about how we started using classification modeling to identify the most successful programs and to promote better approaches and practices; about how we iterated the process through multiple funding cycles, tuning and perfecting it to better meet stakeholder needs. But I cannot.

The performance assessment system we built had some major weaknesses from the outset. It was labor intensive, not required by law, produced no immediate benefit to the agency itself, and was so new it had yet to be entrenched in agency practice. In other words, no institutional incentives worked in its favor. Only the continual push of our Executive Director and myself kept this new performance assessment system going, and once we left the agency, it foundered.

Still, the experience taught me much. I learned first and foremost that programs do not sustain themselves. Most of our attention had been focused on building up the best process we could. Only a minimal effort had been spent on institutionalizing and sustaining it. We had written documentation but no fundamental changes in policy or rule. We had undertaken groundbreaking efforts and built relationships, but had not planned for any meaningful knowledge transfer to other staff. While we had intended to eventually do these things, fate took us away before we could get them in place.

For any kind of change to last, sustainability must be built in from the start. In the moment, these actions can seem low-priority. Policy and rule changes can be arduous and time-consuming. Knowledge transfer from one stably employed staff to another feels redundant and wasteful. But without embedding sustainability, no success will outlast the individual people pushing for it.

Back to Careers

About the author: Noah Wright is a data scientist with the Texas Juvenile Justice Department. He is interested in the applications of data science to public policy in the context of real-world constraints, and the ethics thereof (ethics being highly relevant in his line of work). He can be reached on LinkedIn.

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Wright, Noah. 2023. “The politics of performance measurement.” Real World Data Science, April 18, 2023. URL

Footnotes

Some newer types of courts (Commercial Sexual Exploitation, Mental Health, Veterans) had a much more limited body of research and had to be accommodated separately. For the sake of keeping this narrative coherent I’m focusing on drug courts, which were the majority of our programs.↩︎
Rodamar, Jeffery. 2018. “There ought to be a law! Campbell versus Goodhart.” Significance 15 (6): 9. https://doi.org/10.1111/j.1740-9713.2018.01205.x↩︎

‘Data science challenges you to keep learning – there’ll always be new advances in the field’

Brian Tarran — Tue, 28 Mar 2023 00:00:00 GMT

Hi, Tamanna. Thank you for sharing your career story with Real World Data Science. Please tell us a little about yourself and your role in data science.
I’m Tamanna Haque. I’ve been working at Jaguar Land Rover for nearly four years, recently promoted to lead data scientist working within product engineering. It’s coming up to eight years that I’ve been working in the field, and my areas of interest are the use of machine learning to provide the best products and experiences for my customers and stakeholders.

What does your job involve?
My role involves using the connected car and AI to make our products and customer experiences better, whilst leading within our wide data science team too. The data science team in Manchester, UK, originated with myself and one of my teammates – it’s since grown to nearly 40 (cross-sites and countries) and developed into a high-performing, advanced data science team.

What makes us stand out is the nature of our work – we mostly use vehicle data (of participating customers), which is different to a lot of other commercial businesses or teams who’ll focus more on transactional or web data. The data we use lends itself to some pretty interesting projects, and a general futuristic feel here.

I’m particularly interested and active in enabling a more electric and modern luxury future from the use of vehicle data.

What does “data science” mean to you?
The realisation of value! Whether that is added revenue, saved costs or improved growth, I’m led by what data science can do for the business and its customers. The use of data science can open up many exciting, value-adding opportunities.

Photo supplied by Tamanna Haque, courtesy of Jaguar Land Rover. Used with permission.

There are more routes to getting into data science nowadays, but it’s important to not lose sight of fundamentals such as statistics and mathematics. A lot of people can code-up models but it’s fair to say that only a portion of them appreciate how to do this responsibly.

What do you think is your most important skill as a data scientist?
I’ve always presented myself as a technically astute data scientist, even when entering leadership. But my niche is my ever-growing commercial awareness and passion about our products, customers and business. These aren’t new qualities, but they now align with my professional interests, as well as personal (I’ve been a fan of the Jaguar brand since childhood)!

How did you get into data science?
I did a maths degree at the University of Manchester, where I specialised in statistics. I didn’t do any post-graduate education and this was fine for me.

After graduating, I joined a digital fashion retailer (with a financial services proposition) as an analyst initially. I learned a lot about real-life data and analytics itself, whilst developing a rounded understanding about the business and how to deal with stakeholders cross-functionally. I must have served a few hundred at least(!) and left most of the ‘fancy’ stuff I learned at university aside, whilst getting to grips with so many aspects of commercial analytics. A great way for me to set solid foundations for what followed, and I personally feel this gives me a lens that others who dive straight into data science don’t have.

I was soon attracted to data science because it tapped into what I learned at university and challenges you to keep learning; there’ll always be things to learn, and new advances in the field.

What, or who, first inspired you to become a data scientist?
I have a twin sister, we’ve always been together throughout education. Even before we graduated together, she secured her first role as an analyst. This opened my eyes to data, and data science followed for us both!

What were the hurdles or challenges that you needed to overcome on your route into the profession?
I had a few people tell me I couldn’t do data science, possibly because I didn’t fit the typical data scientist stereotype in several ways. I think attitudes in the field have changed over time though and on a personal level, it’s motivated me to give it everything, and I can’t regret that.

And what are the challenges that you face now, as a working data scientist?
I need to manage my diary well to ensure effectiveness and work-life balance. I’m overseeing people, other projects, doing public speaking and trying to remain hands on. I sometimes block out chunks of time in my diary – I need some meeting-free time to produce quality technical work. I try to finish on time and enjoy a very busy social life with my family and friends. A flexible attitude to how we work helps to keep me happy and energised whilst I’m delivering from various angles.

Thinking back to your earlier roles in data science, how do they compare to your current role?
My current role is very different to my previous roles. I’m continually learning and adapting how I can be a good leader, providing support to a breadth of colleagues (in and outside the team) whilst delivering myself. I’m actively involved in setting and refining our team’s strategy and I’m enjoying leading projects which either deliver high financial impact or help set the path in terms of new tech and/or machine learning capability. There is much more responsibility but it’s easy to stay energised when working on cars and for a business I’ve long admired.

What was the most important thing you learned in your first year on the job?
I should have had more confidence in myself, but this grew – as I adjusted to the new environment I became much more assertive. My domain knowledge and data science expertise combined help to build my self-confidence, credibility and reputation.

What have been your career highlights so far?
I’m most proud of my recent promotion from senior to lead data scientist. Also it was exciting for my family and I when I gained an offer to join Jaguar Land Rover.

Have there been any mistakes or regrets along the way?
No, what’s meant to be will be!

How do you think your role will evolve over the rest of your career?
My progression has been relatively rapid, and I hope I’ve got many, many years ahead of me in my career. It’s hard to say how my role will evolve, I have a blend of responsibilities in my role which combined provide great fulfilment for me at the moment.

If you were starting out in data science now, what would you put at the top of your reading/study list?
A good understanding of analytics and the domain you’re in are my recommended prerequisites to doing data science.

Analytics is an important part of the data science lifecycle, being able to get the data yourself and communicate results with influence, for example, are just a few aspects of analytics which underpin successful data science projects.

Also, without awareness of the business and industry you’re working in, you can become very dependent on others. Data science itself can be quite challenging, so it’s great to have a solid foundation before starting out.

What personal or professional advice would you give for anyone wanting to be a data scientist now?
With the level of continuous learning required to just simply keep up, it can be more of a lifestyle and not a job, so this is something to consider!

What do you think will be the main challenges facing data science as a field in the next few years?
I still expect to see a skills gap in the field. There are more routes to getting into data science nowadays, but it’s important to not lose sight of fundamentals such as statistics and mathematics. A lot of people can code-up models but it’s fair to say that only a portion of them appreciate how to do this responsibly, understanding samples versus populations, statistical testing, which type of regularisation to use in a neural network, et cetera.

I also think there’s a challenge of questionable data science products reaching high levels of popularity and usage amongst the public… Some recent developments in this space have been extremely intelligent but raise ethical concerns. Just because something can be done with AI doesn’t mean it should, and my preferences are towards AI being ethical and (ideally) explainable.

Discover more Career profiles

This work is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence, except where otherwise noted.

How to cite: Tarran, Brian. 2023. “‘Data science challenges you to keep learning – there’ll always be new advances in the field.’” Real World Data Science, March 28, 2023. URL