Real World Data Science

RSS: Data Science and Artificial Intelligence - showcase your research

Wed, 05 Feb 2025 00:00:00 GMT

RSS: Data Science and Artificial Intelligence provides a new forum for research of interest to a broad readership, spanning the data science fields. Created in recognition of the growing importance of data science and artificial intelligence in science and society, the new journal aims to fill the need for a venue that truly spans the relevant fields.

This new open access journal joins the RSS family of world class statistics journals and is published by Oxford University Press.

Scope and type of papers

RSS: Data Science and Artificial Intelligence is seeking high quality papers from across the breadth of these disciplines which encompass statistics, machine learning, deep learning, econometrics, bioinformatics, engineering, computational social sciences and beyond.

As well as three primary paper types - method papers, applications papers and behind-the-scenes papers - RSS: Data Science and Artificial Intelligence will publish editorials, op-eds, interviews, and reviews/perspectives in line with its goal to become a primary destination for data scientists

Why Publish?

RSS: Data Science and Artificial Intelligence offers an exciting open access venue for your work with a broad reach and is peer reviewed by editors esteemed in their field. Discover more about why the new journal is the ideal platform for showcasing your research

Submit a paper

Find out how to prepare your manuscript for submission and visit our submission site to submit your paper

Editors

Sach Mukherjee is Director of Research in Machine Learning for Biomedicine at the Medical Research Council (MRC) Biostatistics Unit, University of Cambridge, and Head of Statistics and Machine Learning at the German Center for Neurodegenerative Diseases.

Silvia Chiappa is a Research Scientist at Google DeepMind London, where she leads the Causal Intelligence team, and Honorary Professor at the Computer Science Department of University College London.

Neil Lawrenece is the inaugural DeepMind Professor of Machine Learning at the University of Cambridge. He has been working on machine learning models for over 20 years. He recently returned to academia after three years as Director of Machine Learning at Amazon.

View the full editorial board here: Editorial Board | RSS Data Science | Oxford Academic (oup.com)

Open Access

RSS: Data Science and Artificial Intelligence is fully open access (OA) and is published by Oxford University Press (OUP). Your research will be free to read and can be accessed globally. An OA license increases the visibility of your research and creates more opportunities for fellow researchers to read, share, cite, and build upon your findings.

The cost of publishing Open Access may be covered under a Read and Publish agreement between OUP and the corresponding author’s institution. Find out if your institution is participating. Members of the Royal Statistical Society can submit papers at a reduced cost.

Explore the journal’s website now www.academic.oup.com/rssdat

Explore more data science ideas

Copyright and licence: © 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution (CC BY 4.0) International licence.

The machine learning victories at the 2024 Nobel Prize Awards and how to explain them

Anna Demming — Thu, 31 Oct 2024 00:00:00 GMT

Few saw it coming when on 8th October 2024 the Nobel Committee awarded the 2024 Nobel Prize for Physics to John Hopfield for his Hopfield networks and Geoffrey Hinton for his Boltzmann machines as seminal developments towards machine learning that have statistical physics at the heart of them. The next day machine learning albeit using a different architecture bagged half of the Nobel Prize for Chemistry as well, with the award going to Demis Hassabis and John Jumper for the development of an algorithm that predicts protein folding conformations. The other half of the Chemistry Nobel was awarded to David Baker for successfully building new proteins.

Figure 1: Close-up of a copy of the Nobel Prize Medal. Photographed on the floor of the Nobel Museum in Old Town, Stockholm. Machine learning came up a winner in both the Physics and Chemistry Nobel Prizes for 2024. Credit: Shutterstock

While the AI takeover at this year’s Nobel announcements for Physics and Chemistry came as surprise to most, there has been some keen interest on how these apparently different approaches to machine learning might actually reduce to the same thing, revealing new ways of extracting some fundamental explainability from the generative AI algorithms that have so far been considered effectively “black boxes”. The “transformer architectures” behind the likes of ChatGPT and AlphaFold are incredibly powerful but offer little explanation as to how they reach their solutions so that people have resorted to querying the algorithms and adding to them in order to extract information that might offer some insights. “This is a much more conceptual understanding of what’s going on,” says Dmitry Krotov, now a researcher at IBM Research in Cambridge Massachusetts, who working alongside John Hopfield made some of the first steps that helps bring the two types of machine learning algorithm together.

Collective phenomena

Hopfield networks brought some of the mathematical toolbox long applied to extract “collective phenomena” from vast numbers of essentially identical parts such as atoms in a gas or atomic spins in magnetic materials. Although there maybe too many particles to track each individually, properties like temperature and magnetic field can be extracted using statistical physics. Hopfield showed that similarly a useful phenomenon he described as “associative memory” could be constructed from large numbers of artificial neurons by defining a “minimum energy”, which describes the network of neurons. The energy is determined by connections between neurons, which store information about patterns. Thus the network can retrieve the memorized patterns by minimizing that energy, just as stable conformations of atomic spins might be found in a magnetic material¹. As the energy of the network is then subsequently minimised the pattern gets closer to the one that was memorised, just as when recalling a word or someone’s name we might first run through similar sounding words or names.

These Hopfield networks proved a seminal step in progressing AI algorithms, enabling a kind of pattern recognition from multiple stored patterns. However, it turned out that the number of patterns that could be stored was fundamentally limited due to what are known as “local” minima. You can imagine a ball rolling down a hill – it will reach the bottom of the hill fine so long as there are no dips for it to get stuck in en route. Algorithms based on Hopfield networks were prone to getting stuck in such dips or undesirable local minima, until Hopfield and Krotov put their heads together to find a way around it. Krotov describes himself as “incredibly lucky” that his research interests aligned so well with Hopfield. “He’s just such a smart and genuine person, and he has been in the field for many years,” he tells Real World Data Science. “He just knows things that no one else in the world knows.” Together they worked out they could address the problem of local minima by toggling the “activation function”.

Figure 2: Energy Landscape of a Hopfield Network, highlighting the current state of the network (up the hill), an attractor state to which it will eventually converge, a minimum energy level and a basin of attraction shaded in green. Note how the update of the Hopfield Network is always going down in Energy. Credit: Mrazvan22/wikimedia

In a Hopfield network all the neurons are connected to all the other neurons, however originally the algorithm only considered interactions between two neurons at each point, i.e. the interaction between neuron 1 and neuron 2, neuron 1 and neuron 3 and neuron 2 and neuron 3, but not the interactions among all three altogether. By including such “higher order” interactions between more than two neurons, Krotov and Hopfield found they made the basins of attraction for the true minimum energy states deeper. You can think of it a little like the ball rolling down a steeper hill so that it picks up more momentum along the slope of the main hill and is less prone to falling in little dips en route. This way Krotov and Hopfield increased the memory of Hopfield networks in what they called the Dense Associative Memory, which they described in 2016². Long before then, however, Geoffrey Hinton had found a different tack to follow to increase the power of this kind of neural network.

Generative AI

Geoffrey Hinton showed that by defining some neurons as a hidden layer and some as a visible layer (a Boltzmann machine³) and limiting the connections so that neurons are only connected with neurons in other layers (a restricted Boltzmann machine⁴), finding the most likely network would generate networks with meaningful similarities – a type of generative AI. This and many other contributions by Geoffrey Hinton also proved incredibly useful in the progress of machine learning. However, the generative AI algorithms grabbing headlines today have actually been devised using a “transformer” architecture, which differs from Hopfield networks and Boltzmann machines, or so it seemed initially.

Transformer algorithms first emerged as a type of language model and were defined by a characteristic termed “attention”. “They say that each word represents a token, and essentially the task of attention is to learn long-range correlations between those tokens,” Krotov explains using the word “bank” as an example. Whether the word means the edge of a river or a financial institution can only be ascertained from the context in which it appears. “You learn these long-range correlations, and that allows you to contextualize and understand the meaning of every word.” The approach was first reported in 2017 in a paper titled “Attention is all you need”⁵ by researchers at Google Brain and Google Research.

It was not long before people figured out that the approach would enable powerful algorithms for tasks beyond language manipulation, including Demis Hassabis and John Jumper at Deep Mind as they worked to figure out an algorithm that could predict the folding conformations of proteins. The algorithm they landed on in 2020 – AlphaFold2 – was capable of protein conformation prediction with a 90% accuracy, way ahead of any other algorithm at the time, including Deep Mind’s previous attempt AlphaFold, which although streaks ahead of the field at the time it was developed in 2018, still only achieved an accuracy of 60%. It was for the extraordinary predictive powers for protein conformations achieved by AlphaFold2 that Hassabis and Jumper were awarded half the 2024 Nobel Prize for Chemistry.

Connecting the dots

Transformer architectures are undoubtedly hugely powerful but how they operate can seem something of a dark art as although computer scientists know how they are programmed, even they cannot tell how they reach their conclusions in operation. Instead they query the algorithm and add to it to try and get some pointers as to what the trail of logic might have been. Here Hopfield networks have an advantage because people can hope to get a grasp on what energy minima they are converging to, and that way get a handle on their working out. However, in their paper “Hopfield networks is all you need”⁶, researchers in Austria and Norway showed that the activation function, which Hopfield and Krotov had toggled to make Hopfield networks store more memories, can also link them to transformer architectures – essentially if the function is exponential they can reduce to the same thing.

“We think about attention as learning long-range correlations, and this dense associative memory interpretation of attention tells you that each word creates a basin of attraction,” Krotov explains. “Essentially, the contextualization of the unknown word happens through the attraction to these different memories,” he adds. “That kind of lens of thinking about transformers through the prism of energy landscapes – it’s opened up this whole new world where you can think about what transformers are doing computationally, and how they perform that computation.”

“I think it’s great that the power of these tools is being recognised for the impact that they can have in accelerating innovation in new ways,” says Janet Bastiman, RSS Data Science and AI Section Chair and Chief Data Scientist at financial crimes compliance solutions company Napier AI, as she comments on the Nobel Prize awards. Bastiman’s most recent work has been on adding explanation to networks. She notes how the report Hopfield networks is all you need highlights “the difference that layers can have on the final outcomes for specific tasks and a clear need for understanding some of the principles of the layers of networks in order to validate results and be aware of potential difficulties and”best” scenarios for different use cases.”

Krotov also points out that since Hopfield networks are rooted in neurobiological interpretations, it helps to find “neurobiological ways of interpreting their computation” for transformer algorithms too. As such the vein Hopfield and Hinton tapped into with their seminal advances is proving ever richer in what Krotov describes as “the emerging field of the physics of neural computation”.

Explore more data science ideas

About the author: Anna Demming is a freelance science writer and editor based in Bristol, UK. She has a PhD from King’s College London in physics, specifically nanophotonics and how light interacts with the very small, and has been an editor for Nature Publishing Group (now Springer Nature), IOP Publishing and New Scientist. Other publications she contributes to include The Observer, New Scientist, Scientific American, Physics World and Chemistry World..

Copyright and licence: © 2024 Anna Demming

Text, code, and figures are licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence, except where otherwise noted. Thumbnail image by Shutterstock/Park Kang Hun Licenced by CC-BY 4.0.

How to cite: Demming, Anna. 2024. “The machine learning victories at the 2024 Nobel Prize awards and how to explain them” Real World Data Science, October 31, 2024. URL

References

Hopfield J J Neural networks and physical systems with emergent collective computational abilities PNAS 79 2554-2558 (1982) https://www.pnas.org/doi/pdf/10.1073/pnas.79.8.2554↩︎
Krotov D and Hopfield J J Dense Associative Memory for Pattern Recognition NIPS (2016)https://papers.nips.cc/paper_files/paper/2016/hash/eaae339c4d89fc102edd9dbdb6a28915-Abstract.html↩︎
Ackley D H, Hinton G E and Sejnowski T E A learning algorithm for boltzmann machines Cognitive Science 9 147-169 (1985) https://www.sciencedirect.com/science/article/pii/S0364021385800124↩︎
Salakhutdinov R, Mnih A and Hinton G Restricted Boltzmann machines for collaborative filtering ICML ’07: Proceedings of the 24th international conference on Machine learning 791-798 (2007) https://dl.acm.org/doi/10.1145/1273496.1273596↩︎
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I Attention is all you need NIPS (2017)https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html↩︎
Ramsauer H, Schäfl B, Lehner J, Seidl P, Widrich M, Adler T, Gruber L, Holzleitner M, Pavlović M, Kjetil Sandve G, Greiff V, Kreil D, Kopp M, Klambauer G, Brandstetter J and Hochreiter S arXiv (2020) https://arxiv.org/abs/2008.02217↩︎

Are we at risk of muting the female voice in the digital world?

Anna Demming — Tue, 17 Sep 2024 00:00:00 GMT

Knowledge is power and today a lot of that knowledge – not just what you know but who you know – is online. In 2015 the UN General Assembly laid out 17 Sustainable Development Goals (SDGs) that aim to end poverty and other deprivations while improving the welfare of both people and the planet. One of the SDGs deals with gender equality and emphasises the importance of digital technology for empowering women. Online, a woman can engage in commercial, social, business or networking transactions without the need to be absent from care responsibilities at home or maintain traditional 9-5 working hours or, in some instances, even expose the fact that she is a woman at all – all potentially transformative features of online engagement¹. Yet the reality for digital technology to empower women is by no means clear cut.

‘For me, whether digital technologies are able to empower women was fundamentally an empirical question,’’ says professor of demography and computational data science at Oxford University Ridhi Kashyap. She adds that in order to ask these questions of impact, you first need to be able to measure inequalities in digital access. However, the pace of technological change has been a lot faster than the rate at which national censuses – or other kinds of surveys useful to social scientists – update their questions, so they shed little light on the demographics around digital technologies.

Since then, progress in accruing data on digital access has revealed some stark gender inequalities. However, access is not the only fly in the ointment when it comes to the potential for digital technology to help towards gender equality. ‘The most harmful illegal online content disproportionately affects women and girls,’ says the explainer for the UK’s 2023 Online Safety Act. A study by the Turing Institute published earlier this year has revealed nuances on this picture, but confirmed that many women feel particularly vulnerable online, suggesting women may be losing a seat at the table as debate and discourse increasingly moves online.

Figure 1: Muted. In the absence of proactive intervention, the shift of debate and discourse online risks muting women and girls as multiple factors exclude them from engaging there as productively as male counterparts. Copyright: Park Kang Hun/Shutterstock.

The digital gender gap has a cost estimated at $126 billion USD for the 32 low- and low-to-middle-income countries analysed by the Alliance for Affordable Internet (A4AI)². This is due to the ‘untold wealth of cultural, social, and scientific knowledge lost because of the exclusion of women’s and girls’ voices from the online world.’ Focus on this issue has brought a little more clarity to the size of the problem. However, while the UK’s Online Safety Act marks some progress, questions remain as to what can be done, and whether the hope of digital technologies helping towards gender equality is still justified.

Gender disparities in internet access

A turning point in the conversation around digital technology and gender equality came in 2018 with work by Kashyap and collaborators in the US and Qatar at the time. They found that where traditional survey-based data on internet and mobile gender gaps was available, it correlated well with the gender gap on Facebook, using data extracted for Facebook’s ad platform: When Facebook’s aggregate user counts did not show women, it provided a good signal that women were not online altogether in those countries. As such, the work revealed a potentially useful proxy to gauge the digital gender gap in countries where little traditional survey data was available³. The results revealed an unexpectedly large gender gap, particularly in parts of South Asia and certain countries in Africa where men were up to twice as likely to have access to the Internet compared with women.

‘In some sense it was perhaps not surprising,’’ says Kashyap highlighting that having a mobile phone or similar device that grants access to the internet amounts to a kind of asset ownership, and studies for other assets indicate women are less likely to own them. ‘This is broadly reflective of economic gender inequality,’ she adds. Perhaps more surprising is that the gaps have changed very little in the five years since their website, which monitors the digital gender gap, was first released, particularly in view of the pace of technological progress in general, and the importance placed on closing the gap. Citing India as an example, Kashyap points out that in 2019 the ratio of access to the internet for men versus women was 0.619 – fewer than two women had access for every three men with access. In the subsequent half decade this digital gender gap has closed by just 7.1% to a ratio of 0.663.

Figure 2: The digital gender gap. Ratio of female-to-male internet use estimated using the Facebook Gender Gap Index⁴

In countries where the gender disparity for access to the internet is large, there is evidence to suggest that those women who do have access are of the more affluent echelons of society. Analysis of the type of device used, which can also be retrieved from the Facebook ad platform, highlighted that where women are less likely to be online, the relative proportion of iOS users tends to be higher among women than among men, and as Kashyap points out, ‘iOS users are on average wealthier’. Fortunately, among the stakeholders starting to see the benefit of closing the gap in access to the internet between the genders are the mobile network providers, who are looking for ways to tap into this part of the market through incentives and discounts on SIMs for women. However, it is unclear to what extent these types of schemes are ultimately beneficial in closing the wider gap.

Kashyap and her colleagues also found that a key predictor of the digital gender gap was the gender gap in educational attainment. ‘I think that’s quite telling, because it’s showing that accessing education and going to educational institutions is also a pathway to becoming more digitally integrated,’ says Kashyap, flagging that schools and educational institutions are where women and girls often access computers and digital technologies. She highlights that beyond giving people a device ‘more of the challenge’ is helping them make good use of it by ‘giving people skills to feel that this is actually meaningful for them, and allows them to do things that they wouldn’t be able to do otherwise, and feeling confident and safe and secure.’ She emphasises the importance of men valuing gender equality, highlighting work from South Asia that shows that even when women have a device, their use of it may be curtailed or scrutinised by male members of the household, sometimes on the grounds of doubts over women’s safety online.

Gender disparities in ‘fears’ of online harms

Safety can be a knotty issue when it comes to enabling women to have a voice online. A study by the Alan Turing Institute⁵ earlier this year suggested just 23% of women in general feel comfortable expressing political opinions online, compared with 40% of men. This might be down to women in general being exposed to online violence more than men, as previous studies of online harms have suggested. Indeed, a key takeaway from the Alan Turing Institute’s study was that women reported greater fears of exposure for all categories of harm, although this included types of harm that women reported experiencing less frequently than men.

Previous studies have largely surveyed women-only sample-groups so that their conclusions were drawn without data on men against which to compare. In contrast, the researchers at the Alan Turing Institute, including researcher Tvesha Sippy, took a nationally representative survey of 2,000 men and women. They investigated whether they had been exposed to various types of online harms, their fears surrounding such exposure, the psychological impact of those experiences in general, tendencies to use protective tools for digital activities, and how comfortable they felt with online behaviours such as expressing opinions and sharing information online. The study revealed that women were more likely to report experiencing some harms, such as online misogyny, cyberflashing, cyberstalking, image-based abuse and eating disorder content to a significantly greater extent than men. However, there were several harms that men reported being the direct targets of to a greater extent than women, such as hate speech, misinformation, trolling and threats of physical violence.

By using a representative cohort, the Alan Turing Institute study tells a more nuanced story than those sampling women only and highlights challenges in similar assessments for minority groups. For example, those identifying as non-binary were excluded from the analysis by the Alan Turing Institute because, although as Sippy emphasises, ‘We do want to look at minoritised genders,’ they did not have sufficient numbers of respondents in this category within their nationally representative survey to do any meaningful analysis. Ultimately, a higher budget enabling larger samples would allow analysis of minority groups as well.

As for the greater fears for all online harms reported by women, ‘it’s a very complex phenomenon,’’ Sippy tells Real World Data Science, highlighting the need for further research. She points to several possible explanations such as differences in the impacts of the harms experienced more by women versus men, as well as innate fearfulness potentially from the offline world translating to behaviour online. Sippy also highlights the differences in how men and women experience online harms, which may offer clues. Women were more likely to report that their fears stem from the experience of a public figure (35% of the women surveyed compared with 26% of the men) or a female friend (37% of the women compared with 27% of the men). Furthermore, the experience of a male friend was much less often cited as the source of online fears for both groups (8% of the women and 14% of the men). There is also the possibility that women’s adaptive behaviours make them less exposed to future online harms than men, since women were more likely to make use of protective tools from disabling location-sharing on a device, and limiting who can engage with images, posts and tweets, or even find their profile. While protective, such adaptive behaviours could also dampen the influence women have in online discourse.

Rather than relying on adaptive behaviour for self-protection, it would seem a lot of people are keen to see more action from social media companies and governments to help people to feel safer online. In 2023, researchers at the Turing Institute led by senior research associate Florence Enock published a study investigating attitudes to online interventions. They found that 79% thought social media platforms should ban or suspend users who create harmful content and 73% thought that platforms should remove harmful content. According to the report ‘this was consistent across age, gender, educational background, income and political ideology.’

There are some complications for social media companies who need to balance privacy needs with protection, as well as having the resources required to handle multilingual posts when investigating what action to take. However, Sippy feels there remains a need to have a civil remedy in place so that a user can request a platform take down content which is harmful without having to pursue criminal proceedings and get the police involved. Where the additional resources needed for social media companies to take corrective action and a lack of business incentive pose an obstacle, government legislation may help. The same study into attitudes to online interventions also reported that for platforms that fail to deal with harmful content online more than 70% of respondents felt the government should be able to issue large fines, and 66% thought that legal action should be taken.

‘The Online Safety Act is a really good start,’ adds Sippy, also highlighting the importance of proposals by the previous UK government to criminalise the creation of sexually explicit deep fakes. She points to a 2019 report by AI firm Deeptrace, suggesting that of 15,000 deep fake videos they found online, 96% constituted nonconsensual pornography with women disproportionately targeted⁶. In a recent Alan Turing Institute survey 90% of respondents expressed concerns about deepfakes increasing misogyny and online violence against women and girls⁷. ‘I do see there’s more advocacy, but it remains to be seen what approach the new Government will take.’

Gender disparities for making an impact online

Challenges to women being heard online seem to go beyond safety issues. Recent research by Kashyap and collaborators at the University of Oxford and collaborators in Iran and Germany has also highlighted differences in how influential women’s professional networks are relative to male counterparts⁸. In previous work with Florianne Verkroost, also at the University of Oxford, Kashyap had investigated the gender gaps in those who have a LinkedIn profile to see how they vary across industries⁹. They found that use of the platform broadly mirrors female-to-male ratios of representation in technical and managerial professions. In reference 8, they then investigated what insights LinkedIn data might provide as to the cause of some of the gender disparities in these professions, and ultimately why women are not progressing in technical and professional jobs as well as male counterparts.

‘One argument is that that’s often because they don’t have advantageous networks,’ says Kashyap, adding that women may be restricted by the need to resume care commitments at home instead of staying for drinks after work or travelling to attend conferences. One might expect online avenues for networking would be able to mitigate such obstacles. In fact, studies of LinkedIn data did suggest that although women are less likely to be in professional and technical occupations as reflected in the platform’s data, in some instances their numbers exceeded them. Kashyap suggests this could be ‘where they’re using online platforms to make themselves more visible, because other fine forms of networking are less available, or they have less time for it.’ Indeed, women who were on LinkedIn were more likely to report a promotion than their male counterparts, suggesting an element of positive selection among the female LinkedIn user population. However, the potential equalising impact of moving professional networking online seems to have its limits.

Their study of LinkedIn data showed women were less likely to report a relocation for work, which Kashyap suggests, ‘is a sign that the work family trade-off is probably still remaining acute for this highly selected group.’ In another 2023 study Kashyap and colleagues had also reported a lower mobility for women, specifically among published scientists, researchers and academics based on bibliometric data from over 33 million Scopus publications¹⁰. In addition, when Kashyap and her colleagues looked at women on LinkedIn working in the tech sector, they found that they had a lower chance of being connected to those working in one of the “big five” firms in the tech sector than men, when not working in one themselves. ‘One way to interpret that is to say that they have maybe less influential online social networks, right, even when they are on the platform.’

Figure 3: Leaky pipeline. The proportion of women working in science decreases towards the mid and senior career stages.

Kashyap suggests several reasons why women may have less influential networks online. For one, online networks are still likely to be influenced by the scenarios playing out offline, since referrals on these networks are based on the people you already know. The difference may also be based on the types of companies women tend to work in and the positions they hold. For instance, women are more likely to work in IT service support than programming-intensive occupations, and here once again Kashyap suggests the work family trade off plays a role in women seeking less intensive or more flexible jobs. She highlights that girls equal or exceed the achievement of male counterparts through school and continue to match them in their early careers before their numbers start to drop off dramatically. ‘I think now there’s a growing recognition that this is actually a real conflict, the work family conflict,’ she tells Real World Data Science. Today’s young women are socialised to have ‘high achieving aspirations’, which can be hard to reconcile with ‘regressive norms’ for women to shoulder the bulk of caring responsibilities, particularly when starting a family.

Real world gender disparities in career development

Neuroscientist Joanne Kenney has also been following data on the gender gap in the science and tech sectors and co-authored ‘A Snapshot of Female Representation in Twelve Academic Psychiatry Institutions Around the World’¹¹ with Assistant Professor of Psychiatry at Harvard Medical School Elisabetta del Re. The figures published here also show that globally women represent a large majority of early career scientists, but their numbers steadily decrease towards the mid and senior career stages so that there is a negative correlation between career stages and female presence in science, often referred to as the ‘leaky pipeline’ or ‘sticky floor’. ‘You don’t always hear their stories or the reasons why they’ve left,’ says Kenney who highlights that in her experience in academia exit interviews are rare. Just 24% of the UK total workforce in the tech sector are women, while black women account for only 0.7% of IT professionals according to the 2024 UN Women UK and Kearney Consulting report ‘Gap to Gateway: diversity in tech as the key to the future’ for which Kenney was an external collaborator. Kenney is currently working on another project with a team of scientists from Europe, Africa, and North and South America led by del Re to gather stories from women and other underrepresented groups in academic institutions around the world through focus groups aimed at better understanding their experiences of working in science.

For those who stick at it, the career path appears to be a steeper hike for women than their male counterparts. There is a citation-bias favouring male-authored articles¹². Women also take on average nine years to transition to senior author whereas men take five¹³, and women are less likely to be promoted to leadership positions¹⁴. While women in science bear a measurably unequal career impact on entering parenthood¹⁵, some of these inequalities may also stem from sexism, which can range from fewer opportunities for mentorship and collaboration to outright harassment¹⁶.

‘I think a lack of mentorship and sponsorship are two big ones,’ says Kenney when it comes to the key discouraging factors for women at the mid-career point in tech and academia. In AI, in particular, less than 3% of venture capital funding deals involving AI startups go to women-founded companies. The gender pay gap, which at 16% in the sector exceeds the overall pay gap of 11.6% may be another disincentive.

In short there is evidence of various patriarchal subcultures at play, both in the tech and science sectors and the world in general that can still pose a significant disadvantage to women. As Sippy points out, ‘Those subcultures also translate to the online world.’ Ultimately while digital technologies may offer creative loopholes for side-stepping some aspects of gender bias and disadvantage, gender inequality needs to be tackled in both spaces in tandem.

Explore more data science ideas

About the author: Anna Demming is a freelance science writer and editor based in Bristol, UK. She has a PhD from King’s College London in physics, specifically nanophotonics and how light interacts with the very small, and has been an editor for Nature Publishing Group (now Springer Nature), IOP Publishing and New Scientist. Other publications she contributes to include The Observer, New Scientist, Scientific American, Physics World and Chemistry World..

Copyright and licence: © 2024 Anna Demming

How to cite: Demming, Anna. 2024. “Are we at risk of muting the female voice in the digital world?” Real World Data Science, September 17, 2024. URL

References

Sicat M, Xu A, Mehetaj E, Ferrantino M & Chemutai V Leveraging ICT Technologies in Closing the Gender Gap World Bank World Bank Group, Washington DC (2020) https://documents1.worldbank.org/curated/en/891391578289050252↩︎
Web Foundation. The Costs of Exclusion: Economic Consequences of the Digital Gender Gap. Alliance for Affordable Internet (2021) https://a4ai.org/report/the-costs-of-exclusion-economic-consequences-of-the-digital-gender-gap/↩︎
Fatehkia M, Kashyap R & Ingmar Weber I Using Facebook ad data to track the global digital gender gap World Development 107 189-209 (2018) https://www.sciencedirect.com/science/article/pii/S0305750X18300883↩︎
Leasure D R, Yan J, Bondarenko M, Kerr D, Fatehkia M, Weber I & Kashyap R. Digital Gender Gaps Web Application, v1.0.0. Zenodo, GitHub (2023) doi:10.5281/zenodo.7897491↩︎
Stevens F, Enock F E, Sippy T, Bright J, Cross M, Johansson P, Wajcman J, Margetts H Z Understanding gender differences in experiences and concerns surrounding online harms: A nationally representative survey of UK adults Alan Turing Institute (2024) https://www.turing.ac.uk/news/publications/understanding-gender-differences-experiences-and-concerns-surrounding-online↩︎
Ajder H, Patrini G, Cavalli F & Cullen L The State of Deepfakes: Landscape, Threats, and Impact, (2019) https://regmedia.co.uk/2019/10/08/deepfake_report.pdf↩︎
Sippy T, Enock F E, Bright J & Margetts H Z Behind the Deepfake: 8% Create; 90% Concerned Alan Turing Institute (2024) https://www.turing.ac.uk/news/publications/behind-deepfake-8-create-90-concerned↩︎
Kalhor G, Gardner H, Weber I, Kashyap R Proceedings of the Eighteenth International AAAI Conference on Web and Social Media 18 (2024) https://ojs.aaai.org/index.php/ICWSM/article/view/31353↩︎
Kashyap R & Verkroost F C J Analysing global professional gender gaps using LinkedIn advertising data EPJ Data Science 10 39 (2021) https://doi.org/10.1140/epjds/s13688-021-00294-7↩︎
Zhao X , Akbaritabar A, Kashyap R & Zagheni E A gender perspective on the global migration of scholars PNAS 120 e2214664120 https://doi.org/10.1073/pnas.2214664120↩︎
Kenney J, Ochoa S, Alnor M A, Ben-Azu B, Diaz-Cutraro L, Folarin R, Hutch A, Luckhoff H K, Prokopez C R, Rychagov N, Surajudeen B, Walsh L, Watts T, Del Re E C A Snapshot of Female Representation in Twelve Academic Psychiatry Institutions Around the World Psychiatry Research (2021) doi: 10.1016/j.psychres.2021.114358↩︎
Dworkin J D, Linn K A, Teich E G, Zurn P, Shinohara R T & Bassett D S The extent and drivers of gender imbalance in neuroscience reference lists Nature 23 918-926 (2020) https://www.nature.com/articles/s41593-020-0658-y↩︎
Bearden C E Accelerating the Bending Arc Toward Equality: A Commentary on Gender Trends in Authorship in Psychiatry Journals Biological Psychiatry 86 575-576 (2019)https://www.biologicalpsychiatryjournal.com/article/S0006-3223(19)31588-4/abstract↩︎
Clark J & Horton R A coming of age for gender in global health The Lancet 393 p2367-2369 (2019) https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(19)30986-9/abstract↩︎
Morgan A C, Way S F, Hoefer M J D, Larremore D B, Galesic M & Clauset A The unequal impact of parenthood in academia Science Advnaces 7 eabd1996 doi: 10.1126/sciadv.abd1996↩︎
O’Connor P s gendered power irrelevant in higher educational institutions? Understanding the persistence of gender inequality *Interdisciplinary Science Reviews” 48 669-686 (2023) https://doi.org/10.1080/03080188.2023.2253667↩︎

Nowcasting upgrade for better real time estimation of GDP and inflation

Atmajitsinh Gohil — Tue, 25 Jun 2024 00:00:00 GMT

Governments, policymakers and central banks across the world are wrestling to keep rising prices under control using monetary policies such as interest rate increases. The effectiveness of such policy changes should be assessed by monitoring inflation data as well as studying the impact on real GDP, making timely and accurate access to key economic indicators crucial for policy planning. The delay in publishing economic indicators such as Real GDP, inflation and other labour related series, makes this real time assessment of the economy particularly challenging. Now Menzie Chinn at the University of Wisconsin, Baptiste Meunier at the European Central Bank and Sebastian Stumpner at the Banque de France report an approach for “nowcasting” built on previous research that develops a framework using different machine learning techniques and is flexible and adaptable compared with traditional methods¹. They report on the accuracy of their 3-step framework for nowcasting global trade volume estimates, showing how it can outperform traditional methods. They also highlight that the 3-step framework can be extended beyond World Trade data.

Nowcasting, an amalgamation of the term now and forecasting, provides a methodology to assess the current state of the economy by predicting the current value of inflation or Real GDP. The Federal Reserve Bank of New York and Federal Reserve Bank of Atlanta have used nowcasting to publish real time GDP estimates, for the USA. Similarly, the Federal Reserve Bank of Cleveland estimates real time inflation using nowcasting methods.

Figure 1: Growth of GDP with statistical graph, 3d rendering. Digital drawing. Credit: Shutterstock, Vink Fan

The basic principle of nowcasting is utilising information that is published early such as using data published at higher frequency, survey data, financial indicators or economic indicators. For example, the running estimate of Real GDP (aka GDPNow) that the Federal Reserve Bank of Atlanta provides is updated 6 or 7 times a month on weekdays when one of the 7 input data sources are released. Similarly, the real GDP growth estimate that the Federal Reserve Bank of New York provides is based on data releases in categories such as housing and construction, manufacturing, surveys, retail and consumption, income, labour, international trade, prices and others.

The traditional methods of nowcasting do not provide an integrated framework, and forecasters need to know which variables to use, and select a method for factor extraction and machine learning regression. Chinn, Meunier and Stumpner propose a sequential framework that selects the most important predictors. The selected variables are then summarized using Principal Component Analysis (PCA) and these factors are used as explanatory variables to perform the regression. Although traditional methods of nowcasting also utilized many of these techniques, the authors test various combinations of pre-selection, factor extraction and regression techniques and propose a combination that improves model accuracy.

Model framework improved flexibility and accuracy:

The 3 steps in the framework are chronological steps to be performed in which the first step is pre-selection of the independent variables with the highest predictive power. The independent variables from the first step are then summarised into a few factors using factor extraction methodology in the second step. The final step consists of using the factors from step 2 to perform regression.

Figure 2: The various methods that can be employed in the 3 step framework in Chinn et al (2024). Credit: National Bureau of Economic Research.

Figure 2 summarises the various methods employed at each step in the 3 step framework. In their report Chinn, Meunier and Stumpner aim to propose the best techniques for pre-selection, factor extraction and regression. As such their 3-step framework comprises performing pre-selection using Least Angle Regression (LARS), factor extraction using Principal Component Analysis (PCA) and employing a Macroeconomic Random Forest (MRF) machine learning technique for nowcasting.

The model performance or accuracy of MRF is compared with traditional methods using Root Mean Square Error (RMSE), a measure of the deviation between the actual data and the predicted data. The 3-step framework model accuracy is tested by holding the preselection and factor extraction fixed to isolate the impact of regression techniques.

Figure 3: Bar chart comparing the accuracy of different methods in terms of RMSE. Credit: National Bureau of Economic Research.

Figure 3 compares the RMSE of traditional methods, machine learning tree and machine learning regression model for backcasting (t-2 and t-1), nowcasting (t) and forecasting (t+1). It highlights the greater model accuracy of MRF and Gradient Boosting compared with traditional models and tree models for backcasting, nowcasting and forecasting.

What’s Next?

Organisations such as The Nowcasting Lab provide GDP estimates for European countries. Such nowcasting techniques have been employed by humanitarian agencies including the United Nations Refugee Agency (UNHCR) which uses nowcasting to estimate the actual forced displaced population. The nowcasting techniques, dashboards and tools have been implemented and accepted as a reliable source of information at government organisations for policy making, central banks, and financial organisations. The 3-step framework, proposed by Chinn, Meunier and Stumpner, is easily adaptable, flexible and provides higher accuracy, which will be valuable to a range of fields employing nowcasting.

Explore more data science ideas

About the author: Atmajitsinh Gohil is an independent researcher in the field of AI and ML, specifically managing AI and ML risk. He has worked with consulting firm assisting clients in model risk management. He has graduated from SUNY, Buffalo with a M.S. in Economics.

Copyright and licence: © 2024 Atmajitsinh Gohil

Text, code, and figures are licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence, except where otherwise noted. Thumbnail image by Shutterstock Van Fink.

How to cite: Gohil, Atmajitsinh. 2024. “Nowcasting upgrade for better real time estimation of GDP and inflation.” Real World Data Science, June 25, 2024. URL

References

Nowcasting World Trade with Machine Learning: a Three-Step Approach Chinn, M. D., Meunier, B. & Stumpner, S. NBER DOI 10.3386/w31419) ↩︎

AI series: Ensuring new AI technologies help everyone thrive

Anna Demming — Tue, 11 Jun 2024 00:00:00 GMT

“There’s some beautiful stories in clinical notes,” said Mark Sales, global strategy leader of the cloud technology company Oracle Life Sciences. He was speaking to delegates at the 2024 London Biotechnology Show about “unlocking health data and artificial intelligence within life sciences”, where opportunities abound, such as exploiting large language models (LLMs) to process some of the detailed information currently hidden in clinical notes into more structured data to inform fields like oncology. Oracle are also looking into using AI to take some of the luck out of connecting the right patients with clinical trials that might help them. The AI in Medicine and Surgery group at the University of Leeds headed by Sharib Ali has demonstrated the potential to reduce the number of times patients need to go through uncomfortable procedures like oesophageal scansfor Barrett’s syndrome , and is working on the potential to provide haptic feedback for robot mediated surgery. The London Biotechnology Showcase delegates had already heard about all these opportunities. Nonetheless Sales’s talk had opened with a note of caution: “There’s a lot more we could do, and there’s a lot more we probably shouldn’t do.”

It is an increasingly familiar caveat. “In the best scenario, AI could widely enrich humanity, equitably equipping people with the time, resources, and tools to pursue the goals that matter most to them,” suggest the Partnership on AI, a non-profit partnership of academic, civil society, industry, and media organizations. The goal of the partnership is to ensure AI brings a net positive contribution to society as a whole not just a lucky minority, which they suggest will not necessarily be the case if we rely on chance and market forces to direct progress. While people working in developing and deploying AI tackle the burgeoning size and complexity of their models, as well as the myriad requirements of testing and training data, establishing whether a model is fit for purpose, and dodging the numerous pitfalls that cause most AI projects to fail, perhaps the greatest challenge remains the range of ethical considerations including inclusiveness and fairness, robustness and reliability, transparency and accountability, privacy and security and general forethought and design. The scope of societal impact can reach far further than the immediate sphere of interaction with the model, or the interests of the companies deploying them, suggesting the need for some sort of governing forces.

However, technology is moving fast in a lot of different directions. Even with agreed sound values that all technological developments should respect, there is still space for companies to deploy AI models without supplying the necessary resources and expertise so that the roll out meets ethical and societal expectations. This expertise can range from the statistical skills required to ensure the appropriate level of representation in training datasets to the social science understanding to extrapolate potential implications for human behaviour when interacting with the technology.

Although the right checks and balances to avoid potential negative societal impacts have been slower to develop than the technologies they should be regulating, some guiding principles are emerging from organisations labouring to assess with greater clarity what the real immediate and longer term hazards are, what has worked well in other sectors, and the impact of government actions so far. There is an element of urgency in the challenge. As the Partnership on AI put it, “Our current moment serves as a profound opportunity — one that we will miss if we don’t act now.”

High stakes

When Open AI publicised their Voice Engine’s ability to clone human voices from just 15s of audio, they too flagged the potential benefit for people with poor health conditions, since those with deteriorating speech could find a means to have their speech restored. However, voice clones had already been used to make robot calls to voters imitating the voice of President Joe Biden and telling voters to stay at home.

“The question you have to ask there is what’s the societal benefit of that tool? And what are the risks,” associate director at the Ada Lovelace Institute Andrew Strait told Real World Data Science. “They thankfully decided to not fully release it,” he adds, highlighting how the timing “right before an election year with 40 democracies across the world” could have made the release particularly problematic.

Figure 1: Themis, goddess of justice. External governance is required to ensure the outcomes of AI deployment are safe and just. Credit Shutterstock, Michal Bednarek.

While OpenAI’s voice engine might have made voice cloning more accessible had they proceeded with a full release, voice cloning is clearly still well within reach for some already. Strait cites the experiences of hundreds of performing artists in the UK over the past few months that have been brought to the attention of the Ada Lovelace Institute. “They’re brought into a room; they’re asked to record their voice and have their face and likeness scanned; and that’s the end of their career,” says Strait. The sums paid to artists on these transactions are not large either. “They are never going to be asked to come back for audition again, because they [the companies] can generate their likeness, that voice doing anything that a producer wants without any sense of attribution, further payments, or consent to be used in that way.”

Customer service is another sector where jobs have been threatened with replacement by a generative AI chatbot, however the technology can run into problems since gen-AI is known to “hallucinate”, generating false information. Air Canada has just lost a case defending its use of a chatbot that misinformed a customer that they could apply for a bereavement fare retroactively, which is not the case according to Air Canada’s bereavement fare policy. In their defence Air Canada flagged that the chatbot had supplied a link to a webpage with the correct information but the court ruled that there was no reason to believe the webpage information over the chatbot, or for the customer to double check the information they had been supplied. While there are ways to mitigate problems with gen-AI with the right teams in place , other industries have also hit problems with the accuracy and reliability of gen-AI, which may dampen the impact AI has on the labour market. All in all the wider picture of how AI deployment may affect jobs is largely a matter of speculation. Here a US piloted scheme may soon provide framework for a more data informed approach to tackling AI’s impact on the workforce.

Strait highlights that conversations that centre around efficiency when weighing up the possible advantages of introducing AI can be ill informed. “If we’re talking about an allocation of resources in which we’re spending an increasing amount of money on automating certain parts of the NHS, or healthcare or the education system, or public sector services, how are we making the decisions that are determining if that is worth the value for money? Instead of investing in more doctors, more teachers, more social workers?” He tells Real World Data Science that these are the questions he and his colleagues at the Ada Lovelace Institute are often pushing governments to try to answer and evidence rather than to just assume the benefits will accrue. When it comes to measures of success of an AI model, Strait says “It’s often defined in terms of how many staff can be cut and still deliver some kind of service…This is not a good metric of success,” he adds. “We don’t want to just get rid of as many jobs as we can, right, we want to actually see improvements in care, improvements in service.”

Michael Katell, ethics fellow in the Turing’s Public Policy Programme and a visiting Senior lecturer at the Digital Environment Research Institute (DERI) at Queen Mary University of London suggests the problems may go deeper still when looking at the use of generative AI in creative industries. “There are definitely parallels with prior waves of disruption,” he says citing as an example the move to drum-based and eventually laser printing as opposed to manual typesetting. “A key difference, though, is that, in the creative arts, we’re talking about contributions to culture, and culture is something that, I think we often take for granted.” He highlights the often overlooked role cultural practices that enable and empower shared experiences have in holding society together. These may come in various forms from works of art to theatre, and the working and living practices among the wider community may play an important role too. While acknowledging there may be interesting and fascinating uses of AI in art to explore, Katell adds, “If we’re not attending to maintaining some aspects, or trying to manage the changes that are happening in our culture, I think we’ll see societal level effects that are much greater than the elimination of some jobs.”

The need for legislation

These stakes all highlight the need for regulatory interventions. However, most governments, bar China and the EU, have so far favoured “voluntary commitments” towards AI safety, which would seem to fall short of providing the kind of governance over the sector that can be robustly enforced. In a recent blog Strait alongside the Ada Lovelace Institute’s UK public policy lead Matt Davies and associate director (Law & Policy) Michael Birtwhistle, “evaluate the evaluations” of the UK’s AI Safety Institute for companies that have opted in for these voluntary commitments. They highlight that on the whole the companies planning to release the product hold too much control over how the evaluation can take place, ultimately empowering them to direct tests in their favour, which inhibits efforts at robust monitoring. Furthermore, there is usually no avenue for the necessary scrutiny of training data sets. Even withstanding these limitations, Davies, Strait and Birtwhistle conclude that “conducting evaluations and assessments is meaningless without the necessary enforcement powers to block the release of dangerous or high-risk models, or to remove unsafe products from the market.”

The reticence to implement firmer regulation might be attributed in some part to the perceived benefits to the state when their AI companies succeed. One often perceived benefit is that the percolating profits these companies accrue may benefit the economic buoyancy of the societies they function within. There is also cause for sovereign state competitiveness in “AI prowess” that stems from the potential for AI-based technology to underpin all aspects of society, prompting what has been described as an “AI arms race”. Here the UK may well regret allowing Google to acquire Deep Mind, whose output is responsible for bolstering the “UK’s share” of citations in the top 100 recent AI papers from 1.9% to 7.2%. However, a lack of robust regulation may prove as much a disservice to the companies releasing AI products as it is to society as a whole.

“The medicine sector here [in the UK] is thriving, not in spite of regulation, but because of regulation,” says Strait. “People trust that the products you develop here are safe.” Katell, highlights the impact of pollution legislation on the automotive industry. “It jumped forward invention and discovery in automotive technology,” he tells Real World Data Science. “It seems prosaic in hindsight, but it wasn’t, it was a major innovation that was promoted by regulators, promoted by legislators.” The UK government’s chief scientific advisor Angela McLean seems to agree. “Good regulation is good for innovation,” she replied when asked about balancing regulation with favourable conditions for a flourishing AI sector at an Association of British Science Writers’ event in May. “We’re not there yet,” she added. The challenge is pinning down what good regulation looks like.

Regulatory ecosystems

As has been emphasised throughout the series, making a success of an AI project requires a unique skillset that combines expertise in AI with the domain expertise for the sector the project is contributing to, and there is often a dearth of people that straddle both camps. The same hunt for “unicorns” with useful expertise in the tech sector and policymakers can also be an obstacle for developing “good regulation”. One solution is to bring people from the different disciplines together to develop legislation collaboratively, as was arguably the case with the roll out of General Data Protection Regulations (GDPR) in 2018. “Policymakers and academics, they worked very closely together in the crafting of that law,” says Katell. “It was one of those rare moments in which we saw the boundaries really dissolve between policy and academia in a way that delivered something that I think we can agree was largely a positive outcome.”

When it comes to AI, an obstacle to that kind of collaboration has been the lack of a common language. In “Defining AI in policy and practice” in 2020¹, Katell alongside Peaks Krafft at the University of Oxford and co-authors found that AI researchers favoured definitions of AI that “emphasise technical functionality”, whereas policy-makers tended towards definitions that “compare systems to human thinking and behavior”, which AI systems remain far from achieving. Strait also highlights a recurring theme among those without experience of actually making AI systems in overselling AI capabilities in suggestions that it will “help solve climate change” or “cure cancer”. “How are you measuring that?” he asks. “How are we making a clear sense of the efficacy, the proof behind those kinds of statements? Where are the case studies that actually work, and how are we determining that’s working?”

Figure 2: Safety first. External governance is required to ensure the outcomes of AI deployment are safe and knock on effects have been considered. Credit: Shutterstock. Photo by 3rdtimeluckystudio.

As Krafft et al. point out in their 2020 paper, such exaggerated perceptions of AI capabilities can also hamper regulation. “As a result of this gap,” they write, “ethical and regulatory efforts may overemphasise concern about future technologies at the expense of pressing issues with existing deployed technologies.” Here a better understanding of what AI is can be helpful to focus attention on the problems that exist now – not just the potential workforce impact, but the carbon cost of training large language models, activities like nonconsensual gen-AI porn aggravating online gender inequality, and a widening digital divide disadvantaging pupils, workers and citizens who cannot afford all the latest AI tools, among others.

Fortunately, there has already been progress to breach the language divide between policy makers and the tech sector. “The current definitions [championed in policy circles] say things like technologies that can perform tasks that require intelligence when humans do them,” says Katell, which he describes as a far more sober and realistic definition than likening technologies to the way humans think and work. “This is really important,” he adds. “Because some of the problems that we see with AI now are symptomatic of the fact that they’re not humans and that they don’t have the same experience of the world.” As an example he describes someone driving a car with child in the car seat, calling on all their training and experience of road use to navigate roads and other traffic, while juggling their attention between driving and the child. “Things that AI is too brittle, to accomplish,” he adds, highlighting how a simple model may identify school buses in images quite impressively until it is presented with an image of a bus upside down. “The flexibility and adaptability, the softness of human reason, is actually its strength, its power.”

Getting everybody on the same page can also help provide a more multimodal approach to governance. Empowering independent assessors of AI product safety prior to release is one thing but as Strait points out, “It could be more like the environmental sector, where we have a whole ecosystem of environmental impact assessments, organisations and consultancies that do this kind of work for different organisations and companies.” Internal teams within companies can play an important role too so long as they work sufficiently independently from the companies themselves. When set up with the right balance of expertise they can be better placed to understand and hence assess the technology and practical elements of its implementation. Although such teams can be expensive, getting the technical evaluation and consideration of ethical issues right can pose a competitive advantage for the companies themselves as well as providing a more thorough safeguard for society at large. Nonetheless there are also obvious advantages in having external regulatory bodies, which do not need to take into account the company’s profit margins or shareholders’ needs. An ideal set up might incorporate both approaches. In fact in their appraisal of the current UK AI Safety Institute arrangement, Davies, Strait and Birtwistle first highlight the need to integrate the AI Safety Institute “into a regulatory structure with complementary parts that can provide appropriate, context-specific assurance that AI systems are safe and effective for their intended use.”

Prosperity for all

With all the precedents in other sectors from environmental impact checks to pharmacology, an organised framework or ecosystem for robust, independent and meaningful evaluation of AI product safety seems an inevitable imperative, albeit potentially expensive. (Davies, Strait and Birtwistle cite £100 million a year as a typical cost for safety driven regulatory systems in the UK², and the expertise demands of AI could further increase costs.) However, such regulatory reform will likely slow down the pace of technological development and the route to market. While the breathing space to adjust to the societal changes they bring with them may be welcomed by some, the delay can be quite unpopular in a tech sector where the ethos is famed for embracing a “move fast, break things” mentality. As Katell points out that ideal is based on the notion that the things being broken were unimportant – when it’s vulnerable people and societies that is “unacceptable breakage”.

Strait also highlights the cultural mismatch between the companies developing AI products – where the research to market pipeline is extremely fast – and the sectors those tools are intended to serve, such as social care, education and health. Although Open AI eventually decided against full release of the Voice Engine, when it comes to the ethos of some AI technology companies , “The default is to put things out there and to not think through the ethical and societal implications,” says Strait who has direct experience of working for a company producing AI tools in the past. “I think it’s so critical for data scientists and ethicists to explore, and do that translation and interrogation of what are the ethics of the sector that we’re working in?”

Katell voices concern shared by many that at present AI is under the control of a very small handful of very large, powerful technology companies, and as a result the AI releases making the most impact are targeting the agendas of the companies releasing them and their current and anticipated customer base, as opposed to the needs of society. The potential for such large tech agents to become too big to fail poses additional regulatory challenges. While many may lament the tension between a demand for open source data sets for testing AI models versus the need to respect data privacy, security and confidentiality, there have already been widely mooted instances where certain companies may not have met expectations for respecting copyright and terms of service. In fact the tech giants are not the only people developing AI models and the open source community have been known to pose valuable competition that may temper the tendency for AI to concentrate a lot of power into the hands of a small few³. However, open source developers can also pose a certain amount of regulatory complexity.

There is also an argument that these efforts should broaden their scope beyond baseline AI safety and aim to focus efforts in AI development towards tools that actively promote greater wellbeing and prosperity to the many. “We need to bring in other values like fairness, justice, and simple things like explainability, gender equity, racial equity,” says Katell, highlighting some of the other qualities that demand attention among others. Taking explainability as an example, there is increasing awareness of the need to understand how certain outputs are reached in order for people to feel comfortable with the technology, and the outputs requiring explanations differ from person to person. Although it can be hard to explain AI outputs, progress is being made in this direction. As Katell says, “We’re not helpless in managing these types of disruptions. It’s a matter of societies coming together and deciding that they can be managed.”

Explore more data science ideas

About the authors: Anna Demming is a freelance science writer and editor based in Bristol, UK. She has a PhD from King’s College London in physics, specifically nanophotonics and how light interacts with the very small, and has been an editor for Nature Publishing Group (now Springer Nature), IOP Publishing and New Scientist. Other publications she contributes to include The Observer, New Scientist, Scientific American, Physics World and Chemistry World.

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Demming, Anna. 2024. “Ensuring new AI technologies help everyone thrive .” Real World Data Science, June 11, 2024. URL

References

Krafft, P. M., Young, M., Katell, M., Huang, K. & Bugingo, G. Defining AI in Policy versus Practice AIES ’20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 72-78 (2020)↩︎
Smakman, J, Davies, M. & Birtwhistle, M. Mission critical Ada Lovelace Policy Briefing (2023)↩︎
Google “We Have No Moat, And Neither Does OpenAI semianalysis.com (2023) (semianalysis.com)↩︎

AI series: What is “best practice” when working with AI in the real world?

Anna Demming — Tue, 04 Jun 2024 00:00:00 GMT

Over the course of the Real World Data Science AI series, we’ve had articles laying out the nitty gritty of what AI is, how it works, or at least how to get an explanation for its output as well as burning issues around the data involved, evaluating these models, ethical considerations, and gauging societal impacts such as changes in workforce demands. The ideas in these articles give a firm footing for establishing what best practice with AI models should look like but there is often a divide between theory and practice, and the same pitfalls can trip people up again and again. Here we discuss how to wrestle with real world limitations and flag these common hazards.

Our interviewees, in order of appearance, are:

Ali Al-Sherbaz, academic director in digital skills at the University of Cambridge in the UK

Janet Bastiman, Napier chief data scientist and chair of the Royal Statistical Society Data Science & AI Section

Jonathan Gillard, professor of statistics/data science at Cardiff University, and a member of the Real World Data Science Board

Fatemeh Torabi, senior research officer and data scientist, health data science at Swansea University, and also a member of the Real World Data Science board

It is often said that while almost everybody is now trying to leverage AI in their projects, most AI projects fail. What nuggets of wisdom do the panel have for swelling that minority that succeed with their AI projects, and what should you do before you start doing anything?

Ali Al-Sherbaz: It’s not easy to start, especially for people who are not aware how AI works. My advice is, first, they have to understand the basics of how AI works because the expectation could be overpromising, and that is a danger. Just 25 years ago, a master dissertation might be about developing a simple – we call it simple now but it was a master’s project 25 years ago – a simple model with a neural network of a combination of nodes to classify data. Whatever the data is – it could be drawing shapes, simple shapes, square, circle triangle – just classifying them was worth an MSc. Now, kids can do it. But that is not the same as understanding what the neural network or the AI is. It’s a matrix of numbers, and actually, for the learning process each does multiple iterations to find the best combination of these numbers – product of sum; sum of product – to classify, to do something, and train them for a certain situation, and that is a supervised learning. Over the last 25 years – especially in the last 10 years – the computational power is getting better, so AI is now working better.

There are other things people have to learn. There’s the statistics as well, and of course people who would like to work in AI and data science must understand the data, and they should also be experts in the data itself. For instance, I can talk about cybersecurity, I can talk about networking and other things, but if it comes to something regarding health data, or financial services, or stock markets, I’m not an expert in the data. So I’m not going to be actively working on those things even if I use the same AI tools. This is in a nutshell why I think some people fail sometimes using AI, or they succeed using AI. And we should emphasise the human value. The AI is there, and it exists to help us to make a better more accurate decision, but the human value is still there. We have to insist on that.

Janet Bastiman: I would just like to build on all of that great stuff that Ali’s just said. When you look at basically the non-data scientist side of it, you often get businesses who think AI can solve a certain problem. They might go out and hire a team – whether that’s directly or indirectly – and get them to try and solve a problem that, as Ali said, they may not have the domain expertise for. The business might not even have the right data for it, and AI might not even be the right way of solving that problem. I think that’s one of the fundamental things to think about – really understanding what you’re trying to solve, and how you’re going to solve it before you start throwing complex tools, and potentially very expensive teams at the problem.

When you look at a lot of the failures, it’s been because businesses have just gone, we can solve this problem, I’m just going to hire a team and let these intelligent people look at something. And then they’re restricted on the data that they’ve got, which won’t even answer the question; they’re restricted on the resources they have; and even restricted in terms of wider buy in from the company. So really understanding what is it that you want to solve? What are you trying to do? Is AI the right thing? And can you even do it with the resources you have available? And I think that’s, that’s a fundamental starting point. Because, you can have wonderful experts, who have that domain knowledge, who understand the statistics, and all that essential stuff that Ali just said. But then if from a business point of view, if you don’t give them the right data to work on, or you don’t let them do their job and tell you when they can’t do their job, then again, you’re going to be doomed to failure.

Jonathan Gillard: Explainability is a big issue when it comes to AI models, as well. They are at the moment, very largely “black box” – data goes in, then these models get trained on dumb data and answers get popped out. And when it works, well, it works fabulously well. And we’ve seen lots of examples of that happening. But often for business, industry or real life, we want to learn. We want to understand the laws of the universe, and to understand the reasons why this answer came about. Because this explainability piece is missing – because everything is hidden away almost – I think that’s a big issue in successful execution. And particularly when it comes to industries where there’s a degree of regulation there as well, if you can’t explain how a particular input arose to a particular output, then how can you justify to regulatory bodies that what you’ve got is satisfactory, ethical, and that you’re learning and you’re doing things in the right way?

There have been efforts at trying to get explanations from these models. How do you think things are progressing there?

JG: Yeah, that’s a good question. I think where we are with explainability is in very simple scenarios, very simple models. This is where traditional statistical models do very well. There’s an explicit model which says if you put these things inside then you’ll get this output. So [for today’s AI] I think we’re actually very far away from having that complete explainability picture, particularly as we fetishise more and more grand models. The AI models are only getting bigger, more complex, and that makes the explainability per se even more challenging. And that’s why I think, as Ali says, at the moment, the human in the loop is absolutely crucial.

What AI does share with classical statistics (or classical data science if you want to call it that) is it can still only be as good as the data that’s put into it, that’s still a fundamental truth. I think a lot of the assumptions currently with AI models – and this is where there could be a few trip ups is that it can create something from nothing. It’s “artificial intelligence” – almost the wording suggested it’s artificial. But fundamentally, we still need a robust and reliable comprehensive source of data there in order to train these models in the first place.

In terms of having outsourced expertise for these projects– does that make more problems if you’re then trying to understand what this AI has done?

JB: Oh, hugely. Let’s say that domain expertise – that’s something Ali touched on –you’ve got to understand your data. Because even that fundamental initial preparation of data before you try and train anything is absolutely crucial – really looking at where are the gaps? Where are the assumptions? How is this data even being collected? Has it been manipulated before you got to it? If you don’t understand your industry, well enough you won’t know where those pitfalls might be – and a lot of teams do this, they just take the data, and then they just put it in, turn the handle and out comes something and it looks like it’s okay. What they’re really missing there – because they’re not putting that effort in to really understand those inputs, what the models are doing, they’re just turning the handle until they get something that feels about right – what they miss out is where it goes wrong. And there are some industries, where the false positives and false negatives from classification or the bad predictions from running things really have a severe human impact. And if you don’t understand what’s going in, and the potential impact of what comes out, then it’s very, very easy to just churn these things out and go, “it’s 80% accurate, but that’s fine” without really understanding the human impact of the 20% [that it gets wrong].

Going back to what Jon said about that explainability, it’s so crucial. It is challenging, and it is difficult, but going from these opaque systems to more transparent systems – we need that for trust. As humans, we divulge our trust very differently, depending on the impact. One of the examples I use all the time is, you know, sort of weather prediction stuff, you know, we don’t really care too much, because it’s not got a huge impact. But when you look at sort of financials or medicals, we really, really want to know that that output is good, and how we got to that output. The Turing Institute’s come out with some great research that says, as humans, if we want to understand why when another human has told us something, then we want the same thing from the models, and that can vary from person to person. So building that explainable level into everything we do, has to be one of the things we think about upfront. But you’ve got to really, truly deeply understand that data. And it’s not just a question of offloading a data set to a generalist who can turn that handle, otherwise you will end up with huge, huge problems.

Fatemeh Torabi: I very much agree with all the points that my colleagues raised. I also think it’s very important that we know why we are doing things. Having those incremental stages in our planning for any project, and then having a vision of where we see AI can contribute into this process and can give us further efficiency – and how – is very important. If we don’t have defined measures to see how this AI algorithm is contributing to this specific element of the project, we can get really lost bringing these capabilities on board. Yes, it might generate something, but how we are going to measure that something is very important. I think, as members of the scientific community, we must all view AI as a valuable tool. However, it has its own risks and benefits.

For example, in healthcare when we use AI for risk predictions, it can be a really great tool to aid clinicians to save time. However, in each stage, we need to assess the data quality, how these data are fed into the algorithm, what procedures, what models, and how we generate those models. And then which discriminative models do we use to balance the risk and eventually predict the risk of outcomes in patients? It’s very much a balance between risks and benefits for usefulness of these tools in practice. We have all these brilliant ideas of what best practice is. But in real terms, sometimes it’s a little bit tricky to follow through.

Could you give us some thoughts on the sort of best practice with data, for example, that doesn’t quite turn out to be quite so easy to follow in practice, and what you might do about it?

FT: We always call these AI algorithms, data hungry algorithms, because the models that we fit require us to see patterns in the data that we feed into them so that the learning happens. And then the discriminative functions come in place to balance and kind of give a score to wherever the learning is happening and give an evaluation of each step. However, the data that we put into these algorithms comes first – the quality of that data. Often in healthcare, because of its sensitivity, the data is held within a secure environment. So we cannot, at this point in time, expose an AI algorithm to a very diverse example, specifically for investigating rare diseases or rare conditions. And above that, there is also complexities in the data itself. We need to evaluate and clean the data before we feed it into these algorithms. We need to evaluate the diversity of the data itself – for example, the tabular data, the imaging data, the genomic data – and each one requires its own specific or tailored approach in data cleaning stages.

Figure 1: The panel. Clockwise from top left: Ali Al-Sherbaz, Janet Bastiman, Fatemeh Torabi and Jonathan Gillard

We also have another level that is now being discovered in the health data science community, which is the generation of synthetic data. We can give AI models access to these synthetic versions of the data that we hold. However, that also has its own challenges because it requires reading the patterns from real data, and then creating those synthetic versions of data.

For example, Dementia Platforms UK is one of the pioneers in developing this. We hold a range of cohort data, patients’ data, genomics data and imaging data. In each one of these when we try to develop those processing algorithms, there are specific tailored approaches that we need to consider to ensure we are actually creating a low fidelity level of data that is holding some of the patterns in it for the AI algorithm to allow the learning to happen. However, we also need to consider whether it is safe enough so that we can ensure the data provided are secure to be released for use at a lower governance level compared to the actual data. So there are quite a lot of challenges, and we captured a lot of it in our article.

A A-S: I can talk about the cybersecurity and other relevant data network security, the point being the amount of data we receive to analyse. It’s really huge. And when I say huge I mean about one gigabyte, probably in a couple of hours, or one terabyte in a week – that’s huge. One gigabyte of a text file – if I printed out this file with A4 – that would leave me with a stack of A4 paper, three times the Eiffel Tower.

Now, if I have cyber traffic, and try to detect any cyber attack, AI helps with that. However, if we train this model properly, they have to detect cyber attacks in real time – when I say real time, we’re talking about within microseconds or a millisecond – and the decision has to be correct. AI alone doesn’t work, doesn’t help. Humans should also intervene, but rather than having 100,000 records to check for a suspected breach, AI can reduce that to 100. A human can interact with that. And then in terms of the authentication or verification, humans alongside AI can learn whether this is a false positive, or a real attack or a false negative. This is a challenge in the cybersecurity area.

JB: I just wanted to dive in from the finance side – again the data is critical, and we have very large amounts of data. However in addition – and I think we probably suffer from the same sort of problem that Ali does in this – when I’m trying to detect things, there are people on the other side actively working against what I’m trying to detect, which I suppose is a problem that maybe Fatemeh doesn’t have in healthcare.

When you’re trying to build models to look for patterns, and those patterns are changing underneath you, it can be incredibly difficult. I have an issue that all of my client’s data legally has to be kept separated – some of it has to be kept in certain parts of the world so we can’t put that into one place. We can try and create synthetic data that has the same nuances of the snapshots that we can see at any one point in time, and we can try and put that together in one place, but what we can detect now will very quickly not be what we need to detect in a month’s time. As soon as transactions start getting stopped, as soon as suspicious activity reports are raised, and banks are fined, everything switches and how all of that financial crime occurs, changes. And it’s changing, on a big scale worldwide, but also subtly because, there are a team of data scientists on the other side trying desperately to circumvent the models that me and my team are building. It’s absolutely crazy. So while I would love to be able to pull all of the data that I have access to in one place and get that huge central visual view, legally I can’t do that because of all the worldwide jurisdictional laws around data and keeping it in certain places.

Then I’ve also got the ethical side of it, which is something that Fatemeh touched on. If I get it wrong, that can have a material impact on usually some of the most marginalised in society. The profile of some of the transactions that are highly correlated with financial crime are also highly correlated with people in borderline poverty, even in Western countries. So false positives in my world have a huge, huge ethical impact. But at the same time, we’re trying really hard to minimise those false negatives – that balance is critical, and the data side of it is such a problem.

Fatemeh mentioned the synthetic side of it. There’s a huge push, particularly in the UK to get good synthetic data to really showcase some of these things that we’re trying to detect. But by the time you get that pooling, and the synthesising of data that you can ethically use and share around without fear of all the legal repercussions, what we’re trying to detect has already moved on. So we’re constantly several steps behind.

I imagine Ali has similar problems in the cybercrime space in that as soon as things are detected, the ways in which they work move on. So there’s an awful lot I think that, as an industry, although we have different verticals, we can share best practices on.

Is there a demand for new types of expertise?

A A-S: There is a huge gap in the in the UK, at least and worldwide about finding people working as a data scientist or working with the data. So we created a course in Cambridge, which we call the data science career accelerator for people who work in data, and would like to move on and learn more. We did market research, and we interviewed around 50 people between CEO and head of security and head of data scientists, in science departments and in industry, to tell us – what kind of skills are you after? What problems do you currently have? And then we designed this course.

We found that first of all there are people who don’t know from where to start – what kind of data they need, what tools they have to learn with… Even if they learn the tools, they still need to learn what kind of machine learning process to use. And then suddenly, we have ChatGPT turned out, and the LLM [large language model] development – all of that in one course, it is a real challenge.

The course has started now, the first cohort. The big advice from industry we have is that during the course they have to work on real world case studies, on scenarios with data that nobody has touched before – that is, it’s new, not public. We teach them on a public data, but companies also have their own data, and we get consent from them to use that data for the students so we can test the skills they learned on virgin data that nobody has touched before.

We just started this month, and the students are going to start with the first project now. They are enjoying the course but that is the challenge we have now. How did we handle that? It’s to work together with the industry side by side, even during the delivery. We have an academic from Cambridge, and we have experts from the industry to support the learners to learn to get the best of both worlds.

The industry has changed so much in the last couple of years. Does that mean that the expertise and demands are also changing very quickly or is there a common thread that you can work with?

A A-S: Well, there is a common thread, but having new tools – I mean, Google just released Gemini, and that’s a new skill they have learnt and been tested on, and looked into how others feel about it and compared it to ChatGPT, or Claude 3 or Copilot. That’s all happened in the last 12 months. And then, of course, reacting on that, reflecting on the material, teaching the material – it’s a challenge. It’s not easy and you need to find the right person. Of course, people who have this kind of experience are in demand, and it’s hard to secure these kinds of human resources as well as to deliver the course. So there are challenges and we have to act dynamically and be adaptive.

What are your thoughts on the evaluation of these models, and how to manage the risk of something that you haven’t thought of before, and the role of regulation.

JG: I think a lot of our discussions at the moment are assuming that we’ve got well meaning, well intentioned people and well meaning, well intentioned companies and industries, who are trying to seek to do their best ethically and regulatorily and with appropriate data, and so on. But there is a space here for bad actors in the system.

Unfortunately, digital transformation of human life will happen in a good and bad way – unfortunately, I think there are going to be those two streams to this. Individuals are very capable now of making their own large language models by following a video guide if they wanted to, and having that data is, of course going to enable them maybe to do bad things with it.

Data is already a commodity in quite a strong way, but I do think we have to visit data security, and even the risks of open data as well. We live in a country, which I think does very well in producing lots of publicly available data. But that could be twisted in a way that we might not expect. And when I speak of those things, we’re usually thinking of groundwork – writing and implementing your own large language models – but there were recent examples of where just by using very clever prompting of existing large language models, you could get quite dangerous material, shall we say, which circumnavigated inbuilt existing safeguards. Again, that’s an emerging thing that we have to have to try and address as it comes on.

I think my final point with ethics and regulation is it will rapidly evolve, and it will rapidly change. And a story which I think can illustrate that is, when the first motorcar was introduced into the UK, it was law for a human to walk in front of the motorcar with a large red flag to warn passers-by of the incoming car because people weren’t really familiar with it. Now, of course, that’s in distant memory, right? We don’t have people with red flags, walking in front of cars. I do wonder, in 20 years or 50 years, what will the ethical norms regarding AI and its use be? Likewise, will we have deregulation? That seems to be the common theme in history that when we get more familiar with things, we deregulate because we’re more comfortable with their existence. That makes me quite curious about what the future holds.

FT: Jon raised a very interesting point and Janet touched upon keeping financial data in silos but we are facing this in healthcare as well. Data has to be checked within a trusted research environment or secure data environment that’s making the data silos. However, efforts at this point in time are on enhancing these digital platforms to bring data and federal data together. Alongside what is happening in terms of our progression towards development of a new ethical or legal requirement, is documenting what is being practised at the moment, because at the moment there are quite a lot of bubbles. Each institution has their own data and applies their own rules to it. So understanding what it is that we are currently working on – the data flows that are flowing into the secure environments – is building the basis of developments that are going on in terms of developing standardisation and common frameworks. A lot of projects have been focused on understanding the current to develop on it for the future.

We know for example, the Data Protection Act, put forward some specific requirements, but that was developed in 2018, before we had this massive AI consideration. In my academic capacity as well, we are facing what Jon mentioned, in terms of the diversity of assessments for students. For example, when we ask these questions, even if the data is provided within the course and within this defined governance, we know that the answers can possibly be aided by AI – a model. So we are defining more diverse assessment methods in academic practice to ensure that we have a way to evaluate the outcome that we are receiving by the human eye, rather than being blinded by what we receive from AI, and then calling it high quality output, whether in research practice or in academic practice. So there’s quite a lot of consideration of these issues, I think that is bringing our past knowledge to the current point where we now have to balance between human and machine interactions in every single process that we are facing.

How does this change the skill set required of data scientists, as AI is getting more and more developed?

A A-S: Regarding the terminology of data scientists, when we talk about data we immediately link that with statistics, and statistics is an old topic. There has been an accumulation of expertise for 100 years, to the best of my knowledge or more in statistics, and people who are new to data analysis or data, have to learn about this legacy. And when we develop the course, we should mention these skills in statistics and build this knowledge on top, that is, when we reach the right point, then we talk about learning or machine learning, supervised and unsupervised, and about LLM – these are the new skills they have to learn. As I mentioned, it’s tricky when we teach learners about it, we have to provide them with simple datasets to teach them something complex in statistics because it’s a danger to teach both [data and statistics at the same time] – we will lose them, they will lose concentration and it’s hard to follow up. So, a little bit of statistics – they have to learn the basics like normal distribution, the distribution, the type, and what does it mean when we have these distributions, the meaning of the data – and that is the point I made earlier about how people should have a sense for the numbers. What does it mean, when I say 0.56 in healthcare? Is that a danger? 60% – is that OK? In cybersecurity, if the probability of attack today is 60% should I inform the police? Should I inform someone; is that important? Or for example, for the stock market? Say we have dropped off 10% – Is that something we have to worry about? So making sense of the numbers is part of it.

That is part of personalised learning because it depends on their background or what they have learned – it’s not straightforward, and it has to be personalised not just for people taking the course now, for instance for someone who is 18 years old coming from their A levels. No, it’s for a wide range. People from diverse courses like to approach this data science course. And now we are in the era of people who are in social science, and engineering, doctors, journalism, art, they are all interested in learning a little bit of data science, and utilising AI for their benefit. So there is no one answer.

You emphasise that people still need to be able to make sense of numbers. We’re often told that AI will devalue knowledge and devalue experience – it sounds like you don’t feel that’s the case.

A A-S: I have to stick with the following: human value is just that – value. AI without humans is worth nothing. I have one example: In 1997, some software was developed for chess, to play against a human, and for the first time, that computer programme (called AI now) beat Kasparov. Guess what happened? Did chess disappear? No, we still value human to human competition. The value of the human is the same for art and for music. So we still have human value, and we have to maintain that for the next generation. They shouldn’t lose this human value, and handover to AI value, which I feel is zero without the human.

J B: I think one of the things we are seeing is that diversity in people’s backgrounds coming into data science, which is fantastic, because I think that really helps with the understanding of when things can go wrong, and how things can be misused. If you have this cookie cutter set of people that have all got a degree from the same place and all had the same experience, which is very similar – this happens a lot in the financial industry where there’s like five universities that all feed into the banks – they all think and solve problems in the same way because that’s how they’ve been trained. But as soon as you start bringing in people with different backgrounds, they’re the ones that say, hang on, this is a problem. So having those different backgrounds is really useful.

But then as Ali said there’s so many people who call themselves a data scientist that don’t understand data, or science. And I think he was absolutely right. If you’ve got a probability of 60%, or you’ve got a small standard deviation, when is that an issue? What do you really understand about that based on your industry, and based on your statistical knowledge? That’s so so key. And it’s something that a lot of people who are self-trained and call themselves data scientists have missed out on. So coming back to your original question about is it harder or is it easier, in some respects, it’s a lot harder, because someone who calls himself a data scientist now needs to do everything from basically fundamental research, trying to make models better, you’ve got to understand statistics, you’ve got to understand machine learning, engineering, production, isolation, efficiencies, effectiveness, ethics – it’s this huge, huge sphere. And it’s too much for one person. So you’ve really got to have well balanced teams and support. Because you can’t keep on top of your game across all of those. It’s just not possible. So I think that becomes really difficult. When I look at how things have changed, there’s so many basic principles from, you know, the 80s and 90s, in standard, good quality computer programming and testing. And I think the one thing that we’re really missing as an industry is a specialist AI testing role. Someone who understands enough about how models work and how they can go wrong and can do the same thing for AI solutions, as good QA analysts can do for standard software engineering models. Someone who can really test them to extremes with what happens when I put the wrong data in.

We saw this – there were a couple of days under COVID, where all the numbers went wrong, because the data hadn’t been delivered correctly, or not enough of it had been delivered. There were no checks in place to say, actually, we’ve only got 10% of what we were expecting, so don’t automatically publish these results. It’s things like that, that we really need to make sure are built into the systems because those are the things that, again, could cause problems. As soon as you get a model that’s not doing the right thing – going back to our original question – when they do go wrong, you can then find a company pulls that model even though it could be easily fixed. And then they’re disillusioned with AI, and won’t use it. That’s that whole project, and all of the expense and investment on that just thrown away when a bit more testing and understanding could have saved it.

Explore more data science ideas

About the authors: Anna Demming is a freelance science writer and editor based in Bristol, UK. She has a PhD from King’s College London in physics, specifically nanophotonics and how light interacts with the very small, and has been an editor for Nature Publishing Group (now Springer Nature), IOP Publishing and New Scientist. Other publications she contributes to include The Observer, New Scientist, Scientific American, Physics World and Chemistry World..

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Demming, Anna. 2024. “What is “best practice” when working with AI in the real world?.” Real World Data Science, June 4, 2024. URL

AI series: Meeting the unprecedented challenges AI poses in the labour market

Julia Lane, Lesley Hirsch, and Adam Leonard — Tue, 28 May 2024 00:00:00 GMT

Roughly $280 billion of new funding was authorized to boost research and production of semiconductors in the US under the CHIPS and Science Act in 2022 - an amount greater than the inflation-adjusted initial spending to create the US Interstate Highway System. The legislation was just one of multiple acts engineered to subsidise and support emerging technologies in the US that are bound to have seismic impacts on the labor market. It signifies how swift changes in new and emerging technologies have the potential to profoundly change the demand for skills and the structure of work. Here AI has the potential to be more disruptive than any other technological development since the industrial revolution.

The US is not alone. Countries across the globe are trying to understand the potential for AI to affect their workforce and economic activity. IPSOS, Group SA, a multinational market research company with headquarters in France, recently attempted to gauge people’s feelings towards AI across the world through a survey across 31 countries and interviews with a small cohort of AI leaders (Global Views on AI 2023). However although extensive, the data retrieved shares the limitations common to all surveys. The OECD’s most recent Employment Outlook devotes six out of seven chapters to understanding the impact of AI on the workforce. But the OECD also notes that “No comprehensive method exists by which to track and compare AI R&D funding across countries and agencies.” ¹ Not surprisingly, the inability to track, let alone compare AI R&D funding, means that it is difficult to make predictions about the R&D induced global labor market consequences.

The lack of a comprehensive method, and the resultant uncertainty about impact, is a clarion call to action. There are many challenges that need to be addressed. A partial list would include the following: a) a lack of a common definition of AI; b) a lack of information about the needed AI capabilities and how they will change; c) mapping AI capabilities to occupational skillls; and d) an inability to measure the impact of AI on job replacement or job augmentation.

Fortunately, there is hope, with new partnerships being established in the US by universities, federal, and state agencies. A new data infrastructure is being developed at the Institute for Research on Innovation and Science (IRIS) at the University of Michigan, joint with Ohio State University, in the United States, funded by the federal US National Science Foundation (NSF). The pilot joins up existing data using university and state sources to trace how scientific innovation translates to the labor market ². The NSF, which was been charged with the regional implementation of the CHIPS and Science investments, is funding the pilot precisely because it needs “innovative tools to accurately assess the impact of these investments across the U.S. ³

How bad is the problem?

The lack of data results in conflicting information. Some reports have warned of apocalyptic takeovers of the job market for many professions. Indeed, a heavily cited report by Goldman Sachs ⁴ predicted that AI could replace 300 million jobs. But the same BBC report that cited the Goldman Sachs prediction quoted the future-of-work director at Oxford University, Carl Benedikt Frey as saying “The only thing I am sure of is that there is no way of knowing how many jobs will be replaced by generative AI”. Simply put, as the former US Federal CIO, Suzette Kent, said “we lack useful information for informing strategic decisions for national workforce matters.”

So just how much of a problem is it that there is no information on how investments in science and technology affect the labor market? Why should we worry if we cannot accurately predict the impact of AI on workers, firms, and jobs? One reason is to avoid the mistakes of the past, in which both workers and firms have borne the consequences of bad information. Just in recent history, digitization and globalization resulted in a devastating loss of jobs in many countries. And geographic inequality soared as jobs in the midwestern and northeastern urban centers were lost and a service economy on the coasts burgeoned. Efforts to reduce the loss of jobs and earnings came too little, too late ⁵ ⁶. Another reason is to make evidence based policy recommendations. For example, the US National AI Research Resources Taskforce, which was directly charged by the President and Congress with recommending ways to invest in AI research to strengthen and democratize the U.S. AI innovation ecosystem did not have joined up data between science investment and the workforce to inform their final recommendations. ⁷

In other words, governments need more timely, local, and actionable data so that they can understand changes in the tasks that employers need performed, which types of jobs and firms will be affected, and where. Concomitantly, data will be needed about the effects of AI on different population groups and different geographic areas so that the costs of change are not unfairly distributed. Armed with such information, policy makers can make investments that mitigate or counteract negative impacts and workers can be trained in the new necessary skills and matched with the firms that need them. But the swift pace of change in AI means that the urgency to create timely, local, and actionable labor market information to guide these investments has never been greater.

A new approach

The IRIS approach, called the “Industry of Ideas” builds on the “economics of ideas” framework for which Paul Romer received the 2018 Nobel Prize in Economics”. ⁸ ⁹. People who create ideas – new technologies – that can be reused, form the foundations of new industries. In other words, “the discovery of new ideas lie at center of economic growth…” (Charles Jones describing Paul Romer’s conceptual framework) ¹⁰.

The project recognizes that, as Robert Oppenheimer said “the best way to transmit knowledge is to wrap it up in a human being”. ¹¹ It uses people-centric methods for following the movement of ideas from investments in research into the marketplace. The approach identifies businesses that employ people with deep skills in AI and other emerging technology areas and developing early, never-before-available indicators that can provide alerts associated with potential impacts on current and future workforce. Initially focused on the artificial intelligence and electric vehicle industries in Ohio, the pilot is creating a data system that can be expanded and applied to other industries and other states across the country.

The new tools are innovative because they build on new opportunities to produce usable information that is local, about relevant industries, and that directly tie investments in new technologies, such as AI, to labor market impacts.

Another key aspect of the NSF piloted “Industry of Ideas” is the focus on tying innovation at its source - individual data on university research activities - to the local workforce data reported by firms to their state departments of labor. The need for local data is critical because so many labor markets are local, not national in scope. Even in a global economy, many businesses and workers are locally based – as are the training providers that work to ensure that labor demand and supply are well matched. Thus the Industry of Ideas pilot provides policy makers, workers, firms, and educational institutions with access to an array of local, timely, granular, actionable resources to help them make decisions. That way, local leaders who need labor market data don’t need to rely on national unemployment figures, which are reported once a month.

Connecting science investments with jobs

The Industry of Ideas approach directly connects investment in science and the labor market, moving beyond the current approach for evaluating investment by studying scientific papers and publications ¹² ¹³ which are disconnected from workers and jobs. The data seeds were sown almost two decades ago. President Bush’s Science Advisor, John Marburger III, who, quite sensibly was unconvinced of the scientific and practical value of relying primarily on document-based, bibliometric approaches to studying science to understand its practical effects, called for a “Science of Science Policy” ¹⁴ ¹⁵.

The Industry of Ideas is testing the potential to securely combine university and state data to measure the link between federal investments on local and regional economies for AI. It uses people-centric data generated by the administrative processes at universities and firms. With this data the Industry of Ideas project can capture the organization of people in science at multiple levels (e.g. individuals, teams, projects, and institutions), their multiple sources of funding (federal scientific and programmatic agencies, philanthropic foundations, industry, and state and local government), inputs into science from vendors (such as computing services, instruments, biological specimens), as well as the dynamics of their careers across time (individual career earnings and employment trajectories).

Figure 1: The Industry of Ideas Infrastructure (provided by Jason Owen-Smith, IRIS, University of Michigan)

The IRIS infrastructure, developed over the past decade, provides administrative records on more than 41% of U.S. total R & D spending at universities ¹⁶. The infrastructure also provides links to survey data, as well as data from private sector suppliers ¹⁷, and can trace the flows of university funded researchers into the private sector ¹⁸ by joining up the university administrative data with state workforce data.

Tying information about AI to skills needs

How is it possible to tie changes in AI to changing needs for skills? State leaders in workforce and education agencies have identified new ways to collaborate, build staff capacity, and develop solutions, services, and products that respond to local need. An example of how to use data to get better information that more accurately connects workers with firms in the swiftly changing labor market is the New Jersey Career Navigator. It provides job seekers recommendations on new careers, available job postings, and relevant training programs based on skills similarity, labor market demand, and wage impacts observed in the underlying data. These recommendations, which are in themselvers generated by AI, show how AI technology can be used to navigate the changes in the labour market AI may cause. The New Jersey Career Navigator draws on millions of wage records, providing earnings and industry information on all workers covered by unemployment insurance in New Jersey firms; employment and wage outcomes from hundreds of thousands of graduates of occupational skills training programs in New Jersey; several years of online job postings from the National Labor Exchange Research Hub (NLx); and the resumes of 400,000 New Jersey residents.

In other words, as the Industry of Ideas pilot evolves, new ideas from states like New Jersey can be used not only to trace the flows of ideas from academia to the workplace but also to develop a new system that targets reskilling efforts once the type and location of skills needs have been identified. The new joined up data and evidence can be used to address challenges such as low labor force participation, and supplies education and training providers the data they need to align their programs with the needs of the labor market. Such a system would help government, business, educators, and workers adjust regional talent pipelines continuously in response to the changes in AI and enable workers to successfully navigate the changes that it brings.

New approaches to classifying industries: “Industries of Ideas”

An important outcome of the new NSF pilot is the potential to transform the way in which we classify firms into industries. The current industry classifications are rule based. They are designed for the economy as it was organized 40 years ago, so are not designed to describe AI. A case in point is the state of Texas – a state that anecdotally has generated a lot of high tech jobs. Current industry data for Texas is limited because firms are grouped into industries that are defined by what they produce, or how they produce it, rather than describing what new technology is being developed or utilized by those firms. As a result, the main source of labor market data in Texas provides an implausibly low picture of AI activity ¹⁹.

The Industries of Ideas approach could provide states with a new way to classify firms, based on clever new ideas of how firms can do their business, and by grouping firms by the people who created and use the technologies they will adopt ²⁰. Examples just for Ohio include funding to use AI to improve the ways in which medicine is delivered, and advancing digital agricuture , which includes things like precision livestock farming, or precision agriculture that reduces waste and improves productivity more generally. As they interact with farmers, the clustering of university researchers and the ideas embodied in them alongside the farms that adopt those ideas represents this new type of industry cluster . Such a classification framework is a sea change from earlier industrial classifications based on what goods are physically produced - like manufacturing and agriculture ²¹.

The Future

Such a bottom-up classification and analysis system, based on local links between researchers and firms, could be designed locally but scaled nationally. It could address the challenges identified at the beginning of this piece. The definition of AI firms could evolve and be defined by the links between AI researchers and the firms with which they work. The lack of information about the needed AI capabilities would be resolved by the direct mapping of firm skill demand and their hiring patterns, as exemplified in New Jersey. The same New Jersey mapping could tie AI capabilities to occupational skills. And the direct impact of AI on job replacement or job augmentation could be mapped from the joined up university and workforce data.

Of course, much needs to be done. The implementation will depend on the success of the pilot, and the ability to build on existing assets. Not all states and universities have the capacity to build a similar system, but the fact that 30 universities and 15 state agencies are participating in advisory boards for the NSF Industry of Ideas pilot is grounds for hope. Indeed, a new generation of data leaders is leading the way, not only at the local and regional government level but also at universities and professional associations (Advisory Committee on Data for Evidence Building) ²².

We began this paper by noting that the urgency to create timely, local, and actionable labor market information has never been greater. We close by arguing that our capacity to fundamentally change the way in which we can use data and information to understand the demand for skills and the structure of work has also never been greater. The opportunity is ours for the taking.

Explore more data science ideas

About the authors: Julia Lane is a Professor at New York University’s Wagner Graduate School of Public Service. She was a senior advisor in the Office of the Federal CIO at the White House, supporting the implementation of the Federal Data Strategy. She recently served on two White House committees: the Advisory Committee on Data for Evidence Building and the National AI Research Resources Task Force.; Adam Leonard is the Chief Analytics Officer & Director of the Division of Information Innovation & Insight (I|3) for the Texas Workforce Commission (TWC). Adam envisioned and founded I|3 to help TWC leverage its most important untapped resource - its data – to help the agency and its partners better help employers, individuals, families, and communities achieve & maintain prosperity.; Lesley Hirsch is the Assistant Commissioner of Research and Information at the New Jersey Department of Labor and Workforce Development. Her vision for the department is to bring cutting-edge digital tools to bear to deliver labor market intelligence to the department’s internal and external customers where, when, and how they need it and to mine every data source so it can tell its full story.

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Lane, J., Hirsch, L. and Leonard, A. 2024. “Meeting the unprecedented challenges AI poses in the labour market.” Real World Data Science, May 28, 2024. URL

Footnotes

A new approach to measuring government investment in AI-related R&D. Galindo-Rueda, F. & Cairns, S. oecd.ai (2021)↩︎
The Industry of Ideas: Measuring How Artificial Intelligence Changes Labor Markets Lane, J. AEI (2023)↩︎
NSF launches pilot to assess the impact of strategic investments on regional jobs *new.nsf.gov (2023)↩︎
AI could replace equivalent of 300 million jobs - report Vallance, C. BBC news (2023)↩︎
The Growth of Low-Skill Service Jobs and the Polarization of the US Labor Market Autor, D. H. & Dorn, D. American Economic Review 103 pp. 1553-97 (2013)↩︎
Explaining Job Polarization: Routine-Biased Technological Change and Offshoring Goos, M., Manning, A. & Salomons, A. American Economic Review 104 2509-26 (2014)↩︎
Strengthening and Democratizing the U.S. Artificial Intelligence Innovation Ecosystem Office of Science and Technology Policy (2023)↩︎
The Deep Structure of Economic Growth Romer, P. paulromer.net (2019)↩︎
Interview With Paul Romer Romer, P. & Lane, J. (2022)↩︎
Paul Romer: Ideas, nonrivalry, and endogenous growth(Jones, C. I. The Scandinavian Journal of Economics 121 859-883 (2019)↩︎
Wrapping it up in a person: Examining employment and earnings outcomes for Ph.D. recipients Zolas, N. et al. Science **350 1367-1371 (2015)↩︎
Let’s make science metrics more scientific Lane, J. Nature 464 488–489 (2010)↩︎
A Vision for Democratizing Government Data Lane, J. Issues in Science and Technology XXXIX (2022)↩︎
Let’s make science metrics more scientific Lane, J. Nature 464 488–489 (2010)↩︎
Wanted: Better Benchmarks Marburger III, J. H. Science 308 p1087(2005)↩︎
The Institute for Research on Innovation & Science (IRIS). Summary Documentation for the IRIS UMETRICS 2022 Data Release Nicholls, N., Brown, C. A., Ku, R. L. and Owen-Smith, J. D. Ann Arbor, MI: The Institute for Research on Innovation & Science (2022) doi: 10.21987/df2a-ha30↩︎
A Linked Data Mosaic for Policy-Relevant Research on Science and Innovation: Value, Transparency, Rigor, and Community Chang, W.-Y., Garner, M., Basner, J., Weinberg, B. and Owen-Smith, J. Harvard Data Science Review (2022) doi: 10.1162/99608f92.1e23fb3f↩︎
The Industry of Ideas: Measuring How Artificial Intelligence Changes Labor Markets Lane,J. American Enterprise institute (2023)↩︎
[Outside of the Box Use of Administra4ve and Wage Data in Texas] (https://digitaleconomy.stanford.edu/wp-content/uploads/2024/03/Adam-Leonard.pdf) Leonard, A. digitaleconomy.standford.edu (2024)↩︎
The Industry of Ideas: Measuring How Artificial Intelligence Changes Labor Markets Lane,J. American Enterprise institute (2023)↩︎
Converting historical industry time series data from SIC to NAICS. The Federal Committee on Statistical Methodology Yuskavage, R. Federal Committee on Statistical Methodology (2007)) – or by how services and goods are produced – like the delivery of health, financial, and investment services The Statistics Corner: The NAICS Is Coming. Will We Be Ready? Haver, M. A. Business Economics 32 63-65 (1997)↩︎
Year 2 Report Supplemental Information Advisory Committee on Data for Evidence Building (ACDEB) (2022)↩︎

AI series: Evaluation essentials for safe and reliable AI model performance

Isabel Sassoon — Tue, 21 May 2024 00:00:00 GMT

It took just sixteen hours for Microsoft’s shiny new chatbot Tay to be shut down for profanity. The chatbot had been released on the social media platform, X, then known as Twitter, following extensive evaluation and stress testing under different conditions to ensure that interacting with the chatbot would be a positive experience. Unfortunately, the testing plan had not bargained on a coordinated attack exploiting the chatbot’s vulnerability when exposed to a torrent of offensive material. Tay soon began tweeting wildly inappropriate words and images and was taken offline within hours.

The chatbot’s failure highlights just how hard yet imperative it can be to test and evaluate a model before real world deployment. With the recent flux of accessible “off-the-shelf” machine learning algorithms, building AI models, in particular generative AI models is now relatively straight forward. However, the simplicity with which models are deployed undermines the complexity of evaluating them. Nonetheless, deploying the model anywhere outside the data and context it has been trained on can be risky if its performance is not evaluated. The evaluation process requires clear definitions for good performance as well as highlighting the potential risks, and can throw up unexpected requirements in the test data. Not only are the subtle nuances in the initial evaluation requirements important, but once deployed a process needs to be in place so that the algorithm can be monitored over time.

Know your goals

The first point to note is that checking how well the output from an AI model matches the data in the training set is not an adequate indication of how well it will perform once deployed on other data. The problem can be exemplified by considering a simple model based on an equation that best fits a training data set. Data values are inevitably subject to measurement uncertainties and local conditions that add various types of noise, so taking the line defined by the equation identified as best matching the training data and measuring how good that match is falls short of adequate evaluation - the more perfectly a model matches this noisy data, the less perfectly it will fit an alternative set of data, a scenario described as “overfitting”. Similarly, what a machine learning or AI algorithm or model learns when it optimises its fit to the training data may not be generalisable.

Figure 1: Evaluation plots. The kinds of charts of monitored performance and risk metrics that are plotted to evaluate an AI model. Reliable deployment of an algorithm requires identifying appropriate metrics for performance and risk as well as rigorous, ongoing evaluation. Image created by Isabel Sassoon using Adobe Firefly to show a technical report process flow of statistical model performance and a huge numbers chart.

There are a number of possible approaches and factors to take into account when sourcing test data but the first thing to consider when drawing up a process for evaluating an AI model is its objective. With this objective in mind it is then possible to pin down an appropriate measure of performance, which will shape how to use the test data to evaluate the model performance. Among the distinguishing factors between different measures used to evaluate how a model performs on test data, some will be suitable when the objective is to classify (e.g high or low risk based on health data?) while others are useful for models that estimate or predict (e.g What is the estimated height of a child given their parents’ heights).

Classification model performance can be measured using accuracy, confusion matrices, sensitivity, specificity and the receiver operating characteristic (ROC). Classification accuracy summarises the performance of a classification model as the number of cases in which the model correctly classifies divided by the total number of cases used in the test set. However, this can be a blunt tool as there are cases where there is a different cost or consequence depending on the direction of the error. Confusion matrices are helpful to explore how the model performs in correctly classifying the different classes. The confusion matrix sums up the number of cases the model classifies correctly within each of the classes, for example how many actual high-risk cases are correctly classified as high risk by the model. The number of cases the model classifies as high risk, for example, that are not high risk is referred to as the False Positives. In the context of medical tests (e.g the covid lateral flow tests) testing positive for a condition that is not actually there is potentially less damaging than testing negative when the condition is there.

Figure 2: The receiver operating characteristic can provide a helpful means of visualising performance. Credit: shutterstock

Additionally, the sensitivity and specificity can provide a more detailed look at model performance. The sensitivity refers to the proportion of cases labelled as positive that are classified as positive by the model, whereas specificity refers to the proportion labelled as negative that it classifies as negative. It is useful to visualise model performance and the receiver operating characteristic (ROC) provides a method to do just that. The ROC plots the True Positive rate against the False Positive rate for the model. This can be further summarised in one value as the area under the curve (AUC). The larger the AUC the better the model is performing.

Deciding whether accuracy is enough or whether there is a need to delve into the directions of the errors depends on the context of the model’s deployment. Other examples in medicine include the risk models that were developed to assess an individual’s risk of a specific medical condition, such as QRISK ¹ which calculates a person’s risk of developing a heart attack or stroke over the next 10 years. Here model performance needs to go beyond accuracy and consider the direction of the errors it makes. A good overview of performance evaluation is outlined by Flach (2019) ². Is it better to tell someone they may be at risk of disease X, run a blood test and rule it out (False Positive) than to tell them they are not at risk and not check (False Negative)? All this needs to be considered and factored into the validation of the model. It is worth noting that a systematic direction for its errors can also cause an algorithm to hit ethical problems.

When evaluating the performance of models that are estimating a numerical value (e.g height of child from height of parents) the measures used are based on how far off the model’s estimate is from the actual value (which is known for testing data). There are then a multitude of ways of summarising that quantity. The mean square error (MSE) is computed by taking the average squared difference between the estimated values from the model and the actual value in the data. Other variations include root mean square error (RMSE) and mean absolute error (MAE). The RMSE is computed in the same way as the MSE but the value is square rooted. The MAE takes a different approach by summing up the absolute errors (i.e. the error magnitude). Each of these three measures involve dividing the value obtained by the number of rows in the data. Depending on the context one of these measures may be better suited than others. For example the MSE is sensitive to outliers so can be easily skewed by a small number of extreme values, which may be useful to highlight them, whereas the RMSE has the advantage of being measured in the same units as the original variable the model is designed to estimate.

Large Language Models (LLMs, e.g. Gemini, ChatGPT) are also models trained on a data set and as such also need to be evaluated and monitored. Whereas in the models discussed so far there are some standard metrics, evaluating LLMs is more challenging as there are a multitude of benchmarks and metrics³. When LLMs are used to answer questions (when you ask a chatbot a question) then monitoring the performance of the model (the trained LLM) can involve anything. Is the answer correct? Is the answer clear? Is the answer biased? The possible metrics are varied and not as simple to capture in one measure. It is also possible to use a LLM to evaluate or score another LLMs’ answer to a question. However this adds its own risk as LLMs are not 100% accurate or consistent themselves, and they can hallucinate.

Getting data right

Not only is separate test data needed for an evaluation, but care is needed to ensure the test data is suitably representative. Similar requirements apply for test data as for the original training data to ensure the dataset is representative of the context the model will be deployed in. For instance, if an algorithm is being developed to handle photos from the UK, training and testing it on photos where the sun always shines may cause problems. The model needs to be trained and tested on a set of photos that include rain and clouds otherwise it cannot be assumed it will reliably classify such photos if they appear during deployment in the real world. Getting the training and test data set right may mean using a smaller more curated set than simply one that contains everything available.

These data sets also need to have reliable labelling i.e. the rows of data need to be accurate so that the model’s performance can be assessed objectively against a trusted “ground truth”. For example, if we want to evaluate the performance of a fraud transaction classification model using accuracy as the performance metric, then we need a reliable training data set with true fraud transactions to evaluate how good the model is at detecting them. A data set with a list of transactions that are not accurately identified (or labelled) as fraud or not is not helpful. Thinking about how some commercial LLMs are trained on all the data in the “internet” it is worth asking whether a smaller more curated and specific training set would be better for model performance as well as being more ethical and safer.

Several approaches for generating test data sets take training and test data as distinct subsets from the same initial data set ⁴. There are different ways of doing this to make the most of the data to evaluate the model as systematically and exhaustively as possible. Perhaps the simplest example is using a hold-out set, which involves taking all the data available and taking a random subset of the data to use for testing the model. Depending on how much data is available then this can be 50% or less.

A slightly more sophisticated approach is k-fold cross validation, which involves splitting all the data you have available into k subsets and then doing k iterations where in each iteration a different kth of the data is used as the testing data for evaluation of the model built by training it on the remaining (k-1/k) of the data. This is repeated k times each time using a different one of the k subsets for testing. The performance of the model can then be averaged over the k iterations. (The measure of performance can be, say, accuracy or sensitivity depending on the context). For example, if k is 3 then the data is split into 3, and each iteration will take a different 2/3 of the data as training data to build the model, and the remaining 1/3 as testing data to evaluate the model.

Figure 3: K fold cross validation can indicate how sensitive a model is to the test data. Credit: Fabian Flöck CC-BY-AS-3.0

Bootstrap is a more computationally intensive approach and it involves creating multiple samples by randomly sampling with replacement from the original data. Typically, hundreds or thousands of and such samples are generated, each will be different. These multiple samples provide multiple versions of the training and testing data so the model can be evaluated on all these variations. As bootstrap relies on sampling with replacement this means that each row of data in the original data can appear multiple times in a sample training or test data during each iteration, or not appear in other samples. As with k-fold cross validation the performance of the model can be then averaged over these multiple iterations. It is important that bootstrap does not rely on only a handful of iterations. Both bootstrap and cross validation offer an opportunity to see how sensitive the model’s performance is to the characteristics of the test data, but when the data sets available are small, the use of the bootstrap approach provides a more robust way of estimating the model’s performance.

An approach that can be useful to test whether the performance of the model is sensitive to time is time-based splits. This involves taking a “sliding window” ensuring that data is split into back-to-back time periods. Using a back-to-back (sliding window) further ensures that the data the model is trained on is separate from the one it is tested on.

Maintained monitoring

Once an algorithm has been let loose it can be challenging to maintain any rigorous monitoring, but it is worth highlighting the importance of taking on the challenge of ongoing monitoring and promising approaches to it. Some of the same metrics will apply to keep a handle on the myriad of issues that could arise. These range from the banal, such as data input errors, to the complex as is the case in model drift.

In the first case, if a model makes use of data that is fed into it from another system (e.g. a billing system) any update to this other system can affect model performance. Identifying this involves checking that the characteristics of the data used to train the model and the latest data fed into the model are not too dissimilar, since a difference in the data such as an increase by a factor of 10 or a hundred can cause the algorithm to fail. The magnitude of acceptable change in the data will depend on the context. Such a step change (due to source system update) in one of the model inputs can be identified and can potentially be an easy fix.

Model drift is more complex as real-world data evolves over time. There are two types of model drift: data drift and concept drift. Data drift refers to the change that can occur to data over time, whilst concept drift⁵ is a deterioration or change in the relationship between the target variable and input variables of a model. An example of data drift could be in the context of billing data the addition of new price plans or phones to the data, whilst an example of concept drift can arise when there is a change in the relationship between the effect (for instance, leaving one mobile phone provider for another) and underlying factors changes. In the context of the mobile phone provider market, a concept drift may mean that leaving for another provider is no longer dictated so much by price sensitivity as the type of network. Both types of drift lead to a deterioration in performance of the model as time goes by. Performance monitoring of the model is key to detecting model drift but differentiating between data or concept drift requires additional specialist approaches. Some of these are outlined in (Rotalinti, 2022)⁶ and (Davis, 2020)⁷.

In some cases, refreshing a model to account for the change in the underlying data (both training and test) can be quick and easy. However, if concept drift is detected, then it may take more than just a model refresh as the relationships between the variable we are trying to model, and the explanatory data has changed. This may involve finding new data sources and could lead to significant changes in the model, for example moving from a regression model to a neural network. Deciding to rebuild or retrain a model can also in some cases have environmental impact (particularly for the more resource intensive models such as deep learning and LLMs). Either way, where models are subject to peer review or some form of governance this can be a more onerous task.

Even with each step in a model’s evaluation stringently adhered to it is also important to assess the context for its deployment for risks and rogue scenarios that might break or in the case of Tay despoil it. And like all other stages of the evaluation this should not just be at the time of deployment but also over time. When models (machine learning or other) are used to inform or make important decisions providing information on how and when the model was evaluated, and how it is monitored should be standard practice not just to avoid the wasted expense of another broken AI model (algorithm) left on the shelf but more importantly to safeguard the welfare of those who come into contact with it.

Explore more data science ideas

About the author: Isabel Sassoon is senior lecturer in the Department of Computer Science, Brunel University London.

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Sassoon, Isabel. 2024. “Evaluation essentials for safe and reliable AI model performance .” Real World Data Science, May 21, 2024. URL

References

Hippisley-Cox, J., Coupland, C. and Brindle, P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.BMJ (2017) doi: https://doi.org/10.1136/bmj.j2099.↩︎
Flach, P. (2019). Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward. Proceedings of the AAAI conference on artificial intelligence pp. 9808-9814 (2019) doi: https://doi.org/10.1609/aaai.v33i01.33019808.↩︎
Chang, Y. et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (2023) doi: https://doi.org/10.1145/3641289.↩︎
Witten, I. H., Frank, E. and Hall, M. A. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann (2011).↩︎
Bayram, F., Ahmed, B. S. and Kassler A. From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems (2022) doi: https://doi.org/10.1016/j.knosys.2022.108632.↩︎
Rotalinti, Y., Tucker, A., Lonergan, M., Myles, P. and Branson, R. Detecting drift in healthcare AI models based on data availability. Joint European Conference on Machine Learning and Knowledge Discovery in Databases 243-258 (2022) Springer Nature Switzerland. doi: https://doi.org/10.1007/978-3-031-23633-4_17↩︎
Davis, S. E., Greevy Jr, R. A., Lasko, T. A., Walsh, C. G. and Matheny, M. E. Detection of calibration drift in clinical prediction models to inform model updating. Journal of biomedical informatics (2020) doi: https://doi.org/10.1016/j.jbi.2020.103611.↩︎

AI series: On AI ethics - influencing its use in the delivery of public good

Olivia Varley-Winter — Tue, 14 May 2024 00:00:00 GMT

Criminal sentencing biased by race in the US, students systematically downgraded in UK public examinations with no process for appeal, and decisions to rescind food welfare in India riddled with errors and discrepancies are all instances where AI algorithms have hit the headlines. When Bill Gates wrote that the age of AI has begun and “will change the way people work, learn, travel, get health care, and communicate with each other,” those probably weren’t the changes he had in mind. Nor need they be an inevitable side effect of living with AI.

A number of points require consideration to work safely with AI, from the potential for bias in input and training data, and consent over data use, to the transparency and fairness of applying an algorithm – who has decided the problem, or set of problems, it is to solve? The steps that are taken to explain and involve an organisation’s stakeholders in the conclusions that AI reaches also require ethical consideration, as does ethical development of AI. Its use for social policies and services highlights an additional set of problems.

As AI becomes more active in society, AI ethics involves not only defining the objectives for data scientists, researchers and technologists to work on. It involves governing bodies, regulators, policy makers, businesses and organisations, the media, and civil society, working to handle and communicate AI’s benefits and mitigate its harms. Organisations with international clout – such as the United Nations Educational, Scientific and Cultural Organization (UNESCO) and the Organisation for Economic Co-operation and Development (OECD) – have prominently set out ethical principles that can broadly apply. Nonetheless, a lot can go wrong.

Bias in bias out

In 2016 when ProPublica launched an investigation into potential biases in a ‘risk assessment’ algorithm used by the US criminal justice system, it was the first independent investigation of its kind. This was despite the widespread use of the algorithm and its power to influence a judge’s sentence, in one instance doubling the duration while increasing the severity of the imprisonment. On examining 7000 risk assessment scores and the records detailing whether the subjects of those scores had reoffended in the subsequent two years, Propublica found “Only 20 percent of the people predicted to commit violent crimes actually went on to do so”. Even when the full range of crimes was taken into account “the algorithm was somewhat more accurate than a coin flip” at 61%. Part of the enthusiasm for these algorithms had been the expectation that they might bypass the prejudices and unconscious biases of human judges, enabling fairer justice. However, while many might baulk at the thought of tossing a coin to determine someone’s prison sentence, it turns out this might be a fairer approach than the algorithm, which was found to “falsely flag black defendants as future criminals” at twice the rate of white defendants.

Eleanor Roosevelt reads the Universal Declaration of Human Rights in 1949; FDR Presidential Library & Museum 64-165 CC-BY-2.0

Since Propublica’s investigation there have been multiple reports highlighting problems with algorithms trained on historic data for use in the criminal justice system. The risk illustrated here, which can be generalised, is that such algorithms will tend to propagate social biases. In this case it means that those from ethnic minorities and lower socioeconomic backgrounds are awarded harsher sentences. Compounding the problem was the proprietary nature of the algorithms involved, which made it difficult to launch independent investigations. However, in the case of the algorithm investigated by Propublica, the input data, which is taken from questions put to the defendant and their prison records, did provide clues as to the scope for unfair outcomes. Although race is not explicitly identified, it likely correlates with other data that is used as input. This meant that the outcomes would be biased with respect to race all the same. A lot more work is needed to mitigate the effects of historical social injustices in how the criminal justice system uses data. Innovators in this area need to have confidence in what will be affected by their evidence base, as well as support from independent legal and ethical reviewers, and from regulators, to determine what will make a good innovation, and what will not.

Consent, human rights, and data provenance

The testing and training of AI algorithms can also run into other ethical questions about the ratio of public to private benefits from data and who governs those benefits. On the eve of the UK’s AI Summit in 2023, Joe Cuddeford of Smart Data Research UK wrote: “Many AI systems rely on data collected passively from individuals, raising questions about transparency, privacy, and who benefits from these data-driven advancements.”

Large scale AI models, such as generative AI models, are often trained on web-scraped data from online platforms. This leads to questions about the fairness of internet data, the ownership of it (e.g. potential violation of copyright law), and methods for users’ consent and human rights to be embedded and respected. There are, once again, questions about accuracy and bias: what do algorithms “learn” from data scraped from the internet, and is the information appropriately curated for use?

Civil liberties concerns also similarly arise when people are compelled to give up data about themselves by powerful arms of the state. For example, the national Facial Verification Testing program run by the National Institute of Standards and Technology, a part of the U.S. Department of Commerce, has held and made use of images of vulnerable individuals to test and validate the performance of commercialised AI technologies. The data used by the agency for testing include ‘mugshots’ or facial images from arrests or from other encounters with law enforcement. ¹ An additional programme focuses on testing the performance of facial recognition algorithms against an image database of sexually exploited children (CHEXIA-FACE). Having statistics from this kind of agency testing has clear commercial benefit: it helped win the case for the vendors who could match those statistics when the London Metropolitan police purchased live facial recognition technology. However, the interests of the people that have been documented do not come up for discussion in this form of data governance. There are many participatory methods that could be used for more ethical stewardship of the data that people are compelled to give. ²

To address the scope for minorities and vulnerable groups to play their part in data collection, it should be possible for data scientists to adopt strategies that consciously address the bias in data collection. Eun Seo Jo (from Stanford University) and Timnit Gebru (formerly at Google) have suggested library and archival approaches. In Strategies for Collecting Sociocultural Data in Machine Learning, ³ they note that internet data is subject to historical and representative biases. Recognising and mitigating biases will “start with a statement of commitment to collecting the cultural remains of certain concepts, topics, or demographic groups.” A public mission statement, which highlights the interests of communities and minorities they plan to support, “forces [archival] researchers to reckon with their data composition.”

These strategies also need to be supported by good management of data collection and curation. A report by the Royal Academy of Engineering (2021) Towards trusted data sharing: implications for policy and practice has highlighted that, to support the use of data for research, good data management must exist among the data owners. Strong relationships with data owners predicated on data quality and ethics will help researchers to specify what data sets they are looking for and how they can best be curated for AI purposes. Good data management will not only help AI developers but also all potential users (as well as the public) to understand the scope and the quality of what’s being shared. “Defining the requirements for data quality, and ensuring these requirements are delivered, remains a central challenge.” (RAE report)

Advocates for accurate and fair data and machine learning have worked hard to clarify what good data management and sharing looks like, which is cause for optimism. However there is the sense that this is the area in which AI has the furthest to go, as data sets currently available fall far wide from the standards recommended by their work. Nonetheless the rise of Trusted Research Environments, Data Safe Havens and other methods of open-source transparency enable more AI innovators to disclose their sources without placing any of the personal information they use at further risk, as discussed previously in the AI series. Leadership on ethical standards for data sharing may yet help to improve the robustness, security, safety, and fairness of AI tools, which the OECD advocates as key principles for AI.

Openness, explainability, and the scope to challenge AI decisions

A principle that many data science communities have been working on, is towards ensuring transparency and explainability of AI (OECD AI Principle). In OECD parlance that is in part “to ensure that people understand when they are engaging with [artificial intelligence] and can challenge outcomes.” In acknowledgement that some AI applications make this disclosure harder and more unappealing, the OECD suggests that the fact that AI is in use should be disclosed “with proportion to the importance of the outcome … so that consumers, for example, can make more informed choices”. The OECD emphasises the importance of the “explainability” of the algorithms, which it defines as “enabling people affected by the outcome of an AI system to understand how it was arrived at. … notably – to the extent practicable – the factors and logic that led to an outcome.”

The tens of millions of digital ‘platform workers’ that now live all over the world are a case in point for where explainability is needed. They perform short-term, freelance, or temporary work through digital platforms or apps in the “gig economy”. There is little transparency about how algorithms and AI influence outcomes for gig workers, and whether platform algorithms are contributing systematically to unfair outcomes. Platform workers themselves have come together to share their data to understand more about the outcomes of the algorithms, or AI, which is shaping their lives.

It follows that where the use of an AI system does not affect outcomes for people, there may be less of a demand to publicly justify how AI arrived at its outcomes. For example, where AI is used to simulate something, or to research a decision, rather than to make a decision, there could be less weight placed on explaining the model publicly.

Aerial view of tech cluster in Silicon Valley, taken on 29 March 2013, courtesy of https://www.flickr.com/photos/patrick_nouhailler/ CC-BY-3.0

François Candelon, Theodoros Evgeniou, and David Martens, writing for the Harvard Business Review have outlined that their preference is for accuracy as well as explainability. Often, to strike this balance, they will prefer ‘white box’ models which are transparent and interpretable. But not always. “In [complex] applications such as face-detection for cameras, vision systems in autonomous vehicles, facial recognition, image-based medical diagnostic devices, illegal/toxic content detection, and most recently, generative AI tools like ChatGPT and DALL-E, a black box approach may be advantageous or even the only feasible option.”

Even where the algorithm is too large and complicated to be interpretable, work like that conducted by the Alan Turing Institute in Project ExplAIn finds ways of extracting some kind of explanation, for instance by embedding layers in the coding. The case for opening up AI in this way has to be balanced against concerns for intellectual property, information security and privacy. There can be cybersecurity issues with making the different layers of an AI model more open to interrogation. Nonetheless, experiments with transparent and explainable models enable developers to advance their understanding of AI, as well as to consider whether its use for decision-making is ethically sound. The OECD principles make clear that it is important that AI doesn’t elude human insight, checks and balances. As Andrew Ng highlighted in the RSS fireside chat in 2021: “AI is increasing concentration of power like never before…governments and regulators need to look at that and think of what to do.”

Appropriate, human-centred governance

When school exams in England were cancelled during the Covid-19 pandemic, the government’s Department for Education decided that an algorithm should be used to allot grades to A-Level students, partly as a measure to counter grade inflation (a trend in which the grades awarded for the same standard of work will tend to rise, year on year). Algorithms had been used before in previous years to adjust the marks that were awarded for exams and coursework. Here instead of exams and coursework, the input data was gathered from Ofqual’s historical records about how particular schools’ pupils had performed in previous years, and some was generated by teachers. Efforts had been made at transparency in terms of how the new algorithm would arrive at these decisions (it was a relatively simple, white box algorithm). But there were ‘outliers’ acknowledged in the model even prior to deployment. Coupled with the widespread downgrading of teacher-estimated grades to fit a curve that would avoid grade inflation, there was not a clear process by which students and schools could appeal to change their grades. Dissatisfaction with the grades awarded in the absence of exams or coursework was rife, as young people regarded as academically talented by their schools fell short of the grades their teachers had predicted, and lost university places.

In the resulting furore, the Department for Education determined that its original policy was wrong and adopted the teacher estimated grades with an appeal process in place. The incident demonstrates that achieving the functional transparency of an algorithm is only one step in due process. Controversial policies could be using an algorithm to apportion losses across the population (e.g. to try to reduce grade inflation) in ways that are abhorred by individuals.

Vested interests also surfaced during investigation of an algorithm brought into use to tackle fraud in India’s welfare system. “From 2014 to 2019, the government of Telangana “cancelled more than 1.86 million existing food security cards and rejected 142,086 fresh applications without any notice.” reported Al Jazeera in January of this year. Despite the government’s initial claims that the cancelled food security cards were fraudulent, critical data scholarship in India and elsewhere has established discrepancies and errors in the algorithms used, such as, confusing the records of a valid claimant with a car-owning citizen by the same name. (Under the government’s policies, SUV owners cannot receive food aid.) Further investigations revealed that at least 7.5 per cent of the food security cards were wrongly cancelled. The investigations highlight what can be a common problem: a focus on reducing the costs of welfare programmes tends to lead services to identify false positives - wrongful claimants – rather than false negatives. Thus efforts to correct sloppy data may meet resistance if this leads to fewer “frauds” being identified, even when citizens bring evidence to challenge it.

There is a similar type of example in the UK’s Post Office scandal, in which many sub-postmasters were wrongfully prosecuted for false accounting, after the Post Office adopted accounting software that contained significant bugs, which were covered up for many years. This similarly goes to show how far organisations can pursue wrongful judgements, and the life-changing consequences.

The EU’s new AI Act advocates a risk-based approach, to balance the desire to minimise the burden of compliance while ensuring the safety of people who may be affected by the implementation of AI algorithms. Systems assessed as high risk according to specific criteria are then “subject to strict obligations before they can be put on the market”.

Governments across the industrialised world have raised their hopes for AI that will help to drive increases in productivity, and do so safely in ways that are fairly constructed, making use of legitimate data sources, and with fair outcomes for society. The work of data scientists is integral to the foundations by which AI can be used for social good, from establishing protocols for data management and sharing, to understanding the workings of complex algorithms, and the use of large and unstructured data sources. Data scientists and researchers are getting closer to understanding what good looks like, not just in terms of the ethical values to uphold but the technicalities of the code and data involved. However, a great deal of not only data work but also other work also needs to be maintained to uphold the ideal of ‘AI ethics’. Support for well-established ethical and legal rights and principles, to meaningfully involve people in policies that will be affected by AI use, and to develop data governance and infrastructure. It is always possible that when we’re working on AI ethics, we find that there are fairer and more ethical approaches that should precede the use of AI.

“AI development raises a range of ethical questions for data practitioners, whether they are data scientists, econometricians, analysts, or statisticians,” Daniel Gibbons, Vice Chair of the Royal Statistical Society’s Data Ethics and Governance Section told Real World Data Science. Today, many data scientists would urge that ethical considerations precede the development of an AI algorithm and must inform its design and use, particularly for processes that significantly affect people, to ensure it does not propagate errors and injustices.

Explore more data science ideas

About the author: Olivia Varley-Winter Olivia is an experienced policy manager who has worked for the Royal Statistical Society, the Open Data Institute, Open Data Charter, the Nuffield Foundation, and the Alan Turing Institute. She was part of the Ada Lovelace Institute’s founding team in 2018 to 2020 and has since supported the development of other policy-related programmes and partnerships relating to data, AI and ethics. She is presently working for Smart Data Research UK on matters pertaining to ethics and responsible data governance. She has an MSc. in Nature, Society, and Environmental Policy from University of Oxford.

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Varley-Winter, O., Author. 2024. “On AI ethics - influencing its use in the delivery of public good.” Real World Data Science, May 14, 2024. URL

Footnotes

Grother, P., Ngan, M. & Hanaoka K. Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects NISTIR 8280 (2019) https://nvlpubs.nist.gov/nistpubs/ir/2019/NIST.IR.8280.pdf↩︎
Participatory data stewardship (2021) Ada Lovelace Institute https://www.adalovelaceinstitute.org/wp-content/uploads/2021/11/ADA_Participatory-Data-Stewardship.pdf ↩︎
Jo, E. S. & Gebru T. Lessons from Archives: Strategies for Collecting SocioculturalData in Machine Learning Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (2020) https://dl.acm.org/doi/epdf/10.1145/3351095.3372829↩︎

AI series: Healthy datasets for optimised AI performance

Tue, 07 May 2024 00:00:00 GMT

In Charles Babbage’s Passages from the Life of a Philosopher he recalls two incidents where he is asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” Reflecting on these incidents he comments, “I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.” Similarly, accurate and clean data is at the core of a functional AI model. However, ensuring accuracy of input data to avoid any “wrong figures” creeping into training datasets used to train AI models, demands meticulous attention during the stages of data wrangling, cleaning, and curation.

This necessity is particularly pronounced when dealing with the vast datasets used for training machine learning algorithms which are at the core of AI models. The predictive power of these models are highly dependent on the quality of the training data. ¹ The most obvious errors which often require meticulous attention at data processing stages are duplication, missingness and data imbalance (Figure 1). The presence of any of these errors can exert multifaceted impacts both in the training and testing stage of machine learning algorithms that are at the core of any AI models, and the challenge does not end there. The provenance, content, format and structure of the data require attention as well. Even data that is essentially correct may be “wrong” for a particular data set.

Obviating the obvious errors

Duplicated records can mask existing diversities within the data, diminishing the representativeness of important subgroups and leading to a biased training set and model outcomes. If duplication originates from data labelling issues, it can lead to fundamental challenges during the training of supervised models. ² In healthcare data, this issue can arise when linking data across multiple sources where each source holds different labels for the same data. ³

Missingness directly leads to loss of information required for training the algorithms on various real-world scenarios. It is typically addressed via two primary routes: deletion of missing rows or imputation. Deleting missing rows results in a reduction of the sample size and bias. For instance, when it comes to health data, using electronic medical records from a single health provider such as general practice may give rise to a lot of missingness in other aspects of an individual’s health such as their hospital records, pathology testing data or medical imaging. On the one hand, structured missingness can serve as an informative feature to explore within the data. However in cases where missing data causes pixelation in the comprehensive health picture we are attempting to construct, it often conceals an underlying narrative. ⁴ For instance, the COVID-19 response involved many initiatives across the AI community, however, during the early stages of the pandemic partial availability of data pixelated the picture and impacted models predictive ability which resulted in a minimal improvement according to the UK’s national centre for data science and AI report. ⁵

Imputing missing values can preserve the whole sample. However, the introduction of noise during the imputation process may compromise the quality of fitted models, contingent upon the proportion of missing records. One way to offset this may be to use larger and larger data sets that might inevitably include a fuller training picture to the algorithm, just as adding dots to a pixelated image makes it more and more clear what is depicted.

It is often perceived that for certain instances, particularly in the context of deep learning models, such as neural networks, the model itself is capable of handling missing values without explicit imputation or deletion procedures. ⁶ Is this really true in a real-world scenario? Are these models advanced enough to achieve an optimised performance even when data quality is not at an optimised level? ⁷ Köse et al. (2020) investigated the effect of two conventional imputation approaches: Multiple imputation (MICE) ⁸ and Factor analysis of mixed data (FAMD) ⁹ on performance of deep learning models. Their study endorsed the use of such explicit imputation approaches, showing an enhancement in model performance. ¹⁰

Data imbalance issues, arise within datasets where there is a disproportionate amount of information pertaining to a specific aspect. When imbalanced data, rich in focused information, is used to train an AI model, the model becomes adept at learning about the specific aspect but may struggle to generalise its findings to diverse scenarios, thus fostering overfitting where the model achieves accurate prediction on the training data but loses accuracy for any new test dataset.

Figure 1: Stages involved in AI model development.

Overfitting severely undermines the predictive performance of AI models on data beyond their training set, defeating the primary purpose of these models. For instance, out of all strokes that occur, approximately 85% are ischaemic, caused by blockage of blood supply to part of the brain, and 15% are haemorrhagic, caused by bleeding into the brain. Development of machine-learning and AI-based stroke predictive models are therefore being affected by this natural imbalance in the two types of stroke. ¹¹ This type of imbalance also exists in population wide studies where stroke itself is only present in a minority subsection of a healthy population. ¹²

The circumstances of data collection can lead to bias, so care is needed at early stages to ensure that datasets are representative of the real world. These types of error can be picked up at an initial Quality Assurance (QA) stage conducted to reveal any unexpected errors in data used by AI models. QA checks often involve principle checks on presented values to ensure they are the right data type and within the expected range and have the expected temporal coverage.

Finally the choice of features included in a data set requires consideration since this can also have implications on an algorithm. Taking another example from the COVID-19 pandemic, a group of researchers trained an algorithm on radiological imaging of COVID-19 patients where the position of the patient during radiology was present as a feature in the dataset. However, since more severe cases were lying down and less severe cases were standing up, the existence of this feature resulted in an algorithm that predicted COVID-19 risk based on the position of the patients. Here although the data included was correct, its inclusion in the dataset proved to be “wrong”.

Things get complicated

When AI algorithms encounter complex, unstructured data, the task of quality assurance suddenly balloons beyond tackling the three main errors highlighted above. Such circumstances require some kind of quality enhancement procedure, where datasets in the form of images or unstructured text go through a curation process which involves enhancement of the quality and standardisation of the format to the level required for integration into AI algorithms. This process of standardization of data is paramount across various domains, especially in healthcare where complex, unstructured health data holds transformative potential for AI driven advances that revolutionise diagnosis, treatment, and prognosis of diseases. From electronic health records to magnetic resonance imaging (MRI) scans and genetic sequences, this data offers a wealth of insights for AI models to learn from. Adopting standardised formats not only facilitates seamless integration of diverse datasets but also streamlines the development and deployment of AI models. However, unlocking this potential requires a strong foundation in high-quality, processed data which begins with standardisation.

One category of complex health data is neuroimaging of which a prime example is MRI. Different institutions will often employ different acquisition protocols and different ways of collecting and storing neuroimaging data. Above all, this can make it very difficult to integrate into existing workflows and processing pipelines, but it also makes it challenging to understand, compare and combine with other datasets. To address these challenges, the neuroimaging community has adopted the Brain Imaging Data Structure (BIDS) ¹³ – a standardised format for organising and naming MRI data which allows compliant data to integrate smoothly with existing workflows and AI models, streamlining processing and analysis. By embracing standardisation, we can pave the way for common processing tools to enable the generation of AI-ready data.

Next, comes pre-processing. Sticking with the neuroimaging example, MRI scans are susceptible to various forms of noise and artifacts, which can appear as blurring or distortions which, without proper processing, can be misinterpreted by AI models. Pre-processing typically includes steps for spatial normalisation and image registration, involving alignment of brain images from different individuals into a common reference model and alignment of different images of the same subject to a common template. This standardisation facilitates inter-subject and inter-study comparisons, enabling AI models to generalise effectively across diverse datasets. However, the multi-layer aspect of this process means that aligning data to a common template is dependent on the choice of template which itself can introduce bias if the template brain doesn’t accurately reflect the patient’s anatomy (due to age, ethnicity, or disease for example).

Figure 2: Neuroimaging data.

Once pre-processing is complete, you may want to combine datasets to increase sample sizes for your AI model. This is where harmonisation techniques ¹⁴ ¹⁵ come in to deal with inconsistencies and variations in acquisition which can add noise and bias into a model. A typical technique for harmonisation in neuroimaging, known as ComBat ¹⁶, works by modelling data using a hierarchical Bayesian model and followed by empirical Bayes to infer the distribution of the unknown factors. The method is actually borrowed from genomics data but is applicable to situations where multiple features of the same type are collected for each participant, whether that be expression levels for genes, or imaging derived measures such as volumes from different regions. This is a crucial step for combining datasets to enable AI models to focus on learning the actual relationships within the data rather than struggling with inconsistencies across datasets. It also leads to models which can generalise better on unseen data.

Feeding hungry algorithms

The public good is at the heart of AI driven approaches and indeed, the aim is to develop models with optimised predictive ability that can be generalised to many scenarios. For this to be achieved a large and diverse training source is required. This is often referred to as data hungry algorithms. To provide a large amount of enriched training data for optimised model development two main approaches have been explored: federated analytics and synthetic data.

Federation is when data from multiple sources is made available for training and analysis of models designed to run on data that is not held in a single place, nominally called distributed models. It provides the opportunity to test algorithms in different populations/settings to ensure generalisability. In the context of patient-level health data, the data is often held institutionally. Enabling federation and trustworthy sharing of these datasets requires extensive attention to governance models and a common model between multiple organisations is a known catalyst of this process ¹⁷ ¹⁸

Generating synthetic data ¹⁹ from original data sources is a resource intensive mechanism. It requires the development of models on the real data to learn existing patterns, formats, and statistical properties within the original data from which it is possible to generate further synthetic versions of these data. When working with sensitive data such as health records, ensuring patient data is safe and secure is covered by information governance. Depending on how close the synthetic data source is to the original data, the same governance level may still be applicable when trying to bring individual/patient data from multiple sources together. A suggested solution to overcome the governance challenges in the context of synthetic data is to use a low-fidelity version of the original data which means a level of bias has been added throughout the synthesisation process to ensure safety and security of individual level data. ²⁰ While the low fidelity data sources are generated based on real data, it is worth noting that the rise in generative AI also poses a concern for data pollution, particularly where AI tools such as Gretel.ai ²¹ are used to generate synthetic data which may also be used to train AI models – the problematic case of AI training AI!

When using sensitive health data of patients a further layer is in place to ensure security of access. These are called Trusted Research Environments (TREs), secure technology infrastructures which play a crucial role in consolidating disparate data collections into a centralised repository, facilitating researcher access to data for scientific exploration. However, integrating data from various sources into AI models poses challenges due to differences in data collection methods and formats, hindering computational analysis. In response, the FAIR (Findable, Accessible, Interoperable, Reusable) data principles were introduced in 2016 to enhance the reusability of scientific datasets by humans and machines. ²² Adoption of FAIR principles within TREs ensures well-documented, curated, and harmonised datasets, addressing issues raised above such as duplicated records and missing data. ²³ Additionally, preprocessing pipelines within TREs streamline data standardisation, creating “AI research-ready” datasets. ²⁴

Access to real-world healthcare data remains challenging, prompting the development of AI models on open-source or synthetic datasets. However, these models often exhibit performance discrepancies when applied to real world data ²⁵ It is therefore imperative to provide researchers with secure access to real-world healthcare data within TREs, bolstered by robust governance and support mechanisms. Initiatives like the GRAIMATTER study ²⁶ and AI risk evaluation workshops ²⁷ exemplify efforts to facilitate AI model development and translation from TREs to clinical settings. By establishing governance guidelines and promoting FAIR datasets, TREs aim to become important resources for the AI research community. Providing standardised and curated data rich repositories that AI models can be developed on is a top priority in UK-TREs. Given the well-defined and secure governance environments of TREs they may also provide the basis for federated data analysis allowing researchers to combine datasets across TREs/data environments. In this way they can provide the large numbers that data hungry algorithms require, while avoiding the wide-ranging and myriad ways that data for a specific dataset can be “wrong”.

Also in the AI series:

What is AI? Shedding light on the method and madness in these algorithms Generative AI models and the quest for human-level artificial intelligence

Explore more data science ideas

About the authors: Fatemeh Torabi is Senior Research Officer and Data Scientist, at Swansea University and works on Health Data Science and Population Data Science for the Dementias Platform UK.; Lewis Hotchkiss is a Research Officer in Neuroimaging at Swansea University and works on Population Data Science for the Dementias Platform UK.; Emma Squires is the Data Project Manager for Dementias Platform UK based at Swansea University and works on Population Data Science; Prof. Simon E. Thompson is Deputy Associate Director of the Dementias Platform UK; Prof. Ronan A. Lyons is the Associate Director of the Dementias Platform UK based at Swansea University and works on Population Data Science.

Copyright and licence: © 2024 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Torabi, Fatemeh, Hotchkiss, Lewis, Squires, Emma, Thompson, Simon E. and Lyons, Ronan A. 2024. “Getting the data right for optimised AI performance.” Real World Data Science, May 7, 2024. URL

Footnotes

Li, P. et al. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]↩︎
Azeroual, O. et al. A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension. Multimodal Technol. Interact. 2022, Vol. 6, Page 27 6, 27 (2022).↩︎
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 2022 281 28, 31–38 (2022).↩︎
Mitra, R. et al. Learning from data with structured missingness. (2023).↩︎
Alan Turing Institution. Data science and AI in the age of COVID-19. 2020 https://www.turing.ac.uk/sites/default/files/2021-06/data-science-and-ai-in-the-age-of-covid_full-report_2.pdf↩︎
Han, J. & Kang, S. Dynamic imputation for improved training of neural network with missing values. Expert Syst. Appl. 194, 116508 (2022).↩︎
Köse, T. et al. Effect of Missing Data Imputation on Deep Learning Prediction Performance for Vesicoureteral Reflux and Recurrent Urinary Tract Infection Clinical Study. Biomed Res. Int. 2020, (2020).↩︎
Azur, M. J., Stuart, E. A., Frangakis, C. & Leaf, P. J. Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20, 40–49 (2011).↩︎
Audigier, V., Husson, F. & Josse, J. A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 10, 5–26 (2016).↩︎
Köse, T. et al. Effect of Missing Data Imputation on Deep Learning Prediction Performance for Vesicoureteral Reflux and Recurrent Urinary Tract Infection Clinical Study. Biomed Res. Int. 2020, (2020).↩︎
Liu, T., Fan, W. & Wu, C. A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset. Artif. Intell. Med. 101, 101723 (2019).↩︎
Kokkotis, C. et al. An Explainable Machine Learning Pipeline for Stroke Prediction on Imbalanced Data. Diagnostics 2022, Vol. 12, Page 2392 12, 2392 (2022).↩︎
Gorgolewski, K. J. et al. BIDS apps: Improving ease of use, accessibility, and reproducibility of neuroimaging data analysis methods. PLoS Comput. Biol. 13, (2017).↩︎
Bauermeister, S. et al. Research-ready data: the C-Surv data model. Eur. J. Epidemiol. 38, 179–187 (2023).↩︎
Abbasizanjani, H. et al. Harmonising electronic health records for reproducible research: challenges, solutions and recommendations from a UK-wide COVID-19 research collaboration. BMC Med. Inform. Decis. Mak. 23, 1–15 (2023).↩︎
Orlhac, F. et al. A Guide to ComBat Harmonization of Imaging Biomarkers in Multicenter Studies. J. Nucl. Med. 63, 172 (2022).↩︎
Toga, A. W. et al. The pursuit of approaches to federate data to accelerate Alzheimer’s disease and related dementia research: GAAIN, DPUK, and ADDI. Front. Neuroinform. 17, 1175689 (2023).↩︎
Torabi, F. et al. A common framework for health data governance standards. Nat. Med. 2024 1–4 (2024) doi:10.1038/s41591-023-02686-w.↩︎
Tucker, A., Wang, Z., Rotalinti, Y. & Myles, P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 2020 31 3, 1–13 (2020)↩︎
Tucker, A., Wang, Z., Rotalinti, Y. & Myles, P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 2020 31 3, 1–13 (2020)↩︎
Noruzman, A. H., Ghani, N. A. & Zulkifli, N. S. A. Gretel.ai: Open-Source Artificial Intelligence Tool To Generate New Synthetic Data. MALAYSIAN J. Innov. Eng. Appl. Soc. Sci. 1, 15–22 (2021).↩︎
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016 31 3, 1–9 (2016).↩︎
Chen, Y. et al. A FAIR and AI-ready Higgs boson decay dataset. Sci. Data 9, (2021).↩︎
Esteban, O. et al. fMRIPrep: a robust preprocessing pipeline for functional MRI. Nat. Methods 16, 111–116 (2018).↩︎
Alkhalifah, T., Wang, H. & Ovcharenko, O. MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning. Artif. Intell. Geosci. 3, 101–114 (2022).↩︎
Jefferson, E. et al. GRAIMATTER Green Paper: Recommendations for disclosure control of trained Machine Learning (ML) models from Trusted Research Environments (TREs). doi:10.5281/ZENODO.7089491.↩︎
DARE UK Community Working Group - DARE UK. https://dareuk.org.uk/dare-uk-community-working-groups/dare-uk-community-working-group-ai-risk-evaluation-working-group/.↩︎

AI series: Generative AI models and the quest for human-level artificial intelligence

Diego Miranda-Saavedra — Mon, 29 Apr 2024 00:00:00 GMT

Generative artificial intelligence (AI) models have taken the world by storm over the past year. The human-like outputs of these systems, and the recent publication of a guideline to determine the degree of consciousness of machines, have again raised the question of whether machines will soon be able to replicate human intelligence. In this article, we discuss some of the merits and limitations of modern machine learning models, and also provide a general view of human intelligence and the position of “intelligent” systems in the constellation of human capabilities.

Large language models (LLMs) such as ChatGPT are designed to process and understand natural language, and to generate human-like text in response to prompts and questions. This is achieved thanks to a specific type of deep learning architecture called the Transformer, which consists of an encoder and a decoder, each made up of some number N of blocks. Here the input text is eventually transformed into predicted (and contextualised) output text (Figure 1).¹ LLMs are trained on complex and large bodies of text so that they can learn complex patterns and relationships between words in sentences, in different contexts.

Figure 1: Architecture of the Transformer. The left-hand block is the Encoder, whereas the right-hand block represents the decoder.

The two main types of LLMs are autoregressive models and autoencoding models. Autoregressive models such as OpenAI’s GPT (Generative Pre-trained Transformer) generate text by predicting the next word in a sequence given the previously emitted words. Autoencoding models such as Google’s BERT (Bidirectional Encoder Representations from Transformers) also aim to produce coherent and contextually relevant text, but they do so by attempting to predict missing words (from a corrupted version of the text) while considering the surrounding context.

LLMs are engineering marvels capable of producing syntactically flawless, coherent, and remarkable responses to complex requests such as question answering, text summarisation, computer code generation, document classification, text generation and sentiment analysis. We tend to associate linguistic skills with intelligence because communication via an elaborate language system is largely synonymous with the human intellect. So, do the linguistic skills of tools like ChatGPT mean these systems are close to displaying human-level intelligence? Proponents of the Turing test might well argue “yes”. Most would still say “no”.

Speaking and understanding

The Turing test was proposed by Alan Turing in the 1950s. It operates on the basis that if a person is unable to tell whether the entity they are interacting with over typed messages is a human or a machine (irrespective of the answers being correct), the machine is said to have achieved human-level intelligence.² There’s some debate over whether ChatGPT could pass this test; it has been specifically trained not to impersonate humans and will frequently preface its responses with the phrase, “As a large language model…”, thus giving the game away. But others are impressed at what developers like OpenAI have been able to achieve in terms of building artificial models that can sustain realistic-sounding conversations.

This, though, is where the Turing test falls apart as a means of assessing machine intelligence. The ability to “mimic human chatter” does not by itself suggest that a machine understands what it is ‘reading’ or ‘writing’ in the same way a human would. Consider a different set of tests, called the Winograd schemas – puzzles that differ by one or two words and whose solutions cannot be determined using statistics but instead require common sense and an understanding of the physical world.³ For example, the following sentence:

The trophy would not fit in the suitcase because it was too small/large.

Humans would infer that if the last word of that sentence is small, then suitcase is the object being described, whereas if the word is large then we are referring to the trophy. ChatGPT v3.5 was unable to make this inference when the question was first put to it (see Figure 2), although a later test finds it now can – as can other LLMs. Some suggest that this type of improvement comes from training ChatGPT to do better on some of the tasks that are routinely used to highlight model limitations on social media at the expense of giving worse answers in other contexts. But could this improvement be due to ChatGPT and other models suddenly having acquired common sense and an understanding of the physical world? A more complex set of Winograd schema questions, the WinoGrande dataset, suggests not: humans still outperform computers on these tests.⁴

Figure 2: GPT-3.5’s inconclusive answer to a typical WinoGrad schema question.

Machines will likely beat us all at these puzzles one day, in the same way that they already beat the world champions of chess and Go, can translate across multiple languages, and diagnose rare forms of cancer that escape well-trained doctors. These narrow AI applications that spectacularly outperform humans at very specific tasks will become more and more common. But will a multiplicity of narrow AI soon lead to general AI that can compete or beat humans at all tasks? Will human-level linguistic capabilities inevitably result in machines acquiring human-level intelligence in the not-too-distant future? Some developers and researchers certainly believe or hope so. Others remain sceptical.

Thinking and learning

What is “intelligence”, anyway? More than 70 working definitions currently exist.⁵ One that focuses specifically on human-level intelligence while excluding lower types of animal intelligence is this: “intelligence can be understood as the ability to generate a range of plausible scenarios about how the world around you may unfold and then base sensible actions on those predictions”.⁶ Does this come anywhere close to describing the way LLMs work, or indeed any other machine learning algorithm?

In my view, part of the reason why attaining human-level intelligence remains a distant goal has to do with how machines think differently from us.

The goal of a machine learning algorithm is always to optimise a particular function – whether the machine is playing chess (the goal is to win) or classifying images (the goal is to correctly classify as many images as possible). The majority of problems that human intelligence has to “solve”, however, do not always have a clear goal. Consider a simple chatbot standing in for a human on a customer helpline: What scalar quantity should it look to optimise? Is it ensuring that engagement with the customer is informative and supportive? Or, perhaps the goal is to build a recurrent relationship. In either case, how will these quantities be measured? Dealing with this type of real-world problem, where the variable to optimise is not well-defined, represents a formidable obstacle for the development of machine learning algorithms whose behaviour must approximate human intelligence.

Moreover, machines can be surprisingly easy to fool. For example, placing an object next to the one we are trying to classify can confuse image classification algorithms – a well-known example shows a patch being placed next to a banana, which makes the deep neural network (DNN) classify the banana as a toaster with a high degree of confidence. We can also fool image classification networks by showing the same object under different lighting conditions and orientations, such as when we flip a school bus on its side (as in an accident). The reason why a DNN cannot do this trivial mental rotation is because learning algorithms cannot generalise knowledge to unseen (or “out of distribution”) examples – imbuing them with no small amount of “brittleness”.⁷ Compare this to the human mind, which learns in a semi-supervised manner: we need only be shown a few guiding examples to be able to extrapolate knowledge. This is a clear evolutionary adaptation since most real-world learning is semi-supervised: we do not need to explore every road to learn to drive, nor do every possible differentiation exercise to become confident at calculus. Likewise, we do not need to become fully bilingual in a language before we start combining newly learned words to try to explain complex concepts.

Next to GPT-4’s 100-trillion parameter model, the human brain seems much more parsimonious. Therefore, progress in AI would perhaps come faster if we could teach machines to learn from a few (or no) labelled examples instead of being so heavily dependent on terabytes of labelled data (supervised learning) and on our own interpretation of the world.⁸

Another major limitation of algorithms for achieving human-like adaptive learning in changing environments is the fact that they cannot keep learning without forgetting previously learned training data. This phenomenon is called catastrophic forgetting,⁹ and it occurs because of the stability-plasticity dilemma: this states that a certain degree of plasticity is required for the integration of new knowledge, but also radically new knowledge (e.g. large weight changes in a DNN) disrupts the stability necessary for retaining the previously learned representations (Figure 3). In other words, weight stability is synonymous with knowledge retention, but it also introduces the rigidity that prevents the learning of new tasks.¹⁰

Figure 3: Illustration of catastrophic forgetting and ideal learning in a two-class classifier.

Although some ingenious approaches have been developed for mitigating catastrophic forgetting (and which are much smarter than simply building a new network for each new task), it has also been shown that no single method can solve catastrophic forgetting while allowing incremental learning in every possible situation.¹¹ The problem with catastrophic forgetting is not only that it contradicts a fundamental characteristic of human intelligence – the ability to learn within, and adapt to, changing environments;¹² catastrophic forgetting is also a major bottleneck for the development of adaptable systems that learn incrementally from the constant flow of data in the real world, such as autonomous vehicles, recommender systems, anomaly detection methods and, in general, any device embedded with sensors. Moreover, the development of continual learning methods is key not just for machine intelligence, but also for learning scalability: by 2025 the world will be producing some 175 zettabytes of data annually, of which we will only be able to store between 3% and 12%. Thus, for learning to be scalable in the future, continual learning methods will need to be able to process data faster and in real time, learn on the fly and then discard the data, much like humans do.

How are we wired?

One further bottleneck for machines emulating human-level intelligence is that we do not yet understand the circuitry of the brain well enough to be able to reproduce it, and therefore we have been working with misleading models. For instance, the realisation of the “all-or-nothing” nature of action potentials (i.e. there is no such thing as the partial firing of a neuron) led McCulloch and Pitts to propose the concept of the artificial neuron and suggest that networks of neurons could equally be modelled as (all-or-nothing) logical propositions.¹³ From this point onwards, the dominant idea over most of the past century has been that the brain is essentially a computer. But even on a superficial level of analysis, brains and computers have very different architectures and behaviour, with computers specifically making optimal use of virtually unlimited memory as well as an extraordinary capacity for brute-force searching. On the microscopic level, networks of neurons cannot possibly be modelled as logical propositions because they do not really operate in this manner: a single neuronal synapse is an environment that harbours hundreds of proteins that specifically interact with other proteins in complex networks that possess clear time-space coordinates. Neurons process information and generate not just electrical signals but also discrete biochemical changes that occur in cycles instead of linearly. Moreover, a system like the brain responds to stimuli over long periods of time, which can effect changes in its own behaviour. Understanding at least how some cognitive tasks are performed at an algorithmic level would likely translate into major progress towards emulating human-level intelligence.¹⁴

GPT-4’s impressive capabilities can make people believe that some AI systems are conscious on a human level (since animals display limited consciousness), but this is an illusion. Consciousness is another essential quality that we can easily recognise when we see it, but which (like intelligence) is extraordinarily difficult to define – other than consciousness simply being everything you experience, or, more formally put, the “awareness of internal and external existence”.¹⁵ Could machines eventually achieve consciousness? This is a controversial and key question because, besides intelligence, having a degree of consciousness on the level exercised by humans is believed to be necessary for displaying goal-directed behaviour.¹⁶ Recall that machine learning algorithms do have the general goal of optimising functions but these goals are determined by human programmers, not the machines themselves.

A fundamental question regarding the development of consciousness is this: if an AI system were close to consciousness, how would we know? Butlin and colleagues recently explored the question in a groundbreaking paper where they compiled a list of “indicator properties” drawn from various neuroscientific theories of consciousness (since no theory is clearly superior). The idea is that the more boxes an AI system ticks, the more likely it is close to being conscious. The authors argue that a failure to identify artificial consciousness has important moral implications because an entity that exhibits consciousness invariably influences how we feel it should be treated. While this is likely true, humans do not necessarily need to feel that an entity is conscious in order to develop empathy. Emotional attachment is a basic human instinct, and we have a tendency to anthropomorphise. You may remember the story of hitchBOT, a clearly unconscious yet friendly robot invented by David Smith of McMaster University. HitchBOT could barely speak and its only mission was to autostop. It ended up travelling throughout Europe and North America thanks to the sympathy it generated. The “beheading” of the robot in Philadelphia in 2015 had huge repercussions across the planet because thousands of strangers had developed empathy and become emotionally attached to hitchBOT despite never having met it.¹⁷ Do you think that a machine is likely to develop a degree of human-like empathy anytime soon?

Finally, another fundamental human trait that is absent from machines, and which is particularly important in these trying times, is our capacity for remaining hopeful, which can be seen as a post-hoc rationalisation of our survival instinct. Being hopeful means that we think things will improve beyond what would be reasonable to predict for the immediate or medium-term future given the most recently available data points. Jane Goodall defines hope as “a crucial survival trait that enables us to keep going in the face of adversity”. Desmond Tutu gave an equally ethereal definition: “Hope is being able to see that there is light despite all the darkness”.¹⁸ One fundamental aspect of hope is its undeniable association with agency, i.e. our capacity to voluntarily act in a given environment: even when we are at odds with the desired outcome, hope makes us take action, which in turn fuels more hope, thus establishing a dynamic form of self-stimulation over thousands of ethical actions without necessarily having a clear variable to optimise, which machines are incapable of doing. And, one very interesting thing about hope is that its effects can be quantified in the short term: hope is much better than intelligence and personality at predicting academic performance,¹⁹ as well as performance in the workplace, with hopeful workers being reported as 14% more productive.²⁰

Closing thoughts

Generative language-based models have reignited much interest in the possibility of artificially recreating human-level intelligence. Despite being seminal breakthroughs, we must not forget that LLMs are, essentially, just very sophisticated pattern recognition systems which, when trained on even larger datasets, may become even better at predicting the most appropriate responses to different prompts. LLMs are incomplete models of thought, though, plagued by practical problems that we have not discussed here, such as giving incorrect answers, security breaches, privacy concerns regarding personal data used in their training datasets, algorithmic opacity and an inability to meet the EU’s General Data Protection Regulation (GDPR), and their amplification of web bias which can result in answers that discriminate against different groups.²¹ Even if these limitations are fixed one day, the capabilities of LLMs still do not approximate general human intelligence.

Generative models are just one type of narrow AI application. Such applications will continue to evolve at a very fast pace and produce breakthroughs of paramount importance. Some of the latest breakthroughs in the biomedical field include the discovery of new antibiotics against deadly antibiotic-resistant bacteria²² and AlphaFold’s accurate prediction of a protein’s structure from its amino acid sequence.²³ ²⁴ The number of ways an amino acid sequence may fold is astronomical. Thus being able to predict a protein’s structure as accurately as experimental measurements (by X-ray crystallography or cryo-electron microscopy) represents a gigantic step towards understanding a protein’s likely function and its regulation, how it may be drugged to combat diseases and for antibiotic development, and its manipulation to guide vaccine design as was done during the coronavirus pandemic.²⁵ Most impressively, AlphaFold is able to predict the structural effects of single amino changes (mutations), which is essential for engineering new proteins as well as for understanding evolutionary history and mechanistic aspects of diseases.²⁶

Being able to harness the power of narrow AI applications and delegate some tasks to machines will allow humans to focus on those tasks at which we do better than machines. Augmented intelligence is the name given to the close collaboration between humans and machines, which was first proposed in the 1950s, and is now finally within reach.²⁷ ²⁸ Current examples of devices that are a functional extension of human beings are virtual reality headsets that expand the users’ senses and perceptions, implantable technologies that substitute access cards, and, in general, any software that automates research and data analysis. Since such technological developments might make us more “intelligent” or at least more productive, will we then still need machines that display human-like intelligence?

The official position of some major players like Microsoft is to not even attempt to replicate human intelligence but to produce “AI centred on assisting human productivity in a responsible way”.²⁹ Still, a recent paper that reported GPT-4’s impressive performance at solving a number of difficult tasks (in the fields of mathematics, coding, vision, medicine, law and psychology) suggests that GPT-4 displays “sparks of artificial general intelligence”. This is in line with OpenAI’s clearly stated goal of developing human-level intelligence. However, the debate over whether we are getting any closer to replicating intelligence with just a few impressive generative models that simply recombine and duplicate data on which they have been trained is self-limiting because it takes a very narrow view of human intelligence. For one thing, mindlessly generating text (“speaking”) and thinking are two very different things. It has been shown that while LLMs may excel at formal linguistic competence (understanding language rules and patterns), their performance on tasks that evaluate human thought in the real world (functional linguistic competence) is very limited. Moreover, GPT-4 is unable to reason. We can define reasoning as the process of drawing justifiable conclusions from a set of premises, which is also a key component of intelligence. When given a set of 21 distinct problems ranging from simple arithmetic and logic puzzles to spatial and temporal reasoning, and medical common sense, GPT-4 proved incapable of applying elementary reasoning techniques.

OpenAI’s newest headline-grabbing development, Sora, shares many of the limitations of GPT-4. Sora is a model that can generate video clips from text prompts – but while it may prove useful for content creation, it seems incapable of understanding the real world. OpenAI’s defence is that Sora still struggles with “simulating the physics of a complex scene” but that it “represents a foundation for models that can understand and simulate the real world”. This is, OpenAI believes, key for training models that will help solve problems that require simulating the physical world (e.g. rescue missions), and eventually for achieving general AI. However, it is suspected that Sora’s limitations in understanding the physical world have nothing to do with physics. For example, in a generated video of a monkey playing chess in a park, we see a 7x7 board and three kings. This is likely not an error of insufficient training data or of computational power. This is an error that reveals a failure to discern the cultural regularity of the world by making wrong generalisations despite having ample evidence of the existence of universal 8x8 chess boards and one king per player. A video of a stylish woman wandering in Tokyo is also incorrect for the same reason: nobody takes two consecutive left steps in a row (about 30s into the video). Sora also does not appear to understand cause and effect; for example, in a video of a basketball that makes a hoop explode, the net appears to be restored automatically following the explosion. Sora uses arrangements of pixels to predict new pixel configurations, but without trying to understand the cultural context of the images. This is why the images and videos generated by Sora seem correct at the pixel level but globally wrong. Thus, OpenAI’s claim that “scaling video generation models is a promising path towards building general purpose simulators of the physical world” is open to doubt.

LLMs do not yet approximate the human brain; generative video models do not approximate the physical world; and human intelligence is so much more than combining formal linguistic competence with complete models of thought, or making creative videos that respect the physical constraints of the world. Human intelligence is not limited to specific domains either but exists in the open to challenge currently held views. Ask Noam Chomsky and he will respond that generative models like ChatGPT are essentially “high-tech plagiarism”. Human consciousness includes a sense of self which machines will not be able to replicate anytime soon – or perhaps never will, since a human brain and a computer are not the same. Human consciousness is coupled with curiosity, imagination, intuition, emotions, desires, purpose, objectives, wisdom and even humour. If we think about humour, a good sense of humour means thinking outside of the box and connecting concepts and situations in novel ways, which is something that machines are unable to do. Also, by thinking outside the box, humans are able to consciously ask a variety of questions – the most extraordinary of which have led to major leaps in our understanding of the world around us.

Reasonable questions can be posed by many and answered logically (some even by machines) using the standard scientific process of experimental design, controls and hypothesis validation. In this context, the faster and more efficient exploration of search space by learning methods, complemented by the delegation of repetitive tasks to machines, will allow scientists to conduct experiments at greater scale while focusing on designing optimal solutions. Beyond reasonable questions and expected results is the concept of serendipity that machines cannot yet be made to grasp. Some of the greatest discoveries in the history of science are indeed serendipitous (accidental), including the discovery of insulin, penicillin, smallpox vaccination, the anti-malarial drug quinine, X-rays, nylon and the anaesthetic effects of ether and nitrous oxide.³⁰ Turning accidents into discoveries requires having a questioning mind that can view data from several perspectives and connect seemingly unrelated pieces of information instead of discarding unusual results right away.

And yet beyond serendipitous discoveries we have extraordinary questions, which machines are as yet incapable of asking. Extraordinary questions lie outside of our current frame of knowledge and require an illogical step that is often the product of letting one’s mind wander freely.³¹ A classical example here is when Einstein was trying to modify Maxwell’s equations so that they were no longer in contradiction with the constant speed of light that had been observed. After trying to modify these equations for years, Einstein eventually realised that it was not Maxwell’s fault. Rather, our notion of time was incorrect. Einstein thus stumbled upon the very question that led to the idea that the rate at which time passes depends on one’s frame of reference. While machines follow rules, the revolutionary ideas of Einstein, Newton, Darwin, Galileo, Wittgenstein and many others did not follow any rules established at the time. Therefore, the real danger in thinking that we can rely on “intelligent” machines to achieve a human level of imagination, intuition, wisdom or purpose anytime soon is that the world will become an even more statistically predictable place.

Also in the AI series:

What is AI? Shedding light on the method and madness in these algorithms Healthy datasets for optimised AI performance

Explore more data science ideas

About the author: Diego Miranda-Saavedra, PhD, is a data scientist and a financial investor. His book How To Think About Data Science (Chapman & Hall / CRC Press) was published in December, 2022.

Copyright and licence: © 2024 Diego Miranda-Saavedra

Text, code, and figures are licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence, except where otherwise noted. Thumbnail image by Jamillah Knowles / Better Images of AI / Data People / Licenced by CC-BY 4.0.

How to cite: Miranda-Saavedra, Diego. 2024. “Generative AI models and the quest for human-level artificial intelligence.” Real World Data Science, April 29, 2024.

References

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is All you Need. In Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, California (USA), 2017. ISBN: 9781510860964.↩︎
Turing A. Computing Machinery and Intelligence. Mind, LIX(236):433-460, 1950.↩︎
Levesque HJ, Davis E, Morgenstern L. The Winograd Schema Challenge. In KR’12: Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, June 2012, Rome, Italy. AIII Press, Palo Alto (CA), USA, 2012. ISBN: 9781577355601.↩︎
Sakaguchi K, Le Bras R, Bhagavatula C, Choi Y. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 34(05), 8732–8740. February 2020, New York (NY), USA. AIII Press, Palo Alto (CA), USA, 2020. ISSN: 2159–5399.↩︎
Legg S, Hutter M. A Collection of Definitions of Intelligence. Proceedings of the 2007 Conference on Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms: Proceedings of the AGI Workshop 2006, 17-24. IOS Press, Amsterdam, the Netherlands, 2007. ISBN: 978-1-58603-758-1.↩︎
Suleyman M, Bhaskar M. The Coming Wave. Bodley Head, London, UK, 2023. ISBN-10: 1847927483.↩︎
McCarthy J. From Here to Human-Level AI. Artificial Intelligence, 171(18):1174–1182, 2007.↩︎
LeCun Y, Bengio Y, Hinton G. Deep Learning. Nature, 521(7553):436–444, 2015.↩︎
McCloskey M, Cohen NJ. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of learning and motivation, 24:109–165, 1989.↩︎
Abraham WC, Robins A. Memory Retention - the Synaptic Stability Versus Plasticity Dilemma. Trends in Neurosciences, 28(2):73–78, 2005.↩︎
Kemker R, McClure M, Abitino A, Hayes T, Kanan C. Measuring Catastrophic Forgetting in Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). AAAI Press, Palo Alto (CA), USA, 2018. ISBN: 9781577358008.↩︎
Hadsell R, Rao D, Rusu AA, Pascanu R. Embracing Change: Continual Learning in Deep Neural Networks. Trends in Cognitive Sciences 24(12): 1028–1040, 2020.↩︎
McCulloch W, Pitts W. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 5(4):115–133, 1943.↩︎
Brooks R, Hassabis D, Bray D, Shashua A. Is the Brain a Good Model for Machine Intelligence? Nature 482: 462-463, 2012.↩︎
Koch C. What Is Consciousness? Nature, 557:S8–S12, 2018.↩︎
DeWall C, Baumeister R, Masicampo R. Evidence that Logical Reasoning Depends on Conscious Processing. Consciousness and Cognition 17(3): 628, 2008.↩︎
Darling K, Nandy P, and Breazeal C. Empathic Concern and the Effect of Stories in Human-Robot Interaction. 24th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN) 2015, Kobe, Japan, August 31 - September 4, pp. 770-775. IEEE, Washington (DC), USA, 2015. ISBN: 9781467367042.↩︎
Goodall J, Abrams D. The Book of Hope: A Survival Guide for an Endangered Planet (1st Edition). Viking Press, New York (NY), USA, 2021. ISBN-10: 024147857X.↩︎
Day L, Hanson K, Maltby J, Proctor C, Wood A. Hope Uniquely Predicts Objective Academic Achievement Above Intelligence, Personality, and Previous Academic Achievement. Journal of Research in Personality 44(4): 550-553, 2010.↩︎
Reichard RJ, Avey JB, Lopez S, Dollwet M. Having the Will and Finding the Way: A Review and Meta-Analysis of Hope at Work. The Journal of Positive Psychology 8(4): 292-304, 2013.↩︎
Baeza-Yates R. Bias on the Web. Communications of the ACM, 61(6):54–61, 2018.↩︎
Liu G et al. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nature Chemical Biology 19: 1342-1350, 2023.↩︎
Senior AW et al. Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins: Structure, Function and Bioinformatics 87(12):1141–1148, 2019.↩︎
Senior AW et al. Improved protein structure prediction using potentials from deep learning. Nature 577:706–710, 2020.↩︎
Higgins MK. Can We AlphaFold Our Way Out of the Next Pandemic? Journal of Molecular Biology 433(20):1–7, 2021.↩︎
McBride JM, Polev K, Abdirasulov A, Reinharz V, Grzybowski BA, Tlusty T. AlphaFold2 Can Predict Single-Mutation Effects. Phys. Rev. Lett. 121:218401, 2023.↩︎
Zheng NN, Liu ZY, Ren PJ, Ma YQ, Chen ST, Yu SY, Xue JR, Chen BD, Wang FY. Hybrid-Augmented Intelligence: Collaboration and Cognition. Frontiers of Information Technology & Electronic Engineering 18:153-179, 2017.↩︎
Bryant PT. Augmented Humanity: Being and Remaining Agentic in a Digitalized World. Palgrave Macmillan, Cham, Switzerland. ISBN: 9783030764449.↩︎
Lenharo M. If AI Becomes Conscious: Here’s How Researchers Will Know. Nature, 24 August 2023.↩︎
Roberts RM. Serendipity: Accidental Discoveries in Science (1st Edition). Wiley-VCH, Weinheim, Germany, 1989. ISBN: 0471602035.↩︎
Yanai I, Lercher M. What Is The Question? Genome Biology 20(1):289, 2019.↩︎

AI series: What is AI? Shedding light on the method and madness in these algorithms

Anna Demming — Mon, 22 Apr 2024 00:00:00 GMT

What do defence cases in litigation, statistical analyses, book summaries and a description of Young’s double slit experiment in the manner of poet Robert Burns have in common? They are all tasks that people have rightly or wrongly attempted to delegate to large language models.

The playground of generative AI algorithms based on large language models extends well beyond the space of generating text-based language but includes creating images, videos and even music from prompts. The capabilities of these algorithms, and the ubiquity of tasks that large language models like OpenAI’s Generative Pre-trained Transformer (GPT) models can have a go at is striking. These large language models and the chatbots and so on based on them have also been catapulted into the centre of mainstream public attention with huge success – who has not heard of ChatGPT? The net result has been something akin to a feeding frenzy as individuals and businesses alike strive to be among the first to benefit from them.

Like others, many data scientists closely familiar with these kinds of algorithms share some enthusiasm for their potential utility, but many also advocate an element of caution. There are some obvious caveats, including accuracy and cost – not just financially but also in terms of the huge energy costs to run these algorithms, a real world consequence that is affecting the planet in the present day but is often eclipsed by fears that AI might take over the world some time in the future. Another concern is security. Should you be sharing the information you are working from with a third party anyway? However, while a lot of attention has focused on what these algorithms can do, fewer have been asking what they actually do – what we know about the initial programming, the training data, the final algorithm and the range of possible outputs, all of which provide useful pointers as to whether a particular algorithm is appropriate for the task in hand, and how best to benefit from it.

Demystifying machine learning

Definitions of artificial intelligence vary, often circling around the theme of a system reaching an “intelligent” decision or output based on multiple inputs, although how “intelligent” might be defined can be hairier still. Nonetheless, there is currently largely a consensus that some kind of machine learning is a route to achieving it. Through machine learning “you are letting the computer adjust the importance of its inputs, and their relationships, to determine an appropriate output” as Napier chief data scientist and chair of the Royal Statistical Society DS & AI Section Janet Bastiman describes it. The term “machine learning” was first proposed by IBM scientist Arthur Samuel in 1952, and it has largely been achieved by two approaches. One is “random forests”, based on constructing multiple decision trees. The other is the neural networks first devised by American psychologist Frank Rosenblatt and simulated at IBM in 1957. Here, a set of artificial neurons – components closer to a capacitor than a biological neuron – is connected to another layer of neurons, which is connected to another layer of neurons, and so on (Figure 1). Crucially the connections are strengthened or not through “learning” based on training data that allows the network to recognise patterns and extract meaningful features.

Figure 1: A schematic of a neural network depicted with circles connected with lines or arrows.

“Machine learning is just like linear regression with tonnes of bells and whistles,” says Daniela Witten, professor of mathematical statistics at the University of Washington in the US, referring to a statistical method for fitting a line to a set of data points that dates back over a hundred years. There are many other traditional approaches to statistical learning that may be nonlinear and so on, but as an example of the “bells and whistles” Witten describes, whereas a traditional regression model might have 5 inputs or variables, the machine learning version might have 15 million, and instead of assuming it is linear the fit is allowed to be more flexible and so on. However, the fundamental statistical ideas underlying both sets of models are the same. For this reason, although some may beg to differ, she feels doing machine learning before you understand statistics is like trying to jump rope before you can walk. “It’s not that you can’t do it but why would you?” she adds.

Broadly speaking, machine learning can be classified two ways. One is “supervised”, which means that the training data is somehow labelled, for instance with a known output collected from real world records. The alternative is “unsupervised” where the algorithm is set the task of finding a way to learn relationships between input data itself. There are also neural network approaches that fall somewhere between the two, such as reinforcement learning where an algorithm may generate outputs for a task at random such that its performance is initially poor but improves with feedback to reinforce generation of outputs that are closer to those desired. An approach that enjoyed great popularity for a time used another machine learning algorithm to provide this feedback, which would initially also be a poor judge but improve as pitted against the algorithm learning to do the task – generative adversarial networks (GANs). GANs are still used a lot, but usually in a pipeline and may be pre-trained so they are not starting from scratch like they used to.

As the number of layers increased from just a few, the term “deep learning” was adopted along with alternative structures. The first models operated with every neuron in each layer connected to every neuron in the previous layer. “This is very wasteful because not every part of your input relates to each other that much,” says Petar Veličković, staff research scientist at Google DeepMind and affiliated lecturer at the University of Cambridge. He cites images as among the first scenarios where people began to implement a tweak to the approach in what is called a convolutional neural network. Based on the assumption that the pixels for each object in an image sit adjacent to each other rather than at opposing corners of the image, the neurons in a convolutional neural network connect only with the neurons in the next layer that are nearby in the image space. In this way the convolutional neural network assumes a kind of structure in the input data – that the image is contiguous, so the pixels for edges and so on are in contact.

Figure 2: Schematics for (left) Scaled Dot-Product Attention and (right) Multi-Head Attention, which consists of several attention layers running in parallel.

“Transformers are also a neural network but they encode a different kind of structure,” Veličković tells Real World Data Science, as he describes the data architecture at the heart of the large language models creating such a buzz at present. Language has structure too – the letters make up words, which then make up sentences and so on. So it makes sense to program some of that structure into the algorithm rather than leaving it to work it all out. “You would need a lot more training data than there is on the internet to train a system without such structure by itself,” adds Veličković. Transformers structure the training data into tokens, and a key component first reported in 2017 is the way each token then connects with or “attends” to all other similar tokens . Here whether they are “similar” is determined by their dot products, a multiplication technique for the kind of vector format of input numbers used for these tokens (Figure 2). Exploiting this “dot product attention” significantly improves the efficiency of the training process.

Taking the world by storm

The transformer architecture proved very powerful as has been seen in the surge to prominence of various AI systems based on generative pre-trained transformer algorithms, such as ChatGPT, BERT and PaLM, although this likely has at least as much to do with the marketing of the recent releases as it does with the algorithm itself. “It was a small evolution rather than a revolution,” says Bastiman in reference to recent GPT releases, explaining that there was an increase in parameter size and the amount of data used for training that gave rise to something that could provide broader answers and was ready for mass market. Nonetheless she adds, “There had been GPT2, GPT1 and all the other previous ones had been released quietly and had all been quite good.”

The marketing spin has not stopped with the product releases as terms like “hallucination” have entered the lexicon to describe instances when the output is wrong and potentially dangerous. (Figure 3) “The language we are using to describe these models is different to how we describe human intelligence to deliberately instil the sense this is better,” adds Bastiman. “So even if the model is incorrect these terms imply that it is still doing something amazing.”

Figure 3: “Hallucinated” references. A study found that out of the 178 references cited by ChatGPT, 69 did not have a DOI. Upon extensive internet search, 41 out of these 69 reference articles were found to exist.

The success of this marketing does have its advantages as Veličković points out, thrusting AI in the spotlight, inviting people to try the algorithms, which thanks to a growth in web-based user interfaces like ChatGPT can reach a much broader audience. This is not only encouraging developers they are doing something potentially useful but prompting important discussions around the potential bias and ethics issues, which many would argue ought to be considered before anything else. Nonetheless Veličković also doubts whether the current AI fanfare can be attributed to advances in the algorithms they use alone, pointing out that neural networks have been around since the 1950s, and the 1980s and 1990s saw the invention of most of the building blocks we need to scale such systems up: the backpropagation algorithm, convolutional and recurrent neural networks, long-short term memory networks, and early variants of linear self-attention and graph neural networks. “It’s just that we needed gamers,” he tells Real World Data Science, suggesting that hardware and engineering have been key to the recent successes of AI. “We needed people to drive the development of graphics cards which are really good hardware for training these things.”

Clearly advances in processing power and the hardware such as GPUs to manage it so that it is possible to compute these algorithms massively affects their potential impact. Although the field no longer relies on GPUs developed for gamers, GPUs are still widely used, as they offer such a good return on investment and are easier to get hold of than alternatives like tensor processing units. Certainly a significant development over the past decade or so is the increase in size of not just the data sets but the algorithms themselves. Implementing algorithms at such colossal scales that require data centres imposes incredibly challenging requirements on the hardware and the electrical and computational engineering involved to set them running and keep them from failure. “When you have a data centre, failure is a common thing,” says Veličković, listing multiple vulnerabilities that balloon at scale such as hardware failures, electrical failures, even apparently exotic events like solar flares can flip bits and scramble data leading to nonsense output. “People underestimate this but good engineering is now the bread and butter of how these systems work.”

Managing expectations

The explosion in scale has also created fundamental distinctions from how people work with machine learning algorithms versus statistical methods. Witten highlights “the ability to gauge uncertainty” by quantifying parameters such as confidence intervals and error bars as a key contribution of statistics. “Often with these machine learning models things get very complicated and we do not yet have a way to quantify that uncertainty.”

This quantifying of finite parameters contrasts with the kind of output achieved with the generative AI applications that have grabbed media focus recently. For instance, asking a large language model to describe Young’s double slit experiment in the style of Robert Burns may sound quite a specific prompt, and it may seem impressive if the algorithm returns something akin to what was asked, but the number of possible responses that could be deemed “correct” are infinite. A lot of applications of generative AI – many with more real world impact than describing iconic experiments in archaic scotch rhyme – similarly have a vast set of reasonable outputs.

Gamers drove the development of GPUs which have proven invaluable for training machine learning algorithms

“We shouldn’t be surprised if ChatGPT does well with a question that has a million reasonable answers,” says Witten, contrasting these scenarios with questions that she suggests might have more real-world importance, like whether a patient with breast cancer will respond to a particular treatment. “Actually ChatGPT often gets into trouble if there is a problem with just one answer, and that answer is not part of the training set.”

For predictive AI there is often only one useful answer – the outcome that will come to pass. This has implications if machine learning is used for predictions, particularly if it is in real world settings that affect real people. “If you are deploying an AI model for some healthcare application like what breast cancer treatment you are going to respond best to, we really better make sure that the model works, and that we understand the uncertainty of those predictions,” says Witten. She feels that over the past few years, the machine learning community has increasingly recognized the importance of bringing statistical thinking to bear within the context of complex machine learning/AI algorithms: in particular, interpretability and uncertainty quantification have become major areas of interest in machine learning. Witten suggests that statistics is making progress here citing as an example conformal inference, “which allows recalibration of the predictions of a machine learning model in order to quantify uncertainty appropriately.”

Explain yourself

Understanding the uncertainties of output is one thing, but many of these algorithms have now reached the kind of scale that totally obfuscates what they are doing with the input data to reach their outputs. There may be specialists who understand how they are programmed but there are just too many variables to track so that even for them, the final process the algorithm lands on for generating its output from the various inputs is a black box with no neat mathematical description, unlike statistical techniques like regression.

“You can draw a picture with circles and arrows, and arrows cross in a certain way, but you don’t have a clear idea of how one feature that you started with maps to the output,” says Witten. “On an actual quantitative level of scientific understanding we don’t have that.” If decisions are being made for and about people based on AI, people will also sometimes want to know how that conclusion was drawn. “When we want to make decisions there’s a level of deferred trust,” says Bastiman citing a work by the Alan Turing Institute that began in the late 2010s. “We as humans want explanations from machines in the same scenarios that we want them from humans but that’s not going to be the same for all people.” For example, a person who has had a bad experience in the past will need more convincing than one who has not. Janet suggests that a very normal cognitive bias can be generalised as most people wanting more explanation if the model output is not in their favour. “Similarly, a person accepted for a job where AI is used, may not require any explanation, while another candidate the AI rejected may challenge the decision and want to know why.”

Hybrid implementations including a human in the loop may help to a degree. However, to get a handle on the workings of the algorithm itself, Bastiman points out that it is possible to introduce layers in the algorithm that will help extract how the output is reached even for unsupervised neural networks. “That’s where a lot of effort goes from data scientists and machine learning engineers to make sure the model has that level of transparency and makes sense,” she adds, emphasising the need to ensure a model has these features before it is released and put in use. The process is far from straightforward as the explanation needs to be at the right level and with the right terminology for a range of audiences, be they data scientists, quality assurance professionals, decision makers, end users or impacted individuals. “People say you can’t explain things when what they really mean is that it’s difficult.”

Veličković suggests a lot could be gained in terms of being able to analyse AI algorithms by marrying them with elements of classical algorithms, which are “nicely implemented and interpretable.” Classical algorithms are also impervious to changes in the input data such as an increase by a factor of 10, which can completely throw an algorithm based on machine learning. “The problem is they are trained to be really useful and give you an answer at all times so they won’t even give you a confidence estimate, they will just try to answer even if the answer is completely broken,” he adds. A lot of his research has focused on “out-of-distribution generalisation” – the way classical algorithms work with any input data – to see how these features might be sewn into AI to extract the best of both worlds. “There’s a lot of research to be done still but our findings so far indicate that if you want out-of-distribution generalisation you need to look at what makes your problem special and put some of those properties inside your neural network model.”

Even what we know about the way the algorithm reaches a decision has caused concern when it comes to critical real-world applications – for example, to determine the likelihood that someone convicted of a crime will reoffend. (More on this to come in the special issue article on ethics). With many commercial algorithms the details of the training data are unknown or essentially constitute the whole internet, which as Witten points out is “a pretty bad place a lot of the time.” While ChatGPT may seem an unlikely choice for anything like gauging risk for recidivism or cancer treatments, concerns remain over biases propagating in AI generated content we might consume through marketing campaigns and other activities. “Even just thinking about deploying AI/machine learning models in critical real-world settings without the associated statistical understanding is just very deeply problematic,” says Witten, emphasising the importance of not just statisticians but also ethicists for tackling these challenges.

The fact is many of us are already interacting with multiple machine learning/AI models on a daily basis through recommendations, search engines and predictive text. “If we are going to deploy these [machine learning algorithms] at scale in a way that will affect human lives, then we first need to understand the implications for humans of these models,” says Witten. “This includes both statistical and ethical considerations.”

Coming up: Forthcoming articles in this special issue will look at machine learning and human-level intelligence, issues around data, techniques for evaluation, gauging workforce impact, governance, best practice and living with AI

Also in the AI series

Generative AI models and the quest for human-level artificial intelligence datasets for optimised AI performance

Explore more data science ideas

About the authors: Anna Demming is a freelance science writer and editor based in Bristol, UK. She has a PhD from King’s College London in physics, specifically nanophotonics and how light interacts with the very small, and has been an editor for Nature Publishing Group (now Springer Nature), IOP Publishing and New Scientist. Other publications she contributes to include The Observer, New Scientist, Scientific American, Physics World and Chemistry World.

Copyright and licence: © 2024 Anna Demming

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail image courtesy of Serenechan3 reproduced under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence

How to cite: Demming Anna. 2024. “What is AI? Shedding light on the method and madness in these algorithms .” Real World Data Science, April 22, 2024. URL

What is data science? A closer look at science’s latest priority dispute

Jonathan Auerbach, David Kepplinger, and Nicholas Rios — Mon, 19 Feb 2024 00:00:00 GMT

What is data science, and where did it come from? Is data science a new and exciting set of skills, necessary for analyzing 21st century data? Or is it (as some have claimed) a rebranding of statistics, which has carefully developed time-honored methods for data analysis over the past century?

Priority disputes – disagreements over who deserves credit for a new scientific theory or method – date back to the beginning of science. Famous examples include the invention of calculus and ordinary least squares. But this latest dispute calls into question the novelty of an entire discipline.

In this article, we use two popular data science algorithms to examine the difference between data science, statistics, and other occupations. We find that in terms of the preparation required to become a data scientist, data science reflects both the work of natural sciences managers – individuals who oversee research operations in the natural sciences – and statisticians and mathematicians. This suggests that data science is a shared enterprise among science and math, and thus those trained in the natural sciences have as much claim to data science as those trained in mathematics and statistics.

In terms of the role a data scientist serves relative to other occupations, however, we find that data science is closest to statistics by far. Both occupations are fast growing and central among the occupations that work with data, suggesting a data scientist serves the same function as a statistician. But this function may be changing. While the centrality of statistics has declined over the past decade relative to other occupations, the centrality of data science has grown. In fact, data science has now surpassed statistics as the most central fast-growing occupation.

We examine the role of data science using data science

Everyone seems to agree that data science requires skills traditionally associated with a variety of different occupations. Drew Conway, for example, describes data science as a combination of math and statistics, substantive (domain) expertise, and “hacking” skills (see Figure 1). In dispute is the relative importance of those skills. Some have argued that data science is basically statistics – and that 20th century statisticians like John Tukey have long possessed the data science skills traditionally associated with computer science and the natural sciences. Others have argued that data science is truly interdisciplinary, and statistical thinking only plays a small role. But while opinions on data science abound, few appear to be based on data or science.¹

Figure 1: Drew Conway describes data science as a combination of math and statistics, substantive (domain) expertise, and “hacking” skills. Conway’s data science venn diagram, reproduced here, is Creative Commons licensed as Attribution-NonCommercial.

To that end, we use two popular data science algorithms, naïve Bayes and eigen centrality (eigen decomposition), to investigate the question: What is data science? Both algorithms use data listing the training a worker must generally complete to work in an occupation, such as data science. Specifically, we use the CIP SOC Crosswalk provided by the US Bureau of Labor Statistics and US National Center for Education Statistics, which links the Classification of Instructional Programs – the standard classification of educational fields of study into roughly 2,000 instructional programs – with the Standard Occupational Classification – the standard classification of professions into roughly 700 occupations.

Our main assumption is that the skills required to work in an occupation can be represented by the instructional programs that prepare students to work in that occupation. For example, the occupation “data scientists” is associated with 35 instructional programs, such as data science, statistics, artificial intelligence, computational science, mathematical biology, and econometrics. The occupation “statisticians” is associated with 26 instructional programs, including data science, statistics, and econometrics, but not artificial intelligence, computational science, or mathematical biology.

The algorithms we employ consider occupations to be similar if they have many instructional programs in common. Data scientists and statisticians share 14 degrees, suggesting they are similar: Half the programs that prepare students to be a statistician also prepare students to be a data scientist. In contrast, data scientists and computer programmers share six degrees in common, suggesting they are less similar; computer programmers have 17 degrees overall so only a third of the programs that prepare students to be a computer programmer also prepare students to be a data scientist.²

Data and code to reproduce the analysis and figures are available through GitHub.

Data science is a shared enterprise among science and math

We use naïve Bayes to measure the similarity between each occupation and data science in terms of the preparation required to work in that occupation. Specifically, we first pretend that the occupation “data scientist” did not exist and then use Bayes’ rule to calculate the probability that a hypothetical group of workers with the 35 degrees associated with data science could have come from one of the roughly 700 other occupations. The higher the measure, the more consistent that occupation is with data science.

The use of Bayes’ rule is appealing because the similarity between a given occupation and data science takes into account the similarities between every other occupation and data science. Our use of Bayes’ rule is naïve in the sense that – before collecting the data – we assume these workers are equally likely to have come from any occupation.

The occupations with the largest probabilities, and thus most related to data science, are summarized in Figure 2. We find that the hypothetical workers have a 50% chance of being natural sciences managers and a 50% chance of being statisticians or mathematicians.³ We conclude that data science is a shared enterprise among science and math, and thus those trained in natural sciences have as much claim to data science as those trained in mathematics and statistics.

Figure 2: We use naïve Bayes to measure the similarity between each occupation and data science in terms of the preparation required to work in that occupation. We find that in terms of the preparation required to become a data scientist, data science is a shared enterprise among science and math.

Data science is closest to statistics in its role among other occupations

We use eigen centrality (eigen decomposition) to measure the similarity of each occupation in terms of its role relative to other occupations. Specifically, we calculate the principal right singular vector of the adjacency matrix denoting whether an instructional program (row) is associated with an occupation (column).⁴ An occupation has high eigen centrality when the instructional programs that prepare a worker for that occupation also prepare that worker for many other occupations as well. This suggests that the higher the measure, the more central the role of the occupation relative to other occupations.

The eigen centrality of each occupation is displayed in Figure 3. Each point represents an occupation, the x-axis denotes the centrality of the occupation, and the y-axis denotes the percent growth of the occupation as predicted by the US Bureau of Labor Statistics over the next decade. The figure demonstrates that data scientists and statisticians occupy nearly identical positions: Both are fast growing and central to the other occupations that work with data. In contrast, natural sciences managers are central but growing much more slowly, suggesting a role closer to managers. We conclude that – though data scientists are prepared similarly to natural sciences managers – a data scientist serves the same function as a statistician.

Figure 3: We use eigen centrality (eigen decomposition) to measure the similarity of each occupation in terms of its role relative to other occupations. We find that in terms of the role a data scientist serves relative to other occupations, a data scientist functions like a statistician.

But this function may be changing. Figure 4 shows the centrality (x-axis) of each occupation (y-axis) in 2010 and 2020. Green bars denote increases from 2010 to 2020 while yellow bars denote decreases. We find that the centrality of statisticians has declined over the past decade relative to other occupations, while the centrality of data scientists has grown. In fact, data science has now surpassed statistics as the most central fast-growing occupation. We conclude that though a data scientist and a statistician serve similar roles today, those roles may change as the workforce changes. Note that the occupation classifications changed in 2018, and we used the crosswalk provided by the US Bureau of Labor Statistics to make these comparisons.

Figure 4: We use eigen centrality (eigen decomposition) to measure the similarity of each occupation in terms of its role relative to other occupations. We find that the centrality of statisticians has declined over the past decade relative to other occupations, while the centrality of data scientists has grown. Data science has now surpassed statistics as the most central fast-growing occupation. (Occupations predicted to grow more than 20% over the next decade shown.)

The findings in this section are based on the adjacency matrix that encodes whether an instructional program (row) is associated with an occupation (column). A more detailed summary of the matrix is provided in Figure 5, which depicts the matrix as a network graph. Larger nodes represent occupations that are growing faster, while nodes closer to the center of the network represent more central occupations. The figure is interactive. You can zoom in to see the similar positions between data scientists and statisticians, which are both large (fast growing) and central.

Figure 5: A visualization of occupations as a network: Occupations are placed according to the instructional programs that train students for that occupation, with occupations closer together sharing more instructional programs in common. We find data scientists and statisticians occupy nearly identical positions at the center of the network. Occupations are colored according to the primary classification of instructional programs that train students for that occupation. Larger nodes represent occupations that are growing faster.

Is data science statistics?

We conclude that individuals trained in managing natural sciences research – a slow growing occupation – are turning to data science – a much faster growing occupation, and one which currently serves a role like that of a statistician. But if present trends continue, data science is poised to eclipse the historic role of the statistician as central to the occupations that work with data.

This suggests that while data science may be new and exciting, the role served by the data scientist is not particularly new. This does not mean that data scientists necessarily use the same time-honored methods for data analysis as statisticians. It is the authors’ experience, however, that many data science tools are in fact statistical. Indeed, the two data science algorithms we used in this article are both taught to students as new and exciting, but in reality are centuries-old methods steeped in statistical history.

Regardless of whether data science is or is not statistics, the occupation “data scientist” has proven immensely popular, capturing a zeitgeist that has eluded statistics. This is best evidenced by the fact that data science – and not statistics – has been crowned the sexiest job of the 21st century. But if statistics has not enjoyed the popularity of data science, perhaps the real question in need of answering is: What is statistics?

Explore more data science ideas

About the author: Jonathan Auerbach is an assistant professor in the Department of Statistics at George Mason University. His research covers a wide range of topics at the intersection of statistics and public policy. His interests include the analysis of longitudinal data, particularly for data science and causal inference, as well as urban analytics, open data, and the collection, evaluation, and communication of official statistics.; David Kepplinger is an assistant professor in the Department of Statistics at George Mason University. His research revolves around methods for robust and reliable estimation and inference in the presence of aberrant contamination in high-dimensional, complex data. He has active collaborations with researchers from the medical, biological, and life sciences.; Nicholas Rios is an assistant professor of statistics at George Mason University. He earned his PhD in statistics 2022 from Penn State University, where his dissertation focused on designing optimal mixture experiments. His primary research interests are experimental design and methods for intelligent data collection in the presence of real-world constraints. He is also interested in functional data analysis, computational statistics, compositional data analysis, and the analysis of high-dimensional data.

Copyright and licence: © 2023 Jonathan Auerbach, David Kepplinger, and Nicholas Rios

Text, code, and figures are licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence, except where otherwise noted. Thumbnail photo by Marc Sendra Martorell on Unsplash.

How to cite: Auerbach, Jonathan, David Kepplinger, and Nicholas Rios. 2023. “What is data science? A closer look at science’s latest priority dispute.” Real World Data Science, February 19, 2024.

References

Donoho, David. 2017. “50 Years of Data Science.” Journal of Computational and Graphical Statistics 26 (4): 745–66.

Stigler, Stephen M. 1981. “Gauss and the Invention of Least Squares.” The Annals of Statistics, 465–74.

Footnotes

Descriptions of occupations by government agencies are not particularly helpful in differentiating between data science, statistics, and related occupations. For example, according to the Bureau of Labor Statistics, data scientists use “analytical tools and techniques to extract meaningful insights from data.” This description is similar to mathematicians/statisticians, who “analyze data and apply computational techniques to solve problems,” and operations research analysts who use “mathematics and logic to help solve complex issues.”↩︎
Our analysis treats all instructional programs as equal and independent. We do not consider, for example, the number of workers who hold a degree from an instructional program or whether two instructional programs are similar or offered by similar academic departments. Our analysis could be adjusted to account for this or related information, although it is unclear to the authors whether such an adjustment would make the results more accurate.↩︎
Note that natural sciences managers share 18 instructional programs with data scientists, while statisticians share 14.↩︎
Or alternatively, the principal eigenvector of the adjacency matrix denoting the number of instructional programs each occupation (row) has in common with each other occupation (column).↩︎

Creating Christmas cards with R

Nicola Rennie — Tue, 12 Dec 2023 00:00:00 GMT

When you think about data visualisation in R (R Core Team 2022), you’d be forgiven for not jumping straight to thinking about creating Christmas cards. However, the package and functions we often use to create bar charts and line graphs can be repurposed to create festive images. This tutorial provides a step-by-step guide to creating a Christmas card featuring a snowman – entirely in R. Though this seems like just a fun exercise, the functions and techniques you learn in this tutorial can also transfer into more traditional data visualisations created using {ggplot2} (Wickham 2016) in R.

The code in this tutorial relies on the following packages:

library(ggplot2)
library(ggforce)
library(sf)

You may also have seen this tutorial presented at the Oxford R User Group November 2023 Meetup.

Let’s build a snowman!

Before we jump in to writing R code, let’s take a step back and think about what you actually need to build a snowman. If you were given some crayons and a piece of paper, what would you draw?

You might draw two or three circles to make up the head and body. Perhaps some smaller dots for buttons and eyes, and a (rudimentary) hat constructed from some rectangles. Some brown lines create sticks for arms and, of course, a triangle to represent a carrot for a nose. For the background elements of our Christmas card, we also need the night sky (or day if you prefer), a light dusting of snow covering the ground, and a few snowflakes falling from the sky.

Now lines, rectangles, circles, and triangles are all just simple geometric objects. Crucially, they’re all things that we can create with {ggplot2} in R.

Build a snowman with R

Let’s start with the background. The easiest way to start with a blank canvas in {ggplot2} is to create an empty plot using ggplot() with no arguments. We can also remove all theme elements (such as the grey background and grid lines) with theme_void(). To change the background colour to a dark blue for the night sky, we can edit the plot.background element of the theme using element_rect() (since the background is essentially just a big rectangle).

In {ggplot2} fill is the inner colour of shapes whilst colour is the outline colour. You can specify colours in different ways in R: either via the rgb() function, using a character string for a hex colour such as "#000000", or using a named colour. If you run colors(), you’ll see all the valid named colours you can use. Here, we’ve picked "midnightblue".

Let’s save this initial plot as an object s1 that we’ll keep adding layers to. Saving plots in different stages of styling as objects can help to keep your code more modular.

s1 <- ggplot() +
  theme_void() +
  theme(
    plot.background = element_rect(
      fill = "midnightblue"
      )
  )
s1

Next we’ll add some snow on the ground. We’ll do this by drawing a white rectangle along the bottom of the plot. There are two different functions that we could use to add a rectangle: geom_rect() or annotate(). The difference between the two is that geom_rect() maps columns of a data.frame to different elements of a plot whereas annotate() can take values passed in as vectors. Most of the {ggplot2} graphs you’ll see will use geom_*() functions. However, if you’re only adding one or two elements to a plot then annotate() might be quicker.

Since we’re only adding one rectangle for the snow, it’s easier to use annotate() with the "rect" geometry. This requires four arguments: the minimum and maximum x and y coordinates of the rectangle – essentially specifying where the corners are. We can also change the colour of the rectangle and its outline using the fill and colour arguments. Here, I’ve used a very light grey instead of white.

If we don’t set the axis limits using xlim() and ylim(), the plot area will resize to fit the area of the snow rectangle. The night sky background will disappear. You can choose any axis limits you wish here – but the unit square will make it easier to find the right coordinates when deciding where to position other elements. Finally, we add coord_fixed() to fix the 1:1 aspect ratio and make sure our grid is actually square with expand = FALSE to remove the additional padding at the sides of the plot.

s2 <- s1 +
  annotate(
    geom = "rect",
    xmin = 0, xmax = 1,
    ymin = 0, ymax = 0.2,
    fill = "grey98",
    colour = "grey98"
  ) +
  xlim(0, 1) +
  ylim(0, 1) +
  coord_fixed(expand = FALSE)
s2

To finish off the background, we’ll add some falling snowflakes. We first need to decide where on the plot the snowflakes will appear. We’ll be plotting lots of snowflakes, so manually typing out the coordinates of where they’ll be would be very inefficient. Instead, we can use functions to generate the locations randomly. For this we’ll use the uniform distribution. The uniform distribution has two parameters – the lower and upper bounds where any values between the bounds are equally likely. You can generate samples from a uniform distribution in R using the runif() function.

When generating random numbers in R (or any other programming language), it’s important to set a seed. This means that if you give your code to someone else, they’ll get the same random numbers as you. Some people choose to use the date as the random seed and since we’re making Christmas cards, we’ll use Christmas day as the random seed – in yyyymmdd format, of course!

We create a variable n specifying how many snowflakes we’ll create. Creating a variable rather than hard coding the variables makes it easier to vary how many snowflakes we want. Since our plot grid goes between 0 and 1 in both the x and y directions, we generate random numbers between 0 and 1 for both the x and y coordinates and store the values in a data.frame called snowflakes.

set.seed(20231225)
n <- 100
snowflakes <- data.frame(
  x = runif(n, 0, 1),
  y = runif(n, 0, 1)
)

Now we can plot the snowflakes data using geom_point() – the same function you’d use for a scatter plot. Since we’re using a geom_*() function, we need to tell {ggplot2} which columns go on the x and y axes inside the aes() function. To plot the snowflakes, we’re going to make using of R’s different point characters. The default when plotting with geom_point() is a small black dot, but we can choose to use a small star (close enough to a snowflake!) by setting pch = 8 and changing the colour to "white".

s3 <- s2 +
  geom_point(
    data = snowflakes,
    mapping = aes(
      x = x,
      y = y
    ),
    colour = "white",
    pch = 8
  )
s3

Now comes the part where we start rolling up some snowballs! Or, in the case of an R snowman, we draw some circles. Unfortunately, there isn’t a built-in geom_*() function in {ggplot2} for plotting circles. We could use geom_point() here and increase the size of the points but this approach can look a little bit fuzzy when the points are very large. Instead, we’ll turn to a {ggplot2} extension package for some additional geom_* functions - {ggforce} (Pedersen 2022).

The geom_circle() function requires at least three elements mapped to the aesthetics inside aes(): the coordinates of the centre of the circle given by x0 and y0, and the radii of each of the circles, r. Instead of creating a separate data frame and passing it into geom_circle(), we can alternatively create the data frame inside the function. The fill and colour arguments work as they do in {ggplot2} and we can set both to "white".

s4 <- s3 +
  geom_circle(
    data = data.frame(
      x0 = c(0.6, 0.6),
      y0 = c(0.3, 0.5),
      r = c(0.15, 0.1)
    ),
    mapping = aes(x0 = x0, y0 = y0, r = r),
    fill = "white",
    colour = "white"
  )
s4

We can use geom_point() again to add some more points to represent the buttons and the eyes. Here, we’ll manually specify the coordinates of the points. For the buttons we add them in a vertical line in the middle of the snowman’s body circle, and for the eyes we add them in a horizontal line in the middle of the head circle.

Since no two rocks are exactly the same size, we can add some random variation to the size of the points using runif() again. We generate five different sizes between 2 and 4.5. For reference, the default point size is 1.5. Adding scale_size_identity() means that the sizes of the points are actually equally to the sizes we generated from runif() and removes the legend that is automatically added when we add size inside aes().

s5 <- s4 +
  geom_point(
    data = data.frame(
      x = c(0.6, 0.6, 0.6, 0.57, 0.62),
      y = c(0.25, 0.3, 0.35, 0.52, 0.52),
      size = runif(5, 2, 4.5)
    ),
    mapping = aes(x = x, y = y, size = size)
  ) +
  scale_size_identity()
s5

To add sticks for arms, we can make use of geom_segment() to draw some lines. We could also use geom_path() but that is designed to connect points across multiple cases, whereas geom_segment() draws a single line per row of data – and we don’t want to join the snowman’s arms together!

To use geom_segment() we need to create a data frame containing the x and y coordinates for the start and end of each line, and then pass this into the aesthetic mapping with aes(). We can control the colour and width of the lines using the colour and linewidth arguments. Setting the lineend argument to "round" means that the ends of the lines will be rounded rather than the default straight edge.

s6 <- s5 + 
  geom_segment(
    data = data.frame(
      x = c(0.46, 0.7),
      xend = c(0.33, 0.85),
      y = c(0.3, 0.3),
      yend = c(0.4, 0.4)
    ),
    mapping = aes(x = x, y = y, xend = xend, yend = yend),
    colour = "chocolate4",
    lineend = "round",
    linewidth = 2
  )
s6

We’ll now add a (very simple) hat to our snowman, fashioned out of two rectangles. We can add the rectangles as we did before using the annotate() function and specifying the locations of the corners of the rectangles. We start with a shorter wider rectangle for the brim of the hat, and then a taller, narrower rectangle for the crown of the hat. Since we’ll colour them both "brown", it doesn’t matter if they overlap a little bit.

This might be one of the situations we should have used geom_rect() instead of annotate() but it might take a lot of trial and error to position the hat exactly where we want it, and this seemed a little easier with annotate().

s7 <- s6 +
  annotate(
    geom = "rect",
    xmin = 0.46, xmax = 0.74,
    ymin = 0.55, ymax = 0.60,
    fill = "brown"
  ) +
  annotate(
    geom = "rect",
    xmin = 0.50, xmax = 0.70,
    ymin = 0.56, ymax = 0.73,
    fill = "brown"
  )
s7

Now we can move on to the final component of building a snowman – the carrot for his nose! We’re going to use a triangle for the nose. Unfortunately, there are no built-in triangle geoms in {ggplot2} so we’ll have to make our own. There are different ways to do this, but here we’re going to make use of the {sf} package (Pebesma 2018). The {sf} package (short for simple features) is designed for working with spatial data. Although we’re not working with maps, we can still use {sf} to make shapes – including polygons.

We start by constructing a matrix with two columns – one for x coordinates and one for y. The x coordinates start in the middle of the head and go slightly to the right for the triangle point. The y coordinates take a little bit more trial and error to get right. Note that although triangles only have three corners, we have four rows of points. The last row must be the same as the first to make the polygon closed. The matrix is then converted into a spatial object using the st_polygon() function, and we can check how it looks using plot().

nose_pts <- matrix(
  c(
    0.6, 0.5,
    0.65, 0.48,
    0.6, 0.46,
    0.6, 0.5
  ),
  ncol = 2,
  byrow = TRUE
)
nose <- st_polygon(list(nose_pts))
plot(nose)

We can plot sf objects with {ggplot2} using geom_sf(). geom_sf() is a slightly special geom since we don’t need to specify an aesthetic mapping for the x and y axes – they are determined automatically from the sf object along with which type of geometry to draw. If your sf object has points, points will be drawn. If it has country shapes, polygons will be drawn. Like other geom_*() functions, we can change the colour and fill arguments to a different colour – in this case "orange" to represent a carrot!

You should see a Coordinate system already present. Adding new coordinate system, which will replace the existing one. message when you run the following code. The is because geom_sf forces it’s own coordinate system on the plot overriding our previous code specifying coord_fixed(). If you run it without the coord_sf(expand = FALSE), the extra space around the plot will reappear. We can remove it again with expand = FALSE.

s8 <- s7 +
  geom_sf(
    data = nose,
    fill = "orange",
    colour = "orange"
  ) +
  coord_sf(expand = FALSE)
s8

You could skip the sf part of this completely and pass the coordinates directly into geom_polygon() instead. However, I’ve often found it quicker and easier to tinker with polygon shapes using sf.

A key part of any Christmas card is the message wishing recipients a Merry Christmas! We can add text to our plot using the annotate() function and the "text" geometry (you could instead use geom_text() if you prefer). When adding text, we require at least three arguments: the x and y coordinates of where the text should be added, and the label denoting what text should appear. We can supply additional arguments to annotate() to style the text, such as: colour (which changes the colour of the text); family (to define which font to use); fontface (which determines if the font is bold or italic, for example); and size (which changes the size of the text). The "mono" option for family tells {ggplot2} to use the default system monospace font.

s9 <- s8 +
  annotate(
    geom = "text",
    x = 0.5, y = 0.07,
    label = "Merry Christmas",
    colour = "red3",
    family = "mono",
    fontface = "bold", size = 7
  )
s9

Sending Christmas cards in R

Now that we’ve finished creating our Christmas card, we need to think about how to send it. You could save it as an image file using ggsave(), print it out, and send it in the post. Or you could also use R to send it!

There are many different R packages for sending emails from R. If you create a database of email addresses and names, you could personalise the message on the Christmas card and then send it automatically as an email from R. If you want to automate the process of sending physical cards from R, you might be interested in the {ggirl} package from Jacqueline Nolis (Nolis 2023). {ggirl} allows you to send postcards with a ggplot object printed on the front. {ggirl} is also an incredible example of an eCommerce platform built with R! Note that {ggirl} can currently only send physical items to addresses in the United States.

Other Christmas R packages

If you’re curious about making Christmas cards with R but you don’t have the time to make them from scratch, you’ll likely find the christmas R package (Barrera-Gomez 2022) helpful. This package from Jose Barrera-Gomez can generate lots of different Christmas cards, many of them animated and available in different languages (English, Catalan and Spanish).

Emil Hvitfeldt has also created a Quarto extension that gives the effect of falling snowflakes on HTML outputs – including revealjs slides which is perfect for festive presentations!

Have you made your own Christmas cards with R? We’d love to see your designs!

Inspired by Nicola’s tutorial, Real World Data Science has indeed made its own Christmas card design. Check out our attempt over at the Editors’ Blog!

Explore more Tutorials

About the author: Nicola Rennie is a lecturer in health data science in the Centre for Health Informatics, Computing, and Statistics (CHICAS) within Lancaster Medical School at Lancaster University. She’s an R enthusiast, data visualisation aficionado, and generative artist, among other things. Her personal website is hosted at nrennie.rbind.io, and she is a co-author of the Royal Statistical Society’s Best Practices for Data Visualisation.

Copyright and licence: © 2023 Nicola Rennie

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Rennie, Nicola. 2023. “Creating Christmas cards with R.” Real World Data Science, December 12, 2023.

References

Barrera-Gomez, Jose. 2022. Christmas: Generation of Different Animated Christmas Cards. https://CRAN.R-project.org/package=christmas.

Nolis, Jacqueline. 2023. Ggirl: Ggplot2 Art in Real Life. https://github.com/jnolis/ggirl.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

Pedersen, Thomas Lin. 2022. Ggforce: Accelerating ’Ggplot2’. https://CRAN.R-project.org/package=ggforce.

R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

How to ‘open science’: A brief guide to principles and practices

Isabel Sassoon — Mon, 06 Nov 2023 00:00:00 GMT

Open science is about making your research freely accessible to others. This includes your data, your code and any outputs (such as reports or articles).

Many people in research, or working or studying in higher education, will be familiar with open science as a concept. As a lecturer, I was aware of it and frequently made use of open data for teaching and research, but it was not until it became a requirement from my funder that I took the opportunity to run my own research as open science by design.

Most tools that I was already familiar with could be used to support open science, but I soon realised that there were some steps and planning that I first needed to learn. As I discovered more about the processes and principles of open science, I came to see that making my research open would not require much additional time and effort. However, I felt that a succinct guide to open science would certainly help me – and others – to make the transition more easily. So, I set out to write such a guide.

This is the result! It is not meant to be an exhaustive document. Rather, I will explain the route I took to open science and what options are out there for others looking to follow suit.

What is open science?

“Open science refers to the process of making the content and process of producing evidence and claims transparent and accessible to others” (Munafò et al. 2017). The open science principles are:

Open source: Any data, code or output is accessible and usable in software that is freely available and with an open license. What this means in practice is that, for example, when sharing data, the .csv format is used rather than .xlsx, as the latter requires closed source software (Microsoft Excel) to run.
Open data: Research data should be freely accessible. One approach to open data is to adhere to the FAIR Data Principles (Wilkinson et al. 2016). FAIR stands for Findable, Accessible, Interoperable, and Reusable, and these principles can be implemented as a step to help make your work open science. However, they are not the only way, nor are they a guarantee that your work will automatically meet the definition of “open science” if you implement them.
Open access: Access to published papers and/or outputs is freely available to all. This can be achieved, for example, by sharing published papers in a pre-print server.

What is a pre-print server?

Pre-print servers are online repositories that enable you to share versions of your manuscript before or while your manuscript is under review. Examples of such repositories include ArXiv and MedRxiv.

Pre-print server example from MedRxiv.

One additional benefit of open science is that it supports reproducible research. This means that others can download your data and code, re-run the analysis, and see if they obtain the same results. To get the full benefit of open science and promote reproducibility, code needs to be written with enough explanations or comments to help others understand the logic of the various stages of an analysis.

Steps to open science

In this section, I will outline steps you can take to easily make your research open science. There will be situations where it is not possible to make all aspects of research open – for example, due to privacy and consent issues related to data. It is still possible to share some elements of such projects, but potentially this involves additional work – to create suitable demo data, say, or generate synthetic data in order to provide data that has comparable trends but preserves privacy. It may also be possible to share the data when it is requested on a case-by-case basis. I am not going to cover this here, but it is worth considering whether open science is possible in each case.

Before you begin…

Pre-registering an analysis plan for your research helps establish that your research is confirmatory (hypothesis testing) rather than exploratory (hypothesis generating). If you have some hypotheses or research questions that are the foundation of your research, it is worth pre-registering. If your research is exploratory, pre-registration is not necessarily applicable. Although pre-registration in itself is not a requirement for open science, the process of pre-registration can all be completed within repositories such as the Open Science Framework (OSF). Pre-registering your analysis plan will add value and rigour to you research.

If your research doesn’t require pre-registration, jump straight to Step 1.

What is pre-registration?

Pre-registration involves completing a form before you start your analysis to explain the primary research questions, the covariates of interest, and the methods you plan to use and why. Haroz (2022) provides more detail on how apps like OSF, Zenodo and Figshare support pre-registration. This video also gives more details.

Below is an example of a pre-registration.

Step 1

Does your research plan require you to write a lot of code for analysis purposes, perhaps in collaboration with others? If the answer is No, skip to Step 2. If Yes:

Consider setting up a GitHub repository (or repo), especially if this is a collaborative project and it is likely that more than one person will be working on the code. Don’t forget to invite your collaborators to join the repo!
GitHub repos can be set to private and then made public at the appropriate time, so development work can take place behind closed doors and then released to the wider world when ready.
Ensure that your code is commented properly so that it is reusable and, eventually, your results are reproducible.

Step 2

GitHub is a great tool for developing code collaboratively, but it may not be right for you – or indeed the only tool to use – if you have a lot of other material to work with and release as part of your research project. If that’s the case:

Set up an area for your project on an open science repository such as OSF, Zenodo or Figshare. (If you use OSF then setting up an OSF repository is quick and easy – head to osf.io. OSF allows many integrations, including to GitHub, through the use of add-ons.)
You can start by setting your repository as private and then make it public at the appropriate time.
Upload all project files, and don’t forget to invite your collaborators.
Add ORCIDs for every team member.

What is an ORCID?

An ORCID is a persistent digital identifier that you own and control. It allows you to connect your ID with your professional information – affiliations, grants, publications, peer reviews, and more. You can set one up at orcid.org.

Step 3

If you are ready to submit your research to a journal or conference, consider the following steps before you submit:

Check that there is enough information in GitHub (if using) and OSF (if using) about the project. This should include instructions for someone to be able to access your files, use the data and run the code.
Make the GitHub and/or OSF repositories publicly visible.
If submitting to a journal that requires anonymous links, generate them and copy them into the manuscript. (In OSF, for example, it is possible to create anonymous links to your repository in case of double-blind submission requirements.)
Share a copy of your manuscript on a pre-print server – but don’t forget to check the journal or conference policy on pre-prints before you do!

Apps and websites to support open science

This is by no means a complete list but instead features the apps and websites that are commonly used when research projects include data and code.

Open Science Framework (OSF)

OSF is a free web app that supports researchers with sharing, archiving, registration and collaboration. The Open Science Framework website is worth checking out and includes a guide to help users get started. Once a project is public in the OSF it will have a DOI and a permanent link, so it can be cited. OSF can also support the tracking of versions of your file. One drawback can be that there is a limit on the maximum size of file that can be uploaded.

Sample OSF repository.

Figshare

This web app supports storing and sharing research outputs (papers, FAIR data, and non-traditional research outputs). Like OSF, Figshare provides a DOI for your files and is similarly limited in the maximum size of file that can be upload.

Zenodo

Another general purpose open repository. As with Figshare, Zenodo also provides a DOI.

GitHub

GitHub is a web app that offers distributed version control. It is very commonly used for software development, especially when there are multiple developers. Although you can share code and many file types through GitHub, accessing and collaborating on projects can be a daunting experience for those who are not familiar with the way GitHub works. Also, GitHub is not always required as it is possible to share your code through OSF, for example. If you want to know more about using GitHub in support of open science and reproducibility, read “The road to reproducible research”.

Example: my own route to open science

In my case, my project did not involve a heavy amount of coding or a large number of researchers, so I opted to use OSF to store the ethics approval documents, the survey questions (which drove the data collection), the data in .csv format, and the outputs. I also then linked this to Figshare from my institution and published the article on MedRxiv at the same time as I submitted it to a journal for review. The paper was eventually published in BMJ Open. The steps I took in this case were sufficient for the work to be recognised as embracing open science principles.

Plot your own route to open science

flowchart TD
  D("- Set up GitHub repo
  - Set repo as private
  - Add collaborators")
  F("- Set up an OSF repository
  - Set project as private
  - Add collaborators and their ORCIDs")
  A(Pre-register statistical analysis plan?) -- Yes --> B(Complete pre-registration through, e.g., Open Science Framework) --> C(Does your research involve writing lots of code?) -- Yes --> D --> E(Do you plan to share data and other research material?) -- Yes --> F --> G(Research project is finished and ready to submit to journal or conference)
  A -- No --> C -- No --> E -- No --> G
  G --> H(Have you used repos?) -- Yes --> I(Change repo settings - GitHub and/or OSF - to public) --> J(Does publication permit sharing manuscripts to pre-print servers?) -- Yes --> K(Submit to pre-print server) --> L(Does publication require anonymous link to OSF repo for double-blind review?) -- Yes --> M(Generate anonymous link and add to submission) --> N(Submit your work)
  H -- No --> J -- No --> L -- No --> N

In summary…

To make your research open science, you need to:

Make any data you collect or generate available to download and reuse.
Pre-register your statistical analysis plan.*
Make your code available for download, and document it clearly so others can reuse it.
Make any supporting material and outputs available for download in formats that are open source.
If publishing to a journal or conference, share manuscripts in a pre-print server.*

* May not be relevant or applicable, depending on the nature of your work.

Explore more data science ideas

About the author: Isabel Sassoon is a senior lecturer in computer science and data science at Brunel University London and a member of the Real World Data Science editorial board.

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence Thumbnail photo by Basil James on Unsplash.

How to cite: Sassoon, Isabel. 2023. “How to ‘open science’: A brief guide to principles and practices.” Real World Data Science, November 6, 2023. URL

References

Haroz, Steve. 2022. “Comparison of Preregistration Platforms.” MetaArXiv. https://doi.org/10.31222/osf.io/zry2u.

Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1): 1–9.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 1–9.

Heading to a conference this summer? Share your learnings here

Brian Tarran — Mon, 17 Jul 2023 00:00:00 GMT

Three major events in the statistics and data science calendar are taking place over the next few months, and we want to give the wider community the opportunity to sample some of the exciting ideas being discussed. For that, we need your help!

If you’re a student or early career researcher and you’re attending one or all of the…

…we invite you to write about your favourite paper or session as a “Bites” post.

Bites posts are digestible, engaging, non-technical summaries of research papers and presentations, written for an undergraduate-level audience. The goal is to draw attention to key findings, potential applications, and the wider implications of new ideas and developments in statistics and data science. Advice and guidance on how to write a Bites post can be found here, and example posts can be found on our DataScienceBites page.

For our summer conference coverage, Real World Data Science is partnering with our friends at MathStatBites. If you write a Bites post and it is accepted for publication, the post will appear on one or both of our sites – depending on the focus of the research you’re writing about.

If you have any questions, please feel free to contact us. Otherwise, safe travels, enjoy the conference(s), and we look forward to hearing from you soon!

About DataScienceBites: DataScienceBites is written by graduate students and early career researchers in data science (and related subjects) at universities throughout the world, as well as industry researchers. We publish digestible, engaging summaries of interesting new pre-print and peer-reviewed publications in the data science space, with the goal of making scientific papers more accessible. Find out how to become a contributor.

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail image includes photo by Product School on Unsplash.

How to cite: Tarran, Brian. 2023. “Heading to a conference this summer? Share your learnings here.” Real World Data Science, July 17, 2023. URL

Choosing the right forecast

Brian King — Thu, 13 Jul 2023 00:00:00 GMT

Nobel laureate Niels Bohr is famously quoted as saying, “Prediction is very difficult, especially if it’s about the future.” The science (or perhaps the art) of forecasting is no easy task and lends itself to a large amount of uncertainty. For this reason, practitioners interested in prediction have increasingly migrated to probabilistic forecasting, where an entire distribution is given as the forecast instead of a single number, thus fully quantifying the inherent uncertainty. In such a setting, traditional metrics of assessing and comparing predictive performance, such as mean squared error (MSE), are no longer appropriate. Instead, proper scoring rules are utilized to evaluate and rank forecast methods. A scoring rule is a function that takes a predictive distribution along with an observed value and outputs a real number called the score. Such a rule is said to be proper if the expected score is maximized when the predictive distribution is the same as the distribution from which the observation was drawn.

Many proper scoring rules exist, such as the continuous ranked probability score (CRPS) and the logarithmic score. Choosing which rule to use is not necessarily straightforward. Furthermore, forecast methods are often selected not based on a single score, but rather averages of scores from many probabilistic forecasts, which can introduce new challenges affecting how one might rank competing forecasts. In the paper under discussion, Bolin and Wallin define several properties of scoring rules that help clarify how the rules behave when multiple forecast scores are averaged. Additionally, they introduce a new class of proper rules that aims to overcome some of the deficiencies of other common scoring rules.

About the paper

Title: Local scale invariance and robustness of proper scoring rules

Author(s) and year: David Bolin and Jonas Wallin (2023)

Status: Published in Statistical Science, DOI: 10.1214/22-STS864.

The authors argue that situations are often encountered where forecasts are derived and subsequently averaged for observations with different inherent variability. One example might be financial data, such as stock returns, where there are commonly periods with much higher variance (known as volatility in the financial setting). Such processes can be represented using a model known as stochastic volatility, where the variance of observed data evolves randomly over time. Figure 1 plots an example path of the data-generating process under such a model. When data exhibits this varying uncertainty, many proper scoring rules will assign a score whose magnitude changes for those observations with more variability, a characteristic the authors term scale dependence. Some rules will ‘punish’ observations with higher uncertainty, and others may ‘reward’ such observations. Hence, when averaging multiple scores, observations will not be treated symmetrically, which the authors argue can “lead to unintuitive forecast rankings.”

Figure 1: Left, a time series of volatility, and right, the resulting observations under a standard stochastic volatility model.

Thus, an ideal scoring rule will not suffer from scale dependence. The lack of scale dependence is a property that the authors term local scale invariance. The logarithmic score possesses this attribute, but the CRPS and other scoring rules, like the Hyvärinen score, do not. To address this issue, the authors propose a new class of scoring rules which exhibits local scale invariance. Among this class is a scoring rule dubbed the scaled CRPS (SCRPS), which features many of the desirable qualities of the CRPS but overcomes the scale dependence issue.

Of course, if local scale invariance is all that matters, then we could just use the logarithmic score in all scenarios. But there is another issue to consider when averaging forecast scores – the presence of outliers. In many scenarios, we might encounter observations that are very far outside the normal range, and we don’t want our average forecast performance measure to be greatly thrown off if such an oddity is observed. In other words, we want our proper scoring rules to be robust. In their article, Bolin and Wallin formalize the concept of robustness for scoring rules and show that, in many cases, the logarithmic score is not robust. Yet they also prove their proposed class of scaled scoring rules is not generally robust, although they show that the scoring rules can be modified to be robust (a new scoring rule they term robust SCRPS). Under such a modification, however, the scoring rule would no longer be local scale invariant in the strict sense. Indeed, under the proposed definitions of local scale invariance and robustness, finding a scoring rule that can simultaneously satisfy both criteria seems difficult. The authors conjecture that it may even be impossible.

Hence, this paper raises many questions for future consideration but achieves its goal of showing that evaluating probabilistic forecasts by averaging proper scoring rules is not necessarily a simple matter. Different scoring rules will lead to different rankings of forecasting methods, and the underlying properties of each scoring rule must be considered on a case-by-case basis. Although not discussed in this summary, the authors also compare scoring rules in several scenarios and present the theory behind the ideas examined here. For interested readers who want to dig more into these ideas, check out the full paper published in Statistical Science.

About the author: Brian King is currently a senior machine learning research engineer at Arm, working on applying machine learning to hardware verification. He recently completed his PhD in statistics at Rice University, where his research focused on Bayesian modeling and forecasting for time series of counts.

About DataScienceBites: DataScienceBites is written by graduate students and early career researchers in data science (and related subjects) at universities throughout the world, as well as industry researchers. We publish digestible, engaging summaries of interesting new pre-print and peer-reviewed publications in the data science space, with the goal of making scientific papers more accessible. Find out how to become a contributor.

This post is republished with permission from MathStatBites. Thumbnail image by Brendan Church on Unsplash.

How to cite: King, Brian. 2023. “Choosing the right forecast.” Real World Data Science, July 13, 2023. URL

Trusted AI: translating AI ethics from theory into practice

Maxine Setiawan and Mira Pijselman — Mon, 03 Jul 2023 00:00:00 GMT

With artificial intelligence (AI) becoming increasingly prevalent across sectors, so too have conversations about AI ethics. AI ethics provides a repeatable and comprehensive way to assess what we should and should not be doing with AI, and sets out how we ought to design, use, and govern AI products in accordance with key principles. Ethical frameworks are essential to derive sustainable value from AI products and services and build trust.

A myriad of AI tools that leverage automated or semi-automated decision-making processes have raised important questions that have become foundational in the AI ethics community, such as ‘What does it mean for an algorithm to be fair?’ As an example, AI tools that are used in recruitment may perpetuate biases arising from historical training data. If a model used to generate a shortlist of applicants has been trained on data from past candidates, say, and those candidates – both successful and unsuccessful – are predominantly men, historical patterns that contain various biases will perpetuate to become algorithmic biases that form the model’s decisions. Thus, the model may algorithmically discriminate against women or gender minorities, as individuals from these groups are not well represented in the training data.

To ensure the safe and responsible use of AI, the focus moving forward needs to be on the operationalisation of AI ethics into the day-to-day development lifecycle. But, what does this look like in practice? And how might you get started as an ethical AI practitioner? In this article, we unpack these questions and give you, the data scientist, a foundation to begin your journey towards trusted AI. Read along to get an overview of key principles that you should be aware of, what they mean, their underlying technical grounding, and what implementation might look like practically.

Ethical AI principles

You have likely heard of several principles in relation to ethical AI, such as fairness or transparency. The context in which you’ve encountered such principles is most probably due to their inclusion in a broader ethical framework. Some of the most popular ethical AI frameworks include the National Institute of Standards and Technology’s AI Risk Management Framework, the UK Data Ethics Framework, and the European Commission’s Ethics Guidelines for Trustworthy AI. Among these and many other frameworks, we can run into what Floridi and Cowls (2019) call “principle proliferation,” whereby it becomes overwhelming for those contributing to AI programmes to know where to begin with ethics due to an excess of choice (p. 2).

At the time of writing, there is no single universally accepted standard that dictates which essential ethical AI values or principles should be adhered to during AI development and deployment. However, there are common themes that emerge. In our organisation, EY, we’ve learned from the variety of principles, frameworks, and white papers in the AI ethics community and developed our own Trusted AI Framework comprising five key attributes that we believe assure the trustworthiness of AI:

Transparent
Explainable
Unbiased
Resilient
High-performing

In this article, we take a deeper dive into the first three attributes – transparency, explainability, and unbiasedness (or fairness). These are areas where data scientists can act as critical enablers of ethical AI when they have the right knowledge and toolkits at their disposal.

Transparency

Transparency is the ability to provide meaningful information to foster awareness and understanding of an AI system. It starts with documenting AI systems in a way that is accessible for a broad audience with a spectrum of technical abilities. It is a simple yet powerful way to build trust in AI. It empowers non-technical stakeholders to critically evaluate AI development decisions, thereby unlocking multi-disciplinary insights that can mitigate reputational or performance risks. Further, it also builds trust with society, as it can enable everyday users to interrogate AI design decisions, product capabilities, and system limitations, thereby permitting users to make informed judgements about technology. Unfortunately, transparency is often misunderstood as disclosing trade secrets or proprietary information, such as source code and datasets. However, transparency can be achieved without disclosing such technically complex information. Instead, it can be as simple as disclosing where and when an AI system is being used, or for what purposes a model should be employed.

But what exactly does “documenting AI systems” look like? Documentation should consist of a mix of technical components (system architecture, dataset selection determination, model selection techniques, etc.) and non-technical components (business case, product purpose of use, alignment to overall AI strategy, etc.). The research community has recommended AI documentation standards, such as datasheets for datasets and model cards for model reporting. You can liken datasheets or model cards to the importance placed upon commenting your code – the more information there is available around decisions throughout model development, the greater the certainty that these artefacts will be understood and used as intended moving forward. Proper documentation and governance will help ensure accountability, improve internal and external oversight, and initiate discussions around model optimisation goals and their trade-offs, such as including fairness and accuracy in optimisation objectives.

With upcoming AI regulations, transparency requirements will become more integral. For example, the European Union (EU) AI Act introduces specific transparency obligations, such as bot disclosures, for both users and providers of AI systems, which would allow users to opt out of interacting with an AI system. Furthermore, in higher risk use cases, specific technical documentation is needed, which would include details of a system’s intended purpose and descriptions of its development process.

Explainability

Once transparency is enabled, explainability is a natural next step, especially when an AI product is implemented in a more regulated or high-risk environment. Explainability is the ability to express why an AI system reached a particular decision or understand the features that affect model behaviour. Explainability is a key concern within the field of explainable AI, which, as a discipline, strives to improve trustworthiness by enabling a better understanding of a system’s underlying logic via a suite of technical methods.

Fundamentally, different model architectures mean that some models are more interpretable than others, as the steps used to evaluate their predictions are easier for humans to comprehend. Decision trees, for example, have more human-interpretable characteristics than deep learning models. Different model architectures also mean that there are interpretation tools that are only applicable to certain models, such as regression weights in a linear model.

Another approach to consider, then, is model-agnostic interpretation, which encompasses both global interpretability (explanation of an entire model’s behaviour) and local interpretability (explanation of a single prediction). While there are fast-developing techniques and tools for model-agnostic interpretability, let’s take a look at two of the more popular methods available:

Local interpretable model-agnostic explanations (LIME)

This is an explanation technique that trains local surrogate models, using explainable models such as Lasso or decision trees, to approximate the predictions of a model that is not interpretable by design in order to explain individual model predictions. The idea is to use interpretable features from the surrogate models to create human-friendly explanations where the underlying model cannot. For example, in an image classification model that detects a flower in an image, LIME is able to highlight the parts of the image that explain why the model classifies the image as a flower (see illustration below). This provides an interpretable explanation between the input variable and prediction, which is an essential part of interpretability.

Illustration of explainable AI processes using LIME on an image classification AI system. Adapted from “Local Interpretable Model-Agnostic Explanations (LIME): An Introduction” and O’Reilly.

Shapley Additive Explanations (SHAP)

SHAP (GitHub repo) uses tools and theoretical foundations from game theory, one of which is Shapley values. It works by assigning each feature an importance value for a particular prediction to numerically explain the contribution of various features to a model’s output. For example, in a model that predicts flu, SHAP calculates the importance of sneezing as a feature by removing and adding the subset of other features, leading to different combinations of features that contribute to the prediction. This method provides interpretable solutions for more complex models similar to the equivalent of “weights” in linear models.

Fairness

The area of AI ethics that is central to impending AI regulations, such as the EU AI Act and the New York City AI Law,¹ is fairness. AI models are inherently biased because of their underlying training data.² Thus, when we speak of fairness in the context of AI ethics, we are referring to a combination of technical and non-technical ways to minimise the impacts of algorithmic bias.

Let’s begin with the technical approaches to fairness. To achieve equitable, reliable, and fair decisions, a diverse and balanced set of examples is needed in training datasets. However, data often contains disparities that, if left unchecked, can perpetuate algorithmic biases and harms. There are various approaches to detect sources of bias, guarantee fairness, or “debias” models. To strive for algorithmic fairness, many papers have proposed various quantitative measures of fairness, with some based on unstated assumptions about fairness in society. Unfortunately, these assumptions are often mutually incompatible, making it difficult to compare fairness metrics to one another – consider, for example, the longstanding debate between equality of outcome and equality of treatment.

Although metrics incompatibilities exist, fairness broadly focuses on equality of opportunity (group fairness), and equality of outcome (individual fairness) to prevent discrimination against certain attributes. Drawing definitions from legal frameworks, the term “protected attribute” refers to the characteristics that are often protected under anti-discrimination laws, such as gender or race. Mathematically, the following metrics are often used to demonstrate scores that support fairness:

Statistical parity

This measure seeks to uncover whether a model is fair towards protected attributes by measuring the difference between the majority and protected class in receiving a favourable outcome. A value of 0 demonstrates the model to be fair.
Disparate impact

This compares the percentage of favourable outcomes for the monitored group to the percentage of favourable outcomes for a reference group. The groups compared can be the majority group and minority group, and this score will highlight in whose direction decisions are biased. For example, if a model grants loans to 60% of people in a middle-aged group and only 50% for those of other age cohorts, then the disparate impact is 0.8, which indicates a positive bias towards the middle-aged group and an adverse impact on the remaining cohorts.
Equality of odds

This measures the balance of the true positive rate and false positive rate between protected and unprotected groups, which seeks to uncover whether a model performs similarly for the two groups.

It is important to remember that statistics are only one side of the fairness problem for machine learning, and one that treats the symptoms of bias as opposed to the underlying causes. In addition to the aforementioned technical approaches, there are a variety of non-technical measures that teams developing AI systems can adopt to augment fairness and inclusion:

Definition of fairness

Organisations that develop or use AI systems need to define, practically, what it means to be fair. Although there are various quantitative fairness measures, these are based on assumptions of fairness in society, which could be defined for each specific use case.
Diversity on teams

There’s been a sharpened focus on the value of team diversity to areas such as productivity and creativity. The same is true for ethics. Ensuring that product teams are composed of a broad cross-section of identities can help to organically drive fairness through diversity of thought and experience.
Education and self-reflection

Developing knowledge within individuals and teams about the socio-technical aspects of AI – that is, the ways in which AI shapes our social, political, economic, and environmental lives. The more critical a person can be as a data scientist in questioning why something is being built, the more likely they are to proactively recognise risks surrounding fairness.
Consider the end user

Imagine that you are on a development team building an AI solution for a problem in the agricultural sector pertaining to livestock health. Who is best suited to solving the problem: a data scientist or a farmer? As a data scientist, you may have the tools to develop a solution, but given your distance from the end user, you are unlikely to intimately understand the problem in the same way a farmer would. If you cannot understand the problem, you cannot hope to find a solution, much less an ethical one. Recognising the importance of consulting individuals that are representative of end users is key to ensure that your design is fair.
AI ethics review boards

Data science teams should not operate in isolation. Increasingly, organisations are establishing AI ethics review boards or similar forums that are intended to act as checks on the design decisions made throughout AI development. Does your organisation have one?

In conclusion

These three areas – transparency, explainability, and fairness – are the starting points to embed and operationalise AI ethics in technical development. Transparency relies on both technical and non-technical documentation to facilitate discussions with non-technical stakeholders, as well as to create and enforce accountabilities. Explainability helps to build trust in AI output by vesting us with an ability to explain “why”. Finally, adopting both technical and non-technical measures of fairness can ensure that AI products in development do not adversely impact certain groups.

In addition to these three areas of AI ethics, within EY we have two other focus areas – resilience and high-performance – that form part of our Trusted AI Framework. We will discuss these in a future article. We’re also keen to explore topics such as generating trust in generative AI! Until then, please share your stories of developing ethical AI projects in the comments below. How are you translating AI ethics from theory into practice?

Footnotes

New York City Local Law 144.↩︎
Not statistical bias (usually known as bias-variance trade-off), which compares the training data and target value to approximate errors.↩︎

A demonstration of the law of the flowering plants

Jonathan Auerbach — Thu, 13 Apr 2023 00:00:00 GMT

This tutorial will demonstrate a popular method for predicting the day a flower will bloom. There are many reasons why you might want to predict a bloom date. You might be a scientist studying ecosystems stressed by climate change. Or you might be planning a trip to Amsterdam and would like to time your stay to when the tulips are in bloom. Or maybe you are participating in the annual Cherry Blossom Prediction Competition and want some ideas to help you get started.

In any case, you might be surprised to learn that the day a flower blooms is one of the earliest phenomena studied with systematic data collection and analysis. The mathematical rule developed in the eighteenth century to make these predictions – now called the “law of the flowering plants” – shaped the direction of statistics as a field and is still used by scientists with relatively few changes.

We present the law of the flowering plants as it was stated by Adolphe Quetelet, an influential nineteenth century statistician. Upon completing this tutorial, you will be able to:

State the law of the flowering plants and explain how Quetelet derived it.
Reproduce Quetelet’s findings with weather data from the Global Historical Climatology Network.
Replicate Quetelet’s findings with more recent data from the USA National Phenology Network.
Predict the day the lilac will bloom in Brussels in 2023 with weather forecasts from AccuWeather.
Describe how the USA National Phenology Network uses the bloom dates of lilacs to monitor the start of spring.

At the end of the tutorial, we challenge you to design an algorithm that beats our predictions. The tutorial uses the R programming language. In particular, the code relies on the following packages:

```{r}
library("knitr")
library("kableExtra")
library("tidyverse")
library("plotly")
library("rnoaa")
library("rnpn")
library("rvest")
```

The law of the flowering plants

We begin by reviewing the law of the flowering plants as it was stated by Adolphe Quetelet. You may already know Quetelet as the inventor of the body mass index. Less known is that Quetelet recorded the bloom dates of hundreds of different plants between 1833 and 1852 at the Brussels Observatory, which he founded and directed. Quetelet reported that a plant flowers when exposed to a specific quantity of heat, measured in degrees of Celsius squared (°C²). For example, he calculated that a lilac blooms when the sum of the daily temperatures squared exceeds 4264°C² following the last frost.

He communicated this law in his Letters addressed to HRH the grand duke of Saxe-Coburg and Gotha (Number 33, 1846; translated 1849) and in his reporting On the climate of Belgium (Chapter 4, Part 4, 1848; data updated in Part 7, 1857). A picture of Quetelet and the title page of On the climate of Belgium are displayed in Figure 1.

Figure 1: Quetelet reported on the law of the flowering plants in On the climate of Belgium (1857). Sources: Wikimedia Commons, Gallica.

Quetelet was not the first to study bloom dates. Anthophiles have recorded the dates that flowers bloom for centuries. Written records of cherry trees go back as far as 812 AD in Japan and peach and plum trees as far as 1308 AD in China. Systematic record keeping began a century before Quetelet with Robert Marsham’s Indications of Spring (1789).

Quetelet was also not the first to study the relationship between temperature and bloom dates. René Réaumur (1735), an early adopter of the thermometer, noted the relationship before Marsham published his Indications. But Quetelet was the first to systematically study the relationship across a wide variety of plants and derive the amount of heat needed to bloom. An example of Quetelet’s careful record keeping can be seen in Figure 2, one of many tables he reported in his publications.

Figure 2: Bloom dates at Brussels Observatory observed by Quetelet between 1839 and 1852. Source: Gallica.

Reproducing Quetelet’s law of the flowering plants

To reproduce Quetelet’s law, we combine the data in Figure 2 with additional observations from his Letters. We focus on Quetelet’s primary example, the bloom date of the common lilac, Syringa vulgaris, row 18 of Figure 2. We do this because Quetelet carefully describes his methodology for measuring the bloom date of lilacs. For example, Quetelet considers a lilac to have bloomed when “the first corolla opens and shows the stamina.” That event is closest to what the USA Phenology Network describes as “open flowers”, depicted in the center image of Figure 3 below. This detail will become relevant when we attempt to replicate Quetelet’s law in a later section. Note that although we focus on lilacs in this tutorial, the R code is easily edited to predict the day that other plants will bloom.

Figure 3: The bloom date occurs when the first corolla opens and shows the stamina (center image). Source: USA National Phenology Network.

In the R code below, the five-column tibble lilac contains the date each year that Quetelet observed the lilacs bloom at Brussels Observatory. The first three columns are the month, day, and year the lilacs bloomed between 1839 and 1852. These columns are combined to form the fourth column, the full date the lilacs bloomed. The last column converts the date to the day of the year the lilacs bloomed, abbreviated “doy.” That is, “doy” is the number of days it took for the lilacs bloom following January 1. Both “date” and “doy” representations of Quetelet’s observations will be useful throughout this tutorial.

```{r}
lilac <-                   
  tibble(month = c("May", "April", "April", "April", "April", "April", "May", 
                   "April", "May", "April", "May", "April", "May", "May"),
         day   =  c(10, 28, 24, 28, 20, 25, 13, 12, 9, 21, 2, 30, 1, 12),
         year  = 1839:1852,
         date  = as.Date(paste(month, day, year), format = "%B %d %Y"),
         doy   = parse_number(format(date, "%j"))) 

lilac %>% 
  kable(align = "c",
        caption = "Table 1: Bloom dates of lilacs observed by Quetelet between 1839 and 1852.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
```

**Table 1:** Bloom dates of lilacs observed by Quetelet between 1839 and 1852.
month	day	year	date	doy
May	10	1839	1839-05-10	130
April	28	1840	1840-04-28	119
April	24	1841	1841-04-24	114
April	28	1842	1842-04-28	118
April	20	1843	1843-04-20	110
April	25	1844	1844-04-25	116
May	13	1845	1845-05-13	133
April	12	1846	1846-04-12	102
May	9	1847	1847-05-09	129
April	21	1848	1848-04-21	112
May	2	1849	1849-05-02	122
April	30	1850	1850-04-30	120
May	1	1851	1851-05-01	121
May	12	1852	1852-05-12	133

To reproduce Quetelet’s law of the flowering plants, we will combine these bloom dates with daily temperature. The daily maximum and minimum temperatures at Brussels Observatory between 1839 and 1852 are available from the Global Historical Climatology Network. The data can be downloaded using the ghcnd_search function contained within the R package rnoaa (2021). The station id for Brussels Observatory is “BE000006447”.

The ghcnd_search function returns the maximum and minimum temperature as separate tibbles in a list. In the R code below, we join the tibbles using the reduce function. Note that the temperature is reported in tenths of a degree (i.e. 0.1°C) so we divide by 10 before calculating the temperature midrange, our estimate of the daily temperature.

The result is a five-column tibble temp, which contains the year of the temperature record (“year”), the date of the temperature record (“date”), the maximum temperature (“tmax”), the minimum temperature (“tmin”), and the midrange temperature (“temp”). The first 10 rows of the table are below. When you produce the full table yourself, you may notice that a small portion of temperature records are missing. We found that imputing these missing values does not significantly change the results. Therefore, we ignore these days when conducting our analysis.

```{r}
temp <- 
  ghcnd_search(stationid = "BE000006447",
               var = c("tmax", "tmin"),
               date_min = "1839-01-01",
               date_max = "1852-12-31") %>%
  reduce(left_join) %>%
  transmute(year = parse_number(format(date, "%Y")), 
            date, 
            tmax = tmax / 10, 
            tmin = tmin / 10, 
            temp = (tmax + tmin) / 2)
  
temp %>% 
  kable(align = "c", 
        col.names = c("year", "date", "maximum temperature (°C)", 
                      "minimum temperature (°C)", "midrange temperature (°C)"),
        caption = "Table 2: Temperature observed at Brussels Observatory between 1839 and 1852.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
```

**Table 2:** Temperature observed at Brussels Observatory between 1839 and 1852.
year	date	maximum temperature (°C)	minimum temperature (°C)	midrange temperature (°C)
1839	1839-01-01	5.7	-0.2	2.75
1839	1839-01-02	6.3	0.8	3.55
1839	1839-01-03	7.2	1.8	4.50
1839	1839-01-04	8.0	1.8	4.90
1839	1839-01-05	5.3	0.8	3.05
1839	1839-01-06	10.0	1.3	5.65
1839	1839-01-07	8.9	1.4	5.15
1839	1839-01-08	3.0	0.1	1.55
1839	1839-01-09	0.8	-0.1	0.35
1839	1839-01-10	2.8	-2.8	0.00
…	…	…	…	…

Reproducing Quetelet’s law is now a simple matter of calculating the sum of the squared daily temperature from the day of last frost until the bloom day. We could use the day of last frost reported in Quetelet’s Letters. However, since we will replicate Quetelet’s analysis with recent data in a later section, we use our own definition of the day of last frost. We define the day of last frost to be the day following the last day the maximum temperature is below 0. The R code below creates the function doy_last_frost to extract the day of last frost from the maximum temperature. To demonstrate this function, we then compare the bloom date with the last frost date in 1839, the first year Quetelet observed.

```{r}
doy_last_frost <- function(tmax, doy_max = 100) {
  dof <- which(tmax[1:doy_max] <= 0)
  if(length(dof) == 0) 1 else max(dof) + 1
  }

bloom_day <- 
  lilac %>% 
  filter(year == 1839) %>%
  pull(doy) + 
  as.Date("1839-01-01")
  
frost_day <- 
  temp %>% 
  filter(year == 1839) %>% 
  pull(tmax) %>% 
  doy_last_frost() + as.Date("1839-01-01") 

tibble(`last frost date` = frost_day, 
       `bloom date` = bloom_day) %>%
  kable(align = "c",
        caption = "Table 3: Last frost date and lilac bloom date at Brussels Observatory in 1839.") %>%
  kable_styling()
```

**Table 3:** Last frost date and lilac bloom date at Brussels Observatory in 1839.
last frost date	bloom date
1839-03-08	1839-05-11

If Quetelet’s law of the flowering plants is correct, Table 3 has the following interpretation. On March 8, 1839 the lilacs at Brussels Observatory began “collecting” temperature. The lilacs continued to “collect” temperature until May 11, at which point they exceeded their 4264°C² quota and bloomed. We visualize this theory in Figure 4 with the R packages ggplot2, a member of the set of packages that constitute the “tidyverse” (2019), and plotly.

```{r}
(temp %>% 
  filter(date < as.Date("1839-06-01")) %>% 
  ggplot() + 
  aes(date, temp) + 
  geom_line() + 
  labs(
    x = "",
    y = "midrange temperature (°C)",
    title = 
      "Figure 4: According to Quetelet's law, the lilacs bloom when exposed to 4264°C² following the last frost.") +
  geom_vline(xintercept = as.numeric(c(bloom_day, frost_day)), 
             linetype = "dotted")) %>%
  ggplotly() %>% 
  add_annotations(x = as.numeric(c(frost_day, bloom_day)),
                  y = c(-4, -4),
                  text = c("last\nfrost", "first\nbloom"),
                  font = list(size = 14),
                  ay = 0,
                  xshift = c(-10, -12)) %>%
  config(displaylogo = FALSE)
```

Figure 4: According to Quetelet’s law, the lilacs bloom when exposed to 4264°C² following the last frost. Author provided, CC BY 4.0.

We now have all the ingredients necessary to reproduce Quetelet’s findings. Our reproduction is greatly simplified by using the nest function from the tidyr package, another member of the “tidyverse”. For an overview of nest, see the “Nested data” section of Grolemund and Wickham (2017). We will group the data by year, nest, calculate the cumulative squared temperature from the frost date to the bloom date within each year, and then unnest. We ignore temperatures below 0°C. That is, temperatures below 0°C are set to 0°C. We do this because it is clear from Quetelet’s derivation of the law that only positive temperatures should be squared. See the next section for details.

```{r}
quetelet <- 
  temp %>% 
  group_by(year) %>% 
  nest() %>% 
  left_join(lilac) %>% 
  mutate(law = map(data, ~ sum(pmax(.$temp, 0, na.rm = TRUE)[(doy_last_frost(.$tmax) + 1):doy]^2))) %>% 
  unnest(law) %>% 
  ungroup()

quetelet %>% 
  summarize(Quetelet = 4264, 
            est = mean(law), 
            se = sd(law)/sqrt(n()),
            ci  = str_c("[", round(est - 2 * se), ", ", round(est + 2 * se), "]")) %>%
  kable(dig = 0, 
        align = "c", 
        col.names = c("Quetelet's law (°C²)", "estimate (°C²)", 
                      "standard error (°C²)", "95% confidence interval (°C²)"),
        caption = "Table 4: Reproduction of Quetelet's analysis.") %>%
  kable_styling()
```

**Table 4:** Reproduction of Quetelet’s analysis.
Quetelet’s law (°C²)	estimate (°C²)	standard error (°C²)	95% confidence interval (°C²)
4264	4261	197	[3867, 4656]

The results show that Quetelet’s findings are indeed reproducible. Quetelet estimated that lilacs bloom once exposed to 4264°C² following the last frost. Our reanalysis suggests a similar amount. However, 4264°C² is the overall average across all years – the estimated amount needed to bloom varies year to year. As a result, the average has a 95% confidence interval of approximately 3870°C² to 4660°C². Quetelet was well aware of this variation. He argued it was due to unobserved factors that influence growing conditions and change each year, and he dedicated significant space in his Letters to discuss them.

These unobserved factors limit the accuracy of predictions made using the law. To assess the predictive accuracy of the law, we temporarily ignore the bloom dates Quetelet observed. Instead, we apply the 4264°C² quota to the temperature records at Brussels Observatory to predict the bloom date. We then compare our predictions with the bloom date Quetelet observed. The R code below creates the function doy_prediction to estimate the day the lilac will bloom from temperature records. Table 5 summarizes the accuracy of Quetelet’s law by the mean absolute error and root mean squared error.

```{r}
doy_prediction <- function(temp, tmax)
  doy_last_frost(tmax) + which.max(cumsum(pmax(temp[(doy_last_frost(tmax) + 1):365], 0, na.rm = TRUE)^2) > 4264)

quetelet %>% 
  mutate(pred = map(data, ~ doy_prediction(.$temp, .$tmax))) %>% 
  unnest(pred) %>% 
  ungroup() %>%
  summarize(mae  = mean(abs(doy - pred)),
            rmse = sqrt(mean((doy - pred)^2))) %>%
  kable(dig = 0,
        align = "c",
        col.names = c("mean absolute error (days)", "root mean squared error (days)"),
        caption = "Table 5: Predictions using Quetelet's law are accurate within a week on average.") %>%
  kable_styling()
```

**Table 5:** Predictions using Quetelet’s law are accurate within a week on average.
mean absolute error (days)	root mean squared error (days)
5	6

Table 5 indicates that predictions made using the law are accurate to within a week on average. For comparison purposes, we also predict the day the lilacs will bloom using the average bloom date between 1839 and 1852. That is, on average the lilac bloomed on April 30 (April 29 on leap years), and we check the accuracy of simply predicting this average date each year. Table 6 indicates the average bloom date yields predictions that are less accurate by an average of two days.

```{r}
quetelet %>%
  summarize(pred = mean(doy),
            mae  = mean(abs(doy - pred)),
            rmse = sqrt(mean((doy - pred)^2))) %>%
  select(mae, rmse) %>%
  kable(dig = 0,
        align = "c",
        col.names = c("mean absolute error (days)",
                      "root mean squared error (days)"),
        caption = "Table 6: Predictions using the average bloom date are off by a week or more on average.") %>%
  kable_styling()
```

**Table 6:** Predictions using the average bloom date are off by a week or more on average.
mean absolute error (days)	root mean squared error (days)
7	9

Quetelet’s derivation of the law of the flowering plants

Quetelet believed that, as in physics, universal laws govern social and biological phenomenon. Quetelet was not only inspired by physics to describe social and biological patterns using mathematical formulas. He often took his formulas directly from physics. In fact, you may have already recognized similarities between his law and Newton’s second law of motion.

Quetelet reasoned that temperature exerts a “force” on plants in the same way that gravity exerts a force on a falling object. Newton’s second law states that acceleration is proportional to force. It follows that an object initially at rest and subject to a constant force will travel a distance proportional to time squared. Quetelet simply substituted temperature for time.

We briefly elaborate. Let denote the distance an object travels after time . Let denote its speed and its acceleration. If acceleration is constant, i.e. ,

and

Quetelet imagined plants experience time in temperature and bloom after “traveling” distance . If a plant is exposed to temperature on day , then the bloom date, , is the first day . Multiplying both sides of the inequality by , yields Quetelet’s law: the bloom is the first day, , that .

The derivation of laws like the law of the flowering plants was popular in the nineteenth century. But any similarities between the “force” of temperature and the force of gravity are likely coincidental. We are not aware of any biological mechanisms that justify Quetelet’s application of Newton’s law.

Today, the law of the flowering plants is considered a heuristic, or rule of thumb, that approximates complicated biological mechanisms. Like Quetelet, scientists model plants as experiencing time in temperature instead of calendar time. These temperature units are typically called “growing degree days”. Scientists often find that plants may only be sensitive to temperatures in specific ranges or “modified growing degree days”. Although modern statistical methods can greatly improve the accuracy of predictions, laws like Quetelet’s remain popular because they are simple to communicate and easy to replicate, as we demonstrate in the next section.

Replicating Quetelet’s law of the flowering plants

In the previous section, we explained how Quetelet derived the law of the flowering plants. Quetelet believed the law of the flowering plants was universal, describing the bloom date of all flowers around the world and in any year. Whether the law can in fact be considered universal requires replicating Quetelet’s results with new data collected at a different location in a different year.

In this section, we replicate the law of the flowering plants using lilac bloom dates observed by scientists between 1956 and 2009 at 53 locations throughout the Pacific Northwest (2015). The data can be downloaded from the USA National Phenology Network using the rnpn package (2022). For space considerations, the R code that downloads and cleans the data is provided in the Appendix. Running this code yields the tibble usa_npn. Each row of the tibble corresponds with a bloom date observed at a given site in a given year. There are 31 columns, only seven of which we use in our replication. The remaining columns are documented in the rnpn package, and we will not review them here.

Table 7 displays six of the seven columns (and only the first 10 rows of the full table). These columns are defined in the same way as the columns of Table 1, except for “site_id”, which denotes the site at which the observation was made. Table 1 does not have a “site_id” column because all observations were made at the same site, Brussels Observatory.

```{r}
load(url("https://github.com/jauerbach/miscellaneous/blob/main/usa_npn.RData?raw=true"))

usa_npn %>%
  transmute(site_id, 
            month = first_yes_month, 
            day   = first_yes_day, 
            year  = first_yes_year, 
            date  = as.Date(paste(month, day, year), format = "%m %d %Y"),
            doy) %>%
  kable(align = "c",
        caption = "Table 7: Bloom dates of lilacs observed in pacific northwest between 1956 and 2009.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", 
             height = "400px")
```

**Table 7:** Bloom dates of lilacs observed in pacific northwest between 1956 and 2009.
site_id	month	day	year	date	doy
150	5	25	1956	1956-05-25	146
150	5	22	1957	1957-05-22	142
150	5	12	1958	1958-05-12	132
150	6	3	1959	1959-06-03	154
150	5	27	1960	1960-05-27	148
150	5	27	1961	1961-05-27	147
150	5	26	1962	1962-05-26	146
150	5	24	1963	1963-05-24	144
150	5	28	1964	1964-05-28	149
150	5	26	1966	1966-05-26	146
…	…	…	…	…	…

The seventh column we review is “temp”. Each row of “temp” is a tibble of temperature records taken at the nearest station in the Global Historical Climatology Network. The first tibble (again, only the first 10 rows) is displayed in Table 8 below. The columns are defined in the same way as the columns of Table 2, except for “id”, which denotes the location at which the temperature record was made. Table 2 does not have an “id” column because all observations were made at the same site, Brussels Observatory.

```{r}
usa_npn %>%
  pull(temp) %>%
  .[[1]] %>%
  mutate(year = parse_number(format(date, "%Y"))) %>%
  select(id, year, date, tmax, tmin, temp) %>%
  kable(align = "c",
        col.names = c("id", "year", "date", "maximum temperature (°C)", 
                      "minimum temperature (°C)", "midrange temperature (°C)"),
        caption = "Table 8: Temperature observed at an example pacific northwest site in 1956.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", 
             height = "400px")
```

**Table 8:** Temperature observed at an example pacific northwest site in 1956.
id	year	date	maximum temperature (°C)	minimum temperature (°C)	midrange temperature (°C)
USC00245761	1956	1956-01-01	5.6	-5.6	0.00
USC00245761	1956	1956-01-02	1.7	-7.2	-2.75
USC00245761	1956	1956-01-03	3.3	-11.7	-4.20
USC00245761	1956	1956-01-04	4.4	-10.0	-2.80
USC00245761	1956	1956-01-05	7.8	0.0	3.90
USC00245761	1956	1956-01-06	4.4	-11.1	-3.35
USC00245761	1956	1956-01-07	2.8	-6.1	-1.65
USC00245761	1956	1956-01-08	4.4	-4.4	0.00
USC00245761	1956	1956-01-09	1.7	-9.4	-3.85
USC00245761	1956	1956-01-10	2.8	-6.1	-1.65
…	…	…	…	…	…

We are now prepared to replicate Quetelet’s findings. We will use R code nearly identical to the code we used to reproduce Quetelet’s findings earlier. The main difference is due to the fact that temperature records are dependent across sites within a year. To account for this dependence, we compute the cumulative temperature squared from the last frost to the bloom date for each site and year. We then take the average across all sites within a year. Finally, we calculate the standard error and confidence interval using only the variation of the averages across years. Table 9 displays the results.

```{r}
usa_npn %>%             
  group_by(rownames(usa_npn)) %>%
  mutate(law = 
           map(temp, ~ sum(pmax(.$temp, 0, na.rm = TRUE)[(doy_last_frost(.$tmax, doy) + 1):(doy - 1)]^2))) %>%
  unnest(law) %>% 
  group_by(year) %>%    
  summarize(law = mean(law)) %>%
  summarize(Quetelet = 4264, 
            est = mean(law), 
            se = sd(law) / sqrt(n()),
            ci  = str_c("[", round(est - 2 * se), ", ", round(est + 2 * se), "]")) %>%
  kable(dig = 0, 
        align = "c",
        col.names = c("Quetelet's law (°C²)", "estimate (°C²)",
                      "standard error (°C²)", "95% confidence interval (°C²)"),
        caption = "Table 9: Replication of Quetelet's analysis.") %>%
  kable_styling()
```

**Table 9:** Replication of Quetelet’s analysis.
Quetelet’s law (°C²)	estimate (°C²)	standard error (°C²)	95% confidence interval (°C²)
4264	4329	116	[4098, 4560]

Table 9 indicates that Quetelet’s findings are replicable in the sense that the confidence interval calculated using Quetelet’s data (Table 4) overlaps with the confidence interval calculated using the USA lilac data (Table 9). The standard error in Table 9 is smaller than Table 4 because the replication uses 54 years of data compared to Quetelet’s 14. Note that in the R code above, we subtract 1 from “doy” to correct for differences in how the bloom date is reported. This correction is not particularly important; the confidence intervals still overlap when this correction is removed.

We now investigate the accuracy of Quetelet’s law when applied to the USA lilac data. As before, we make use of the doy_prediction function.

```{r}
usa_npn <- 
  usa_npn %>% 
  mutate(pred = map(temp, ~ doy_prediction(.$temp, .$tmax))) %>% 
  unnest(pred) %>% 
  ungroup()

usa_npn %>% 
  summarize(mae  = mean(abs(doy - 1 - pred)),
            rmse = sqrt(mean((doy - 1 - pred)^2))) %>%
  kable(dig = 0,
        align = "c", 
        col.names = c("mean absolute error (days)",
                      "root mean squared error (days)"),
        caption = "Table 10: Predictions using Quetelet's law are accurate within about two weeks on average.") %>%
  kable_styling()
```

**Table 10:** Predictions using Quetelet’s law are accurate within about two weeks on average.
mean absolute error (days)	root mean squared error (days)
10	15

Table 10 indicates that the predictions are accurate to within two weeks on average. Recall that the predictions using Quetelet’s own data were accurate to within one week on average (Table 5). We speculate that the decrease in accuracy is due in part to the fact that both Quetelet’s lilacs and the temperature were observed at the same site, Brussels Observatory. In some cases, the USA lilacs were a few miles from where the temperature was recorded.

Although the accuracy of the predictions made using Quetelet’s law is lower when applied to the USA lilac data, Figure 5 indicates that the law produces the correct bloom date on average. The figure plots the predictions made by the law against the actual bloom dates scientists observed. Note that instead of representing prediction-observation pairs as points in a scatter plot, the data are represented using blue contours. We use contours because there are more than 1,500 observations – too many to study using a scatter plot.

```{r}
(usa_npn %>% 
   mutate(doy = first_yes_doy) %>%
   unnest(pred) %>% 
   ungroup() %>%
   mutate(predicted = as.Date("2020-01-01") + pred,
          observed = as.Date("2020-01-01") + doy) %>%
   ggplot() + 
    aes(x = observed, y = predicted) +
    geom_density2d(contour_var = "ndensity") +
    geom_abline(intercept = 0, slope = 1, linetype = 2) +
    labs(x = "date observed", 
         y = "date predicted",
         title = "Figure 5: Predictions using Quetelet's law are accurate within about two weeks on average.") +
    theme(legend.position = "none")) %>%
  ggplotly(tooltip = "") %>%
  config(displaylogo = FALSE)
```

Figure 5: Predictions using Quetelet’s law are accurate within about two weeks on average. Author provided, CC BY 4.0.

The contours are easy to interpret. The blue lines are much like a mountain range observed from above. The inner circles are peaks of high elevation in which many prediction-observation pairs co-occur. The outer circles are areas of low elevation in which few prediction-observation pairs co-occur.

The dotted line is the “y = x” line, having zero intercept and unit slope. Prediction-observation pairs that lie on the line indicate perfect predictions. The fact that the dotted line intersects the blue contours at their peak suggests the law derived from Quetelet’s data accurately predicts the typical bloom date of the USA data. This accuracy is impressive given the fact that the USA lilacs were observed more than a century later and on a different continent. The blue curves deviate from the line by about two weeks in the vertical direction, which is consistent with Table 10.

An average accuracy of two weeks might not sound impressive. But it is far more accurate than using the average bloom date Quetelet observed, April 30 (April 29 on leap years). The average bloom date yields predictions that are off by an additional eleven days on average.

```{r}
usa_npn %>%
  mutate(doy = first_yes_doy) %>%
  ungroup() %>%
  summarize(
    pred = mean(quetelet$doy), 
    mae  = mean(abs(doy - pred)),
    rmse = sqrt(mean((doy - pred)^2))) %>%
  select(mae, rmse) %>%
  kable(
    dig = 0,
    align = "c",
    col.names = c("mean absolute error (days)",
                  "root mean squared error (days)"),
    caption = "Table 11: Predictions using the average bloom date are off by three weeks or more on average.") %>%
  kable_styling()
```

**Table 11:** Predictions using the average bloom date are off by three weeks or more on average.
mean absolute error (days)	root mean squared error (days)
21	24

Predicting the day the lilac will bloom in Brussels in 2023

Any weather forecast can become a flower forecast by applying the law of the flowering plants. In this section, we use the AccuWeather forecast to predict the day a hypothetical lilac will bloom in Brussels in 2023. AccuWeather forecasts daily maximum and minimum temperatures three months into the future. We do not evaluate the quality of these forecasts. The purpose of this section is to simply convert them into flower forecasts.

We use the AccuWeather forecast as it appeared on the webpage AccuWeather.com on February 19, 2023. AccuWeather reports the forecast for each month on a separate webpage. For reproducibility, we saved each page on the Internet Archive. The following R code creates the function get_weather_table to retrieve each page we saved, extract the forecast contained within that page, and arrange the data as a tibble. The get_weather_table function combines several functions from the rvest package, which is yet another member of the “tidyverse”. In particular, the forecast on each page is contained within the div “monthly-calendar” and can be extracted with the html_nodes and html_text2 functions.

Applying the get_weather_table function to the url for each page yields a five column tibble temp_br, with columns defined in the same way as the tibble temp, discussed in previous sections. The first 10 rows are below; the data are also available on the author’s GitHub.

```{r}
 get_weather_table <- function(url)
  read_html(url) %>% 
  html_nodes("div.monthly-calendar") %>% 
  html_text2() %>%
  str_remove_all("°|Hist. Avg. ") %>%
  str_split(" ", simplify = TRUE) %>%
  parse_number() %>%
  matrix(ncol = 3, 
         byrow = TRUE,
         dimnames = list(NULL, c("day", "tmax", "tmin"))) %>%
  as_tibble() %>%
  filter(
    row_number() %in%
      (which(diff(day) < 0) %>% (function(x) if(length(x) == 1) seq(1, x[1], 1) else seq(x[1] + 1, x[2], 1))))

temp_br <-
  tibble(
    base_url = "https://web.archive.org/web/20230219151906/https://www.accuweather.com/en/be/brussels/27581/",
    month = month.name[1:5],
    year = 2023,
    url = str_c(base_url, tolower(month), "-weather/27581?year=", year, "&unit=c")) %>%
  mutate(temp = map(url, get_weather_table)) %>%
  pull(temp) %>%
  reduce(bind_rows) %>%
  transmute(date = seq(as.Date("2023-01-01"), as.Date("2023-05-31"), 1),
            year = parse_number(format(date, "%Y")),
            tmax,
            tmin,
            temp = (tmax + tmin) / 2)

temp_br %>%
  relocate(year) %>%
  kable(dig = 2,
        align = "c", 
        col.names = c("year", "date", "maximum temperature (°C)",
                      "minimum temperature (°C)", "midrange temperature (°C)"),
        caption = "Table 12: Temperature forecast for Brussels, retrieved on February 19, 2023.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
```

**Table 12:** Temperature forecast for Brussels, retrieved on February 19, 2023.
year	date	maximum temperature (°C)	minimum temperature (°C)	midrange temperature (°C)
2023	2023-01-01	15	11	13.0
2023	2023-01-02	14	5	9.5
2023	2023-01-03	9	3	6.0
2023	2023-01-04	13	8	10.5
2023	2023-01-05	12	10	11.0
2023	2023-01-06	12	10	11.0
2023	2023-01-07	11	9	10.0
2023	2023-01-08	10	6	8.0
2023	2023-01-09	8	5	6.5
2023	2023-01-10	12	4	8.0
…	…	…	…	…

We now predict the day the lilacs will bloom. The R code below uses the doy_prediction and doy_last_frost functions created in earlier sections and displays the prediction in Table 13. At the time of our writing, the predicted date is April 19. The forecast is easily updated by providing the url to the updated AccuWeather webpage. (You might use the url https://web.archive.org/save to save a webpage to the Internet Archive to ensure your work is reproducible.)

```{r}
bloom_day_br <-
  temp_br %>%
  summarize(date = doy_prediction(temp, tmax) + as.Date("2023-01-01")) %>%
  pull(date)

frost_day_br <- 
  temp_br %>% 
  pull(tmax) %>% 
  doy_last_frost() + as.Date("2023-01-01") 

tibble(`last frost date` = frost_day_br, 
       `bloom date` = bloom_day_br) %>%
  kable(align = "c",
        caption = "Table 13: Last frost date and lilac bloom date in Brussels in 2023.") %>%
  kable_styling()
```

**Table 13:** Last frost date and lilac bloom date in Brussels in 2023.
last frost date	bloom date
2023-01-27	2023-04-19

We visualize the predictions in Figure 6, which has the same interpretation as Figure 4. If the temperature forecast and Quetelet’s law are correct, on January 27, 2023 the lilacs in Brussels began “collecting” temperature. The lilacs will continue to “collect” temperature until April 19, at which point they will exceed their 4264°C² quota and bloom.

```{r}
(temp_br %>% 
  ggplot() + 
  aes(date, temp) + 
  geom_line() + 
  labs(
    x = "",
    y = "midrange temperature (°C)",
    title =
      "Figure 6: According to Quetelet's law, the lilacs will bloom once exposed to 4264°C² following the last frost.") +
  geom_vline(xintercept = as.numeric(c(frost_day_br, bloom_day_br)), 
             linetype = "dotted")) %>%
  ggplotly() %>% 
  add_annotations(x = as.numeric(c(frost_day_br, bloom_day_br)),
                  y = c(14, 14),
                  text = c("last\nfrost", "first\nbloom"),
                  font = list(size = 14),
                  ay = 0,
                  xshift = c(-14, -16)) %>%
  config(displaylogo = FALSE)
```

Figure 6: According to Quetelet’s law, the lilacs will bloom once exposed to 4264°C² following the last frost. Author provided, CC BY 4.0.

Quetelet’s legacy: advocate, mentor, and perhaps data scientist

In this tutorial, we stated the law of the flowering plants and explained how Quetelet derived it. We also reproduced and replicated Quetelet’s findings before using his law to predict the day the lilac will bloom in Brussels. We now conclude with a reflection on Quetelet’s legacy.

The law of the flowering plants surely stands the test of time. It continues to be used by scientists – with relatively few changes – to plan harvests, manage pests, and study ecosystems stressed by climate change. We speculate the law’s longevity is due to the fact that it balances simplicity with relatively accurate predictions.

Although Quetelet did not discover the law, he did much to advance it. Quetelet founded an international network for “observations of the periodical phenomena” (in addition to numerous statistical societies and publications, including the precursor to the Royal Statistical Society). Quetelet’s network of 80 stations collected observations throughout Europe from 1841 until 1872. In particular, Quetelet collaborated with Charles Morren – who later coined the term phenology, the name of the field that now studies biological life-cycle events like the timing of flower blooms (Demarée and Rutishauser 2011).

In recent years, the observations collected through phenology networks have become an important resource for understanding the impacts of climate change. For example, the USA National Phenology Network calculates the Spring Bloom Index, which measures the “first day of spring” using the days lilacs are observed to bloom at locations across the United States. The index is then compared to previous years. Figure 7 shows one comparison, called the Return Interval. The Return Interval is much like a p-value, calculating how frequently more extreme spring indices were observed in previous decades. Bloom dates that are uncommonly early (green) or late (purple) may indicate environments stressed by changing climate. Scientists exploit the relationship between temperature and bloom date to extrapolate the index to areas with few observations.

Figure 7: The Spring Bloom Index Return Interval measures whether spring is typical when compared to recent decades. Source: USA National Phenology Network.

Quetelet’s emphasis on discovering the universal laws he believed govern social and biological phenomenon has not endured. But data scientists continue to appropriate laws from one area of science to study another. For example, data scientists use neural networks and genetic algorithms to study a wide variety of phenomenon unrelated to neuroscience or genetics. Perhaps Quetelet’s appropriation of Newton’s law, in addition to his careful use of data, make him among the first data scientists?

Your turn: Do you have what it takes to beat Quetelet’s law?

Quetelet reported that a plant flowers when the sum of the daily temperatures squared exceeds a specific quantity. His prediction rule was state of the art in 1833. But surely you, a twenty-first century data scientist, can do better. Here are some ideas to get you started.

Quetelet squared the temperature before calculating the sum. Would another function of temperature produce a more accurate prediction?
1. Remove the square so that a plant flowers once the sum of the daily temperatures exceeds a (different) specific quantity. Does this version of the law produce more accurate predictions? What if you use the daily temperatures cubed? (Beginner)
2. Suppose a lilac only registers temperatures between 0°C and 10°C. That is, a lilac experiences temperature below the lower limit, 0°C, as 0°C, and above the upper limit, 10°C, as 10°C. Does the accuracy of the predictions improve if you use the temperature the lilac experienced instead of the ambient temperature measured by a weather station? Write a program that finds the lower and upper limits that produce the most accurate predictions. (Intermediate)
3. Quetelet used mean absolute error to evaluate the accuracy of his predictions. But his estimate of the specific quantity of heat needed to bloom, 4264°C², does not actually minimize mean absolute error. Write a program that finds the specific quantity that minimizes mean absolute error. Redo part i. and ii. using this function. (Advanced)
Quetelet calculated the sum of the daily temperature squared between the day of last frost and the bloom date. Would another time interval produce more accurate predictions?
1. We estimated the day of last frost using the last day the maximum temperature was below 0°C. Try estimating the day of last frost by the last day the midrange temperature was below 0°C? Which estimate yields the most accurate predictions? What if you ignore the day of last frost and simply calculate the sum of the daily temperatures squared between February 1 and the bloom date? When you change the time interval, be sure to calculate the new specific quantity of heat needed to bloom. (Beginner)
2. Write a program that finds the time interval which yields the best predictions. (Intermediate)
3. Write a program that calculates the prediction rule for many different time intervals. Use cross-validation to combine these prediction rules into a single prediction rule. (Advanced)
Quetelet’s law only considers the temperature. Would additional information provide more accurate predictions?
1. Is the specific quantity of heat needed to bloom different in years with abnormally cold winters? Would the predictions be more accurate if you use one quantity of heat for years with cold winters and a different quantity of heat for years with warm winters? (Beginner)
2. Is the estimated quantity of heat needed to bloom similar for locations close in space and time? Write a program that leverages spatial and temporal correlation to improve the accuracy of the predictions. (Intermediate)
3. Some biologists report that a plant must be exposed to a fixed amount of cold temperature in the winter – in addition to a fixed amount of warm temperature in the spring – before it can bloom. Augment the law of the flowering plants to require the accumulation of a specific quantity of cold temperature before the accumulation of a specific quantity of warm temperature. Write a program that uses this new law to predict the day the lilac blooms. (Advanced)

Feeling good about your prediction algorithm? Show it off at the annual Cherry Blossom Prediction Competition!

Appendix: Preparing USA NPN Data

```{r}
# 1. download lilac data using `rnpn`
usa_npn <- 
  npn_download_individual_phenometrics(request_source = "Jonathan Auerbach",
                                       year = 1900:2050,
                                       species_ids = 36,                       
                                       phenophase_ids = c(77, 412))            

# 2. limit analysis to sites that report more than 25 times
site_ids <- 
  usa_npn %>% 
  group_by(site_id) %>% 
  summarize(n = n()) %>% filter(n > 25) %>% pull(site_id)

usa_npn <- 
  usa_npn %>% 
  filter(site_id %in% site_ids)

# 3. find nearest weather stations for each site
locations <- 
  usa_npn %>% 
  group_by(site_id) %>% 
  summarize(latitude = first(latitude), 
            longitude = first(longitude))

stations <- 
  ghcnd_stations() %>%
  filter(first_year <= min(usa_npn$first_yes_year),
         last_year  >= max(usa_npn$first_yes_year),
         state != "") %>%
  group_by(id, latitude, longitude, state) %>%
  summarize(temp_flag = sum(element %in% c("TMIN", "TMAX"))) %>%            
  filter(temp_flag == 2) %>% 
  ungroup()

dist <- function(x, y = stations %>% select(latitude, longitude)) 
  stations$id[which.min(sqrt((x[1] - y[,1])^2 + (x[2] - y[,2])^2)[,1])]

locations$station_id <- apply(locations, 1, function(x) dist(c(x["latitude"], x["longitude"])))

# 4. get weather data from nearest station using `rnoaa`
get_station_data <- function(station_id) 
  ghcnd_search(stationid = station_id,
               var = c("tmin", "tmax"),
               date_min = "1956-01-01",
               date_max = "2011-12-31") %>%
  reduce(left_join, by = c("id", "date")) %>%
  transmute(id, 
            date, 
            tmax = tmax / 10,
            tmin = tmin / 10)

usa_npn <- 
  locations %>%
  mutate(temp = map(station_id, get_station_data)) %>%
  right_join(usa_npn, by = c("site_id", "latitude", "longitude")) %>% 
  group_by(rownames(usa_npn)) %>% 
  mutate(temp = map(temp, ~ .x %>% 
                      filter(format(date, "%Y") == first_yes_year) %>%
                      mutate(temp = (tmin + tmax) / 2)),
         num_obs = map(temp,~ sum(format(.x$date,"%j") <= 150)),
         doy = first_yes_doy, year = first_yes_year) %>% 
  unnest(num_obs) %>%  
  filter(num_obs == 150) %>%
  ungroup()
```

Explore more Tutorials

About the author: Jonathan Auerbach is an assistant professor in the Department of Statistics at George Mason University. His research covers a wide range of topics at the intersection of statistics and public policy. His interests include the analysis of longitudinal data, particularly for data science and causal inference, as well as urban analytics, open data, and the collection, evaluation, and communication of official statistics. He co-organizes the annual Cherry Blossom Prediction Competition with David Kepplinger and Elizabeth Wolkovich.

Text and code are licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Images are not covered by this licence, except where otherwise noted.

How to cite: Auerbach, Jonathan. 2023. “A demonstration of the law of the flowering plants.” Real World Data Science, April 13, 2023.

References

Chamberlain, Scott. 2021. “’NOAA’ Weather Data from r [r Package Rnoaa Version 1.3.8].” The Comprehensive R Archive Network. Comprehensive R Archive Network (CRAN). https://CRAN.R-project.org/package=rnoaa.

De Réaumur, René. 1735. “Observations Du Thermometre, Faites a Paris Pendant l’annees 1735, Comparees a Celles Qui Ont Ete Faites Sous La Ligne, a l’isle de France, a Alger Et En Quelques-Unes de Nos Isles de l’amerique.” Mémoire de l’Académie Royale Des Sciences, 545–76. https://www.academie-sciences.fr/pdf/dossiers/Reaumur/Reaumur_pdf/p545_576_vol3532m.pdf.

Demarée, Gaston R., and This Rutishauser. 2011. “From ‘Periodical Observations’ to ‘Anthochronology’ and ‘Phenology’ – the Scientific Debate Between Adolphe Quetelet and Charles Morren on the Origin of the Word ‘Phenology’.” International Journal of Biometeorology 55 (6): 753–61. https://doi.org/10.1007/s00484-011-0442-5.

Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. Sebastopol, CA: O’Reilly Media.

Marsham, Robert. 1789. “XIII. Indications of Spring, Observed by Robert Marsham, Esquire, f. R. S. Of Stratton in Norfolk. Latitude 52° 45’.” Philosophical Transactions of the Royal Society of London 79: 154–56. https://doi.org/10.1098/rstl.1789.0014.

Observatoire royal de Bruxelles. 1848. Annales de l’observatoire Royal de Bruxelles. Bruxelles: M. Hayez. https://catalog.hathitrust.org/Record/000553895.

Quetelet, Adolphe. 1846. Lettres à s.a.r. Le Duc Régnant de Saxe-Coburg Et Gotha: Sur La Théorie Des Probabilités, Appliquée Aux Sciences Morales Et Politiques. Bruxelles: M. Hayez. https://catalog.hathitrust.org/Record/001387625.

———. 1849. Letters Addressed to h.r.h. The Grand Duke of Saxe Coburg and Gotha on the Theory of Probabilities as Applied to the Moral and Political Sciences. London: C. & E. Layton. https://catalog.hathitrust.org/Record/008956987.

———. 1857. Sur Le Climat de La Belgique : De l’état Du Ciel En Général. Bruxelles: M. Hayez. https://catalog.hathitrust.org/Record/000553895.

Rosemartin, Alyssa H., Ellen G. Denny, Jake F. Weltzin, R. Lee Marsh, Bruce E. Wilson, Hamed Mehdipoor, Raul Zurita-Milla, and Mark D. Schwartz. 2015. “Lilac and Honeysuckle Phenology Data 1956-2014.” Scientific Data 2 (1). https://doi.org/10.1038/sdata.2015.38.

Rosemartin, Alyssa, Chamberlain Scott, Lee Marsh, and Kevin Wong. 2022. “Interface to the National ’Phenology’ Network ’API’ [r Package Rnpn Version 1.2.5].” The Comprehensive R Archive Network. Comprehensive R Archive Network (CRAN). https://cran.r-project.org/package=rnpn.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Using ‘basket complementarity’ to make product recommendations

Moinak Bhaduri — Thu, 02 Mar 2023 00:00:00 GMT

Anyone who has ever worked in a retail store will be familiar with the concept of cross-selling. A customer wants a can of paint? Try to sell them some paintbrushes. That new cellphone they’ve just decided to buy? They’ll probably need a case to protect it. Online retailers (and digital services of all sorts) have taken this idea and run with it, to great success. Sophisticated algorithms sort through data on a customer’s past transactions, and those of similar-looking customers, to identify and recommend other products a customer might be interested in.

A large amount of cross-selling, whether attempted in store by a sales assistant or online by an algorithm, relies on the concept of complementarity: that certain products are often bought and/or used together. Relationships between products might be obvious – paint and paintbrushes, for example – or they may be obscure and only revealed through the analysis of large datasets. In a 2021 paper that highlights complementarity’s relevance to association analysis, Puka and Jedrusik put forward “a new measure of complementarity in market basket data”, which sheds light on how product recommendations can be derived.

About the paper

Title: A new measure of complementarity in market basket data

Author(s) and year: Radosław Puka and Stanislaw Jedrusik (2021)

Status: Published in the Journal of Theoretical and Applied Electronic Commerce Research, open access: HTML, PDF.

Inspired by complementarity-based ideas prevalent in microeconomics, Puka and Jedrusik begin by collecting some established ideas from traditional market basket analysis, the key one being “confidence”. In this case, we’re talking about the confidence that item A leads (in a way) to item B (which we can express in notation as conf({A} → {B})). Take a look at Table 1 (below), which presents a numbered list of 18 shopping trips, with details of what was purchased on each trip. Notice how two of the trips (1 and 3) resulted in sales of both milk (B) and cornflakes (A), while five trips (1, 3, 7, 17, and 18) had cornflakes. Under the assumption that someone already has cornflakes in their trolley, the probability that they will buy milk is 2/5 = 0.4. So, conf({cornflakes} → {milk}) = 0.4. The closer this number gets to one, the more automatic the cornflakes–milk connection becomes. This number can therefore be used to recommend an item that is related in some way to another already bought.

Table 1: A list (with each row representing a trip to a grocery store) that can be seen in one of two ways: (a) a record of what 18 different people bought, or (b) a history of one person’s purchases over 18 trips. The list is similar to the one examined by Puka and Jedrusik, except for the last three rows. These trips, under interpretation (b) may help us develop an understanding of a single shopper’s preferences.

Figure 1: Basket complementarity under varying tolerance. The horizontal axis reports the probability that someone will buy item B under the assumption that A is already in their shopping trolley (i.e., conf({A} → {B})). The vertical axis reports the opposite: the chance that someone will go for A given that B is already in the cart (i.e., conf({B} → {A})). For any pair of items, these two probabilities can be found and, when plotted in 2D, a pair of items generates a single point. The more similar the two probabilities are for each pair, the closer the point comes to the line of equality (the red dashed line that runs diagonally through the origin), and the more complementary the items become. It’s rare that a dot will land exactly on the line of equality, so the green and orange lines parallel to the red line mark how far off a dot is from this ideal setting, using different levels of tolerance. From this we may say, for example, that cornflakes and milk are more complementary to each other than bread rolls and butter, as the first pairing lies closer to the line of equality.

Asymmetry and tolerance

Milk and cornflakes are reasonably complementary, and we can see from Figure 1 above that, regardless of whether you start by picking up milk or cornflakes, the probabilities of a shopper buying the other item are broadly similar: conf({cornflakes} → {milk}) = 0.4, while conf({milk} → {cornflakes}) = 0.33. There is a small amount of asymmetry in the probabilities in this particular example, but asymmetry can be more extreme for other pairs of items. This leads to the idea of one- and two-sided complementarity. Two items sharing a smallish asymmetry – like milk and cornflakes – will be connected through two-sided complementarity, while large asymmetries indicate one-sided complementarity. Such imbalances will be quite common when, for instance, items of hugely different prices are involved. When someone buys a house, for example, they may want to buy a bookcase, but buying a bookcase doesn’t mean someone wants to buy a house: this would be an instance of one-sided complementarity.

Puka and Jedrusik capitalize on this observation. They define two items to be “basket complementary” if the two probabilities – the normal and its opposite – remain close and reasonably high. The items need to share a bond that is blind to the direction: seeing you bought one, no matter which, means you are (almost equally) likely to buy the other.

It is rare that the two probabilities should be exactly the same, of course, and the authors allow some deviation. Along the red diagonal line of perfect equality (Figure 1) we may lay tolerance bands marking degrees of product inseparability. This, if need be, may lead to the notion of being complementary at such-and-such a tolerance level – 0%, 1%, 5%, etc. – generating a score of sorts. In cases where a dot representing the two-way dependencies between two items falls within a narrow band – corresponding to a smaller tolerance – the more inseparable the items are, and the more sensible a cross-selling recommendation may become.

In conclusion

A large part of the world we inhabit, particularly the economy, is powered by recommendations: from strangers, friends and algorithms. That applies not only to the things we buy but also to the things we watch or read. (Perhaps you arrived at this article because of a tweet that Twitter thought you might like, or maybe it was suggested to you by Google News because of your past reading habits.) Whatever the intent of these recommendations, the key challenge is in knowing which two things are functionally or thematically intertwined. Which item or product is, by default, synonymous with which? Puka and Jedrusik deliver an answer: two items that are basket complementary to each other, preferably at a slim tolerance, are inextricably linked. One may be safely offered – perhaps always – whenever the other is already in the shopping basket.

The relative simplicity and interpretability of basket complementary may provide small-scale retailers, starved of analytical wherewithal, a sane and safe strategy for developing their product offer. It might also serve as a benchmark to keep other, more sophisticated recommendation algorithms in check. (In weather forecasting, for example, it is often seen that naive benchmarks – such as using today’s temperature to predict tomorrow’s – frequently outperform more advanced models.)

Basket complementarity could also be used to help individuals understand their own shopping habits and the links between the things they buy. I’ve built an interactive dashboard where you can enter your own receipt lists and filter associations based on various confidence thresholds. The underlying code is also available.

About the author: Moinak Bhaduri is assistant professor in the Department of Mathematical Sciences, Bentley University. He studies spatio-temporal Poisson processes and others like the self-exciting Hawkes or log-Gaussian Cox processes that are natural generalizations. His primary interest includes developing change-detection algorithms in systems modeled by these stochastic processes, especially through trend permutations.

About DataScienceBites: DataScienceBites is written by graduate students and early career researchers in data science (and related subjects) at universities throughout the world, as well as industry researchers. We publish digestible, engaging summaries of interesting new pre-print and peer-reviewed publications in the data science space, with the goal of making scientific papers more accessible. Find out how to become a contributor.

This work is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence, except where otherwise noted.

How to cite: Bhaduri, Moinak. 2023. “Using ‘basket complementarity’ to make product recommendations.” Real World Data Science, March 2, 2023. URL