The point of this article is to, in a mildly entertaining way, persuade you that developing a sovereign foundation AI model should be a priority for the NHS, professional bodies and patients but we need to get the research right.
Risk is personal
How do we move from treating disease to preventing disease? The traditional approach has been to publicise well evidenced public health interventions; don’t smoke, drink less, eat vegetables, exercise, vaccinate, wear sunscreen. This is all very good advice at the population level but for the individual it’s hard to know what to worry about and what to prioritise. I, being a clumsy man with bad ankles and a lack of spatial awareness, am at risk of going to A&E with (another) concussion. You will be different.
A little bit of history
Individualised risk models in healthcare are not new. Traditional statistical approaches have used tabular data to predict healthcare events and have done a good job. These models are converted into questionnaires that clinicians can use to make decisions based on your risk. If you have had the NHS health check; a clinician will have measured blood pressure, cholesterol, height, weight and a few questions on your medical history. They will then feed this into a model and the output is the risk of you having a heart attack or stroke over the next ten years (1). There are also automated approaches built into the systems your GP uses that help stratify the population based on individual risk of things like frailty (2).
These kind of models are usually based on a snapshot of data and require bespoke data pipelines and engineering to massage the data into the right shape for the model, wonderful news for data scientists and statisticians as it leads to a proliferation of finely tuned models which can keep us in gainful employment for many years. However, each one has significant costs to develop, test, validate, deploy and integrate into clinical practice.
Another issue with these traditional models is that they squash a medical history into a single row of data for each patient, losing the chronology of health. Intuitively, we would expect that the sequence of events matters in predicting healthcare outcomes and traditional approaches struggle to capture this.
Using a sequence of events to predict a sequence of events
Sequences of events are easier for data engineers too. It’s much simpler to join together all the data into a sequence than perform a series of complex aggregations and transformations for every model. The simpler the data engineering needed to create the inputs the easier it is to scale as you are making fewer assumptions about the data.
So, if they are easier to engineer, and they capture more information why are they not the standard way of predicting health outcomes? Because modelling sequences is harder than modelling a row of data. As the model sees more of a sequence it has to hold that memory somewhere so that the model can accumulate the appropriate information. Models that could do this started appearing in machine learning literature in the early 1990’s (3) but for a long time we had neither the data, the computing power nor quite the right kind of algorithms to make them useful. Today they have become feasible in healthcare due to the rise in electronic healthcare records, standardised codes for classifying events and the rise of the transformer model. Transformer models combine the ability to hold an internal “memory” of the sequence with the capacity to pay attention to different aspects of the sequence, which basically make them magic.
These models have demonstrated state of the art accuracy in predicting future events using electronic patient histories. Examples for those interested in reading more include BEHRT (4), Med-Bert (5), TransformerEHR (6) and the more recent generative transformer model ETHOS (7). These can be used for a range of healthcare prediction tasks whilst delivering state of the art predictive accuracy, again, magic.
A recent preprint (8) from Microsoft has also demonstrated that these EHR models act in a similar way to the large language models like those backing ChatGPT; their performance scales predictably with processing power, data and the size of the model. This means that more data will probably lead to a better model and we can optimise this model performance to a given computational budget.
So what?
Why should you care about this? If we can take these architectures and train them on data at the scale of the NHS then each individual patient could have a relatively accurate prediction of their most likely next healthcare events(9). It would be your medical history projected forward, providing a narrative that is easier to understand than a page of risk scores. It’s your potential medical future. This could help with changing behaviour to reduce future risk, something we all struggle with. I think of it like the medical version of the ghost of Christmas future but using a chain of events rather than clinking ghost chains.
We are already seeing heavy usage of publicly available large language models for healthcare. 10% of a representative sample of Australians used ChatGPT for medical advice rising to 26% of 25-34 year olds (10) , I assume the UK is similar. It seems that the public is much more ready than the health system to use these models and regulation is struggling to keep up, and for good reason, they may not actually help.
The underwhelming evidence
As of August 2024 there were 950 AI models approved by the FDA, with a significant proportion of those for clinical decision support, but only 2.4% of these are supported by randomised controlled trials (11).
This is important, as what works on a machine learning researcher’s infrastructure may not work in a clinical setting. In 2018, a comprehensive health economic evaluation of a risk prediction model for identifying people at risk of hospital admission found that those in the treatment arm had a higher healthcare cost and there was no significant impact on the number of people being admitted to hospital, despite accurate predictions (12). Some prediction models even cause harmful self-fulfilling prophecies when used for decision making (the paper is well worth a read) (13).
The prize
The UK government is clear about the ambition to be an “AI maker” not an “AI taker”. Given the expected improvement in accuracy from scaling these EHR models, there is an opportunity for the UK to leverage what should be one of its greatest data assets (decades of longitudinal electronic healthcare records from cradle to grave) and create a sovereign foundational model that supports patient care. These are being developed now in the US and elsewhere. A meta-analysis in 2023 found over 80 foundational healthcare models, there are many more today and there is concern that at some point it will be cheaper for the NHS to bring one in and pay for it than to train its own.
Foresight
Fortunately we have made some progress in the UK with NHS data. Foresight (14), a transformer model developed in London on data from 1.4 million patients has demonstrated impressive results . This model has been taken on for covid research to see if the same approach can better predict disease/COVID-19 onset, hospitalisation and death, for all individuals, across all backgrounds and diseases using national data made available during the pandemic for research specifically on covid. This is being done through the British heart foundation’s collaboration with NHS England’s secure data environment (15).
However, just because we can do this, it does not mean that we should. Researchers need to be careful to stay within the bounds of their project and make extraordinary efforts to engage with the public. We have to ensure that our data is not being exploited inappropriately for commercial gain. The Royal College of General Practitioners has raised concerns that this model goes beyond what they agreed to, Professor Kamila Hawthorne, Chair of the Royal College of GPs, said “As data controllers, GPs take the management of their patients’ medical data very seriously, and we want to be sure data isn’t being used beyond its scope, in this case to train an AI programme.” The project has been paused for the time being despite being approved and specifically targeted at covid for research.
The best model for predicting outcomes from covid or the risk factors involved in covid is likely to be a population scale generative transformer model. This research will determine whether that hypothesis is true and whether this kind of data could provide more accurate predictions for patients. The NHS data and the model are kept inside a secure data environment with personal identifiers stripped out. No patient details are passed to researchers and no data or code leaves that environment without explicit permission. This research seems like something we should do.
Despite the potential of AI assisted clinicians for differential diagnosis (with recent evidence that they perform better than both clinicians alone and clinicians using search (16) and the attractiveness of having your medical history and your medical future in your pocket, we are a way off this reality. The gap between research and demonstrating the cost-effectiveness of AI solutions in the real world is significant but all the component parts needed to close this gap exist; the data, the models, the research capability and the political will.
We will get there. Foundational models in healthcare are no longer a theoretical possibility, but an imminent reality. The UK has a rare opportunity to lead, not follow, by building a sovereign AI model trained on NHS data to accelerate the transition from treating disease to preventing disease. To get there, we must confront hard questions about patient engagement and real-world benefit. But to stop research based solely on the sophistication of the method is to misunderstand the moment. I think patients expect us to do better.
References
Hippisley-Cox, J., Coupland, C.A.C., Bafadhel, M. et al. Development and validation of a new algorithm for improved cardiovascular risk prediction. Nat Med 30, 1440–1447 (2024). https://doi.org/10.1038/s41591-024-02905-y
Clegg A, Bates C, Young J, Ryan R, Nichols L, Ann Teale E, Mohammed MA, Parry J, Marshall T. Development and validation of an electronic frailty index using routine primary care electronic health record data. Age Ageing, May;45(3):353-60, (2016) https://doi.org/10.1093/ageing/afw039.
Jeffrey L. Elman,Finding structure in time,Cognitive Science,Volume 14, 179-211, (1990). https://doi.org/10.1016/0364-0213(90)90002-E
Li, Y., Rao, S., Solares, J.R.A. et al. BEHRT: Transformer for Electronic Health Records. Sci Rep 10, 7155 (2020). https://doi.org/10.1038/s41598-020-62922-y
Rasmy, L., Xiang, Y., Xie, Z. et al. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit. Med. 4, 86 (2021). https://doi.org/10.1038/s41746-021-00455-y
Yang, Z., Mitra, A., Liu, W. et al. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat Commun 14, 7857 (2023). https://doi.org/10.1038/s41467-023-43715-z
Renc, P., Jia, Y., Samir, A.E. et al. Zero shot health trajectory prediction using transformer. npj Digit. Med. 7, 256 (2024). https://doi.org/10.1038/s41746-024-01235-0
Grout R, Gupta R, Bryant R, Elmahgoub MA, Li Y, Irfanullah K, Patel RF, Fawkes J, Inness C. Predicting disease onset from electronic health records for population health management: a scalable and explainable Deep Learning approach. Front Artif Intell. 2024 Jan 8;6:1287541. doi: 10.3389/frai.2023.1287541.
Sheng Zhang et al. Exploring Scaling Laws for EHR Foundation Models (2025) arXiv:2505.22964v1
Julie Ayre, Erin Cvejic and Kirsten J McCaffery. Use of ChatGPT to obtain health information in Australia, 2024: insights from a nationally representative survey Med J Aust (2025). doi: 10.5694/mja2.52598
Windecker D, Baj G, Shiri I, Kazaj PM, Kaesmacher J, Gräni C, Siontis GCM. Generalizability of FDA-Approved AI-Enabled Medical Devices for Clinical Use. JAMA Netw Open. 2025 Apr 1;8(4):e258052. doi: 10.1001
Snooks H et al. Predictive risk stratification model: a randomised stepped-wedge trial in primary care (PRISMATIC). Southampton (UK): NIHR Journals Library; 2018 Jan. PMID: 29356470.
van Amsterdam WAC, van Geloven N, Krijthe JH, Ranganath R, Cinà G. When accurate prediction models yield harmful self-fulfilling prophecies. Patterns (N Y). 2025 Apr 11;6(4):101229. doi: 10.1016/j.patter.2025.101229.
Kraljevic, Zeljko et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. The Lancet Digital Health, Volume 6, Issue 4, e281 - e290
CVD-COVID-UK/COVID-IMPACT: Projects CCU078: Foresight: a generative AI model of patient trajectories across the COVID-19 pandemic https://bhfdatasciencecentre.org/projects/ccu078/
McDuff, D., Schaekermann, M., Tu, T. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025). https://doi.org/10.1038/s41586-025-08869-4
- About the author
- Will Browne is co-founder of healthcare technology company Emrys Health, where he works on the development of infrastructure for transformative, equitable and accessible healthcare. He is Events Secretary of the RSS Data Science and AI section and a member of the RSS AI Taskforce. Copyright and licence
- © 2025 Royal Statistical Society
-
Thumbnail image by Tugce Gungormezler / on Unsplash.
This article is licensed under a Creative Commons Attribution (CC BY 4.0) International licence.