Why 95% Of AI Projects Fail and How to Change the Odds

Drawing on two decades of work building AI systems for medicine discovery, Lee Clewley argues that the success of AI models rests on three principles.

Viewpoints
AI
Author

Lee Clewley

Published

January 12, 2026

Artificial intelligence is now capable of performing substantive work across scientific, medical, industrial and economic domains, yet organisational experience remains uneven. Most large firms have experimented with AI; very few report material gains. MIT’s NANDA study of enterprise generative AI estimates that only 5 percent of custom tools reach production with measurable impact on profit and loss (1). Early analysis from MIT’s Iceberg project points in the same direction at task level: current systems could already support far more work than they do today, but observed use remains shallow, concentrated in a narrow set of roles and often confined to standalone ‘copilot’ tools rather than embedded in core workflows (2).

For anyone who has sat through AI vendor demonstrations, the pattern is familiar: a procession of polished prototypes that rarely change how important decisions are made. As one Chief Information Officer put it (1): ‘We’ve seen dozens of demos this year. Maybe one or two are genuinely useful. The rest are wrappers or science projects.’

Two caveats matter here. Many pilots are exploratory by design, so failure to reach production may not necessarily be a failure in a scientific sense. Profit based metrics also miss scientific and operational learning, which often matters more in research intensive organisations (3); (4); (5); (6). Even allowing for those points, evidence across independent surveys is remarkably consistent: most organisations struggle to turn AI model capability into repeated value, and with an estimated 95% failure rate, the question becomes: how do we change the odds? (1); (3); (4); (5); (6).

Three Reasons for Failure

There are three important reasons why so many projects fail that are often overlooked in the literature. First, the problem is often mis-specified: it is framed by technologists or vendors rather than co-owned by the domain experts who understand the decision and bear the consequences. Second, leadership expectations are frequently misaligned, short time horizons and demands for certainty collide with a technology that improves through iteration and organisational learning. Third, many deployments are brittle: they assume stability in a domain defined by rapid model change and rising user expectations, when what is needed is an engineered system designed to adapt.

This article draws on two decades of work building AI systems for medicine discovery at GSK (7) and at Tangram (8) to argue that success rests on three principles, in roughly this order:

  • integrated subject matter expertise;
  • patient and informed executive leadership;
  • and building AI systems that learn with the organisation.

The sections that follow develop each element in the context of drug discovery and show how an AI platform can help move real projects into the small minority that deliver value.

Essential Contexts Outside Our Focus

Before turning to the three principles developed below, it is worth acknowledging four adjacent domains that I will not treat in detail here, each of which now has a substantial literature of its own. First, organisational scholars have long shown that new information systems reshape power, status and discretion, making the politics of implementation as important as the technology itself (9); (10). Second, governance, regulation and responsible AI practice which includes everything from model documentation and auditability to privacy, robustness and safety, have become central determinants of what can be deployed in practice, especially in regulated sectors (11); (12); (13). Third, there is an emerging body of work on workforce transformation: how AI complements or displaces skills, how hybrid human–AI roles are designed, and how training, trust and professional bodies mediate adoption (2); (3); (4); (5). Finally, the question of how to measure value and learn at portfolio scale with existing legacy IT systems (through experimentation, counterfactuals and disciplined comparisons between use cases) is itself a rich field that extends well beyond any single organisation. (3); (4); (5); (6); (14). Each of these strands is critical to understanding why AI succeeds, stalls or remains at a proof-of-concept level.

To maintain focus, I concentrate on three overlooked questions arising from direct experience: how to organise subject matter expertise such that the enterprise owns its AI; how to cultivate genuine leadership ownership; and how to engineer systems that learn and adapt rather than remain isolated demonstrations.

Principle 1: The importance of building with subject matter experts

The hardest part of building an AI platform is not the models or the engineers but assembling the subject matter experts (SMEs) who will frame and judge the work. Most commentary treats SMEs as validators, brought in at the end to bless a prototype. It rarely explains how to organise a molecular biologist, a clinician and a chemist so that they can state, in plain terms, what counts as an acceptable outcome.

Most commentators are not operators. They observe patterns across organisations but do not live with the consequences of poor SME integration. This distance between writing and practice shows up in the surveys. Foundry’s 2024 State of the CIO, summarised in MIT Sloan Management Review, reports that 85% of IT leaders see the CIO role as a driver of change, yet only 28% list leading transformation as their top priority (15)]. The people commenting on AI often sit with strategy decks rather than with the unglamorous work of managing technological change and cross-functional coordination.

Drug discovery starkly exposes the gap. The relevant team is wide and requires exceptional coordination. Business and portfolio leaders understand how projects absorb capital and create value whereas molecular biologists and geneticists judge whether a gene is plausibly causal for a disease. Clinicians think through trial design and patient risk. Chemists know what can be made and delivered. Statisticians, AI engineers and data scientists understand models, data pipelines, experimental design and evaluation. This diversity is a strength but requires a lot more from leaders of such teams. When these groups work as separate silos, the result is a generic set of tools whose outputs are not trusted by the users and whose inputs are irrelevant. When these experts can operate as a single team, the conversation starts with a simple set of questions. Which decisions are we trying to improve? How will we know if we have succeeded? What data and statistical methods count as acceptable evidence? Which risks are we prepared to take and which are not negotiable?

At Tangram, that joint framing often collapses into one critical choice: which disease do we want to target for drug development, and which gene is driving it? That decision already embeds genetics, hepatocyte biology, chemistry, clinical feasibility and commercial context. The role of AI and engineering is then precise. It is to help the group search the vast hypothesis space, structure the evidence and quantify uncertainty, while leaving the final judgement with experts who feel they own the AI platform, can see why the AI has come to the conclusions it has, and also own the consequences.

Principle 2: Patient and strategic executive leadership.

The second element is executive patience. Research from MIT Sloan shows that firms gaining value from AI tend to run more projects, over more years, with a sustained focus on learning how people and AI work together. (3) Leaders in these organisations accept that early returns are small and uneven. They invest in a pipeline of use cases rather than a single bet. They resist what researchers have called the “last mile problem”: AI projects that reach technical proof of concept but never change how work is done. (14) In life sciences this is acute. Discovery timelines are long, data are messy and early signals are faint. Leaders who expect quick, clean returns tend to cycle through pilots without ever building an asset that scientists trust.

Patience does not mean passivity; actually, the opposite is true. It means choosing a small number of important decisions, funding cross-functional teams to attack them, and holding the bar for quality high. It requires senior sponsorship to unblock data access, align incentives across discovery, clinical and commercial groups, and shield long term work from quarterly fashion cycles. When those conditions are in place, AI stops being a sequence of demonstrations and starts to become part of how the organisation thinks: the informed leader knows the difference.

Principle 3: Building AI Systems that learn with the organisation

The third element is the AI engineering. The MIT findings on the 95 percent figure are instructive: most generative AI projects fail not because the models are weak but because the systems around them are brittle. (1); (11) Foundation models are dropped into existing workflows with minimal adaptation. There is limited monitoring. Data quality is assumed rather than measured. When something breaks (as it inevitably will in non-deterministic systems) teams revert to manual work.

Modern AI engineering starts from the opposite assumption. Models and tools will change quickly. The surrounding stack must absorb that change without being rebuilt each time. The strategy must be built so that, when the product director hears a new technology is built or an LLM improved, this is always a good day.

Build only what you must.

The sensible principle is to only build the components where your domain expertise creates defensible value. Everything else should be bought. But buying is not effortless. Integration, monitoring, and vendor management drains teams unless you are staffed for it. Organisations that succeed at scale partner with vendors offering systems that learn and adapt; they focus on workflow integration; they deploy tools where process alignment is easiest.

So assuming you have the right staff who are working together, supportive leaders and good vendor relationships the next problem is how to create a platform itself.

A useful pattern for a modern AI platform has four parts:

  • First, the data stack. Discovery teams need a small number of trusted stores for data with clear provenance and at least basic quality checks. For early target selection this means human genetics, expression data, interaction networks, preclinical phenotypes, clinical outcomes and internal experiments. Traceability and reproducibility is critical here. Any claim a model makes about a gene and disease pair should be traceable back to specific pieces of evidence.
  • Second, the platform needs to be built on modular services rather than monoliths. Each service has a single responsibility and can be swapped when a better service appears. This keeps the cost of change low and allows teams to combine external tools with internal components in a controlled way.
  • Third, the system needs to have continuous evaluation. Every component that answers questions is tested on held-out tasks, with simple metrics for accuracy, faithfulness, and recall, and monitored continually. There should be repeat measures and other tests of robustness. (16) There is no reason not to report error bars in AI and yet they are rarely part of AI publications. Where this matters most is at the interface with non-determinism, inherent in large language models. A good medical AI assistant should give consistent answers even when questions are phrased differently. It should also say it does not know when the information is unclear or incomplete. (16); (17); (18)
  • Fourth, include memory and reinforcement learning so that the system learns. This is the most difficult component to implement and the one most often deferred. A system that cannot learn from use will make the same mistake repeatedly. Even the most patient users will lose trust and patience. But building memory into production systems, where the model retains context across sessions and improves from feedback, requires specialist expertise in reinforcement learning, retrieval-augmented generation with persistent stores, and the infrastructure to support online learning without catastrophic forgetting (19). These skills are in high demand and short supply. The alternative is a system that feels potentially useful in demonstrations but frustrates users in daily work.

For this to work, engineering teams need to stay in constant contact with biologists, geneticists, clinicians, chemists and portfolio managers. Together they decide what error rate is acceptable for a triage tool, what form of uncertainty estimate a portfolio board will respect and where human review is mandatory, for example before a new target enters serious preclinical work. Work on human–AI interaction design reinforces this point: systems should explain what they can and cannot do, expose their confidence and make it easy for users to correct them. The hardest part is that the first version is almost never right. Cross-functional teams need patience and ownership. They contribute real examples, refine prompts and evaluation sets, and expect the system to learn from its mistakes. The AI platform must be useful enough, early enough, that experts are willing to spend scarce attention improving it.

A worked example: target-indication pairing in siRNA

At Tangram we built LLibra OS, an internal system designed to surface and assess new small interfering RNA targets (8). siRNA refers to short double-stranded RNA molecules that can silence a specific gene via the RNA interference pathway. The purpose is narrow: the AI needs to help scientists identify medicines worth taking forward.

In early discovery, hypothesis generation often reduces to a single question: which target and disease pair should we move into the siRNA pipeline? The question sounds simple. It is not.

A purely data-driven approach can surface millions of candidates. Genome-wide association, expression atlases, and protein interaction networks will produce statistical associations at scale. But association is not mechanism. Two things can correlate because they share upstream causes, because they sit in the same pathway without being rate-limiting, or because of confounding in the data.

Plausibility requires a different kind of evidence. If we modulate this target, what functional change should we observe at the cellular or tissue level? Does that functional phenotype connect credibly to the disease we care about? For our purposes, the chain of reasoning must pass through liver biology: does knockdown of this gene alter a measurable secretory or metabolic function, and does that function relate to the clinical phenotype we wish to treat? Is there a real unmet need for your research for patients? (20)

The AI assists in answering these questions. It helps the team hold multiple threads of conditional evidence in view simultaneously. It retrieves, reasons and summarises over tens of millions of papers in the literature, joins and flags inconsistencies between thousands of data sources and quantifies uncertainty where the evidence is thin. Whilst the AI platform does the work of thousands of researchers and continually learns on the job, the expert judgement remains with the SMEs. The SME is central and is given all the reasons why and how the AI found a piece of evidence. The AI structures and expands the space in which that judgement operates. When it works well the AI platform uncovers the non-obvious connections that researchers may never have found.

Conclusions

Seen from this angle, the 95 percent figure is not a verdict on AI technology but a statistic about organisational design: how rarely good questions, high quality data, diverse experts and committed leaders are brought together at the same time. The systems described in this essay matter, but they are secondary. The primary determinant of value is whether biologists, clinicians, chemists, data scientists and portfolio leaders sit together, own the same objectives and are backed by executives willing to invest over years rather than quarters. Where that integrated team is absent, even elegant architectures will fail. Where they are present, imperfect tools still move the needle.

Much commentary on AI spends its energy on model choice, technical detail and tooling. This article has argued that the more important work is organisational: deciding which decisions to improve, agreeing what counts as acceptable evidence, and creating cross-functional teams that can live with the consequences. The few organisations that succeed treat AI as an experiment in decision making rather than a procurement exercise. They expect the stack to change and the vendors to turn over, but they hold fast to the team, the questions and the discipline.

For Real World Data Science readers, the implication is direct. AI projects fail when nobody owns the estimand, the counterfactual and the error bars. AI doesn’t need to be perfect, but it needs to be good enough. As the statistician George E. P. Box famously observed: “All models are wrong, but some are useful”. Usefulness here depends on design, discipline and humility as much as model choice. Statisticians, data scientists and methodologists can reclaim the narrative by insisting not only that every AI project begins with a clear question, a credible experiment and a plan to learn, but also that these are held collectively by an integrated team with visible executive backing. That is how more organisations move into the 5 percent.

Explore more data science ideas

About the author:
Lee Clewley (PhD) is VP of Applied AI & Informatics at Tangram Therapeutics, where he led the design and deployment of LLibra, a multi-LLM, agentic system for early discovery. Formerly Head of Applied AI at GSK, he is a member of the Real World Data Science editorial board.

Copyright and licence : © 2026 Lee Clewley
This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite :
Clewley, Lee. 2026. “Why 95% Of AI Projects Fail and How to Change the Odds.” Real World Data Science, 2026. URL

References

1.
MIT NANDA. The GenAI divide: State of AI in business 2025. Massachusetts Institute of Technology; 2025.
2.
Chopra A et al. Measuring skills-centered exposure in the AI economy. MIT Project Iceberg; Oak Ridge National Laboratory; 2025.
3.
Ransbotham S, Khodabandeh S, Kiron D, Candelon F, Chu M, LaFountain B. Expanding AI’s impact with organizational learning. MIT Sloan Management Review. 2020;
4.
Bellefonds N de et al. Where’s the value in AI? Boston Consulting Group; 2024.
5.
Deloitte AI Institute. The state of generative AI in the enterprise: Now decides next. Deloitte; 2024.
6.
Schlegel D, Schuler K, Westenberger J. Failure factors of AI projects: Results from expert interviews. International Journal of Information Systems and Project Management. 2023;11(3):25–40.
7.
GSK.ai. JulesOS: GSK’s agent-based operating system [Internet]. 2024. Available from: https://www.gsk.ai
8.
Tangram Therapeutics. LLibra OS: Identifying the right targets. 2025.
9.
Markus ML. Power, politics, and MIS implementation. Communications of the ACM. 1983;26(6):430–44.
10.
Orlikowski WJ. The duality of technology: Rethinking the concept of technology in organizations. Organization Science. 1992;3(3):398–427.
11.
Paleyes A, Urma RG, Lawrence ND. Challenges in deploying machine learning: A survey of case studies. ACM Computing Surveys. 2022;55(6).
12.
EY. How responsible AI translates investment into impact. Ernst
& Young; 2025.
13.
Capgemini Research Institute. Generative AI in organizations 2024. Capgemini; 2024.
14.
Davenport TH, Ronanki R. Artificial intelligence for the real world. Harvard Business Review. 2018;96(1):108–16.
15.
Foundry. State of the CIO survey 2024. Foundry; 2024.
16.
Bolton WJ, Poyiadzi R, Morrell ER, Bergen Gonzalez Bueno G van, Goetz L. RAmBLA: A framework for evaluating the reliability of LLMs as assistants in the biomedical domain. arXiv preprint. 2024;
17.
Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12).
18.
GSK.ai. RAmBLA: Evaluating the reliability of LLMs as biomedical assistants. 2024.
19.
Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. In: Advances in neural information processing systems. 2022. p. 27730–44.
20.
Crooke ST, Liang XH, Baker BF, Crooke RM. Antisense technology: A review. Journal of Biological Chemistry. 2021;296:100416.