What AI Adoption in Science Tells Us About Trust
Why AI in scientific software wins or loses on the parts that aren't AI — verification, reproducibility, and the rigour scientists already apply to everything else.
Professionalised Scepticism: What AI-Native Actually Means in Scientific Software
Deloitte surveyed 104 R&D executives across biopharma in early 2025. The headline finding was near-universal optimism about AI in the lab. Investments flowing into automation, analytics platforms, AI tools, robotics, the lot. The same survey, in the very next breath, was clear-eyed about why most of it isn’t paying off yet: data foundations and governance.
Industry surveys of biotech R&D leaders have shown the pattern more sharply. Adoption appears to be fastest where outputs can be verified. Literature review, structure prediction, scientific reporting. Conversely, it falls off where they can’t be. e.g. Generative design or biomarker analysis.
Most analysis frames this as a maturity gap. I think it’s a pattern, and the lesson for anyone building scientific software in the next decade is in that pattern.
The pattern, not the gap
What is interesting is that this verifiability divide isn’t a niche observation. When you dig deeper it appears to be playing out across all the biggest deals of the last eighteen months.
- Eli Lilly committed $2.75 billion to Insilico Medicine in March 2026 for AI-discovered drug candidates, sitting on top of two earlier collaborations and a software licence.
- GSK paid $50 million upfront in January 2026 to anchor-license Noetik’s virtual cell foundation models, with milestones and annual subscription fees on top.
- Pfizer disclosed it was adding more than 1,200 GPUs to its data centres on its Q4 2025 earnings call.
- Lilly and NVIDIA opened a $1 billion AI factory in South San Francisco running on a thousand-plus Blackwell Ultra GPUs.
- (and outside life sciences) Siemens and NVIDIA used CES 2026 to announce an Industrial AI Operating System, with Siemens’s Erlangen factory as the first fully AI-driven adaptive site.
Importantly, the pattern isn’t unique to drug discovery. It’s how AI is being deployed across regulated, high-stakes industries such as big pharma.
The part most commentary misses?
Every one of these works because there’s something downstream verifying the model’s output. Wet-lab confirmation. Virtual cell simulation. Physics-based engineering tolerances. Audit logs that survive a regulatory inspection. The deals aren’t bets on the model. They’re bets on the verifier the model gets paired with.
Why the verifiability pattern isn’t an accident
Science is professionalised scepticism. Peer review, replication, retraction, error bars, confidence intervals. These aren’t bureaucracy. They’re the trust architecture, and they exist because every working scientist knows that confident-sounding answers are the easy part. The hard part is establishing which of those answers survives a determined attempt to break it.
In that frame, a language model that produces a fluent, plausible answer to a scientific question without any way to verify it isn’t really a scientific tool. It’s a literature-shaped object. Useful in narrow contexts, dangerous in broad ones, and increasingly easy to spot once you know what you’re looking for.
What is encouraging is that the architectural answer to this problem already exists. Berkeley AI Research called it compound AI systems in a 2024 paper. State-of-the-art results, they argued, are increasingly produced by systems with multiple components rather than monolithic models. Cross-verification, retrieval, tool use, deterministic checks. The model is one piece of the system, not the whole of it.
That consensus has been absorbed by serious AI engineering teams across legal, financial services, and code generation. It hasn’t fully landed in scientific software yet. Obviously that’s the gap, and in my view it’s a strategic one, not a maturity one.
AI-native doesn’t mean LLM-native. It means starting from the trust architecture, not the model.
The architecture
The most important part of an AI product is the part that isn’t AI.
Sounds backwards, doesn’t it? Let’s explore further.
The pattern I keep noticing across every domain where AI has to produce outputs that survive scrutiny is the same one. A model proposes; a deterministic engine verifies; the system flags what’s unknowable and routes the hard cases for review. Propose, verify, revise.
The pattern is already in production across very different industries.
- Harvey, the prominent legal AI company, pairs its language model with a deterministic citation pipeline that confirms every cited case actually exists, supports the claim, and is still good law.
- AlphaFold predicts protein molecular structures and, importantly, predicts its own confidence in each residue, telling scientists which parts of the structure not to trust.
- Our new friends at Siemens have NX, a generative industrial design tool that proposes part geometries and then hands each candidate to Simcenter, a physics simulation engine that verifies it against real-world behaviour like stress, vibration, and thermal response before anything gets manufactured.
Three industries, three vocabularies, the same architecture. A generative model proposes; a deterministic engine verifies; the loop iterates. Replace “part geometry” with “molecular structure,” “case law” with “experimental evidence,” and the architecture doesn’t change. Hard to read this as a coincidence. It looks more like the pattern that emerges anywhere AI has to produce outputs that survive contact with the physical world.
Same pattern, three industries, all in production. The architecture isn’t a forecast. It’s already here.
What this means for what we build
A few things follow from all of this for anyone building scientific software.
Adoption follows verification. The question isn’t where you can add an LLM. It’s where the answer is checkable, what the verifier looks like, and how much of the user’s existing trust the verifier inherits. Get that ordering right and adoption is fast. Get it backwards and you build something that demos well and dies in production.
Auditability isn’t a feature. It’s the product. In any domain that touches GxP, EMA, peer review, or a clinical filing. The audit trail isn’t a compliance tax: it’s the reason the software gets used. Builders who treat the audit trail as an afterthought are, in my view, building toys.
The integration substrate is settling. The Model Context Protocol (MCP) went from Anthropic’s launch in late 2024 to Linux Foundation governance in thirteen months, with developer adoption quickly reaching in the millions. The plumbing is being commoditised, and what runs on top of it appears to be where the value moves. Agentic workflows are twelve to twenty-four months from being standard enterprise deployment, and I’d expect them to favour vendors with verified computation rather than raw data access.
The EVE Online Project Discovery initiative is one of my favourite signals here. Player counts in the high hundreds of thousands have spent their gaming time drawing polygons around cell clusters in real flow cytometry data, building a consensus dataset that AI gating algorithms train on. My colleague and friend Ryan Brinkman is the scientific partner on the project, and his academic work on automated flow cytometry gating predates this entire wave of AI by two decades. The verifier became valuable enough that the AI got built around it, not the other way round. Of course, that’s what a moat looks like in this architecture.
There’s a counter-argument worth taking seriously, and it’s a good one. The dominant view in foundation model labs is that general methods scaling with compute power consistently beat domain-specific engineering over the long run. The implication drawn: as models get good enough, the verifier becomes legacy code.
I’d push back on that. Even if a future model is right 99% of the time, science still has to know which 1% is wrong, why, and how to reproduce the result tomorrow on a different machine. A system that can’t be audited can’t be cited. The verifier isn’t legacy code. It’s the part that makes the model usable in a domain where being right most of the time isn’t the same as being trusted.
Coda
The companies that win in scientific software post-AI won’t be the ones that bolt a language model onto existing workflows. My bet is they’ll be the ones that treat AI with the same rigour scientists apply to everything else: measurable fitness, reproducible outputs, and the intellectual honesty to flag what’s unknowable.
Science is professionalised scepticism. That’s exactly the mindset this space needs more of.
If you’re building in this space and thinking about the verifier rather than the model, I’d love to hear from you. I’d genuinely enjoy comparing notes, especially with anyone who has counter-evidence on this.