'Every living thing on Earth runs on the same programming language': How AI foundation models trained on DNA could transform plant biology

3 hours ago 8

Artificial intelligence has already having a big impact on fields like language processing and computer vision, but biology is emerging as one of the next major frontiers.

That move comes at a moment when genomic data is growing faster than many traditional tools can handle. Sequencing technology has become cheaper and more widespread over the past two decades, producing vast collections of biological data that researchers can read but still struggle to interpret in meaningful ways.

Article continues below

The challenge is no longer gathering genetic information, but understanding how different sequences interact and influence real-world outcomes.

Enter Living Models

Living Models is part of a growing group of companies attempting to tackle that gap using transformer-based architectures, the same underlying approach that powered the recent wave of large language models.

Instead of predicting the next word in a sentence, these systems analyze patterns across biological sequences, aiming to uncover structural relationships that traditional statistical tools often miss.

The company’s first model family focuses on plant biology, an area where genetic data is widely available and where faster insight could directly affect crop development and climate resilience.

Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!

The idea reflects a bigger shift in how researchers think about biology itself, moving from static catalogs of genetic parts toward systems that can interpret how those parts work together.

"Every living thing on Earth runs on the same programming language: DNA codes for RNA codes for proteins codes for phenotype," said Bertrand Gakière, VP Biology at Living Models. "We're not building another chatbot. We're building a model that can read and interpret that code, which is infinitely more useful than predicting the next word in a sentence."

I wanted to understand what that transition could mean in practice, so I spoke to Living Models CEO and co-founder Cyril Véran about why biology is becoming an information problem — and why plants are the starting point.

Living Models wants to build foundation models for biology. But why? Can we draw parallels with the race, back in the 1990s to decode the human genome?

The Human Genome Project gives us a useful before-and-after. Before 2003, we could not read the code at all. The project's achievement was monumental — a complete parts list for human biology.

But a parts list is not understanding. After twenty years of remarkable work — GWAS studies, CRISPR screens, QTL mapping, genomic selection — we have accumulated enormous amounts of genomic data and produced real results.

What we have not produced, at scale, is generalisation. The tools that exist today are fundamentally correlative: they learn that certain marker combinations tend to co-occur with certain phenotypes, within a given population, in a given environment.

They do not learn why. Ask them to extrapolate to a novel genetic combination, a different environment, or a related species, and the statistical associations break down. That is the wall the industry has been hitting for twenty years.

What changed is the same thing that changed natural language processing: transformer architecture. When applied to text, transformers stopped memorising words and started learning the structural relationships between them — grammar, context, long-range dependencies. That shift is now happening in biology.

The question is not whether DNA has 'intention' in the way human language does. It does not. But it does have structure — regulatory grammar, conserved motifs, epistatic interactions between distant genomic regions — and that structure can be learned from sequence data alone, at scale, without requiring every relationship to be manually annotated.

That is the race we are in. Not to sequence more genomes — we have plenty. To build a model that reads them with sufficient comprehension that a breeder, a researcher, or a biotech company can ask a meaningful question and get a biologically grounded answer.

The HGP was the Apollo Programme: it proved we could get there. What we are building is the infrastructure that makes the journey routine.

Why plants and why not the other two major domains? I assume this is on your roadmap given your name is Living Models.

There is a strategic answer and a scientific one, and they point in the same direction.

The question people usually ask is: why not start with human health, where the funding is deeper and the clinical outcomes are more visible? There are four concrete reasons we went the other way.

First: data access. Every plant genome we trained on is fully public. No HIPAA, no GDPR, no patient consent frameworks, no biobank access negotiations, no institutional review boards. We assembled training data covering thousands of plant genomes without a single legal dependency.

In human genomics, building an equivalent dataset would require years of regulatory navigation before the first model is trained. That asymmetry is not a footnote — it is a fundamental structural advantage that let us move at a speed that would have been impossible in a clinical context.

Second: regulatory friction. Deploying a genomic model in human medicine means navigating the FDA, the EMA, and their equivalents across every market. The evidentiary bar is rightly very high — and very slow.

In agriculture, the path from model output to field application is governed by plant variety registration frameworks that, while meaningful, operate on a fundamentally different timescale. We can iterate, validate, and deploy in years, not decades.

Third: experimental velocity. In human biology, a failed prediction has consequences that extend far beyond the experiment.

In plant biology, we can design a trial, grow it out, and measure the result in a single season. If a variant we predicted to confer drought tolerance turns out to be irrelevant, we learn that in months, not years, and at a cost measured in field plots rather than clinical trials.

The feedback loop that improves the model is dramatically faster. Nobody regulates what happens to a crop that underperforms.

Fourth, and perhaps most important: urgency. Agriculture is the industry most directly, most immediately, and most irreversibly affected by climate change. Growing seasons are shifting. Drought and heat stress events that were once rare are becoming baseline conditions in the world's breadbaskets.

The varieties that will feed ten billion people by 2050 need to be bred for a climate that does not yet exist at scale — which means we cannot wait for twenty years of field trials to identify which genomic combinations are relevant.

The need for exactly what BOTANIC does — predicting biological function in conditions outside the historical training distribution — is not a future use case in agriculture. It is the defining problem of the sector right now.

As for fungi, microbiome, and the rest: Living Models is not a plant company. We are a foundation model company for living systems. Plants are where the structural advantages are highest and the urgency is greatest. The architecture generalises. The name was chosen deliberately.

What prevents Bayer CropScience, Corteva, Syngenta, BASF, and Limagrain from emulating what you're doing? And how did you match much larger teams — are you the DeepSeek of your category?

DeepSeek is a reasonable reference point, with one important clarification: what made DeepSeek significant was not that it was cheap — it was that it was architecturally efficient in ways that larger, better-resourced teams had not prioritised.

The lesson is that in deep learning, the team closest to the problem often moves faster than the team with the most capital. The same dynamic applies here.

The large agrochemical groups are extraordinary organisations. They run global breeding programmes, navigate complex regulatory environments across dozens of markets, and manage supply chains of staggering scale.

What they are structurally not built to do is frontier AI research — the kind that requires hiring researchers from Huawei Noah's Ark Lab, Mila, Owkin, and the École Normale Supérieure, and giving them the autonomy to redesign training pipelines from scratch. That is a different institutional mode.

You do not acquire it by redirecting an IT budget. You build it over years, or you partner with someone who already has it. We expect many of the largest seed companies to do the latter.

On the IP question: we released BOTANIC as open weights deliberately, and the logic is worth explaining precisely. The model weights are a snapshot. The durable competitive asset is the flywheel that generates the next, better snapshot: proprietary fine-tuning data accumulated through each customer partnership, the feedback loops from real breeding programmes, and the architectural improvements that compound over time.

Every partnership we close with a major seed group produces training signal that no competitor can replicate, because that phenotypic data — decades of field trials, trait measurements, environment interactions — was never public to begin with. Open weights accelerate the first step of adoption. Proprietary data pipelines create the moat that follows.

As for acquisition: it is a real strategic option for the incumbents, and we are aware of it. What it would confirm is that the capability cannot be built internally at the pace required. That is itself a form of validation.

What could be the consequences of biological hallucinations, and what barriers do you have to mitigate any risks?

I want to be precise here rather than reassuring, because the question deserves precision.

BOTANIC operates as a hypothesis engine, not a decision system. When the model scores genomic variants for their likely contribution to drought tolerance, it is prioritising a candidate list for experimental validation — not issuing a planting instruction.

In a research setting, the consequence of an incorrect prediction is a wasted experiment, typically weeks to months of work. That is a real cost, and we take it seriously.

The more significant risk operates at the industrial scale: a seed company that allocates its R&D programme on the basis of systematically biased predictions could misallocate resources over a multi-year breeding cycle before the error surfaces in field data.

Plant breeding runs on timescales of four to eight years from genomic hypothesis to commercial variety. That is the error propagation window we design against.

Concretely, we do three things. First, uncertainty quantification is built into every model output — predictions come with calibrated confidence distributions, not point estimates, and we validate that calibration against held-out genomic benchmarks documented in our bioRxiv technical report.

Second, we explicitly flag low-coverage regions of genomic space where the training distribution is thin and model confidence should be treated sceptically.

Third, our commercial deployments are integrated into existing breeding workflows where domain experts make the consequential decisions — BOTANIC accelerates the hypothesis generation step, it does not replace the agronomist or the field trial.

The structural safeguard is the nature of the domain itself. Unlike a software system where a model error can propagate at machine speed through millions of decisions, agricultural biology has human experts and multi-season validation cycles built into every step. We design for that reality rather than trying to substitute it.

Can you give a real application? Would scientists talk to it like ChatGPT? Can companies combine BOTANIC with proprietary data?

Concrete example: a wheat breeder wants to develop varieties resilient to the kind of drought that devastated harvests in southern Europe in 2022. The traditional approach means crossing thousands of candidate lines, growing them through multiple seasons, and measuring which ones survive.

That is a process of five to twelve years from hypothesis to commercial variety, with most candidates failing late.

The existing computational toolkit — genomic selection models like GBLUP or BayesC — already helps narrow that funnel. But those models work by learning statistical correlations between marker combinations and measured phenotypes within a specific training population.

They require hundreds to thousands of phenotyped individuals per trait, they degrade when you move to a different environment or genetic background, and they are blind to biological mechanism.

They will tell you that a particular haplotype block tends to co-occur with drought tolerance in your historical data. They cannot tell you why, or whether it will hold in a genetic background they have never seen.

BOTANIC approaches the same problem from a different direction. Because it is trained on raw genomic sequence across 1,600 plant genomes — not on phenotype-marker associations — it learns the underlying biological structure: regulatory grammar, conserved functional motifs, the long-range epistatic interactions that classical models treat as noise.

When applied to the breeder's candidate lines, it can prioritise variants that are biologically coherent, not just statistically associated — including novel combinations absent from any historical training set. The experimental programme then targets a far smaller, better-grounded set of candidates.

The breeding cycle does not disappear, but its front end becomes dramatically more efficient, and its predictions hold up further from the training distribution.

On the interface question: the primary environment is the computational workflow that genomics researchers already use — sequence files, annotation tracks, variant call formats. That is where the value is highest and the integration is cleanest.

On hybrid deployment: yes, and this is the architecture we run with enterprise customers. A major seed group typically holds decades of proprietary phenotypic data — field trial results, trait measurements, environment-specific performance records — that have never been combined with a model capable of reasoning over the underlying genomics.

We fine-tune BOTANIC on that dataset in a private deployment: the customer's data does not leave their environment, the resulting model weights remain their property, and what they get back is a model that combines general biological knowledge from 1,600 plant genomes with deep specificity to their crops, environments, and breeding objectives.

The difference between that and a genomic database query is the difference between a fluent domain expert and a search engine.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

Désiré has been musing and writing about technology during a career spanning four decades. He dabbled in website builders and web hosting when DHTML and frames were in vogue and started narrating about the impact of technology on society just before the start of the Y2K hysteria at the turn of the last millennium.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Read Entire Article