The Genome Is the Largest Unread Library in the Universe. We're Building the Reader.
A manifesto from Living Models
Evolution has been running experiments for four billion years.
Every drought survived. Every pathogen defeated. Every adaptation to cold, heat, salt, shade, flood - written down. Archived. Stored in sequence, across millions of species, in the most stable medium that physics allows. Four nucleotides. Billions of positions. And somewhere in that combinatorial space: solutions to problems we haven’t even named yet.
We cannot read it.
Not really. We can sequence it - faster and cheaper every year. We can annotate fragments of it. We can search it for patterns we already know to look for. But the deep grammar - the non-linear, long-range, cross-species logic that makes a genome mean something - that has been beyond us. We’ve been staring at the largest library in the universe and reading individual letters.
That changes now.
Something Happened in 2017
A paper dropped. Eight authors. Eleven pages. Title: Attention Is All You Need.
Nobody outside a small corner of machine learning paid much attention. The authors proposed a new architecture for sequence modeling - transformers - built around a mechanism called self-attention. Their claim was that this architecture could learn the underlying grammar of sequential data better than anything before it, given enough data and enough compute.
They were more right than they knew.
Within five years, that architecture had consumed language. Then code. Then protein structure - AlphaFold2 nearly solved a fifty-year-old biology problem in a single pass. Then images, audio, video. Wherever you could frame a problem as sequence modeling, transformers arrived and rewrote what was possible.
Here is the thing that most people working in AI understand but haven’t fully followed to its conclusion:
DNA is a sequence.
Not metaphorically. Literally. A string of tokens - A, T, C, G - with statistical structure, long-range dependencies, functional grammar, and four billion years of supervised signal baked in by evolution. The genome is not like a language. It is a language. And we bet that the architecture that broke language open will be the same one that breaks genomics open.
We are not at the beginning of the AI revolution. We are at the beginning of its biological phase.
The Crime Scene
Here is what bothers us.
Humanity has been sequencing genomes at scale for twenty years. We have accumulated petabytes of biological sequence data - plant, animal, microbial, human. We have phenotyped millions of lines. We have run thousands of GWAS studies. We have mapped QTLs, annotated genes, built databases that took decades of collective scientific labor to assemble.
And then we left most of it on the table.
The tools we built to analyze genomic data are, almost without exception, tools of reduction. They look for single variants. They test one gene at a time. They assume additive effects in a system that is profoundly non-additive. Most were designed for a world that no longer exists, a world where compute was scarce and sequence data was scarcer.
Meanwhile, the methods that cracked language - self-supervised pretraining on massive sequence corpora, emergent representations that capture meaning without being told what meaning is - have barely touched genomics. The field is still running BLAST and linear mixed models. Not because those tools are good enough. Because nobody has built what comes next.
This is the crime scene. Petabytes of biological intelligence, accumulated by evolution and by science, functionally unreadable. The library exists. The librarian doesn’t.
A foundation model trained on biological sequences does not learn a catalog. It learns a prior - a compressed representation of what is normal, what is surprising, what is functionally coherent, across the entire distribution of life it has seen.
When you show it a new sequence, it does not look it up. It understands it.
This is categorically different from everything that came before. BLAST finds known sequences. Variant effect predictors find conserved positions. Both are bounded by what has already been characterized. They cannot tell you anything genuinely new.
A foundation model can.
The Rosetta Stone didn’t translate one inscription. It unlocked the ability to read an entire civilization - including texts that had never been seen before. Foundation models for biology are that key. Not a better dictionary. Not a faster search engine. A grammar.
We started with plants - the most genomically complex organisms on earth - because the hardest proof is the most convincing one. BOTANIC is that proof. What it demonstrates is not specific to plants. It is specific to life.
We are not building a catalog of what biology has done. We are building a model of what biology can do.
From retrieval to understanding. From lookup to reasoning. From reading individual letters to reading the library.
We Started With Plants. Here Is the Real Reason.
Plants are the base layer.
Primary producers. First converters of solar energy into biological form. The foundation of every food chain, every terrestrial ecosystem, every agricultural system that has ever fed a human civilization. If you want to understand what climate change will actually do to the biosphere - not in models, in reality - the answer is being written in plant genomes right now, in real time, under selection pressure that has no precedent in the last ten thousand years.
But the honest reason we started with plants is more specific than that.
Plant genomics has a property that human genomics doesn’t: the data is open, the feedback loop is fast, and the stakes are high enough to matter.
No HIPAA. No consent frameworks. No fragmented biobank access negotiations. Thousands of fully sequenced genomes, publicly available, ready to train on - right now. When we train a model, we can validate its predictions in a single growing season. Cross two lines, grow the population, measure the phenotype, check the prediction. Six months. In human health, that validation cycle runs for years. The ability to iterate against biological ground truth at that speed is not a minor convenience. It is what separates research from engineering.
And the market pressure is not theoretical. Viruses that held for a decade are breaking resistance genes in two seasons. Drought patterns have moved outside the historical range that fifty years of breeding data was calibrated for. The wild genetic diversity that plant breeders have drawn from for a century is contracting as native habitats disappear. The seed industry - $65 billion in annual revenue, the quiet infrastructure of global food security - is running out of time with the tools it has.
We are not here because plants are easy. We are here because plants are where we can prove the thing that matters most: that foundation models make the genome legible, in ways that change what biology can do in the real world. Fast enough to know it’s working. At stakes high enough to justify the effort.
The Thing Nobody Wants to Say Out Loud
Here it is.
The genomic data that seed companies, agricultural research institutes, and plant biology labs have accumulated over the last thirty years is, by any reasonable measure, more richly structured than most of the datasets used to train the most powerful AI systems on earth.
It has been collected under rigorous experimental protocols. It has known ground truth - phenotypic outcomes, measured in real field conditions, across populations large enough to be statistically meaningful. It spans thousands of genotypes, hundreds of environments, decades of selection history. It is labeled, in the sense that matters most to machine learning: the experiments have already been run, the outcomes already recorded.
And almost none of it has ever been used to train a serious foundation model.
Not because the companies holding it don’t understand its value. Because the tools to use it didn’t exist.
They exist now. We are building them. And the organizations that understand this first - that move first to pair their biological data with foundation model infrastructure - will not just have better R&D pipelines. They will have a compounding advantage that widens every year, as the model learns more from their data and their data reveals more about what the model can do.
This is not hype. This is what happened in every other domain where foundation models arrived. The early movers compounded. Everyone else played catch-up.
What We Are
We are Living Models.
A team of over ten PhDs - from Huawei Noah’s Ark Lab, Owkin, Mila, École Normale Supérieure, Datadog - who left comfortable positions to build the thing that should exist and doesn’t yet.
Our first model family, BOTANIC-0, is not yet the strongest plant genomics model in existence. It is the proof that we can build one - at scale, across a billion parameters, competitive with every existing benchmark. And it is the foundation from which we build what comes next.
The roadmap runs from DNA to RNA. From genomics to epigenomics. From plants to microbial systems to mammalian biology. From a model that understands plant sequence grammar to one that understands the grammar of living systems at large. Not because we’re trying to do everything at once. Because the architecture is the same, the approach is the same, and the beachhead we’ve chosen gives us the fastest path to proving that it works.
We are based in Paris and Berkeley. We have raised our first round. We have a technical report on bioRxiv that says what we actually built and how it actually performs and our open-weight models are available on Hugging Face for research purposes.
We are not waiting for the moment to be obvious. The window to build the foundational infrastructure of biological AI - before it becomes crowded, before the large labs fully redirect their attention here, before the first-mover advantages are locked in - is open right now. We are building through it.
An Invitation
If you are a researcher and you think the things we are describing should exist - we want to hear from you. Not to sell you something. Because the people who will push this field forward are the ones who understand the problem well enough to be frustrated by the current state of the tools.
If you are a builder and you are looking for the next genuinely hard problem in applied AI - one where the data is real, the stakes are existential, and the methods that work will be non-obvious - welcome.
If you are sitting on biological data that has never been fully understood, in a plant lab or a seed program or a research institute anywhere in the world - the model that makes it legible is being built. We would rather build it with you than for you.
And if you simply believe that the intelligence evolution encoded in living systems deserves the same quality of attention we’ve been giving to the intelligence we’re encoding in silicon -
You’re in the right place.
Living Models - Foundation models for living systems.




