Press Release: With Evo 2 Genome Modelling and Design Across All Domains of Life

Posted on March 06, 2026 by Admin

Researchers at the Arc Institute describe the development and advantages of “Evo 2”. Evo 2 is a novel biological foundation model trained on a massive dataset of approximately 9 trillion DNA base pairs and implemented in models containing up to 40 billion parameters.

Representing a significant advancement over previous artificial intelligence (AI) implementations, Evo 2 was trained on an extensive dataset of approximately 9 trillion DNA base pairs spanning bacteria to humans. Its genomic context window can reach up to 1 million nucleotides, allowing it to analyse the intricate, long-range dependencies that govern gene function.

Research findings reveal that the model demonstrates strong performance across multiple genomic prediction tasks, including predicting the functional impacts of genetic variants (including coding and non-coding mutations) and splice-related changes, and can even design novel DNA sequences at the genome scale (for example, mitochondrial genomes and large microbial or eukaryotic genomic segments) with high precision.

Study

In the present study, researchers report the development and testing of “Evo 2”, a novel generalist biological foundation model. The model was trained on a massive, scientifically curated dataset called OpenGenome2, comprising approximately 8.8 trillion nucleotides from bacteria, archaea, eukaryotes, and bacteriophages while intentionally excluding viruses that infect eukaryotic hosts for biosafety considerations.

Unlike previous implementations in the field, Evo 2’s model architecture (called “StripedHyena 2”) adopts a hybrid computational design that combines convolutional and attention mechanisms, greatly expanding its “context window”. The research team likens this advancement to Evo 2’s ability to hold hundreds of pages of the genomic “novel” in its memory, while previous models could at best only analyze paragraphs at a time.

Evo 2’s performance was evaluated on two primary frontiers: 1. Prediction, determining if a specific DNA mutation or other genetic variant can result in disease or loss of function, and 2. Generation, the guided de novo design of synthetic DNA sequences.

Results

Study findings revealed that, in scenarios where Evo 2 was required to predict outcomes without specific training on that task (“zero-shot” tasks), the model successfully identified pathogenic (disease-causing) mutations in humans.

Encouragingly, when analyzing the breast cancer-linked BRCA1 gene, the model’s internal representations could be used to train a classifier that outperformed the base model's zero-shot predictions (Area Under the Receiver Operating Characteristic [AUROC] = 0.95). The model further accurately predicted the effects of non-single-nucleotide variants (complex mutations such as insertions and deletions), outperforming other tested models on these mutation classes in benchmark evaluations.

Analyses of Evo 2’s generative capabilities revealed that the model could generate complete mitochondrial genomes and sequences resembling bacterial and yeast chromosomes that maintained natural biological architecture in silico, although such generated sequences do not necessarily represent replication-competent genomes.

Furthermore, when coupled with guidance from external predictive models and search algorithms, Evo 2 could design DNA sequences that folded into specific physical shapes in mouse cells and even encoded Morse code messages ("LO", "ARC", "EVO2") in the DNA's physical accessibility patterns. These designs were experimentally validated in mouse embryonic stem cells using chromatinv accessibility assays (AUROC of 0.92-0.95), demonstrating that the generated DNA functioned as intended within living cells.

Finally, interpretability tools revealed that specific artificial neurons in Evo 2 had spontaneously learned to recognize biological features. Evo 2 generated candidate regulatory regions that showed a statistically significant enrichment of transcription factor motifs (P = 3.6 x 10-7), confirming the model was capturing biologically meaningful regulatory patterns rather than producing random sequences.

Conclusion

Evo 2 represents a paradigm shift from analyzing isolated biological components to modeling the holistic complexity of genomes. Its extensive context window and mechanistic advancements enable it to elucidate universal patterns of evolution and generalize from single-celled organisms to humans.

To foster innovation, the researchers have made the model parameters, code, and dataset fully open source, thereby democratizing access to this cutting-edge technology. Evo 2’s development marks a significant step toward a future where biology is not just studied, but programmable.

Source:

https://www.news-medical.net/news/20260305/AI-trained-on-9-trillion-DNA-letters-predicts-harmful-mutations-and-designs-new-genomes.aspx