“Arc Institute researchers have developed a machine learning model called Evo 2 that is trained on the DNA of over 100,000 species across the entire tree of life. Its deep understanding of biological code means that Evo 2 can identify patterns in gene sequences across disparate organisms that experimental researchers would need years to uncover. The model can accurately identify disease-causing mutations in human genes and is capable of designing new genomes that are as long as the genomes of simple bacteria. Evo 2’s developers—made up of scientists from Arc Institute and NVIDIA, convening collaborators across Stanford University, UC Berkeley, and UC San Francisco—will post details about the model as a preprint on February 19, 2025, accompanied by a user-friendly interface called Evo Designer. The Evo 2 code is publicly accessible from Arc’s GitHub, and is also integrated into the NVIDIA BioNeMo framework, as part of a collaboration between Arc Institute and NVIDIA to accelerate scientific research. Arc Institute also worked with AI research lab Goodfire to develop a mechanistic interpretability visualizer that uncovers the key biological features and patterns the model learns to recognize in genomic sequences. The Evo team is sharing its training data, training and inference code, and model weights to release the largest-scale, fully open source AI model to date. Building on its predecessor Evo 1, which was trained entirely on single-cell genomes, Evo 2 is the largest artificial intelligence model in biology to date, trained on over 9.3 trillion nucleotides—the building blocks that make up DNA or RNA—from over 128,000 whole genomes as well as metagenomic data. In addition to an expanded collection of bacterial, archaeal, and phage genomes, Evo 2 includes information from humans, plants, and other single-celled and multi-cellular species in the eukaryotic domain of life. “Our development of Evo 1 and Evo 2 represents a key moment in the emerging field of generative biology, as the models have enabled machines to read, write, and think in the language of nucleotides,” says Patrick Hsu (@pdhsu), Arc Institute Co-Founder, Arc Core Investigator, an Assistant Professor of Bioengineering and Deb Faculty Fellow at University of California, Berkeley, and a co-senior author on the Evo 2 preprint. “Evo 2 has a generalist understanding of the tree of life that’s useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life. We’re excited to see what the research community builds on top of these foundation models.”
Sorry, comments are closed for this post.