Research scientists at Meta AI have revealed that the ESM2 language model generalises beyond natural proteins and allows the programmable generation of complex and modular protein structures. In their two new research papers, they have explained it in detail.
The team at Meta AI Research comprised research scientists Robert Verkui, Tom Sercu, Ori Kabeli, Alex Rives and many others. They collaborated with Sergey Ovchinnikov of Harvard University, Yilun Du of MIT, Basil Wicky and Lukas Milles of the University of Washington, Justas Dauparas of MIT, and popular biochemist David Baker for the project.
ESM2 learns the design principles of proteins. In collaboration with the Institute of Protein Design, the team experimentally validated 152 ESM2 designs, including de novo generations beyond natural proteins (<20% sequence identity to known proteins). In addition, it implemented a high-level programming language for generative protein design with ESM2, which will help the program generate large proteins and complexes with complicated modular structures.
Tom Sercu of the same team took to Twitter to explain how language models can generalise beyond natural proteins to design completely new ones from scratch.
With a 67% success rate, the scientists examined 228 proteins experimentally. In the first step, the team designed the sequence for a fixed backbone design. It could produce successful designs for all targets using only the LM. Despite the LM being trained solely on sequences, it saw 19/20 successes compared to designs without LM, which only achieved 1/20 success.
The scientists proposed a new method to sample (sequence, structure) pairs from the high-energy landscape specified by the LM in the second phase, which is an unconstrained generation. With excellent experimental success rates (71/129 or 55%), it can explore a variety of topologies. The team compared their generated protein sequences to sequence databases that include all known natural proteins to prove that LM generalises beyond natural proteins. However, there are few strong matches because natural sequences and their expected structures differ for many designs.
35 of the 152 experimentally successful designs show no appreciable sequence similarity to any known natural proteins. Sequence identity to the closest sequence match for the remaining 117 designs is at a median of 27%, below 20% for six designs, and as low as 18% for three designs. The language model produces a successful design for fixed backbones for each of the eight artificially constructed fixed backbone objectives that have been empirically examined. Sampled proteins encompass many topologies and secondary structure compositions for an unconstrained generation and have a good experimental success rate (71/129 or 55%).
Deep patterns linking sequence and structure are reflected in the designs, including motifs that are seen in related natural structures and motifs that are not seen in analogous structural contexts in well-known protein families. The findings demonstrate that despite being taught on sequences, language models may learn a complex grammar that allows for the creation of protein structures beyond natural proteins. These results show that protein language models, trained on sequences alone, learn deep patterns linking sequence and structure and can be used to make de novo (new) proteins beyond the design space nature has explored.
The top-down approach to designing proteins is challenging due to biological complexity; hence, most protein designs have adopted a manual bottom-up strategy employing components derived from nature. In their most recent paper, the group detailed how generative artificial intelligence can be used to achieve the long-desired modularity and programmability of protein design. In addition, advanced protein language models show emerging learning of protein design principles and atomic resolution structure. The team leveraged these developments to enable the programmable design of highly complex de novo protein sequences and structures.