Introduction
Protein design, aiming to create proteins for specific applications, is crucial for addressing various biomedical and environmental challenges. Recent advancements in Transformer-based architectures have led to language models capable of generating human-like text. Drawing parallels between protein sequences and human language—both being information-complete sequences where order dictates structure and function—the authors hypothesized that these NLP methods could revolutionize protein design. While differences exist, the analogies between protein sequences and human language have inspired the application of NLP methods in protein research for decades. Supervised NLP methods, requiring labeled data, have been used for tasks like structural similarity detection and stability prediction. However, unsupervised learning, particularly with Transformer-based models like BERT-like architectures and autoregressive models, has emerged as a powerful tool. Models such as TCR-BERT, epiBERTope, ESM, ProtTrans, and ProteinBERT utilize unsupervised learning, mostly focusing on sequence embedding. Autoregressive models, exemplified by the GPT-x series, excel at generating long, coherent text and have inspired the development of protein autoregressive language models like ProGen, RITA, and DARK. This study aims to train a generative model to effectively learn the protein language, generate fit, stable proteins, and explore previously unseen regions of protein space.
Literature Review
The field of natural language processing (NLP) has significantly advanced due to the development of large pre-trained language models. These models have been successfully applied to various tasks, including chatbots, smart assistants, and translation machines. The analogy between protein sequences and human language has been explored previously, noting similarities in their information-complete nature: structure and function are encoded efficiently within the sequence order. Supervised learning in NLP has been applied to protein research problems, including those related to structure and stability. The BioSeq-BLM platform provides a collection of supervised language models for biomolecules. However, unsupervised learning using Transformer-based architectures has gained prominence. Several models, including TCR-BERT, epiBERTope, ESM, ProtTrans, and ProteinBERT, have demonstrated competitive performance using BERT-like architectures and denoising autoencoding. Autoregressive models, such as the GPT-x series, have proven highly effective in generating long, coherent text, inspiring similar approaches in protein design (ProGen, RITA, DARK).
Methodology
ProtGPT2, a decoder-only autoregressive Transformer model with 738 million parameters, was trained using an unsupervised approach. The training dataset consisted of approximately 50 million sequences from UniRef50 (version 2021_04), a clustered UniProt database at 50% identity. This dataset was chosen to improve generalization and performance. The dataset was tokenized using the Byte Pair Encoding (BPE) algorithm. The model was trained by minimizing the negative log-likelihood over the dataset, learning relationships between amino acids within sequences. Different sampling strategies (greedy search, beam search, and random sampling) were evaluated for sequence generation, with random sampling (k=950, repetition penalty=1.2) chosen for generating sequences that best resemble natural ones. Several extrinsic tests were developed to assess the quality of the generated sequences, including analysis of amino acid propensities, disorder prediction using IUPred3, secondary structure prediction using PSIPRED, homology detection using HHblits against Uniclust30, AlphaFold2 structure prediction, Rosetta Relax scoring, and molecular dynamics (MD) simulations. Similarity networks were constructed using HMM profiles and HHsearch to visualize the generated sequences in the context of known protein space.
Key Findings
ProtGPT2 effectively learned the protein language, generating sequences with amino acid and disorder propensities comparable to natural sequences. The model produces predominantly globular proteins (88%), similar to the distribution in natural proteins. Analysis using HHblits against Uniclust30 showed that ProtGPT2 sequences are distantly related to known proteins, indicating the generation of novel sequences rather than memorization. AlphaFold predictions revealed well-folded structures with a mean pLDDT of 63.2 (best-scoring structure per sequence). Rosetta Relax calculations yielded average Rosetta energies comparable to natural proteins, suggesting similar stability. MD simulations showed dynamic properties similar to natural proteins. Importantly, network analysis revealed that ProtGPT2 sequences explore previously uncharted regions of protein space, expanding natural superfamilies and generating novel topologies not present in current databases. Finally, visual inspection of structural superimposition identified cases where ligand-binding hotspots were conserved in ProtGPT2 sequences despite low sequence identity, suggesting preservation of functional determinants.
Discussion
ProtGPT2 represents a significant advancement in de novo protein design, leveraging the power of unsupervised language models. The model's ability to generate sequences with properties similar to natural proteins, while exploring novel regions of protein space, opens up new avenues for protein engineering. The generation of complex, non-idealized structures—including challenging folds such as all-β structures and membrane proteins—is a notable achievement. The preservation of functional determinants, as observed in the ligand-binding hotspots, further emphasizes the model's potential for targeted protein design. The speed and ease of use of ProtGPT2 make it a valuable tool for high-throughput protein engineering. Further development could include incorporating conditional tags to enable more specific and controlled generation of proteins with desired properties.
Conclusion
ProtGPT2 demonstrates the effectiveness of large language models in protein design. It generates novel, stable, and structurally complex proteins that resemble natural proteins but explore largely uncharted regions of protein space. The model's ability to preserve functional motifs despite low sequence identity opens new opportunities for protein engineering. Future research could focus on incorporating functional information to enable the design of proteins with specific functionalities and explore the application of ProtGPT2 for designing proteins with novel functions.
Limitations
While ProtGPT2 generates sequences with many properties similar to natural proteins, it's important to note that experimental validation is needed to confirm the predicted structures and functions. The accuracy of the predictions depends on the underlying training data and the limitations of the prediction tools used (AlphaFold, Rosetta, etc.). Furthermore, the model currently lacks explicit incorporation of functional information; future work could improve the model by including data that links sequences with functions.
Related Publications
Explore these studies to deepen your understanding of the subject.