Introduction
The demand for novel enzymes for various applications, from chemical production to pharmaceuticals, exceeds the supply of naturally occurring ones. Traditional directed evolution methods are inefficient due to the high percentage of nonfunctional mutations. Computational protein design offers an alternative, generating diverse sequences and potentially reducing the number of experiments required. Several generative models exist, including deep neural networks such as GANs, VAEs, language models, and statistical methods like ASR and DCA. However, comparing their effectiveness and validating computational metrics for predicting functionality remain significant challenges. Existing evaluations often use different experimental systems, making direct comparison difficult. Furthermore, common computational metrics lack experimental validation. This study addresses these gaps by experimentally evaluating a wide range of computational metrics to predict the in vitro activity of enzyme sequences generated by three different models.
Literature Review
The literature highlights the challenges and successes in generating functional proteins using computational methods. Directed evolution, while effective, suffers from low success rates due to the largely deleterious nature of random mutations. Generative models aim to overcome this by learning the distribution of functional sequences from large datasets like UniProt. Various approaches, including deep learning models (GANs, VAEs, language models) and statistical methods (ASR, DCA), have been proposed. However, a major limitation in the field is the lack of consistent experimental validation of these models and the computational metrics used to assess their performance. Existing studies often employed diverse experimental settings, hindering meaningful comparisons. The current methods for evaluating generative models often rely on alignment-based scores, such as sequence identity, which, while efficient, may not fully capture functional aspects, especially epistatic interactions. Alignment-free methods and structure-based metrics offer potential advantages, but their predictive power requires experimental validation.
Methodology
The study employed three generative models: ESM-MSA (a transformer-based language model), ProteinGAN (a convolutional neural network GAN), and ASR. Two enzyme families, malate dehydrogenase (MDH) and copper superoxide dismutase (CuSOD), were chosen for their sequence diversity, availability of structures, and assayable activity. The study was conducted in three rounds. Round 1 involved naive sequence generation and evaluation of 144 sequences (18 from each model and natural controls), revealing a low success rate. Analysis indicated overtruncation during data preparation negatively impacted activity. Round 2 addressed this by using full-length sequences and increasing the sequence identity threshold for generated sequences. A broader set of 20 computational metrics were evaluated, including alignment-based (sequence identity, BLOSUM62 scores, ESM-MSA probabilities), alignment-free (CARP-640M probabilities, ESM-1v probabilities, net charge), and structure-based metrics (Rosetta energies, AlphaFold2 PLDDT scores, ProteinMPNN, ESM-IF, MIF-ST). Round 3 focused on validating a composite computational filter (COMPSS) combining ESM-1v and ProteinMPNN metrics, to select sequences for experimental validation. The filter included automated quality checks for sequence features (start methionine, long repeats, transmembrane domains). Approximately 200 sequences were selected for each model and family based on the filter. Additionally, the study validated COMPSS using publicly available datasets from six additional enzyme families and two different models. Protein expression and purification were performed in *E. coli*, followed by in vitro enzymatic assays to measure activity. Statistical analysis, including AUC-ROC and Spearman correlation, was used to evaluate the predictive power of the metrics and the filter’s effectiveness.
Key Findings
Round 1 demonstrated low activity in generated sequences, largely attributed to overtruncation. Round 2, with improved data curation and a wider range of metrics, showed higher activity rates, particularly for ASR. Inverse folding metrics showed the best predictive power for activity (AUC-ROC of 0.72). Round 3 validated COMPSS, a filter combining ESM-1v and ProteinMPNN, which significantly increased the rate of active sequences (up to a twofold increase). The success rate for sequences passing the COMPSS filter was significantly higher than for those failing the filter (74% vs. 38%). Furthermore, the vast majority of active generated sequences selected by COMPSS exhibited activity levels comparable to wild-type controls. COMPSS’s effectiveness was further demonstrated by applying it to six additional enzyme families from published datasets. In these datasets, COMPSS consistently enriched for active enzymes. Analysis indicated that the combination of ESM-1v and ProteinMPNN provided orthogonal information to predict enzyme activity, with the best performance in the upper-right quadrant where both scores were high. For the lysozyme data, ProteinMPNN-based filtering also improved success rates.
Discussion
The study demonstrates the critical importance of data curation for effective protein sequence generation. Improved data preparation significantly enhanced the performance of the generative models. The development and validation of COMPSS provide a valuable tool for improving the efficiency of enzyme discovery from generated sequences. The combination of fast sequence-based filters (ESM-1v and quality checks) with more computationally expensive structure-based filtering (ProteinMPNN) represents a computationally efficient strategy for identifying active enzyme variants. The results suggest that while deep learning models have significant potential, careful selection of sequences based on a comprehensive set of metrics is essential to achieve high success rates in experimental validation. The findings highlight the interplay between generative models, computational metrics, and biological considerations specific to the enzyme family being studied.
Conclusion
This research establishes a robust framework (COMPSS) for generating and selecting functional enzymes. The three-step workflow—data curation, sequence generation, and sequence selection using COMPSS—demonstrates significant improvements in success rates for generating active enzymes. COMPSS, combining sequence-based and structure-based metrics, is particularly effective in identifying active sequences and can be applied to different enzyme families and generative models. This work advances the field of protein engineering by providing a practical and validated computational tool for accelerating the discovery of novel enzymes.
Limitations
The study focused on two enzyme families, and the generalizability to all enzyme families needs further investigation. Although COMPSS improved success rates significantly, it is not a perfect predictor and some inactive sequences may still pass the filter. The computational cost of structure prediction (using AlphaFold2) limits the scalability of the method for extremely large datasets. The study is limited by the specific expression system (*E. coli*) used; success rates might vary in other systems.
Related Publications
Explore these studies to deepen your understanding of the subject.