
Biology
Computational scoring and experimental evaluation of enzymes generated by neural networks
S. R. Johnson, X. Fu, et al.
This captivating research by Sean R. Johnson and team dives deep into evaluating 20 metrics for enzyme sequence quality, revealing a significant breakthrough with the COMPSS computational filter that boosts experimental success rates by 50-150%. Discover how this work sets a benchmark for generative models and advances the field of protein engineering.
~3 min • Beginner • English
Introduction
The study addresses the challenge of predicting which computationally generated protein sequences will express, fold and retain enzymatic activity. While generative models can sample beyond natural sequence space, many mutations yield nonfunctional proteins, and experimental validation is costly. Existing evaluations often rely on sequence similarity (for example, identity to closest natural sequence) and lack standardized, experimentally validated metrics across models and protein families. The purpose here is to systematically assess diverse computational metrics—alignment-based, alignment-free, and structure-informed—for their ability to predict activity of generated enzymes, and to develop a composite selection framework (COMPSS) that enriches for active sequences, thereby improving efficiency of protein engineering campaigns.
Literature Review
Generative protein models, including GANs, VAEs, language models, ASR, and DCA-based methods, learn from natural sequences under evolutionary constraints to propose functional variants. Prior reports compared generated sequences to natural controls via sequence alignment-derived scores; limited biological assays and varying systems hinder cross-study comparison. Alignment-based metrics (identity, BLOSUM62) capture homology but ignore epistasis and positional importance. Alignment-free language model likelihoods can detect deleterious sequence patterns and correlate with pathogenicity and evolutionary dynamics. Structure-based approaches (Rosetta energies, AlphaFold2 confidence, inverse folding models like ProteinMPNN, ESM-IF, MIF-ST) can reflect biophysical plausibility but are computationally expensive at scale. There has been no broad experimental validation establishing which metrics best predict activity of generated proteins, and no standard benchmark across generative approaches.
Methodology
Study design comprised three iterative rounds. Targets: two enzyme families with physiological relevance and tractable assays—malate dehydrogenase (MDH; 300–350 aa) and copper superoxide dismutase (CuSOD; 150–250 aa), both multimeric with spectrophotometric assays. Generative models: (1) ESM-MSA (masked language model used for iterative sampling), (2) ProteinGAN (convolutional GAN with attention), and (3) ASR (phylogeny-based ancestral reconstruction). Rounds: Round 1 (naive generation) trained on UniProt-derived sets filtered for correct Pfam domains and truncated to domain boundaries; generated >30k sequences; selected 18 per model plus natural test sequences with 70–80% identity for expression and in vitro activity assays in E. coli. Round 1 analysis diagnosed failure modes including overtruncation (particularly affecting CuSOD dimer interface) and presence of signal peptides/transmembrane domains in natural sequences. A CuSOD pretest added eukaryotic/viral/bacterial selections with Phobius-predicted signal peptide handling and FeSOD controls. Round 2 (calibration) improved curation by using full-length sequences, removing predicted signal peptides/transmembrane domains, shifting selection to 80–90% identity, restricting CuSOD to eukaryotic/viral, and refining ESM-MSA sampling (masking one training-like sequence at a time within a nearest-neighbors MSA). Experimentally, 18 sequences each from ASR, GAN, and ESM-MSA plus 13 natural controls were tested. Computational metrics computed: alignment-based (identity; BLOSUM62; PFASUM15; avg phmmer top 30; ESM-MSA likelihood) and alignment-free sequence-only (ESM-1v; CARP-640M; net charge; abs(net charge); charged fraction). Structures were predicted with AlphaFold2, then structure metrics computed: Rosetta-relax energies; solvent-accessible surface areas (total, polar, apolar, percent polar); inverse folding log-likelihoods (ProteinMPNN, ESM-IF, MIF-ST). Correlations and AUC-ROC vs activity were evaluated across models and families. Round 3 (validation) implemented COMPSS: initial automated quality checks (starts with methionine, no long single-AA >3 or pair repeats >4, no predicted transmembrane domain, identity 50–80%), ESM-1v threshold set at the top 10th percentile of natural sequence scores (more stringent than the empirically optimal ~20th percentile from round 2), then AlphaFold2 prediction and ProteinMPNN scoring on 200 passing sequences per model–family; 18 of the top 40 by ProteinMPNN were randomly chosen for testing. For each selected sequence, a control failing the ESM-1v filter and identity-matched (within 1%) was included. Assays: expression in E. coli BL21(DE3), purification via metal affinity, SDS–PAGE for expression/solubility, MDH activity by NADH oxidation at 340 nm, SOD activity via WST-1 formazan inhibition; significance P≤0.05. Additional external validation: application of COMPSS to published datasets from ProGen-generated lysozymes (five families) and bmDCA-designed chorismate mutases, adjusting filters to omit identity and 'starts with M' where incompatible. Data curation specifics: domain-based sequence gathering from UniProt, MGnify, and NCBI TSA; removal of transmembrane/signal peptides (Phobius); deduplication via CD-HIT; train/test splits; MSAs via MAFFT/MUSCLE; phylogenetic trees via FastTree; ASR via GRASP. Computational details: identity via ggsearch36 with BLOSUM62; ESM-1v and CARP-640M scores as average log-likelihood; ESM-MSA scores via phmmer nearest 31 training sequences and masked passes; inverse folding scores from AlphaFold2-predicted structures; Rosetta-relax on predicted structures; SASA via FreeSASA; repeat scoring by longest n-mer counts.
Key Findings
Round 1: Overall only 19% of tested sequences were active. ASR produced many actives (CuSOD: 9/18; MDH: 10/18), while ESM-MSA and GAN often failed for CuSOD and MDH respectively. Overtruncation (removal of dimer-interface residues) and inclusion of signal peptides/transmembranes in natural sequences explained many failures; truncating known positive controls abolished activity, supporting this. CuSOD pretest with corrected handling yielded activity in 8/14 CuSODs and both FeSOD controls. Round 2: With curated training data and improved ESM-MSA sampling, activity rates increased markedly: natural controls 66% active; generated sequences ≥50% active across model–family combinations except GAN–MDH (2/18). Metrics calibration: across models/families, inverse folding metrics best predicted activity on average (AUC-ROC ≈ 0.72); ESM-1v was the strongest alignment-free metric (average AUC-ROC ≈ 0.68). AlphaFold2 PLDDT predicted activity for CuSOD (Wilcoxon P = 5×10−7) but not MDH. Sequence identity did not predict activity within 70–90% identity windows. Structure-based metrics were mutually correlated; ProteinMPNN offered strong predictive power with better computational efficiency than Rosetta. Round 3: The COMPSS filter (quality checks + ESM-1v threshold + ProteinMPNN) greatly enriched actives among sequences with 50–80% identity to closest natural training sequence. Selected sets showed high activity rates: ESM-MSA CuSOD 17/18 (94%) and MDH 18/18 (100%) active. Pooled across models and families, 74% of COMPSS-selected sequences were active versus a significantly lower rate in identity-matched controls failing the ESM-1v filter, representing a 77% higher success rate (Fisher exact P = 0.00018). Moreover, 83% (44/53) of active generated sequences had activities within 10× of wild-type controls. External datasets: Applying COMPSS to five lysozyme families (ProGen) and chorismate mutases (bmDCA) increased the fraction of functional enzymes among filter passers in 5/6 families; ProteinMPNN AUC-ROCs ranged from 0.6 to 1.0. Among lysozymes prefiltered by language models, additional ProteinMPNN filtering raised success by 38% (39/46 vs 27/44). Trends held even below 70% identity for chorismate mutase (AUC-ROC ~0.91–0.92 for ProteinMPNN and ESM-1v) and across lysozyme datasets (ProteinMPNN AUCs ~0.67).
Discussion
The work demonstrates that computational metrics can prospectively predict activity of generated enzymes when applied in a composite manner. Identity-based homology is insufficient within 70–90% identity ranges to distinguish functional from nonfunctional variants. AlphaFold2 confidence (PLDDT) is not a universal predictor across families, despite accurate structure prediction; it correlated with activity for CuSOD but not for MDH. In contrast, language model likelihoods (ESM-1v) and inverse folding probabilities (ProteinMPNN) capture complementary properties: sequence plausibility under evolutionary constraints and compatibility with a folded structure, respectively. Their moderate correlation indicates orthogonality; combining them in COMPSS addresses multiple failure modes (folding/stability, expression, structural compatibility) and improves experimental efficiency while controlling compute via staged filtering. Curation of training data (removing signal peptides/transmembranes, avoiding overtruncation, ensuring full-length constructs) was crucial, turning poor round 1 outcomes into high success in rounds 2–3. ASR consistently generated functional sequences under naive settings, highlighting that deep generative models still lag but can be effective with curation and filtering. External validations across unrelated models and families support generalizability of COMPSS, though family-specific adjustments are advised. The findings provide a benchmark dataset and a practical pipeline for selecting candidates from generative models, reducing wet-lab burden and accelerating protein engineering.
Conclusion
This study introduces and experimentally validates COMPSS, a composite selection framework that integrates fast sequence-level quality checks and language model scores (ESM-1v) with structure-informed inverse folding scores (ProteinMPNN) to prioritize generated enzyme sequences. Applied prospectively, COMPSS achieved up to 100% activity in selected sets and a 50–150% improvement in success rates over unfiltered baselines, with most actives near wild-type activity levels. Identity alone and AlphaFold2 PLDDT are inadequate as universal predictors within close homology ranges, whereas ESM-1v and ProteinMPNN jointly provide robust, complementary signals. The work offers a curated dataset of >500 tested enzymes, code, and notebooks to benchmark future generative models and guide sequence selection. Future directions include extending COMPSS to additional protein families and functions, refining family-specific quality filters and thresholds, integrating additional orthogonal metrics (e.g., dynamics, oligomerization interfaces, expression predictors), reducing computation by improved prefilters, and expanding experimental datasets to disentangle model–family–metric interactions and mitigate potential overfitting.
Limitations
Performance and metric predictiveness vary by protein family and generative model, reflecting distinct failure modes. Some metrics share modeling assumptions with generative models, risking overfitting. Despite testing over 2,200 variants across eight families (including literature datasets), coverage of the vast sequence space remains limited. AlphaFold2 PLDDT is not consistently predictive of activity; high-confidence structures can still be inactive. The ESM-1v thresholding introduced phylogenetic bias, particularly for MDH. Structure-based steps add computational cost; although ProteinMPNN is efficient, structure prediction at scale remains expensive. Early failures highlighted sensitivity to data curation (overtruncation, presence of signal peptides and transmembranes), and results depend on expression in E. coli and in vitro assay conditions, which may not generalize to other hosts or in vivo function. ProGen lysozyme results suggest model- and family-specific fine-tuning may be required for both generation and selection.
Related Publications
Explore these studies to deepen your understanding of the subject.






