Biology
Machine learning differentiates enzymatic and non-enzymatic metals in proteins
R. Feehan, M. W. Franklin, et al.
The study addresses the longstanding challenge of distinguishing enzymatic from non-enzymatic metal-binding sites in proteins. Enzymes accelerate reactions dramatically and are cataloged by EC numbers; however, despite extensive sequence and structural data, accurate identification of catalytic sites remains difficult, particularly for metalloproteins where catalytic and non-catalytic metal sites share similar physicochemical environments. Prior work has identified descriptive features of catalytic residues (e.g., deviations in theoretical titration curves, location in large surface clefts near the protein centroid, and networks enabling energy transfer), but these descriptors have limited predictive power. Sequence-based machine learning approaches that predict EC numbers often miss active-site subtleties and can fail when residues near the active site are mutated. Structure-based catalytic residue predictors face severe class imbalance and may conflate pocket detection with catalysis. This work proposes a structure-based, site-focused machine learning approach centered on metal ions in proteins to mitigate differences in local environments and class imbalance. The authors construct a large, curated structural dataset of enzymatic and non-enzymatic metal-binding sites and develop MAHOMES, a decision-tree ensemble model that leverages physicochemical features to distinguish enzymatic from non-enzymatic metal sites with high precision and recall.
The paper situates its contribution within several strands of prior research: (1) Electrostatics-based indicators of catalysis, where deviations in theoretical titration curves (THEMATICS) identify residues involved in Brønsted acid–base chemistry. (2) Geometric descriptors showing catalytic residues often reside in the largest accessible clefts near the molecular centroid. (3) Dynamic and network-based views highlighting energy transfer pathways among active-site residues. (4) Sequence-based EC number predictors (e.g., DeepEC, DEEPre, EFICAz2.5) that perform hierarchical classification but can overlook active-site context and may struggle to distinguish native from perturbed active sites. (5) Catalytic residue predictors combining sequence conservation and structural features, which suffer from extreme class imbalance (~1:124) leading to low precision and potential bias toward pocket detection rather than catalytic function. The authors also note the prevalence of metalloproteins (e.g., MetalPDB annotations) and that many metals occupy pockets with unusual electrostatics, suggesting metal sites as a promising focus for improved classification of catalytic activity.
Data collection and curation: Crystal structures were retrieved from RCSB PDB containing metals Fe, Cu, Zn, Mn, Mg, Mo, Ni, Co (various residue codes). Filters included: no nucleotide macromolecules; resolution ≤3.5 Å; ≤40 sites per structure; protein chains ≥20 residues; UniProtKB accession present. Sites were defined as metal atoms on the same chain within 5 Å; potential multichain sites (metal within 4 Å of another chain) and sites with >4 metals (e.g., storage proteins) were removed. Sites within 10 Å of mutated residues were excluded. Structures deposited pre-2018 formed the dataset; those from 2018 onward formed a temporal holdout test-set (T-metal-site).
Computational labeling: Starting from M-CSA entries (manually adjusted to exclude problematic cases such as heme-dependent or obligate oligomeric mechanisms), PHMMER identified PDB homologs (E-value ≤1e-6). Homologs of excluded M-CSA entries were removed. To reduce pseudoenzymes, homologs were required to have at least one of: associated EC number, “ase” in PDB classification, or “ase” in macromolecule name. Homologs were aligned to their M-CSA representatives via TM-align; those with TM-score ≥0.40 and with metal sites within 5 Å of annotated catalytic residues were labeled enzymatic. Other sites on chains containing enzymatic sites were labeled non-enzymatic. Sites lacking M-CSA homology and all enzymatic evidence (no EC and no “ase”) were labeled non-enzymatic. Ambiguous sites were discarded.
Redundancy reduction: Within each set (dataset and test-set), sequence-identical chains were collapsed; sequence homologs were clustered (PHMMER, E-value ≤1e-20). Within clusters, chains were aligned (TM-align) and sites aligning within 5 Å with TM-score ≥0.50 were compared for local similarity based on the Jaccard index of residue identities within 3.75 Å. Sites exceeding 0.80 local similarity were pruned, retaining the best representative (prioritizing catalytic sites, absence of other ligands, site isolation, no mutations, and resolution). To ensure non-overlap between dataset and test-set, an all-against-all Jaccard comparison of residue identities within 6.0 Å around metals removed cross-set similarities >0.80.
Additional preprocessing: Due to labeling inconsistencies, Mg sites within 10 Å of nucleic acid phosphates labeled as non-enzymatic (e.g., ATPases/GTPases) were removed. Structures were relaxed in Rosetta; sites where metals moved >3 Å (indicative of weak/artefactual binding) were excluded. Sites failing feature computation were removed.
Features: Five categories were computed on Rosetta-relaxed structures and scaled using scikit-learn’s RobustScaler (fit on 20th–80th percentiles): (1) Rosetta per-residue energy terms (beta_nov16) aggregated as mean or sum over shells (0–3.5, 3.5–5, 5–7.5, 7.5–9 Å) or spheres (0–3.5, 0–5, 0–7.5, 0–9 Å); 84 features per aggregation, with 294 unique features used across models. (2) Pocket geometry via Rosetta pocket_measure: pocket volume, depth, opening distance, and 2D slice shape descriptors (20 features). (3) Pocket lining: counts and volumes of lining residues (backbone/sidechain), hydrophobicity statistics (Eisenberg, Kyte-Doolittle), occluding volumes, and pocket wall SASA (16 features). (4) Electrostatics: theoretical titration curve moments (using bluues; generalized Born approach) for ionizable residues in first and second shells; deviations from Henderson–Hasselbalch behavior captured via moments of the derivative. (5) Coordination geometry features. In total, 391 features were available; 67 feature subsets (sizes 4–181) were evaluated during model selection.
Model training and selection: Fourteen scikit-learn classifiers were considered (SVMs with various kernels, decision-tree ensembles: extra trees, random forest, gradient boosting; naive Bayes; nearest neighbors; neural network; linear and quadratic discriminant analyses; logistic/ridge/passive-aggressive linear models). Nested cross-validation on the pre-2018 dataset employed inner CV for hyperparameter optimization using one of four scoring objectives (accuracy, precision, MCC, or a multi-score combination) and outer CV for unbiased performance estimation. In total, 3752 models (14 algorithms × 67 feature sets × 4 objectives) were trained. To avoid optimistic bias, outer CV was re-run restricting to the most frequently selected hyperparameters for each model; models with large deviations were discarded, yielding 1668 robust models. Performance was examined versus algorithm family, optimization metric, Rosetta aggregation method, feature exclusions, and set size. The top model, an Extra Trees classifier using all feature categories with mean-sphere Rosetta aggregations (termed MAHOMES), was selected based on MCC and precision and demonstrated stable hyperparameter selection across folds.
Datasets: The final ML dataset comprised 3465 sites from 2626 structures (24% enzymatic). Manual spot-checking of labeling (50 enzymatic, 50 non-enzymatic) estimated 97% balanced accuracy; three non-enzymatic labels were actually catalytic. The temporal holdout T-metal-site test-set contained 520 sites from 404 structures (31% enzymatic). Covariate shift between dataset and test-set was assessed and not detected.
- MAHOMES performance: On the corrected T-metal-site test-set, MAHOMES achieved 94.2% accuracy, 92.2% precision, and 90.1% recall. Outer CV metrics were similar, indicating good generalization.
- Error analysis and relabeling: Among 27 false positives and 17 false negatives on the original test-set, manual curation revealed 10 of the 27 false positives were actually true enzymatic sites missed by the labeling pipeline (often lacking M-CSA homologs/EC/“ase”; includes sonic hedgehog Zn sites now recognized as metalloproteases). Four false positives remained uncertain. Metrics were recalculated after correcting these labels.
- Feature importance: Despite pocket volume being the most quantitatively different feature between classes (39% similarity between enzymatic and non-enzymatic distributions), it ranked only 13th in importance. The top feature was an electrostatic descriptor: average second moment of the theoretical titration curve derivative for second-shell ionizable residues (3.5–9 Å from the metal). Additional important electrostatic features included the third and fourth moments (ranks 10 and 19). Pocket lining features (counts and total volumes of lining residues) occupied ranks 2–5, suggesting surface/lining characteristics are more discriminative than raw volume. Distance from active-site center to protein centroid ranked 9th despite moderate similarity (57%).
- Robustness to feature ablations: Excluding the electrostatics category increased outer CV errors by only six across 3465 predictions (3 enzymatic, 3 non-enzymatic). Excluding pocket lining improved true negatives by two but reduced true positives by fourteen, underscoring their importance for enzyme detection.
- Benchmarking against sequence-based EC predictors (on T-metal-seq, 400 sequences): DeepEC (accuracy 69.9%, precision 59.6%, recall 90.5%) over-predicted enzymes; DEEPre (accuracy 90.1%, precision 81.3%, recall 100.0%) achieved perfect recall with reduced precision; EFICAz2.5 (accuracy 90.8%, precision 88.4%, recall 90.0%) was strong overall. MAHOMES exceeded these in precision and accuracy while maintaining high recall for metalloproteins.
- Comparison to catalytic residue predictors: Using comparable recall metrics from the T-124 benchmark, catalytic residue predictors reported recalls of 48.8–62.2%, whereas MAHOMES attained 90.1% recall, correctly identifying ~1.3× more catalytic sites than the best of these methods.
The study demonstrates that a structure-based, metal site–focused ML approach can effectively differentiate enzymatic from non-enzymatic metalloprotein sites, addressing the limitations of both sequence-based EC predictors (which may overlook active-site context) and residue-level catalytic site predictors (which suffer from severe class imbalance and pocket vs catalysis conflation). Key determinants are electrostatic properties beyond the first coordination shell—captured via moments of theoretical titration curves—and characteristics of pocket lining (counts and volumes of lining residues), which together outperform pocket volume alone. The prominence of second-shell electrostatics aligns with theories of coupled networks and promoting motions in catalysis and suggests that catalytic environments are distinguished by extended electrostatic perturbations rather than solely by first-shell coordination. The finding that volume is highly distinctive yet less important for classification underscores that nuanced features (e.g., lining flexibility/surface area proxies and centroid proximity) are more predictive of catalytic function. MAHOMES’ superior precision and competitive recall versus EC predictors, achieved with far less training data, highlights the value of structural physicochemical features. The model’s ability to correct labeling errors (identifying true catalytic metal sites misclassified by the homology-based pipeline) further supports its utility. Potential applications include prioritizing de novo metalloenzyme designs and detecting loss-of-function mutations affecting catalysis, especially where homology-based approaches are inadequate.
This work introduces MAHOMES, a decision-tree ensemble model trained on the largest curated structural dataset of metalloprotein sites to date, that distinguishes enzymatic from non-enzymatic metal-binding sites with high precision and recall. By leveraging structural physicochemical features—particularly second-shell electrostatic descriptors and pocket lining characteristics—MAHOMES outperforms state-of-the-art sequence-based enzyme function predictors and compares favorably to catalytic residue predictors in recall. The analysis reveals that while pocket volume is strongly different between classes, features reflecting electrostatic perturbations and pocket lining are more informative for classification. The approach enables correction of mislabeled data and provides actionable insight into determinants of metalloenzyme activity. Future directions include refining electrostatic moment usage (investigating why second moments outperform higher moments here), disentangling the relative contributions of pocket surface area versus flexibility/lining as proxies for catalytic efficiency, expanding to broader metal sets and multi-chain sites, and integrating the method into protein design and catalytic-site prediction pipelines as more structural data become available.
- Labeling relies on homology to M-CSA and heuristics (EC number or “ase” annotations), which led to some misclassifications detected upon manual curation (e.g., sonic hedgehog Zn sites). Ambiguously annotated sites were discarded, potentially reducing coverage.
- The method requires high-quality 3D structures; structural data availability and preprocessing filters (e.g., removal of multichain sites, loosely bound metals moving >3 Å upon relaxation) limit applicability and may exclude genuine biological sites.
- Metal identity was not used as a feature to avoid bias, which could limit discrimination in cases where metal-specific chemistry is informative.
- Direct benchmarking against catalytic residue predictors was limited by lack of accessible implementations; comparisons rely on published recalls from different test-sets.
- The approach focuses on metalloproteins and does not generalize to non-metal enzymes; coordination environments requiring cofactors (e.g., heme) or obligate oligomers were excluded during curation.
- Potential covariate shifts over time were assessed and not detected, but residual dataset biases cannot be fully excluded.
Related Publications
Explore these studies to deepen your understanding of the subject.

