Biology

Machine learning differentiates enzymatic and non-enzymatic metals in proteins

R. Feehan, M. W. Franklin, et al.

Unlock the secrets of enzyme design with groundbreaking research from Ryan Feehan, Meghan W. Franklin, and Joanna S. G. Slusky. This study presents a novel machine learning model that distinguishes between enzymatic and non-enzymatic metal-binding sites with impressive accuracy. Discover how these insights could revolutionize the identification of new enzymatic mechanisms!

00:00

Playback language: English

Index

Introduction

Metalloenzymes, representing 40% of all enzymes, catalyze a wide range of biochemical reactions. However, distinguishing their active sites from inactive metal-binding sites is challenging due to physicochemical similarities. This difficulty hinders the identification of both naturally occurring and engineered enzymes. While several physicochemical properties have been linked to catalytic function (e.g., deviations in theoretical titration curves, location of catalytic residues within protein clefts, amino acid networks for energy transfer), their predictive power remains limited. Machine learning offers a powerful alternative, capable of generalizing trends from training data to predict characteristics in novel proteins. Existing machine learning approaches for enzyme prediction primarily fall into two categories: enzyme function predictors (using sequence data to predict Enzyme Commission (EC) numbers) and catalytic residue predictors (using structure data to identify catalytic residues). Sequence-based methods often overlook crucial active site information, while residue-based methods struggle with imbalanced datasets and low precision. This study focuses on metalloproteins, aiming to overcome these limitations by leveraging the distinct physicochemical properties of metal-binding sites and employing a machine learning approach to differentiate between enzymatic and non-enzymatic sites.

Literature Review

Several studies have explored the use of machine learning to predict enzyme function or identify catalytic residues. Enzyme function predictors often utilize sequence data to predict EC numbers, typically employing layered models to progressively refine predictions from broad enzyme classes to specific subclasses. However, these methods often lack the resolution to distinguish between native and mutated sequences with altered active sites. Alternatively, methods focused on catalytic residue prediction use structure-based features to identify specific residues responsible for catalysis. These methods often combine sequence and structural features, but suffer from imbalanced datasets (many more non-catalytic than catalytic residues) leading to low precision. Previous research has highlighted the importance of physicochemical properties such as electrostatic properties and pocket geometry in characterizing both enzymatic and non-enzymatic metal-binding sites. The abundance of metalloprotein structures in databases like RCSB PDB and MetalPDB provides a valuable opportunity to create a large and relatively balanced dataset for machine learning.

Methodology

The researchers developed a homology-based pipeline to identify and label metalloprotein sites as enzymatic or non-enzymatic. This pipeline utilized the Mechanism and Catalytic Site Atlas (M-CSA) database and homology searches to identify metalloproteins homologous to known enzymes. Structural homology and proximity to catalytic residues were used to further validate enzymatic labels. Sites lacking such evidence were classified as non-enzymatic. To reduce bias and overfitting, the pipeline also removed structurally or sequentially redundant sites. The resulting dataset, consisting of 3465 metal-binding sites from 2626 proteins (24% enzymatic), and a separate holdout test set (520 sites from 404 proteins, 31% enzymatic) were created. The holdout test set consisted of structures deposited in or after 2018, ensuring temporal independence from the training data. A covariate shift analysis was performed to confirm no significant differences between the datasets. Feature calculations encompassed five categories: Rosetta energy terms (sum and mean values in shells and spheres), pocket geometry, pocket lining residues (hydrophobicity, volume), electrostatics (theoretical titration curves), and coordination geometry. A nested cross-validation strategy was employed to optimize and select from 14 machine learning algorithms across 67 feature combinations, resulting in the selection of the optimal extra-trees algorithm for the final model, dubbed MAHOMES. The Jaccard index and kernel density estimation were used to quantify feature similarity between enzymatic and non-enzymatic sites. The performance of MAHOMES was evaluated using the Matthews Correlation Coefficient (MCC), precision, recall and accuracy. A manual inspection of misclassified sites was also performed to evaluate the accuracy of the pipeline and model.

Key Findings

The MAHOMES model achieved high performance metrics on the holdout test set (T-metal-site), reaching 94.2% accuracy, 92.2% precision, and 90.1% recall. Manual inspection revealed that MAHOMES correctly identified some enzymatic sites mislabeled by the homology-based pipeline, particularly those lacking EC numbers or "ase" suffixes. Analysis of feature importance showed that electrostatic features (specifically, the average second moment of the theoretical titration curve derivative for second-shell residues) and pocket lining features (number and volume of lining amino acids) were most influential for classification, even though pocket volume was the most quantitatively different feature between enzymatic and non-enzymatic sites. Notably, a comparison with three enzyme function prediction tools (DeepEC, DEEPre, and EFICAz2.5) using a sequence-based test set (T-metal-seq) demonstrated that MAHOMES outperforms these methods in accuracy and precision, while exhibiting similar recall. Though less directly comparable due to imbalanced datasets, MAHOMES also outperformed catalytic residue predictors (PREvall, CRHunter, CRPred) in recall, suggesting its superior ability to identify catalytic sites.

Discussion

The superior performance of MAHOMES over existing methods highlights the effectiveness of using structure-based physicochemical features to identify enzymatic metal-binding sites. The model's prioritization of electrostatic and pocket lining features over pocket volume suggests that subtle differences in these properties are crucial for distinguishing enzymatic activity. These features are consistent with previous findings linking electrostatic interactions and amino acid networks to enzyme catalysis. The discrepancy in feature importance between volume and other features could be attributed to the dynamic nature of enzyme active sites and the limitations of static volume measurements. The comparison against enzyme function predictors demonstrates the advantage of using structural information, which can capture subtle active site variations overlooked by sequence-based methods. The superior performance of MAHOMES, despite using substantially less training data, underscores the power of structural features in predicting enzymatic activity.

Conclusion

This study presents MAHOMES, a highly accurate machine learning model for identifying enzymatic metal-binding sites in proteins. MAHOMES leverages structure-based physicochemical features, outperforming existing methods in both accuracy and precision. The model's findings regarding feature importance shed light on the critical role of electrostatic interactions and pocket lining in enzymatic activity. Future work should focus on expanding the dataset, incorporating additional features, and exploring applications in de novo enzyme design and mutation analysis.

Limitations

The study's reliance on the M-CSA database for enzymatic labeling introduces a potential bias, as the database itself is not exhaustive. The manual curation of the M-CSA data could also introduce subjectivity. Moreover, the model's performance might be affected by the quality of protein structures in the PDB, and the assumption that all identified metal binding sites are either purely enzymatic or non-enzymatic might oversimplify biological reality. Further investigation into the potential role of other factors, such as dynamic protein flexibility, might improve the model’s predictive capabilities.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Recent Advancements and Perspectives in the Diagnosis of Skin Diseases Using Machine Learning and Deep Learning: A Review

J. Zhang, F. Zhong, et al.

Computer Science

On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare

S. Mittal, K. Thakral, et al.

Veterinary Science

Machine learning and metagenomics reveal shared antimicrobial resistance profiles across multiple chicken farms and abattoirs in China

M. Baker, X. Zhang, et al.

Medicine and Health

A wearable sensor and machine learning estimate step length in older adults and patients with neurological disorders

A. Zadka, N. Rabin, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny