Introduction
Enzyme turnover number (kcat), representing the enzyme's catalytic efficiency, is vital for comprehending cellular physiology and resource allocation. Experimental determination of kcat is laborious and expensive, leaving kcat values unknown for the vast majority of enzymatic reactions. This scarcity hinders accurate simulations of cellular processes using metabolic models, which rely on kcat values for precise predictions of enzyme production costs and cellular resource allocation. Existing computational approaches, such as those employing flux balance analysis and proteomic data, are often limited to well-studied organisms like *E. coli*, providing estimates for only a small fraction of its reactions. While previous machine learning models have attempted kcat prediction, they suffer from limitations such as reliance on organism-specific data or poor generalization to enzymes dissimilar to those in their training sets. This study aims to overcome these limitations by developing a generalizable and organism-independent model to accurately predict kcat for wild-type enzymes.
Literature Review
Prior work on kcat prediction has explored various approaches. Heckmann et al. (2018) developed a model for *E. coli* enzymes, using detailed features like active site properties, metabolite concentrations, and reaction fluxes. However, this model's reliance on comprehensive data limits its applicability. Li et al. (2022) presented DLKcat, a deep learning model using enzyme amino acid sequences and substrate information. Although potentially applicable to any enzyme, DLKcat's predictive power diminishes significantly for enzymes dissimilar to those in its training dataset. These previous models highlight the need for a more robust and generalizable approach for kcat prediction, capable of handling a broader range of enzymes and reactions.
Methodology
The researchers developed TurNuP, a machine learning model for predicting kcat. The model utilizes two key input representations:
1. **Enzyme Representation:** Employing a fine-tuned state-of-the-art Transformer Network (ESM-1b) trained on millions of protein sequences to generate a numerical representation of each enzyme from its amino acid sequence.
2. **Reaction Representation:** Utilizing differential reaction fingerprints (DRFPs) to numerically encode the entire chemical reaction, including all substrates and products, ensuring a consistent representation regardless of the number of reactants. The DRFPs are directly calculated from the reaction's substructures using hash functions, thereby capturing more holistic reaction information than previous methods.
The input vectors from both enzyme representation and reaction representation are concatenated before fed into a gradient boosting model for training. Three different types of reaction fingerprints (structural, difference, and differential) were evaluated, with DRFPs showing superior performance. The dataset for training and testing was compiled from BRENDA, UniProt, and Sabio-RK databases, rigorously preprocessed to include only wild-type enzymes and natural reactions, removing outliers and redundancies. The dataset was split into training (80%) and testing (20%) sets, ensuring no enzyme sequence appeared in both. Five-fold cross-validation with random grid search was employed for hyperparameter optimization. For comparison, several other machine learning models (linear regression, random forest, and a neural network) were trained, but the gradient boosting model consistently outperformed them. Furthermore, the impact of additional features, such as Michaelis constants (Km) and reaction fluxes, was investigated. A web server was developed for easy access to the TurNuP model.
Key Findings
TurNuP significantly outperforms previous kcat prediction models, exhibiting superior generalization capabilities. Key findings include:
* **Improved Accuracy:** TurNuP achieves a coefficient of determination (R²) of 0.44 on the test set, substantially higher than previous models. The mean absolute deviation of predicted kcat values from experimental measurements is 0.69 on a log₁₀ scale (4.8-fold deviation). Although the model tends to overestimate lower kcat values, this might be attributed to regression dilution due to noisy data and variations in experimental conditions across different studies.
* **Generalizability:** TurNuP shows good performance even for enzymes with low sequence identity (<40%) to those in the training set, a significant advantage over previous models which struggle with dissimilar enzymes. The model also performs well on reactions not seen during training, with predictive power correlated with the reaction similarity to the training reactions (Figure 6).
* **Superiority over DLKcat:** A direct comparison with DLKcat reveals that TurNuP consistently outperforms DLKcat across various levels of enzyme sequence similarity to the training data. The improved performance of TurNuP is statistically significant for all the subsets of the test set considered (Figure 5). The superior performance is attributed to the use of state-of-the-art enzyme representations (ESM-1b vectors) and the consideration of the whole chemical reaction in the input vector, as opposed to only one substrate in DLKcat.
* **Improved Metabolic Model Predictions:** Integrating TurNuP-predicted kcat values into enzyme-constrained genome-scale metabolic models leads to a significant improvement in proteome allocation predictions for multiple yeast species and diverse growth conditions (Figure 5b). This indicates TurNuP's practical utility in refining metabolic modeling.
* **Additional Input Features:** While exploring the impact of additional input features like Km and reaction fluxes, the study found that including these features did not noticeably improve model performance, possibly due to redundancy with existing features in the model. This highlights the model's ability to extract relevant information from the fundamental features—enzyme sequence and reaction details—without the necessity of additional data.
* **Web Server:** A user-friendly web server was implemented, providing easy access to the TurNuP model, eliminating the need for programming skills or specialized software installation.
Discussion
TurNuP addresses a critical gap in predicting kcat values for a wide range of enzymes. Its superior performance and generalizability stem from the combined use of advanced enzyme and reaction representations, capturing crucial information about both enzyme characteristics and reaction properties. The improved predictions resulting from incorporating TurNuP into genome-scale metabolic models demonstrate its practical impact on systems biology research. The observed overestimation of lower kcat values, likely due to data noise and variations in experimental conditions, highlights limitations of current datasets and the potential for future improvements with more standardized and comprehensive experimental data. While additional input features didn't improve performance in this study, the findings emphasize the power of well-designed core features—the enzyme sequence and the complete reaction information—in obtaining robust and generalized predictions.
Conclusion
TurNuP represents a significant advance in kcat prediction, offering a robust and generalizable model for predicting turnover numbers of enzymes. Its superior performance compared to existing methods, coupled with its easy accessibility via a web server, makes it a valuable tool for researchers in biochemistry, systems biology, and metabolic engineering. Future research could focus on improving data quality, incorporating experimental conditions into the model, and exploring the integration of TurNuP with other predictive models to achieve even more comprehensive predictions of enzyme kinetics.
Limitations
The accuracy of TurNuP is limited by the quality and availability of the training data. The dataset, while extensive, still represents a fraction of all known enzymatic reactions, and inconsistencies across different databases and experimental conditions introduce noise that limits predictive accuracy. Further limitations include the model's current inability to handle non-natural enzyme variants or reactions. While the model includes enzyme sequence and reaction information as its input, it does not directly incorporate other experimental parameters that might significantly influence kcat, such as pH and temperature; however, the relatively high predictive accuracy despite this omission emphasizes the model's robustness.
Related Publications
Explore these studies to deepen your understanding of the subject.