Biology

Generative machine learning produces kinetic models that accurately characterize intracellular metabolic states

S. Choudhury, B. Narayanan, et al.

Unlocking the secrets of metabolic states in *Escherichia coli* just got easier! Researchers Subham Choudhury, Bharath Narayanan, Michael Moret, Vassily Hatzimanikatis, and Ljubisa Miskovic introduce RENAISSANCE, a groundbreaking machine learning framework that harnesses the power of omics data to accurately characterize metabolic processes. This study offers a powerful new tool for those exploring metabolic variations in health and biotechnology.... show more

Introduction

The integration of heterogeneous high-throughput omics data to derive coherent insights into cellular metabolism remains a central challenge. Constraint-based genome-scale models have enabled data integration by enforcing physicochemical and genetic constraints, such as thermodynamic inequalities linking fluxes and metabolite levels. However, these approaches often produce broad uncertainty in intracellular metabolic states, complicating precise determination of metabolite concentrations and reaction rates. Kinetic models promise to alleviate these issues by incorporating enzyme kinetics and regulatory mechanisms to explicitly couple metabolite concentrations, fluxes, and enzyme abundances, and to capture time-dependent metabolic responses relevant to biomedical and biotechnological phenomena. Despite their advantages, kinetic models are underutilized due to the difficulty of determining in vivo kinetic parameters and the substantial computational and expert effort required, limiting their scalability across conditions and cohorts. Recent parameterization and machine learning efforts have improved efficiency but still face long runtimes and dependencies on training data from traditional kinetic modeling. To address these limitations, the authors introduce RENAISSANCE (REconstruction of dyNAmic models through Stratified Sampling using Artificial Neural networks and Concepts of Evolution strategies), a generative machine learning framework that efficiently parameterizes kinetic models without requiring pre-generated training datasets, aiming to construct biologically relevant models whose dynamic properties match experimental observations.

Literature Review

Prior work in constraint-based modeling integrates omics data by imposing physicochemical constraints (for example, thermodynamics-based flux analysis) to link fluxes and metabolite concentrations, but leaves substantial uncertainty in intracellular states. Kinetic models offer a richer representation by including enzyme kinetics and regulation, enabling explicit coupling of metabolites, fluxes, and enzymes and capturing dynamic behavior relevant to disease, drug metabolism, and metabolic engineering. However, the lack of in vivo kinetic parameters and the complexity of parameter estimation have limited widespread adoption. Recent methods such as iSCHRUNK and REKINDLE leveraged machine learning to accelerate model generation; REKINDLE notably used GANs to greatly improve efficiency but required training data produced by traditional kinetic modeling, maintaining a data-generation bottleneck. The ORACLE framework previously introduced unbiased sampling of kinetic parameters around steady states to construct valid kinetic models. RENAISSANCE builds on these developments by employing natural evolution strategies to train generator networks using only a scoring function tied to biological dynamic constraints, removing the need for training datasets and enabling stratified sampling biased toward desired dynamic properties.

Methodology

Overview and inputs: RENAISSANCE parameterizes kinetic models to be consistent with experimentally observed steady states and dynamic timescales. The input is one or more steady-state profiles (metabolite concentrations and reaction fluxes) computed by integrating structural network information (stoichiometry, regulation, rate laws) with available data (metabolomics, fluxomics, thermodynamics, and optionally transcriptomics/proteomics) using thermodynamics-based flux balance analysis (pyTFA). For the Escherichia coli W3110 trpD9923 case study, 5,000 steady states were sampled; a single profile (index 1712) was used for model generation, and all 5,000 were used for intracellular state characterization.

Generator architecture and optimization: RENAISSANCE uses feed-forward neural networks (generators) that map multivariate Gaussian noise to kinetic parameter vectors consistent with the model structure. Generators are optimized via Natural Evolution Strategies (NES), iterating four steps: (I) initialize or recreate a population of generators by perturbing a parent generator’s weights; (II) sample parameter sets from each generator and instantiate kinetic models; (III) evaluate model dynamics via the Jacobian eigenvalues to determine validity and assign rewards; and (IV) aggregate normalized rewards to estimate a gradient-like update to the parent generator’s weights, then mutate to produce the next generation. The design objective is to maximize the incidence of valid models. When no valid models are initially produced, a sigmoidal reward term biases optimization toward generators producing models closer to the validity threshold.

Dynamic validity criterion: For each parameterized model, the Jacobian at the steady state is computed; local stability requires negative real parts for all eigenvalues. The dominant time constant is defined as −60/Re(λmax) (minutes). For the studied E. coli strain (doubling time 134 min), to ensure perturbations settle well before division, models must have dominant time constants faster than 24 min, corresponding to Re(λmax) ≤ −2.5. Parameter sets meeting this condition are labeled valid.

Rewards and fitness in NES: The primary objective function is the incidence I(G) of valid models generated by a generator G. The reward R(G) equals I(G) if I(G) > 0, otherwise a sigmoidal term r = 1 + 0.01/(1 + exp(λfastest − λpartition)) (with λpartition = −2.5) rewards generators producing faster (more negative) λmax. For large-scale intracellular state characterization, the reward r = 0.5 exp(−0.1 λmean) is used, where λmean is the mean of the 10 fastest λmax among 100 samples from a generator, favoring faster dynamics.

Hyperparameters and network details: NES hyperparameters tuned via grid search: population size n = 20, noise level σ = 10^2, learning rate α = 10^−3, learning rate decay d = 5%. Generator network: three dense layers with dropout (0.5): 256 units → 512 units → 1,024 units; total weights ~1,076,352; implemented in Python (v3.6) with TensorFlow (v2.3.0). Generated KM values were constrained to [1.3×10^−3, 20], consistent with BRENDA ranges.

E. coli model and data integration: The E. coli W3110 trpD9923 kinetic model (adapted from Narayanan et al.) includes 113 ODEs, 123 reactions, and initially 507 kinetic parameters (384 KM and 123 Vmax; elsewhere 502 parameters are referenced after tailoring). Core pathways modeled: glycolysis, PPP, TCA, anaplerotic reactions, shikimate pathway, glutamine synthesis, and a lumped growth reaction. Context-specific constraints were imposed using literature data: growth rate lower bound 0.26 h−1, anthranilate secretion rate 0.14 mmol gDW−1 h−1, glucose uptake adjusted accordingly; extracellular secretion bounds; intracellular metabolite concentrations constrained within twofold of reported values; thermodynamics constraints via group contribution methods to enforce the second law. 5,000 steady states consistent with these constraints were sampled using pyTFA. Transcriptomics/proteomics can be integrated via REMI and TEX-FBA, or enzyme capacity constraints v ≤ kcat·E; such data were not available and thus not used in this study.

Robustness tests and bioreactor simulations: Robustness was assessed by perturbing steady-state metabolite concentrations up to ±50% and simulating whether models return to steady state within specified timescales. Nonlinear dynamic bioreactor simulations (mimicking experimental conditions) were performed to compare time courses of biomass, glucose, and anthranilate with experimental data; 13 models closely matched experimental fermentations.

Integration of experimental kinetic parameters: Experimentally measured KM values were retrieved from BRENDA. For parameters with multiple measurements, the geometric mean KM,exp was used with ±20% error bounds; if only one value existed, the same ±20% was applied. Integrated KM values were sampled uniformly within KM,exp ±20% during training and generation. Analyses included local integration around aconitase (four KM values) and subsystem-level integration (randomly integrating 9 of 17 TCA KMs across 10 combinations), generating thousands of valid models to assess how integrated KMs constrain and improve estimates of non-integrated KMs and Vmax across the network. Dimensionality reduction (PCA followed by t-SNE with a polar transform) visualized steady-state landscapes and dynamic properties.

Key Findings

Efficient generation of valid kinetic models: Over 50 NES generations, the mean incidence of models satisfying the dynamic constraint (Re(λmax) ≤ −2.5; dominant time constant ≤ 24 min) converged to ~92%, with some repeats reaching 100% valid incidence.
Dynamics shift with optimization: Distributions of λmax shifted from slower (>-2.5) to faster (<-2.5) dynamics over generations, matching experimentally observed timescales.
Robustness to perturbations: For 1,000 valid models (from a top-performing generator), after ±50% perturbations in metabolite concentrations, biomass returned to steady state within 24 min for 100% of cases. Key metabolites NADH, ATP, and NADPH returned within 24 min for 99.9%, 99.9%, and 100% of models, respectively. Across all cytosolic metabolites, 75.4% of models returned within 24 min and 93.1% within 34 min.
Bioreactor validation: In dynamic bioreactor simulations, 13 models closely fit experimental time courses of biomass, glucose consumption (halted), and anthranilate production (saturating around 20 h), reproducing exponential and stationary phases without explicit training on fermentation curves.
Intracellular state characterization and uncertainty reduction: Among 5,000 steady-state profiles, 918 (18.4%) had λmax > −2.5 and are inconsistent with observed dynamics. Local neighborhoods in the reduced-state space (t-SNE) enriched for fast dynamics yielded additional sampled states with similarly fast dynamics, enabling exclusion of regions yielding slow dynamics and constraining feasible metabolite concentration ranges. For example, relevant dynamics were obtained only when 3-phosphoglyceric acid concentration was below ~0.002 mM; sampling with 30 metabolites constrained to non-supporting ranges predominantly produced models lacking relevant dynamics.
Propagation of integrated kinetic data: Integrating four aconitase KM values constrained local KM estimates (e.g., KACONTa) and propagated constraints to correlated Vmax and downstream KMs (e.g., ICDHyr and AKGDH), demonstrating network-wide effects from local data.
Subsystem-wide improvements from KM integration: Integrating subsets of experimental TCA KM values improved the overlap score (OS) between predicted and experimental ranges for 16 of 17 TCA KMs and increased the average subsystem OS. Similar improvements were observed across other subsystems (PPP, glycolysis, anaplerotic reactions, shikimate, pyruvate metabolism). Outside TCA, 85 of 91 KMs improved upon TCA KM integration, with the largest gains often in the shikimate pathway, glycolysis, and anaplerotic reactions.
Global parameter estimate shifts: PCA of 276 experimentally unverified KMs showed notable shifts and increased similarity of predicted parameter distributions when experimental KM data were integrated, indicating reduced uncertainty and more coherent estimates.
Computational efficiency: RENAISSANCE trains generators on a standard workstation in ~3–20 minutes and can sample ~1 million models in 15–20 seconds, achieving orders-of-magnitude speedups over traditional sampling-based kinetic modeling.

Discussion

The study demonstrates that RENAISSANCE efficiently parameterizes large-scale kinetic models without requiring training datasets from traditional modeling, directly addressing a major bottleneck in kinetic model construction. By optimizing generator networks with NES toward biologically relevant dynamic timescales, the approach yields high incidences of valid models whose dynamics align with experimental observations and are robust to substantial perturbations. The framework also enables dynamic characterization of intracellular states, facilitating the exclusion of states that cannot reproduce observed timescales and constraining feasible ranges of specific metabolite concentrations. Integrating sparse experimental kinetic data (KM) improves estimates for numerous non-integrated parameters within and beyond the subsystem of integration, propagating constraints across the network and reducing uncertainty. Compared with previous machine-learning-based approaches (e.g., GAN-based REKINDLE), RENAISSANCE preserves generation efficiency while removing the dependency on curated training datasets. Its flexibility allows targeting diverse design objectives and incorporating heterogeneous data types, suggesting broad applicability in biomedical and biotechnological settings where accurate dynamic metabolic descriptions are needed.

Conclusion

RENAISSANCE is a fast, generative machine learning framework that parameterizes biologically relevant kinetic models by aligning model dynamics with experimental constraints, without requiring prior training datasets. Applied to E. coli W3110 trpD9923, it generated large populations of valid, robust models, accurately reproduced bioreactor behaviors, and enabled dynamic-based reduction of uncertainty in intracellular states. The framework effectively integrates sparse experimental kinetic data to refine and reconcile unknown parameters across the network, reducing uncertainty and improving prediction accuracy. With training times of minutes and rapid model generation, RENAISSANCE substantially advances the scalability of kinetic modeling. Future work can leverage the framework’s agnosticism to parameter types and objectives to incorporate additional data (e.g., transcriptomics, proteomics, kcat), enforce alternative biological design criteria (e.g., knockout phenotypes, drug response time series), and extend to other organisms and conditions to further enhance predictive capabilities.

Limitations

Transcriptomics and proteomics datasets were not available for integration in the present study, limiting multi-omics constraints to metabolomics, fluxomics, and thermodynamics at steady state. Many kinetic parameters lacked direct experimental measurements (276 of 384 KM values), necessitating estimation and reconciliation within the framework. Validation focused on an E. coli W3110 trpD9923 context and primarily on matching linearized dynamic timescales around steady states.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

S. H. Snyder, P. A. Vignaux, et al.

Medicine and Health

Permutation-based identification of important biomarkers for complex diseases via machine learning models

X. Mi, B. Zou, et al.

Earth Sciences

Training machine learning models on climate model output yields skillful interpretable seasonal precipitation forecasts

P. B. Gibson, W. E. Chapman, et al.

Psychology

Building machine learning prediction models for well-being using predictors from the exposome and genome in a population cohort

D. H. M. Pelt, P. C. Habets, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny