Chemistry

Developing a machine learning model for accurate nucleoside hydrogels prediction based on descriptors

W. Li, Y. Wen, et al.

Discover the groundbreaking research by Weiqi Li, Yinghui Wen, Kaichao Wang, Zihan Ding, Lingfeng Wang, Qianming Chen, Liang Xie, Hao Xu, and Hang Zhao on predictive machine learning models for hydrogel-forming nucleoside derivatives. With a 71% accuracy rate, their model led to the development of two novel cation-independent nucleoside hydrogels, showing immense potential for Ag+ and cysteine detection.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the longstanding challenge of predicting whether nucleoside derivatives will form supramolecular hydrogels in aqueous environments. Although nucleoside-based hydrogels exhibit excellent biocompatibility and have been explored for applications such as drug delivery, biosensing, and tissue engineering, the discovery of gelators has largely been serendipitous due to limited understanding of structure–property relationships. The authors propose that machine learning (ML), by learning from high-dimensional molecular descriptors, can capture the complex relationships governing hydrogel formation in nucleoside derivatives. The objective is to construct, optimize, and validate an ML model capable of accurately predicting hydrogel-forming ability, thereby accelerating the discovery of new gelators and enabling rational design. The significance lies in providing a systematic, data-driven approach for nucleoside hydrogel prediction and uncovering new cation-independent gel systems with potential practical applications.

Literature Review

The paper situates its work within a century of research on nucleoside-based gels, beginning with early observations that guanylic acid could form gels. Subsequent advances demonstrated stable guanosine-based hydrogels often requiring metal cations (e.g., K+, borate systems) and their utility in controlled release, wound healing, cancer therapy, and biosensing. Despite practical successes, a predictive framework has been lacking, and gelators are frequently identified through trial-and-error or modification of known systems. In parallel, ML has shown promise in related gelation problems: models have been developed for dipeptide and peptide-like gelators using algorithms such as random forest, gradient boosting, and logistic regression, and descriptor-based approaches have been explored for supramolecular gelation prediction. However, complexities of nucleoside self-assembly (e.g., G-quartet vs G-ribbon formation, cation dependence) have impeded the development of accurate predictors specifically for nucleoside-derived hydrogels. This gap motivates the current work.

Methodology

Dataset curation: A systematic literature review (MeSH-guided searches in Medline, Web of Science, and SciFinder) identified 882 articles; after screening and applying inclusion criteria (clear gelator/non-gelator definition via tube-inversion or rheology; exact chemical structure; aqueous or pure water solvent; nucleosides and derivatives only), 18 articles with 71 unique nucleoside derivatives remained (gelators n=38, non-gelators n=33). Structures were redrawn in ChemDraw, converted to SMILES, and used to compute molecular descriptors. Descriptor computation and filtering: Using alvaDesc (via alvaDescCLIWrapper) 5666 descriptors were calculated; 1491 with missing values were removed, yielding 4175 descriptors for analysis. Feature selection (three-step): (1) Univariate rank-sum test identified 144 descriptors with significant differences between gelators and non-gelators (P<0.05). (2) Spearman correlation filtering removed collinear pairs with |ρ|>0.8, leaving 40 descriptors. (3) Recursive feature elimination (RFE) was applied with four classifiers—logistic regression (LR), decision tree (DT), random forest (RF), and XGBoost—to determine optimal subsets: XGBoost n=16, LR n=24, DT n=30, RF n=37. Modeling and validation: Models (LR, RF, XGBoost, DT; implemented in scikit-learn) were trained and evaluated using stratified 5-fold cross-validation, repeated 10 times. Hyperparameters were optimized via Bayesian optimization (Optuna). Main metrics were test accuracy and AUC; precision, recall, and F1 were auxiliary metrics. Sensitivity analysis created an independent test set by cluster-stratified sampling: 80% (n=56) training with 5-fold CV and 20% (n=15) held-out test set to assess generalization. Feature importance: LASSO regression across 4175 descriptors identified 70 non-zero features; for the final LR-RFE model the 24 descriptors’ regression coefficients provided importance. Permutation feature importance (PFI; 1000 permutations) quantified mean accuracy decrease per descriptor. Key high-importance descriptors related to hydrogen bonding, polarity, and lipophilicity. External application and experimental validation: Using PubChem3D, 11,406 nucleoside-like structures were retrieved based on 3D similarity (shape-Tanimoto ST≥0.80, color-Tanimoto CT≥0.50) to five base nucleosides; after deduplication, 7,257 remained. The optimal LR model (24 descriptors) scored these candidates; predictions were ranked by probability. Twelve compounds from the top 10% and twelve from the bottom 10% were selected considering availability and cost. Hydrogel formation was tested via tube-inversion under various aqueous conditions (including Tris/H3BO3, NaB(OH)4, KB(OH)4, NaCl, KCl, AgNO3). For newly identified hydrogels, rheology (frequency sweep, self-healing), SEM, AFM, VT-SAXS, PXRD, fluorescence assays (ThT, ARS), and NMR (1H, 11B, NOE, titrations) were performed to elucidate structure and mechanism. Single-crystal X-ray diffraction was obtained for 8-aminoguanosine (compound 6) and DFT calculations assessed conformational preferences and G-ribbon vs G-quartet energetics. Experimental conditions highlights: Typical nucleoside concentrations were 50–100 mM; salts (NaCl, KCl, AgNO3) at 0.2 M; borate species (H3BO3, NaB(OH)4, KB(OH)4) and Tris at half-equimolar relative to nucleoside. Rheology used plate-plate geometry (25 mm, 1 mm gap) with time, frequency, and strain sweeps; microscopy applied standard SEM and AFM sample preparation; VT-SAXS recorded temperature-dependent nanostructure evolution; CD/UV spectra probed chiroptical features and g-factors; fluorescence with Rho123 assessed dye–hydrogel interactions; ion response tests screened Li+, Na+, K+, Cs+, Rb+, Ag+, Ca2+, Mg2+, Ba2+, Zn2+, Cu2+, Cr3+, Al3+. Code and data: All code (GitHub/Zenodo), datasets (including 71-compound set and external screening), and crystallographic data (CCDC 2253566) are publicly available.

Key Findings

- Optimal model: Logistic regression with RFE-selected 24 descriptors achieved test accuracy of 0.71 (95% CI 0.69–0.73) and AUC 0.84±0.02. It also yielded high recall (0.95±0.01) and F1 score (0.78±0.01). - Sensitivity analysis (independent test set, n=15): LR-RFE achieved accuracy 0.67 and AUC 0.81; in training (5-fold CV) validation accuracy 0.70±0.02 and AUC 0.84±0.02, consistent with full-cross-validation results. - Feature importance: Among the 24 descriptors, four showed regression coefficients >0.1 and related to hydrogen bonding capacity, polarity, and lipophilicity—properties consistent with known gelator behavior. - External validation: From 7,257 PubChem-derived nucleoside-like molecules, 24 were experimentally tested (12 high-probability, 12 low-probability). Results: 20/24 correctly predicted (83.33% accuracy). Among the top 12, 10 formed hydrogels (83.33%); among the bottom 12, 10 did not (83.33%). Eight gelators (1, 6–12 except 3 and 4) had not been previously reported. - Discovery of cation-independent hydrogels: Two gelators, 8-aminoguanosine (6) and 8-hydroxyguanosine (8), formed stable, long-lived hydrogels (8AG-T and 8OHG-T) in Tris/H3BO3 without added cations, maintaining stability for 6 months and exhibiting excellent self-healing. - Mechanical and structural properties: 8AG-T and 8OHG-T exhibited G′>G″ over frequency sweeps; higher G′ than their Na+/K+ counterparts; robust self-healing under cyclic strain. SEM/AFM revealed porous networks comprising intertwined nanofibers (8AG-T) or rod-like structures (8OHG-T). VT-SAXS: 8AG-T nanowire diameter ~12 nm at 25 °C; 8OHG-T rod diameter ~23 nm at 25 °C transitioning to ~2 nm nanowires at 85 °C. PXRD showed peaks at 2θ≈28° (d≈3.3 Å) indicating π–π stacking. - Mechanistic insights: Borate diester formation (confirmed by 11B NMR and ARS assays) under Tris/H3BO3 promotes cation-independent gelation. ThT and CD/UV data suggest absence of G-quartets and formation of G-ribbons. Single-crystal data for 8-aminoguanosine (6) revealed an anti glycosidic preference with six intramolecular H-bonds; DFT favored anti over syn and G-ribbon over G-quartet (ΔE≈+6.2 kcal/mol for quartet vs ribbon), supporting a G-ribbon assembly mechanism. - Application: 8OHG-T hydrogel enabled rapid visual detection of Ag+ (collapse and fluorescence restoration of Rho123 at 10 μM within 10 min) and subsequent cysteine detection (re-quenching at 10 μM within 10 min), suggesting utility in simple, portable sensing.

Discussion

The study demonstrates that a carefully engineered ML workflow—large-scale descriptor computation, stringent three-step feature selection, and Bayesian hyperparameter optimization—can yield a robust predictor of nucleoside hydrogelation despite small datasets and complex self-assembly mechanisms. The logistic regression model with 24 interpretable descriptors achieves state-of-the-art performance and generalizes to an independent test split and an external screening set, where it guided the successful discovery of multiple new gelators. The identification of two cation-independent hydrogels (8AG-T and 8OHG-T) is particularly significant, addressing the practical constraint that many guanosine-based hydrogels require metal cations for stability. Mechanistic studies support a borate-diester-mediated, G-ribbon–based self-assembly pathway leading to robust, self-healing networks. The functional demonstration in Ag+ and cysteine detection highlights how the new gel systems can be harnessed for rapid, instrument-light sensing. Collectively, the findings validate ML as a valuable tool for accelerating nucleoside hydrogel discovery and provide mechanistic insight that may guide rational design toward application-relevant properties.

Conclusion

An optimal ML model (logistic regression with 24 RFE-selected descriptors) was developed to predict nucleoside hydrogel-forming ability with 71% accuracy (95% CI 0.69–0.73) and AUC ~0.84. External validation on 24 predicted candidates achieved 83.33% accuracy and led to the discovery of two rare cation-independent hydrogels (8AG-T and 8OHG-T). Mechanistic studies revealed that dynamic borate diesters and G-ribbon assembly, rather than G-quartets, underpin their robust, self-healing networks. The 8OHG-T hydrogel enabled rapid visual detection of Ag+ and cysteine, illustrating application potential. Future work should expand and diversify the training dataset, refine and interpret descriptor sets, incorporate broader solvent and condition spaces, and explore transferability to other supramolecular systems to further improve predictive power and applicability.

Limitations

- Dataset size and diversity: Only 71 labeled nucleoside derivatives were available, limiting model capacity and generalizability; non-gelator cases are underreported, potentially biasing the dataset. - Cross-validation constraints: Although sensitivity analysis with an external test split was performed, reliance on cross-validation can still overestimate performance relative to large independent benchmarks. - Descriptor interpretability: Many of the most predictive descriptors are complex and not straightforward to interpret chemically, constraining mechanistic insight. - Condition specificity: Labels reflect gelation in pure water or aqueous solutions under particular conditions; predictions may not transfer to other solvents or experimental setups without retraining. - External validation scope: Experimental verification covered 24 candidates; broader validation across chemical space and conditions would further assess robustness.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

A multi-model architecture based on deep learning for aircraft load prediction

C. Sun, H. Li, et al.

Medicine and Health

Machine learning for accurate estimation of fetal gestational age based on ultrasound images

L. H. Lee, E. Bradburn, et al.

Medicine and Health

UroPredict: Machine learning model on real-world data for prediction of kidney cancer recurrence (UroCCR-120)

G. Margue, L. Ferrer, et al.

Medicine and Health

Machine learning-based prediction of in-hospital death for patients with takotsubo syndrome: The InterTAK-ML model

O. D. Filippo, V. L. Cammann, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny