Chemistry

Transferring chemical and energetic knowledge between molecular systems with machine learning

S. Heydari, S. Raniolo, et al.

Discover a groundbreaking machine learning methodology that enhances knowledge transfer between molecular systems! This innovative approach, developed by Sajjad Heydari, Stefano Raniolo, Lorenzo Livi, and Vittorio Limongelli, focuses on classifying high and low free-energy conformations, boasting an impressive AUC of 0.92 in its predictions.

00:00

~3 min • Beginner • English

Index

Introduction

Molecular simulations are fundamental in applied sciences to predict system properties, but sampling limitations hinder their application to large biomolecules and long timescales. While machine learning (ML) has been applied to cheminformatics and, more recently, to molecular simulations for predicting atomistic properties and identifying free-energy states and slow degrees of freedom, ML applications remain constrained by data scarcity and the difficulty of adequately sampling phase space. Enhanced sampling and coarse-grained representations have improved sampling but are still often insufficient for complex systems. An appealing strategy is transfer learning—leveraging knowledge from simpler, computationally tractable systems to predict properties of more complex targets. Here, the authors propose a methodology to learn free-energy-related information on a given molecular system and transfer it to a different, larger system to classify low vs. high free-energy conformations. They introduce a hypergraph representation that captures multi-atom interactions (bonds, non-bonds, angles, dihedrals) and a neural network capable of processing variable-size hypergraph inputs via novel message-passing and pooling layers. They demonstrate transfer from alanine dipeptide to tri-alanine, and from tri-alanine to deca-alanine—whose most probable conformations (ϕ, ψ torsions) differ from the source systems—achieving high AUC. They further show unsupervised grouping of deca-alanine secondary structures with similar predicted free-energy. This provides a proof of concept that ML-enabled transfer can help overcome sampling issues for large systems by predicting structural and energetic properties from data on smaller, chemically related molecules.

Literature Review

The paper situates the work within prior applications of ML to chemistry and materials, including prediction of molecular properties and atomistic/quantum-level properties. It notes recent ML efforts for identifying free-energy states and slow modes, and longstanding challenges with sampling free-energy landscapes of biomolecular systems despite enhanced sampling and coarse-graining advances. Transfer learning has seen limited application in computational chemistry, primarily for approximating quantum-mechanical calculations and inferring material/molecular properties, with little exploration in conformational free-energy estimation and classification for molecular simulations. Existing hypergraph and simplicial complex neural methods have limitations (e.g., scalar hyperedge features, lack of pooling for variable-size inputs), motivating the new hypergraph representation and processing layers introduced here.

Methodology

- Molecular representation: Each molecular conformation is encoded as a hypergraph H(V, E, X, W), where vertices V are atoms and hyperedges E capture multi-atom interactions: |e|=2 (bonds, non-bonds/Coulomb, Van der Waals), |e|=3 (angles), and |e|=4 (dihedrals). Vertex features include atomic properties (e.g., mass/radius or atomic number/partial charge). Hyperedge features form a 5-element vector encoding interaction type and corresponding value (e.g., Coulomb force, Van der Waals force, angle, dihedral). The representation is invariant to rotations, translations, and atom index permutations by relying on internal coordinates and interaction-derived features. - Energetic context: Potential energy Ep is classically decomposed into bond, non-bond, angle, and dihedral terms, motivating higher-order (≥3-body) interactions in the representation. Free-energy F(s) is defined via metadynamics bias V(x) over collective variables s(x), enabling identification of low-energy conformations; these computed free-energies label source-system conformations. - Neural architecture: A novel hypergraph message-passing neural network (HNN) processes variable-size hypergraphs. Message passing uses learnable linear transforms with sigmoid activations, sum aggregation, and updates for both node and hyperedge embeddings. Two consecutive hypergraph message-passing layers are followed by a pooling mechanism that compares the input hypergraph’s learned representation to a set of points of interest (e.g., distinct conformations identified via K-means on internal representations). For each point, a sliding-window cosine similarity produces min/mean/max statistics concatenated into a fixed-size vector, fed to a fully connected layer outputting the probability that the conformation is low free-energy. - Transfer learning: Zero-shot transfer is employed. The model is trained on a source system’s structures and free-energies, then directly applied to the target system without target free-energy supervision. Training includes standard binary cross-entropy with L2 weight regularization and an additional regularization term that partially processes a representative structure of the target distribution through the message-passing layers to encourage target-relevant representation learning without labels. - Systems and simulation data: Source systems: alanine dipeptide (100 ns metadynamics in vacuum; CVs ϕ and ψ; converged free-energy) and tri-alanine (400 ns metadynamics; more complex CV setup). Target system: deca-alanine (~700 ns in vacuum), sampled via metadynamics using RMSD of Cα atoms as CV to generate diverse conformations (not for free-energy calculation). Free-energy threshold for low vs. high energy was set at 8 kJ/mol based on alanine dipeptide free-energy surface. Data splits: alanine (temporal selection per 5-conformation windows: 20% train, 40% validation, 40% test); tri-alanine (100,000 examples; 20% train, 20% validation, 60% test). ROC/AUC used for threshold-invariant evaluation. - Unsupervised analysis: Deca-alanine conformations were clustered (10 clusters) by backbone RMSD into families (helix: clusters 1,2,4,9; hairpin-like: 5,6; extended: 0,3,7,8). From each cluster representative, 1,000 additional structures were generated with backbone RMSD constraints (10,000 total). Predicted free-energy distributions across clusters were compared using Wilcoxon signed-rank tests (p<0.01) to assess similarity/dissimilarity consistent with secondary structure families. - Computational considerations: Message passing scales O(e+n) per layer; pooling scales approximately quadratically with molecule size times number of interest points (assumed small).

Key Findings

- Transfer from alanine dipeptide to tri-alanine: HNN trained on alanine dipeptide free-energy labels and tested zero-shot on tri-alanine achieves AUC = 0.89 for classifying low vs. high free-energy conformations (threshold for low energy ≤8 kJ/mol). - Transfer from tri-alanine to deca-alanine: HNN trained on tri-alanine and tested zero-shot on deca-alanine achieves AUC = 0.92. Classification precision/recall vary with decision threshold, e.g.: threshold 0.45—Low class (P=0.956, R=0.836), High class (P=0.979, R=0.624); threshold 0.50—Low (P=0.923, R=0.950), High (P=0.921, R=0.880); threshold 0.55—Low (P=0.907, R=0.999), High (P=0.877, R=0.894). - Robust zero-shot generalization despite differing preferred conformations: Poly-proline-like is a minimum in tri-alanine but disfavored in deca-alanine, which stabilizes via intra-molecular H-bonds not present in the training set; the model still distinguishes low/high free-energy conformations effectively. - Unsupervised secondary structure recognition: Using only predictions on deca-alanine, HNN identifies energetic similarities/differences largely consistent with helix, hairpin-like, and extended families. Statistical tests (Wilcoxon, p<0.01) support correct grouping of many clusters; examples include distinguishing fully extended β (cluster 0) from β-hairpin clusters (5,6) despite similar torsion angle regimes, and sensitivity to subtle helix-end unfolding differences (e.g., clusters 1,4,9 vs. 2). Some outliers are noted (e.g., 0 vs. 3/8; 2 vs. 4/9), reflecting fine-grained distinctions learned by the model. - Overall, the hypergraph representation capturing 2-, 3-, and 4-body interactions, combined with the novel HNN and pooling, enables effective transfer of chemical and energetic knowledge across systems of increasing size and complexity.

Discussion

The study addresses the challenge of limited sampling and data scarcity in molecular simulations by transferring free-energy knowledge from smaller to larger systems. By embedding multi-atom interactions in a hypergraph representation and processing them with a tailored HNN and pooling, the model generalizes from alanine dipeptide and tri-alanine to deca-alanine, accurately classifying low vs. high free-energy conformations in a zero-shot setting (AUC up to 0.92). The unsupervised analyses further indicate that predicted free-energy distributions reflect secondary structural similarities, even detecting subtle variations (e.g., helix termini unfolding) beyond simple RMSD-based clustering. This suggests the learned representations capture physically meaningful, higher-order interactions relevant to stability. The approach could substantially reduce reliance on lengthy enhanced sampling for large systems by providing rapid, ML-based energetic assessments and may be integrated as a run-time or post-processing aid in MD to guide sampling toward low/high free-energy regions. The work expands the application of transfer learning in computational chemistry, showing feasibility for conformational free-energy classification across systems sharing chemical building blocks but differing in size and secondary structure repertoire.

Conclusion

The paper introduces a hypergraph-based molecular representation and a bespoke hypergraph neural network with pooling that together enable zero-shot transfer of free-energy knowledge from simpler to more complex molecular systems. Demonstrations on alanine dipeptide→tri-alanine and tri-alanine→deca-alanine show strong classification performance (AUC 0.89 and 0.92) and the capacity to capture secondary-structure-related energetic similarities in an unsupervised setting. This establishes a proof of concept for using transfer learning to overcome sampling bottlenecks in molecular simulations, offering a route to rapid energetic characterization of larger, biologically relevant systems. Future research directions include improving regression accuracy for quantitative free-energy prediction, extending to heterogeneous systems (e.g., diverse amino acid sequences, tertiary structures, protein–ligand complexes), refining hypergraph features and pooling mechanisms, and integrating the method into MD workflows to actively steer sampling.

Limitations

- Ground truth free-energy landscapes for complex targets (e.g., deca-alanine) were not computed; deca-alanine metadynamics was used to generate diverse conformations rather than converged free-energies, necessitating indirect/unsupervised validation. - Zero-shot transfer relied on training data that did not include intra-molecular hydrogen bonds characteristic of helices/β-sheets in deca-alanine, challenging generalization despite good results. - Some cluster-level predictions deviated from expectations based on structural similarity (outliers), indicating sensitivity to subtle features and potential misclassifications. - The regression setting for direct free-energy value prediction performed poorly (results not shown), limiting quantitative use; current strengths are in classification. - Scalability: pooling has quadratic dependence on molecule size (per number of interest points), which may impact very large systems. - Generalizability to more complex, heterogeneous biomolecular systems (multiple amino acids, tertiary structures, protein–ligand complexes) remains to be demonstrated; model and representation may require further development.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Machine Learning Model to Differentiate Between Acute Kidney Injury and Functional Decline in Children with Urinary Tract Infection

T. Cm

Computer Science

On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare

S. Mittal, K. Thakral, et al.

Veterinary Science

Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations

M. Wardeh, M. S. C. Blagrove, et al.

Medicine and Health

A wearable sensor and machine learning estimate step length in older adults and patients with neurological disorders

A. Zadka, N. Rabin, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny