Chemistry

Hierarchical Molecular Graph Self-Supervised Learning for property prediction

X. Zang, X. Zhao, et al.

This exciting research introduces HiMol, a pre-training framework that leverages Hierarchical Molecular Graph Neural Networks to decode intricate molecular structures and predict their properties. Conducted by Xuan Zang, Xianbing Zhao, and Buzhou Tang, this work demonstrates remarkable effectiveness in understanding chemical semantics through innovative self-supervised learning techniques.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses how to learn comprehensive and effective molecular graph representations for property prediction when labeled data are scarce. Prior work explored string-based (SMILES/SELFIES) and graph-based approaches with self-supervised pre-training, but common graph SSL often ignores chemical-specific structures (e.g., rings, functional groups) and relies on READOUT pooling that limits interaction between local (atom) and global (graph) features. The paper identifies three challenges: preserving complete molecular structure including motifs without destructive augmentations; fusing multi-level information so graph and node representations can interact; and designing effective pretext tasks that improve transferability. The research proposes a hierarchical GNN that encodes node–motif–graph representations and a multi-level self-supervised pre-training scheme to enhance downstream property prediction.

Literature Review

The paper reviews: (1) NLP-inspired molecular modeling using SMILES/SELFIES with RNNs/Transformers and SSL (e.g., sequence reconstruction, masked token recovery), which misses 2D topological information. (2) Graph SSL frameworks that capture 2D topology but often use generic augmentations potentially destructive for chemical graphs. (3) Molecular-specific methods incorporating motifs and chemistry: clustering-based motif learning, motif order prediction, chemically-informed augmentations and contrastive learning, reaction-aware learning. Limitations of these include not directly integrating motif information into hidden representations during fine-tuning, reliance on motif prediction dictionaries with computational overhead, and READOUT-based graph representation that lacks bidirectional interaction with node-level features.

Methodology

HiMol comprises two components: (1) Hierarchical Molecular Graph Neural Network (HMGNN) and (2) Multi-level Self-supervised Pre-training (MSP). Data processing: Molecules from SMILES are converted to graphs G(V,E) via RDKit. Motif decomposition follows BRICS rules with an added constraint to break large ring fragments and select minimum rings as motifs. Motifs are added as motif-level nodes Vm and connected to their constituent atoms (atom–motif edges). A graph-level node Vg is appended and connected to all motif nodes (motif–graph edges), forming an augmented hierarchical graph. Encoder: A GNN backbone (GIN with 3 or 5 layers for SMALL or LARGE versions) processes the augmented graph to produce atom-level Ha, motif-level Hm, and graph-level Hg representations. The explicit graph-level node participates in message passing, enabling bidirectional interaction between local and global information via motif nodes, instead of relying on a READOUT pooling. Pre-training (MSP): Multi-level self-supervised tasks are designed: (a) atom-level generative tasks predicting bond links, atom types, and bond types from Ha (optimized with cross-entropy loss); (b) graph-level predictive tasks predicting the number of atoms and bonds from Hg (optimized with smooth L1 loss). A learnable weight balances multi-task losses. Fine-tuning: The pre-trained HMGNN is transferred to downstream tasks; Hg feeds a 2-layer MLP (ELU) for property prediction. Training setup: Pre-training on ZINC15 for 100 epochs with Adam (lr=0.001), batch size 32, dropout 0.5, 512-dim hidden at all levels. Fine-tuning for 100 epochs with Adam (lr in [1e-4,1e-3]), batch size 32, dropout in [0.5,0.7], reporting averages over five runs. Baselines are pre-trained under the same setting when possible. Additional analyses include fine-tuning modes (freeze vs full) and ablations removing motif- or graph-level components and pretext tasks.

Key Findings

- Classification (ROC-AUC, %; MoleculeNet): HiMol achieved top performance on 4/6 datasets and best average. HiMol (LARGE) results: BACE 84.3, BBBP 73.2 (best), Tox21 76.2, ToxCast 66.3 (best), SIDER 61.3, ClinTox 80.8. Best average ROC-AUC: 73.7 for HiMol (LARGE), exceeding baselines. Motif-based baselines (G_Motif, MGSSL) were strongest among non-HiMol methods, highlighting the value of motif information. - Regression (ESOL, FreeSolv, Lipophilicity RMSE; QM7, QM8, QM9 MAE): HiMol (LARGE) achieved best on 5/6 datasets: ESOL 0.833 (best), FreeSolv 2.283 (best), Lipophilicity 0.708 (best), QM7 91.501 (best), QM8 0.0199 (competitive to best 0.0187), QM9 3.243 (best). The MAE on QM9 decreased markedly versus prior baselines (reported as a 55.5% reduction over the strongest baseline). - Visualization: t-SNE of fine-tuned Hg showed clear clustering aligned with labels across classification and regression tasks, indicating capture of chemical semantics. - Similarity ranking: Cosine similarity rankings of learned Hg retrieved molecules sharing atom types and substructures (e.g., fused rings, functional groups), evidencing chemically meaningful embeddings. - Fine-tuning modes: Full fine-tuning outperformed freeze; pre-training improved stability and convergence speed versus training from scratch; HMGNN without pre-training still outperformed vanilla GIN, indicating architectural benefits. - Ablations: Removing motif-level nodes or replacing the graph-level node with mean pooling degraded performance. Removing either atom-level or graph-level pretext tasks reduced transfer performance; combining multi-level tasks with a learnable loss weight yielded the best results.

Discussion

The results demonstrate that encoding node–motif–graph hierarchies and enabling bidirectional information flow with a graph-level node substantially improves molecular representation quality and downstream prediction. By integrating chemically meaningful motifs directly into the encoder, HiMol preserves substructural information and avoids destructive augmentations. The multi-level self-supervised objectives align with the hierarchical architecture, enriching both local (atom/bond) and global (molecule-level) features. Visualization, similarity retrieval, and ablations corroborate that HiMol captures chemical semantics and that both motif encoding and the graph-level node are pivotal. Performance gains across diverse MoleculeNet tasks and improved training stability indicate strong transferability of the pre-trained representations.

Conclusion

HiMol introduces a hierarchical molecular graph self-supervised learning framework combining HMGNN (with explicit motif- and graph-level nodes) and MSP (multi-level atom- and molecule-level pretext tasks). This design facilitates multi-order feature interaction and directly integrates motif structures into learned representations. Extensive experiments show state-of-the-art or competitive performance on classification and regression benchmarks, improved convergence and stability, and chemically meaningful embeddings. Potential future research directions include incorporating 3D geometric information, exploring reaction-aware or temporal data, designing additional chemically grounded pretext tasks, and scaling to larger or more diverse molecular corpora.

Limitations

The paper does not explicitly discuss limitations.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

Crystal Twins: Self-Supervised Learning for Crystalline Material Property Prediction

R. Magar, Y. Wang, et al.

Computer Science

Self-supervised learning for human activity recognition using 700,000 person-days of wearable data

H. Yuan, S. Chan, et al.

Biology

Self-supervised machine learning for live cell imagery segmentation

M. C. Robitaille, J. M. Byers, et al.

Computer Science

MD-HIT: Machine learning for material property prediction with dataset redundancy control

Q. Li, N. Fu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny