Chemistry
A molecular video-derived foundation model for scientific drug discovery
H. Xiang, L. Zeng, et al.
Discover VideoMol, an innovative molecular video-based foundation model pre-trained on a staggering 120 million frames of drug-like and bioactive molecules. This groundbreaking research by Hongxin Xiang and team not only excels in predicting molecular properties but also showcases remarkable interpretability. Dive into the future of molecular modeling!
~3 min • Beginner • English
Introduction
Drug discovery is complex and time-consuming, spanning target identification, compound design/synthesis, and multi-stage efficacy/safety testing. Traditional discovery relies on domain expertise and iterative validation in cellular or animal models. Computational and AI-enabled approaches promise to accelerate this pipeline by leveraging large-scale biological and chemical datasets and foundation models to identify targets, generate candidate molecules, and evaluate their properties more efficiently. A central challenge is learning accurate, generalizable molecular representations across hundreds of millions of existing and novel compounds. Hand-crafted fingerprints (e.g., pharmacophoric or read-across) are constrained by domain knowledge and may lack generalizability. Recent advances in deep and self-supervised learning enable automated representation learning from molecular sequences and images at scale, leading to substantial gains across drug discovery tasks. Building on progress in video representation learning, we hypothesize that representing molecules as dynamic 3D videos can better capture conformational and physicochemical information, improving target binding and molecular property predictions. We introduce VideoMol, a molecular video-based foundation model that leverages dynamic and physicochemical awareness via self-supervised pretraining on 2 million molecular videos (60 frames each; 120 million frames total) to learn transferable representations that can be fine-tuned across diverse drug discovery tasks.
Literature Review
Traditional molecular representations use hand-crafted descriptors and fingerprints (e.g., pharmacophoric fingerprints, read-across fingerprints), which can be subjective and limited in generalization. As datasets grew, self-supervised and deep learning approaches emerged, learning molecular representations from sequences (e.g., SMILES, InChI) and 2D images, demonstrating improved performance across activities and properties. Parallel advances in self-supervised video representation learning (e.g., contrastive learning, masked modeling, transition ranking) suggest that temporal and viewpoint dynamics can encode rich 3D structure without manual features. Prior image-based molecular models (e.g., ImageMol) improved upon graph or sequence-only approaches but remain limited to static 2D views. VideoMol extends this trajectory by treating molecules as dynamic 3D videos to exploit multi-view and temporal cues, integrating multiple self-supervised objectives to inject physicochemical awareness and inter/intra-video discrimination into learned embeddings.
Methodology
Overview and data: VideoMol treats each molecule as a 60-frame video capturing 3D conformational dynamics via controlled rotations. The model is pretrained self-supervised on 2 million unlabeled molecules (120 million frames) sampled from PCQM4M-v2. Each molecule’s conformers are rendered into frames by rotating around the x, y, and z axes and snapshotting to form a video. Rendering uses PyMOL (six-still mode), generating 640×480 images subsequently padded and resized to 224×224 RGB frames. Videos are V ∈ R^{n×224×224×3} with 60 frames per molecule.
Video encoder: A 12-layer Vision Transformer (ViT) serves as the video encoder, processing each frame independently by splitting into 16×16 patches and extracting 384-dimensional features per frame. Frame-wise features are aggregated downstream.
Self-supervised pretraining tasks:
- Video-aware pretraining (VAP): Contrastive learning (InfoNCE) to maximize similarity among frames from the same video (intra-video) and minimize similarity across different videos (inter-video), stabilizing multi-view representations.
- Direction-aware pretraining (DAP): Predicts geometric relationships between frame pairs via three MLP classifiers: (i) rotation axis (x/y/z), (ii) rotation direction (clockwise/anticlockwise), and (iii) rotation angle (discrete 1–19). Inputs are residual features H′ from paired frame features, optimized by cross-entropy losses.
- Chemical-aware pretraining (CAP via MSCS): Multi-Channel Semantic Clustering constructs MSCS fingerprints by stitching 21 physicochemical fingerprints and reducing dimensionality (PCA) to build 6000-D descriptors. K-means (k=100) provides pseudo-labels; an MLP predicts cluster assignments from latent features, injecting physicochemical awareness.
Multi-task optimization: Losses from VAP, DAP, and CAP are combined via weighted multi-objective optimization with adaptive weights λ_k = L_k / Σ_i L_i. The final objective is L_weighted = Σ_k λ_k L_k, balancing tasks during training.
Data augmentation and normalization: Augmentations avoid geometry-altering transforms and include CenterCrop(224,224), RandomGrayScale(p=0.3), ColorJitter(brightness/contrast/saturation (0.6,1.4), p=0.3), and GaussianBlur(kernel=3, sigma 0.1–2.0, p=0.3). Frames are normalized with ImageNet mean (0.485, 0.456, 0.406) and std (0.229, 0.224, 0.225).
Pretraining protocol: 90% of molecular videos are used for pretraining and 10% for validation. Hyperparameters and logging details are in supplementary materials. Uncertainty intervals are computed using BCa bootstrap for select evaluations.
Fine-tuning: For downstream tasks, an external MLP head is added to the encoder. Each frame’s logits are computed, then averaged across frames to yield sample-level predictions. Classification uses cross-entropy; regression uses MSE or Smooth L1.
Downstream tasks, splits, and metrics:
- Binding activity prediction: 10 kinases (classification, ROC-AUC) and 10 GPCRs (regression, RMSE/MAE), using balanced scaffold splits (80/10/10 train/val/test), aligned with ImageMol protocols.
- Molecular property prediction: 12 MoleculeNet benchmarks: classification (BBBP, Tox21, HIV, BACE, SIDER, ToxCast; ROC-AUC) and regression (FreeSolv, ESOL, Lipophilicity, QM7, QM8, QM9; RMSE/MAE) with scaffold splits.
- Anti-SARS-CoV-2 activity prediction: 11 assays aligned with REDIAL-2020/ImageMol splits. Means/SDs reported across multiple seeds.
Additional analyses:
- Ablations: Evaluate contributions of individual and paired pretraining tasks; assess sensitivity to video generation platforms (OpenBabel, DeepChem, RDKit) and to number of frames per video; analyze conformer sensitivity via cosine similarity vs RMSD.
- Interpretability: Grad-CAM visualizations highlight substructures driving predictions; t-SNE analyses of frame- and molecule-level embeddings, and intra-/inter-video similarity distributions.
Key Findings
- Overall representation advantage: The proposed molecular video representation showed a 39.8% improvement over existing video representations on basic attributes (Supplementary Tables).
- Kinase binding (10 datasets, classification): VideoMol achieved superior ROC-AUC across BBTK (0.816 ± 0.023), CDK4-cyclinD3 (0.972 ± 0.039), EGFR (0.905 ± 0.017), FGFR1 (0.848 ± 0.027), FGFR2 (0.985 ± 0.017), FGFR3 (0.896 ± 0.039), FGFR4 (0.582 ± 0.801), FLT3 (0.951 ± 0.026), KPCD3 (0.867 ± 0.036), MEF (0.963 ± 0.026), yielding an average AUC improvement of about 5.9% (range ~1.8%–20.3%) over baselines.
- GPCR binding (10 datasets, regression): VideoMol outperformed ImageMol and MoLCIR, with average improvements of 4.5%–10.2% on RMSE and 6.2%–12.0% on MAE (average improvements cited include 6.7% vs ImageMol and 20.1% vs MoLCIR in some summaries).
- Molecular property prediction (12 benchmarks): On regression tasks, VideoMol achieved lower errors, e.g., FreeSolv RMSE ≈ 1.725 ± 0.053 and ESOL RMSE ≈ 0.866 ± 0.017, with improved performance across Lipophilicity and QM7–QM9 (e.g., QM9 MAE ≈ 0.9066 ± 0.0003), outperforming competitive 2D-graph, graph-based, and image-based baselines (see Fig. 2 and Supplementary Tables).
- Anti-SARS-CoV-2 activity (11 assays, classification): VideoMol achieved higher ROC-AUC than REDIAL-2020 and ImageMol across assays, e.g., A2RE 0.759 ± 0.025, RYTCOVA 0.765 ± 0.003, MERS-PtSE 0.835 ± 0.027, MERS-PF 0.814 ± 0.004, CPIE 0.747 ± 0.013, CvI-PF 0.836 ± 0.029, CoVI-PF 0.737 ± 0.007, Cytotoxic 0.761 ± 0.002, AlphaLSA 0.841 ± 0.004, Truth 0.862 ± 0.002, with average AUC improvements of ~3.9% over ImageMol and ~8.1% over REDIAL-2020.
- External target generalization (ChEMBL, 4 targets): High ROC-AUCs on validation/test sets for BACE1 (0.897/0.893), COX-1 (0.849/0.901), COX-2 (0.810/0.907), EP4 (0.773/0.899), improving over ImageMol by ~6.4% (validation) and ~4.1% (test). t-SNE embeddings showed clear separability of inhibitors vs non-inhibitors.
- Virtual screening success: Re-identified known inhibitors with high precision: BACE1 15/16 (93.8%), COX-1 8/22 (36.4%), COX-2 11/35 (reported), EP4 8/8 (reported). Overall average precision improvement vs ImageMol ~38% (range 12.5%–75.0%). For BACE1 drug repurposing from DrugBank, 10/20 top VideoMol predictions had literature support (55% success) vs 25% for ImageMol; docking to BACE1 (PDB 4IVS) indicated favorable grid scores (DockX 1.0).
- Uncertainty quantification: BCa bootstrap 95% intervals showed VideoMol’s average improvement of ~5.44%–10.07% vs ImageMol on ligand binding and SARS-CoV-2 datasets.
- Ablations and robustness: Pretraining tasks improved downstream performance (average RMSE ↓27.1%, MAE ↓31.0% vs no pretraining). VideoMol’s learned features (VideoMolFea) surpassed a 21-fingerprint ensemble (EnsembleFP) by ~17.0% RMSE and ~19.6% MAE on GPCR datasets. Performance improved with more frames (e.g., MAE gains from 5→10, 10→20, 20→30 frames). Video generation source had minimal impact (comparable RMSE/MAE across OpenBabel, DeepChem, RDKit). Feature similarity across conformers decreased with increasing RMSD, indicating sensitivity to conformational changes.
- Interpretability: Grad-CAM highlighted chemically meaningful substructures and consistent attention on recurring motifs across frames; intra-video similarities were higher than inter-video, and cluster structure in t-SNE reflected physicochemical pseudo-labels.
Discussion
Representing molecules as dynamic videos enables VideoMol to capture 3D structural cues and conformational dynamics that are difficult to encode from static 2D images or hand-crafted fingerprints. By integrating video-aware contrastive learning, direction-aware geometric reasoning, and chemical-aware clustering objectives, VideoMol learns embeddings that generalize across diverse drug discovery tasks. Empirically, VideoMol improves target binding predictions (kinases, GPCRs), molecular property predictions (pharmacology, biophysics, physical and quantum chemistry), and antiviral activity prediction (SARS-CoV-2 assays), outperforming strong sequence-, graph-, and image-based baselines. The model generalizes to external targets (BACE1, COX-1, COX-2, EP4) with improved ROC-AUC and demonstrates practical value in virtual screening and repurposing (e.g., BACE1), corroborated by docking grid scores and literature evidence. VideoMol is robust to class imbalance and data scarcity, aided by multi-view aggregation and self-supervised pretraining, and provides interpretable attention maps that localize key substructures even under occlusion or changing viewpoints. Collectively, these findings support the hypothesis that video-based molecular representations can address limitations of static or hand-crafted approaches, improving predictive performance and offering actionable insights for computational drug discovery.
Conclusion
This work introduces VideoMol, a self-supervised, video-based foundation model for molecular representation learning that leverages dynamic and physicochemical awareness to learn transferable features from 2 million unlabeled molecular videos. Across binding, property, and antiviral activity prediction tasks, VideoMol consistently outperforms state-of-the-art baselines, generalizes to external targets, and demonstrates practical utility in virtual screening and repurposing (e.g., BACE1). Its interpretability via Grad-CAM enables localization of key substructures, offering transparency in predictions. Future research directions include: training larger VideoMol variants with broader biomedical data; reducing computational overhead via pruning and efficient architectures; distillation to smaller sequence/graph models as students; improved video processing and ensemble strategies to fuse multi-view information; and explicit modeling of conformer diversity and molecular dynamics to further exploit 3D variability. These avenues can further enhance VideoMol’s accuracy, efficiency, and applicability in AI-driven drug discovery.
Limitations
- Computational cost: Rendering and processing 60-frame molecular videos increase storage and compute demands versus 2D or graph inputs.
- View selection: Choice and coverage of viewpoints can influence representation quality; optimal view scheduling remains open.
- Conformer diversity: Current approach renders fixed conformers from multiple angles but does not fully model the distribution of conformations or dynamic transitions between them.
- Data and scalability: Scaling to larger chemical spaces and targets requires substantial resources; strategies like pruning and distillation are needed for efficiency.
- Potential sensitivity to rendering/augmentation choices: Although robustness across video generation platforms is observed, different rendering settings or augmentations may still affect performance in edge cases.
Related Publications
Explore these studies to deepen your understanding of the subject.

