Chemistry

Leveraging molecular structure and bioactivity with chemical language models for de novo drug design

M. Moret, I. P. Angona, et al.

Dive into cutting-edge drug design with our exploration of hybrid chemical language models, spearheaded by Michael Moret and team. Discover how we crafted a virtual library of novel PI3Kγ ligands, leading to the identification of potent compounds that inhibit cancer cell pathways.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses how chemical language models (CLMs) can be leveraged not only for de novo molecular generation but also for activity-focused design by integrating bioactivity information of known ligands. Computational methods complement high-throughput screening by enabling focused virtual libraries to explore chemical space efficiently, yet extremely large libraries raise concerns about practicality and false positives. CLMs, trained on SMILES representations, have shown success in generating bioactive molecules and focused libraries. The typical CLM workflow involves pretraining on large chemical corpora, transfer learning (fine-tuning) on target-relevant molecules, and sampling new structures. Here, the authors develop a hybrid pipeline for PI3Kγ ligand design that combines a generative CLM with a CLM-based bioactivity classifier, aiming to generate and prioritize molecules with desired activity, and to validate hits rapidly using commercially available compounds and synthesized designs.

Literature Review

Prior work has demonstrated CLMs and other data-driven approaches can generate focused chemical libraries and de novo bioactive molecules, including VEGFR-2 inhibitors and nuclear hormone receptor modulators. Earlier CLMs were typically pretrained on ChEMBL bioactive molecules (<1 µM), which capture features of bioactive compounds but may not reflect properties of approved drugs. Sampling strategies (temperature vs nucleus) affect validity, uniqueness, novelty, and alignment with training distributions (e.g., Fréchet ChemNet Distance). For property prediction, traditional chemoinformatics often uses descriptors and ML, while recent advances in language-model pretraining (e.g., ELECTRA, BERT, ULMFiT) inform better representation learning. Ensemble methods and uncertainty estimation improve robustness, especially for out-of-domain predictions, a known issue in QSAR modeling. Scaffold analysis (Bemis–Murcko and atom/graph scaffolds) and fingerprint similarity (Morgan/Tanimoto) guide assessment of novelty and similarity principles.

Methodology

Target selection: PI3Kγ was chosen from Drug Target Commons (DTC) due to abundant annotated data. After filtering assay parameters (e.g., biochemical functional assays with competitive inhibitor type, specified units/conditions), 198 molecules remained, with conflicting duplicates removed. Training data and preprocessing: Molecules represented as canonical SMILES (RDKit). Length ≤90 characters, salts removed, duplicates filtered. Pretraining data comprised 839,674 unique patented pharmaceutical molecules (US patent database). Fine-tuning set: PI3Kγ inhibitors with activity ≤100 nM from DTC; 46 were initially identified, ultimately 43 retained after processing. Generative CLM: Implemented as an LSTM-based recurrent neural network (Keras/TensorFlow). Architecture: BatchNormalization → 1024 LSTM → 256 LSTM → BatchNormalization (≈5.82M parameters). Training: Adam (1e-3), categorical cross-entropy, 40 epochs over pretraining data. Transfer learning: parameters of first layer frozen; second layer trained at 1e-4. Sampling strategies: Temperature sampling uses softmax with temperature T. Nucleus (top-p) sampling restricts to characters whose cumulative probability exceeds threshold p (with T=1). Empirically compared using validity/uniqueness/novelty and Fréchet ChemNet Distance (FCD). For PI3Kγ-focused generation during transfer learning, nucleus sampling with threshold 0.85 was used. Sampling protocol: 5,000 SMILES per epoch over 50 epochs, repeated 10 times (2,500,000 total); 1,121,735 were valid, unique, and novel relative to pretraining and fine-tuning sets. Hybrid CLM classifier (bioactivity prediction): Activity framed as an ordinal classification with three classes: inactive (pIC50 ≤4.0), moderately active (4.0<pIC50≤6.5), highly active (pIC50>6.5). Two pretraining strategies compared: (1) autoregressive (same as generative CLM), and (2) ELECTRA-style pretraining (E-CLM), adapted to SMILES where the model discriminates original vs corrupted tokens in SMILES strings using an LSTM encoder. Classifier architecture: Pretrained encoder (CLM or E-CLM) plus an added feedforward dense layer with three sigmoid units for ordinal classification. To address class imbalance, oversampling was applied to underrepresented classes (“inactive” and “highly active”). k-means clustering (k=5) on Morgan fingerprints partitioned fine-tuning molecules for cross-validation (four folds train/val, one test); best settings selected on the single-use test set (oversampling: +40 molecules to the two minority classes; sigmoid threshold 0.4; 200 transfer-learning epochs). Final ensemble: 100 E-CLM models trained on all data with Adam (1e-4), binary cross-entropy, 200 epochs; architecture: BatchNorm → 1024 LSTM → 256 LSTM → BatchNorm → Dropout → Dense(3, sigmoid) with ~5.65M parameters. Prediction confidence defined as the number of ensemble votes classifying a molecule as “highly active”. Commercial library screening baseline: Similarity-based virtual screening using Morgan fingerprints and Tanimoto similarity to fine-tuning molecules; fusion via reciprocal sum of ranks (score S). Docking: Protein structure PI3Kγ (PDB 3ENE) prepared in MOE 2019.0102 (QuickPrep, Protonate3D, water deletion beyond 4.5 Å, receptor tethering/refinement). Docking with GOLD v5.2.2 (default efficiency, 80 poses/compound, induced fit). Poses refined with MOE GBVI/WSA ΔG; top-scoring pose selected. GoldScore/GBVI-WSA estimated binding energies; redocking of co-crystal ligand yielded RMSD 0.448 Å and ΔG ≈ -33.93 kJ/mol. Biochemical assay: PI3Kγ binding (KINOMEscan) by Eurofins Discovery. 11-point 3-fold serial dilution in DMSO (final 1% DMSO). Ki estimated via Hill equation fit (Hill slope fixed at -1) using nonlinear least squares (Levenberg–Marquardt). Cell-based assays: HD-MB03 medulloblastoma cells cultured in RPMI + 10% FBS, antibiotic, GlutaMAX. For signaling: serum-starved 48 h, pretreated 3 h with DMSO, 100 nM copanlisib (control), or compounds, then stimulated 15 min with 10 ng/mL EGF. Western blot probed p-AKT (S473), total AKT, and GAPDH. Viability: CellTiter-Glo after 72 h exposure to increasing concentrations in FBS-containing medium; nonlinear fit for IC50 estimation.

Key Findings

- Nucleus sampling improved the quality of generated SMILES during transfer learning, yielding lower FCD versus temperature sampling and higher novelty at a threshold of 0.85. From 2.5 million sampled SMILES, 1,121,735 were valid, unique, and novel relative to training/fine-tuning data. - For ordinal bioactivity classification, ELECTRA-pretrained E-CLM outperformed autoregressive CLM in identifying highly active molecules with fewer false positives. At a 0.4 threshold and 71.3% TPR, E-CLM achieved a 10.0% FPR versus 46.7% for CLM. - An ensemble of 100 E-CLM classifiers increased prediction confidence. As vote threshold increased, fewer molecules were predicted highly active. Forty-seven de novo designs received 99/100 votes; among these, 64% had new atom scaffolds and 62% new graph scaffolds relative to the fine-tuning set. - Commercially available designs: 16 generated molecules were purchased and tested. One hit (compound 1) showed submicromolar binding to PI3Kγ with Ka ≈ 0.6–0.7 µM (N=2: 670 nM and 620 nM), consistent with Fig. 4 reporting Ki ≈ 640 nM. All lower-ranked molecules were inactive. Ensemble confidence correctly prioritized active compound 1 over inactive close analogs with the same core but different substituents. - De novo synthesis and testing: Two top-ranked designs (17 and 20; 99/100 votes) and derivatives (18, 19, 21, 22) were synthesized. Biochemical binding (Ki; mean of N=2) showed nanomolar potency: 17 (63 nM), 18 (52 nM), 19 (160 nM), 20 (120 nM), 21 (290 nM), 22 (13 nM). Potencies aligned with higher ensemble vote confidence versus compound 1 (80/100 votes). - Docking suggested key hinge hydrogen bonds (Glu880, Val882) for 4-amino-pyrazolopyrimidines (17–22) and generally lower estimated binding energies than compound 1; however, docking failed to correctly rank 17–22 by Ki. - Cellular efficacy: In HD-MB03 cells, compounds 18 and 22 at 100 nM prevented EGF-induced AKT S473 phosphorylation, reducing p-AKT by ~70–90%, comparable to 100 nM copanlisib. Neither compound reduced viability at 100 nM. Estimated CellTiter-Glo IC50 values: ~172 nM (18), ~188 nM (22), and ~19 nM (copanlisib). - The pipeline enabled scaffold hopping: compound 1 introduced a new atom scaffold relative to known PI3Kγ actives (ChEMBL v28), with nearest Tanimoto similarity ~34%. Similarity-based screening ranked it 25,693rd, underscoring the advantage of the CLM+E-CLM approach. - Generative design reproduced a known generic scaffold (S1) with novel sidechains; target-prediction (TIGER) scores supported selecting between top-ranked designs when ensemble scores were indistinguishable.

Discussion

The work demonstrates that hybrid chemical language modeling can address both de novo molecular generation and activity-focused prioritization. By combining LSTM-based CLMs for structure generation with an ELECTRA-pretrained classifier fine-tuned on ordinal activity labels, the pipeline effectively narrows a large design space to candidates with a higher likelihood of activity. Nucleus sampling improved sampling quality and alignment with pretraining distributions during transfer learning, leading to a sizable set of valid, unique, and novel molecules focused on PI3Kγ. E-CLM’s lower false positive rates at comparable true positive rates indicate that discriminative pretraining tailored to detecting subtle token replacements is better suited for classifying small structural variations—a key requirement in ligand series where small changes can cause large activity shifts (activity cliffs). Ensemble voting provided a practical means to quantify prediction confidence and correlated with increased similarity to fine-tuning ligands, consistent with similarity–activity principles. The approach enabled both scaffold hopping (compound 1) and potent analog generation (17–22), bridging hit finding and hit-to-lead expansion. Biochemical and cell-based validations confirmed that top-ranked designs bound PI3Kγ with nanomolar potency and repressed downstream AKT signaling in tumor cells without acute cytotoxicity at active concentrations. While docking rationalized hinge-binding interactions for 4-aminopyrazolopyrimidines, its quantitative ranking limitations highlight the benefit of data-driven design complemented by, but not dependent on, docking scores. Integrating external target-prediction tools (TIGER) helped differentiate among top-ranked de novo designs with similar ensemble confidence, suggesting a multi-scorer strategy can further refine selection.

Conclusion

This study introduces a twin-CLM pipeline that integrates generative molecular design with an ELECTRA-pretrained, ordinal bioactivity classifier to generate and prioritize PI3Kγ ligands. Methodological advances—nucleus sampling for transfer learning and discriminative pretraining for classification—yielded a focused, novel chemical library and accurate activity ranking with uncertainty estimates via deep ensembles. Prospective validation identified a new scaffold with submicromolar activity from commercially available compounds and led to synthesized de novo designs with medium-to-low nanomolar potency that repressed PI3K/AKT signaling in tumor cells. Future work should systematically assess the impact of different pretraining corpora and methods on downstream generation vs classification, improve resolution among closely related potent analogs, and extend the pipeline to multi-parameter optimization (e.g., permeability, solubility, off-target profiles). A practical strategy is to first validate de novo designs against physically available libraries before committing to synthesis, potentially accelerating the design–make–test–analyze cycle across diverse target families.

Limitations

- Inconsistent counts of fine-tuning molecules (46 in Results vs 43 in Methods after processing) highlight dataset curation sensitivities. - The hypothesized advantage of patent-based pretraining for eventual drug-likeness could not be conclusively evaluated within the study timeframe. - The classifier struggled to differentiate among structurally very similar, highly potent analogs; external target-prediction support (TIGER) was needed to prioritize among top-equal ensemble scores. - Docking provided qualitative rationale but failed to quantitatively rank 4-aminopyrazolopyrimidines by experimental Ki, underscoring limits of physics-based scoring for this series. - Validation focused on in vitro biochemical binding and a single cell model; broader pharmacology, ADME, and off-target effects were not addressed. - Generalizability to other targets and superiority over alternative methods remain unproven and require additional prospective studies.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions

O. R. Sarrias, M. P. M. D. Prado, et al.

Computer Science

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

S. H. Snyder, P. A. Vignaux, et al.

Medicine and Health

A comparative study of COVID-19 transcriptional signatures between clinical samples and preclinical cell models in the search for disease master regulators and drug repositioning candidates

H. Chapola, M. A. D. Bastiani, et al.

Medicine and Health

In Vitro Tumor Models on Chip and Integrated Microphysiological Analysis Platform (MAP) for Life Sciences and High-Throughput Drug Screening

H. Ngo, S. Amartumur, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny