logo
ResearchBunny Logo
Autonomous molecule generation using reinforcement learning and docking to develop potential novel inhibitors

Medicine and Health

Autonomous molecule generation using reinforcement learning and docking to develop potential novel inhibitors

W. Jeon and D. Kim

This exciting research introduces MORLD, a novel computational method by Woosung Jeon and Dongsup Kim that enhances drug discovery by autonomously generating and optimizing lead compounds. MORLD combines cutting-edge reinforcement learning with efficient docking simulations, achieving swift modifications to improve binding affinity in just under two days!

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of autonomous, de novo drug design using artificial intelligence without relying on target-specific experimental data, which are often unavailable for new targets. Prior AI-driven approaches (e.g., QSAR, molecular simulations, docking) and recent deep generative models (VAEs, GANs, RL) have shown promise, including rapid DDR1 inhibitor discovery by GENTRL. However, many methods require substantial experimental training data and may produce molecules similar to known actives. Ultra-large virtual screening (e.g., for D4DR) can find new chemotypes but demands significant expertise and computation. The purpose here is to develop a method that, given only a target protein 3D structure, autonomously modifies or generates ligands to improve predicted binding affinity while maintaining synthesizability and drug-likeness, operating efficiently on moderate hardware. The significance lies in enabling rapid design for novel targets without prior activity data, potentially complementing or replacing large-scale virtual screening.
Literature Review
The paper situates MORLD among AI-driven de novo design and optimization methods. Generative models (VAEs, GANs, RL) have been used to design drug-like compounds; GENTRL demonstrated fast DDR1 inhibitor discovery but relied on activity data and produced molecules similar to the training set. Ultra-large library docking identified D4DR agonists but is computationally intensive. Fragment- and GA-based tools (GANDI, MoleGear, LigBuilder, AutoGrow4) optimize by combining predefined fragments and reactions. De novo DOCK builds ligands via anchor-and-grow. In contrast, MORLD is atom-based and integrates reinforcement learning with docking, requiring only the target structure. It also contrasts with MolDQN by incorporating docking-derived rewards at terminal steps and additional constraints (SA, QED) during intermediate steps.
Methodology
MORLD performs iterative molecule optimization episodes composed of T modification steps. At each step, MolDQN proposes chemically valid actions (add/remove an atom or bond) constrained by valence rules (checked via RDKit). Allowed atom types are user-specified (e.g., C, N, O, F). Action selection follows a decaying epsilon-greedy policy with Double Q-learning and bootstrapped DQN to balance exploration and exploitation. For non-terminal states (t < T), molecules are evaluated by synthetic accessibility (SA; normalized so higher is easier) and quantitative estimate of drug-likeness (QED). A weighted sum of SA and QED (weights set to 1 in this study) scaled by a factor that emphasizes later steps provides the intermediate reward. At the terminal state (t = T), the ligand is docked against the target using QuickVina 2, and the negative docking score (binding energy) is used as the reward. Across episodes, the Q-function is updated to reduce the gap between predicted and observed returns, enabling MORLD to learn modification policies that improve final docking scores while maintaining synthesizability and drug-likeness. Docking setup: Ligand 3D conformations are generated from SMILES using Open Babel and protonated at pH 7. Target proteins are prepared from PDB structures, protonated using PDB2PQR (pH 7), and converted back to PDB; binding sites derive from native ligands. QuickVina 2 computes predicted binding energies in predefined grids. For evaluation beyond the training loop, additional docking programs (AutoDock Vina, rDock, Ledock) are used for consensus and robustness checks. Reward design: Intermediate reward r(s,t) combines SA and QED with step weighting to emphasize later steps; terminal reward is the negative QuickVina 2 docking score. Weights for SA and QED were set to 1 in experiments but are user-configurable. Docking is only performed at terminal steps to reduce computational cost. Experimental design: Targets included DDR1 (PDB 3ZOS) and D4DR (PDB 5WIU). Scenarios tested: (i) starting from a known active (e.g., ponatinib for DDR1), (ii) starting from a small-scale virtual screening hit (e.g., ZINC12114041 for DDR1; ZINC12203131 for D4DR), and (iii) de novo generation (None) without an initial lead (for D4DR). Docking grids were set from co-crystal binding sites (3ZOS grid center −7.5, 2.5, −40; size 24×20×20 Å; 5WIU grid center −17, 15, −18; size 24×12×24 Å). Hyperparameters varied by task (e.g., steps 20–48; episodes 7,000–20,000; atom types C,N,O,F; SA and QED weights 1). Performance was benchmarked against a random-action control and, for PARP-1, against AutoGrow4 using comparable leads and docking conditions. Sampling/selection: From many generated molecules, candidates were first sorted by QuickVina 2 docking score, then filtered for acceptable SA and QED, and visually inspected for chemical reasonableness. The web server returns all three scores to facilitate user selection. Hardware and runtime: Training and docking ran on a single Nvidia GTX2080Ti GPU and Intel Xeon CPUs (20–24 cores). Typical optimization (e.g., from ponatinib) completed in under two days.
Key Findings
- Learning and property optimization: Compared to a random-action control, MORLD progressively improved terminal docking scores, SA, and QED over episodes for DDR1. Early episodes were similar to random due to high exploration; later, MORLD consistently outperformed the control as epsilon decayed and the policy converged. - Docking score improvements: For DDR1 starting from ponatinib, optimized molecules achieved docking scores around −15.9 kcal/mol (QuickVina 2), improving over the initial −12.7 kcal/mol. The average docking score approached about −16 kcal/mol by the end of training, ~3 kcal/mol better than the lead. - Diversity and redundancy: The number of unique compounds per 100 episodes initially increased but decreased after ~3,000 episodes; by ~7,000 episodes, ~80% of generated compounds were redundant, reflecting policy convergence. Nevertheless, non-redundant molecules showed a broad similarity distribution to the lead (Tanimoto ECFP similarity ~0.5–0.8; mean 0.644 vs 0.587 for random; Wilcoxon rank-sum p < 1e−10). - DDR1 inhibitors: Using the “Parent structure” from Zhavoronkov et al. or a VS-derived lead (ZINC12114041), MORLD produced molecules whose AutoDock Vina, rDock, and Ledock scores were better than or comparable to experimentally validated compounds from Zhavoronkov et al. For example, L_Sample1–3 and V_Sample1–3 showed Vina scores around −12.4 to −13.1 kcal/mol, outperforming several benchmarks; trends across compounds (active > moderate > inactive) were preserved by docking, supporting docking plausibility. - D4DR agonists: MORLD generated candidates from scratch (None) and via a VS hit (ZINC12203131). Some de novo and optimized molecules displayed docking scores comparable to or better than active compounds reported by Lyu et al. According to multiple docking programs and SA/QED assessments, MORLD could propose potent agonists even without initial leads. - Pose and interaction analysis: Optimized DDR1 molecules (e.g., P_sample1) preserved key interactions observed in the ponatinib co-crystal (e.g., residues A653, K655, L679) and introduced new favorable interactions (e.g., E672, F762, R789), rationalizing improved docking scores. Molecules derived from non-native leads (e.g., V_sample1) showed newly formed interactions (e.g., I684, I701, I782) compared to initial VS hits. - Benchmark vs AutoGrow4 (PARP-1): Starting from the same E7449 fragment and docking with the same QuickVina 2 settings, a MORLD sample (A_sample1) achieved −14.6 kcal/mol vs −12.2 kcal/mol for AutoGrow4’s Compound 3 (under MORLD’s docking protocol), with better SA (0.65 vs 0.54) and QED (0.56 vs 0.51), while preserving known key interactions (π-π with Y907/Y896; H-bond with G863). - Efficiency: Full design cycles completed in under two days on moderate hardware, significantly faster than prior reports (e.g., 21-day generation phase in GENTRL).
Discussion
The findings demonstrate that integrating reinforcement learning with docking as terminal rewards enables autonomous optimization toward improved predicted binding affinity while balancing synthesizability and drug-likeness. MORLD learns modification policies that outperform random search, reliably enhancing docking scores and maintaining acceptable SA/QED. Across DDR1 and D4DR, generated molecules were competitive with experimentally validated actives by multiple docking programs, suggesting generality across targets and task setups (lead optimization, VS-hit optimization, and de novo generation). The observed reduction in unique outputs over time reflects policy convergence; while it limits diversity, it indicates stable discovery of high-scoring chemotypes. Pose analyses support that improvements are chemically plausible, preserving or introducing favorable interactions in target binding sites. Compared to fragment-based GA methods and ultra-large virtual screening, MORLD requires fewer resources and no target-specific activity data, potentially serving as a practical alternative or complement for early-stage discovery on novel targets. Nonetheless, reliance on docking scores and 2D fingerprints introduces uncertainties, and practical deployment should incorporate orthogonal validation and diversity controls.
Conclusion
MORLD is an autonomous molecule generation and optimization framework that, given only a target 3D structure, modifies or builds ligands to improve predicted binding affinity, while encouraging synthesizability and drug-likeness. It efficiently discovers candidate inhibitors or agonists without target-specific training data, as shown for DDR1 and D4DR, and can operate across scenarios ranging from lead optimization to de novo design. Benchmarks indicate improved docking scores and favorable SA/QED compared to controls and some existing methods, with rapid runtimes on modest hardware and accessible deployment via a public web server. Future work could address limitations by integrating more physics-informed or ML-based scoring, incorporating 3D-aware molecular representations in the policy network, adding explicit diversity-promoting objectives, and constraining chemistry via reaction rules or medicinal chemistry filters to avoid undesirable substructures.
Limitations
- Docking limitations: Docking scores imperfectly correlate with true binding affinity; improved scores do not guarantee activity. Consensus scoring helps but does not fully mitigate this. - Target structure dependence: Requires a 3D structure with a druggable pocket; not applicable to targets lacking structures or to intrinsically disordered regions. - Representation constraints: Q-values are based on ECFP (2D), which does not capture 3D conformations and stereochemistry fully, potentially limiting policy learning. - Diversity collapse: RL policy tends to converge to a single optimum, reducing output diversity and increasing redundancy in later episodes. - Search space and chemistry: Atom-based edits face combinatorial explosion; outcomes depend on initial search direction and episode count. Without stricter chemistry constraints, models may generate chemically inadequate substructures; SA and QED are imperfect proxies. - Generalizability: Performance was assessed via docking and property proxies; experimental validation is needed to confirm activity and developability.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny