
Chemistry
Identifying opportunities for late-stage C-H alkylation with high-throughput experimentation and in silico reaction screening
D. F. Nippa, K. Atz, et al.
This groundbreaking study by a team of experts investigates innovative methods to identify optimal substrates for late-stage C-H alkylation using advanced computational techniques and high-throughput experimentation, leading to the creation of 30 novel molecules.
~3 min • Beginner • English
Introduction
The synthesis of novel compounds is often the rate-limiting step in small-molecule drug discovery. Late-stage functionalization (LSF) enables rapid diversification by introducing new functional groups directly into complex drug-like molecules without de novo synthesis, facilitating SAR exploration and optimization of ADME properties at lower synthetic cost. Predicting which molecules are amenable to specific LSF reactions remains challenging due to functional group incompatibilities and complex electronic/steric environments in advanced substrates. Minisci-type alkylations provide a valuable route to introduce sp3-rich alkyl groups into electron-deficient heteroarenes using radical intermediates generated from carboxylic acids, broadening medicinal chemistry space with increased 3D character. However, practical prediction of Minisci reactivity and scope across complex substrates is difficult, and generating sufficiently rich training data at conventional scales is resource-intensive. High-throughput experimentation (HTE) allows systematic miniaturized reaction screening and, when combined with rigorous data curation, can generate machine-learning-ready datasets. Graph neural networks (GNNs), particularly those leveraging 3D molecular information, have shown promise for reaction prediction and LSF tasks. In this study, the authors develop and apply GNNs trained on curated Minisci reaction data from literature, HTE, and decoy reactions to virtually screen thousands of heterocyclic building blocks against sp3-rich carboxylic acids, guiding experimental selection and scale-up to identify new opportunities for late-stage C–H alkylation.
Literature Review
A systematic literature analysis of Minisci-type alkylations surveyed 45 publications to identify conditions suitable for miniaturized, parallel HTE without specialized photo- or electrochemical equipment. Sutherland et al. reported a metal-, photocatalyst-, and light-free protocol using carboxylic acids as radical precursors that met criteria for robustness and adaptability. Reaction data from the literature were manually curated and standardized into the SURF (Simple User-friendly Reaction Format), enabling direct machine learning ingestion without further curation. The literature set informed the design of a 24-well screening plate featuring 23 sp3-rich carboxylic acids (n-alkyls, cyclic alkanes, O- and N-heterocycles) relevant to drug discovery.
Methodology
HTE reaction screening and data generation: The Sutherland Minisci protocol was downscaled 300-fold from 150 µmol to 0.5 µmol in a 24-well parallel plate format. Reactions were performed under nitrogen in a glovebox, with DMSO stock solutions for all components to ensure accurate nanomole dosing and mixing. Optimization showed best conversion at 40 °C; higher temperatures favored di-alkylation. Doubling alkyl carboxylic acid (20 equiv) and oxidant (6 equiv) improved conversion by 1.2–1.5×. A fixed reference reaction (Quinoline 1 + acid e in well B4) served as a plate quality control. Each substrate was tested against 23 carboxylic acids emphasizing compact sp3 ring systems. Reaction outcomes were labeled successful if LCMS detected mono- or di-alkylation ≥5% (yields of mono/di regioisomers were combined); unsuccessful otherwise. Initial experimental set comprised drugs and fragments from an LSF informer library plus additional fragments, yielding 691 reactions (379 successful, 312 unsuccessful). A separate decoy set of 368 unsuccessful reactions from 16 non-reactive substrates (lacking suitable aromatic/heteroaromatic motifs) was curated to balance predominantly positive literature/experimental data.
In silico model development: A graph transformer neural network (GTNN) based on an E(3)-equivariant architecture was adapted to take two variable molecular graph inputs (N-heteroarene and carboxylic acid) plus encoded reaction conditions. 3D conformers (UFF) defined graph nodes (atoms with embeddings for element, ring membership, aromaticity, hybridization) and edges (neighbors within 4 Å), with interatomic distances encoded via Fourier features. Three message-passing layers produced atomic features pooled by a graph multiset transformer to molecular representations. Separate GNN modules processed each reactant (shared initial embeddings only), concatenated with a learned reaction-condition embedding (one-hot for components; real-valued scalars for equivalents, solvent fractions, temperature, time, concentration). Final MLP heads addressed two regression tasks: reaction yield (0–1) and binary outcome (0/1). Models with partial charges showed no performance gain; 3D graphs without electronic features were used.
Training data and validation: Initial training used 621 reactions (368 decoys, 45 literature, 207 HTE). For applications, six models were trained (three yield, three binary). Validation on the complete 691-reaction experimental dataset (random split) yielded MAE 18.7±0.2% and Pearson r 0.687±0.006 for yields; yield-category accuracy 55.7±0.7% across four bins; binary accuracy 81±1% and F1 82.7±0.6%.
Virtual screening and experimental validation: The trained ensemble virtually screened 3180 Roche heterocyclic building blocks against 23 carboxylic acids. Each substrate’s score averaged outputs from the six models (yield and outcome predictions), with uncertainty estimated from model variance. Agglomerative clustering on ECFP4 Jaccard similarity produced eight clusters; two were discarded due to structural unsuitability (no free C–H). From the remaining six clusters, three top-scoring molecules each were selected (18 total) for HTE validation. Automated HTE generated 414 reaction data points, with 276 successful alkylations across the 18 substrates (17 of 18 had ≥10 successful transformations; 94% selection success). Reactivity trends across acids and N-heteroarenes were analyzed.
Scale-up: Selected hits (based on >40% conversion by UV) were scaled from microgram to milligram scale under nitrogen, purified by flash or RP-HPLC, and characterized by NMR and HRMS. Thirty novel alkylated products were isolated, including late-stage derivatives of Loratadine and Nevirapine and diverse fragment analogs. Regioselectivity generally followed Minisci guidelines, with notable exceptions in densely functionalized substrates and a sulfur-centered reaction (38e) yielding a thioether instead of pyridine alkylation.
Data and code: SURF-formatted datasets (literature, experimental, decoy) are provided as TSV files in Supplementary Data; reference implementation available at https://github.com/ETHmodlab/minisci.
Key Findings
- Reaction miniaturization: Minisci protocol successfully downscaled to 0.5 µmol in 24-well plates; best at 40 °C; increasing carboxylic acid to 20 equiv and oxidant to 6 equiv improved conversion by 1.2–1.5×.
- Data generated: Balanced experimental dataset of 691 reactions (379 successful, 312 unsuccessful). Initial model training used 621 reactions (368 decoys, 45 literature, 207 experimental).
- Virtual screening: 3180 heterocyclic building blocks screened against 23 sp3-rich carboxylic acids using an ensemble of six GNN models; clustering guided selection of 18 substrates.
- Experimental validation: 414 HTE reaction points yielded 276 successful Minisci alkylations; 17 of 18 substrates produced ≥10 successful reactions (94% selection success). Median yield of successful reactions was 26%.
- Model performance: Yield prediction MAE 18.7±0.2% with Pearson r 0.687±0.006; correct yield-category prediction in 55.7±0.7% of cases. Binary outcome accuracy 81±1% and F1 82.7±0.6%.
- Reactivity trends: Cyclic ethers and alkanes (e.g., acids u, s, a, b, e, g) showed higher success; cyclic Boc-protected amines (o, p, q, r, v) and amides (d) gave low yields (often 5–20%). Meta-unsubstituted pyridines outperformed meta-substituted analogs; electron-rich meta-substituents (amine/methoxy) depressed yields. Five-membered N-heterocycles (e.g., 2, 4, 9) exhibited very low conversion (≤4% average).
- Scale-up and products: 30 novel, fully characterized molecules synthesized, including multiple Loratadine and one Nevirapine derivatives, and diverse fragment analogs (e.g., cyclohexyl, cyclobutyl, cyclic ether insertions). Regioselectivity generally aligned with Minisci rules, with exceptions in densely substituted pyridines and a sulfur-centered alkylation (38e).
Discussion
The study addresses the challenge of prospectively identifying drug-like substrates suitable for Minisci-type late-stage C–H alkylation by combining nanomole-scale HTE with GNN-based in silico screening. Training on a compact, curated reaction set augmented with negative decoys enabled models to discriminate reactive from non-reactive combinations and prioritize substrates across a large library. The pipeline achieved high hit rates upon experimental validation (94% of selected substrates affording broad alkylation scope), demonstrating that limited yet representative datasets can guide meaningful reaction predictions. The model’s yield prediction correlated moderately with experiment, while binary outcome prediction was strong, suitable for triaging candidates. Observed reactivity trends—higher performance with certain sp3-rich acids, detrimental effects from meta-substitution and electron-donating groups on pyridines, and low reactivity of five-membered N-heterocycles—were captured sufficiently to steer substrate selection (e.g., de-prioritizing five-membered systems). The approach translated to practical synthesis, enabling mg-scale preparation and full characterization of 30 novel derivatives relevant for SAR exploration, thereby expanding chemical diversity toward higher sp3 content. These results validate the feasibility of integrating HTE-generated FAIR data with geometric deep learning to accelerate LSF opportunity identification and reduce experimental burden.
Conclusion
This work presents an end-to-end framework coupling miniaturized HTE with 3D GNN-based virtual reaction screening to identify opportunities for late-stage Minisci-type C–H alkylation on complex, drug-like heterocycles. The method efficiently downscaled reactions, curated high-quality SURF data, trained dual-input GTNNs, and prospectively screened 3180 building blocks to select 18 substrates, yielding 276 successful alkylations and enabling isolation of 30 novel molecules. The models delivered robust binary outcome predictions and moderate yield predictions, sufficient for prioritization and scale-up. Future work should broaden reaction condition space (oxidants, solvents), incorporate photoredox and electrochemistry, diversify radical precursors, and extend substrate scope (particularly five-membered heterocycles). Incorporating richer electronic or transition-state features may further improve regioselectivity and yield predictions, and multi-output modeling could capture mono/di-alkylation distributions. Continuous data generation within the SURF framework will iteratively enhance model performance and applicability in medicinal chemistry.
Limitations
- Reaction scope limitations: Five-membered N-heterocycles showed very low reactivity, limiting generalizability across heterocycle classes. Densely functionalized pyridines exhibited regioselectivity deviations from literature guidelines.
- Functional group tolerance: Cyclic Boc-protected amines and amides often gave low yields; the method’s compatibility with broader functionalities remains constrained.
- Data scale and composition: Training relied on a relatively small dataset enriched with decoy negatives and literature positives; prediction performance, especially yield categorization (55.7% accuracy), leaves room for improvement.
- Experimental setup dependence: Optimal outcomes required glovebox operation under nitrogen and specific reagent equivalents; translation to different labs or conditions may affect reproducibility.
- Electronic descriptors: Models did not benefit from partial charges; absence of detailed quantum descriptors may limit fine-grained prediction of regioselectivity and yields.
Related Publications
Explore these studies to deepen your understanding of the subject.