Introduction
The synthesis of novel and complex chemical structures is crucial for establishing structure-activity relationships (SAR) in medicinal chemistry. SAR models guide drug discovery, aiming to improve the pharmacological activity and physicochemical properties of drug candidates. Efficient synthesis is vital, as it often represents a bottleneck in the design-make-test-analyze cycle. Late-stage functionalization (LSF) methods, which modify C-H bonds in advanced drug molecules, offer a solution. Among these, C-H borylation is particularly versatile due to the transformability of organoboron species into a wide array of functional groups, facilitating extensive SAR studies. However, the application of LSF in drug discovery remains limited, often involving only a single reaction type. The complexity of drug molecules, with multiple functional groups and C-H bonds of varying strengths and environments, makes reactivity and selectivity prediction challenging and necessitates extensive experimentation. High-throughput experimentation (HTE) offers a way to accelerate reaction optimization through miniaturized, parallel screenings. Combining HTE with FAIR data documentation generates high-quality datasets suitable for advanced data analysis and machine learning. Graph neural networks (GNNs) have proven effective in molecular feature extraction and property prediction. GNNs, particularly, have been used successfully for retrosynthesis planning, regioselectivity prediction, and reaction product prediction. While other methods like transformers and fingerprint-based approaches exist, the application of GNNs to complex drug-like molecules and larger datasets remains a challenge. This study introduces a geometric deep learning approach applied to automated LSF borylation screening to identify late-stage hits, addressing the limitations of previous approaches by utilizing both high-quality literature data and newly generated experimental data to train sophisticated models able to handle the complexity of drug-like molecules.
Literature Review
The authors conducted a thorough literature review, identifying 38 publications detailing relevant borylation methods. From these, they manually curated a high-quality dataset containing 1301 chemical transformations. This dataset served as a foundation for the study's computational models and informed the design of the high-throughput experimentation (HTE) process. The review highlighted the existing limitations in applying late-stage functionalization (LSF) to drug discovery, particularly the difficulty of predicting reactivity and selectivity in complex molecules and the need for extensive experimentation. Existing machine learning approaches using GNNs, while showing promise, were often limited to smaller datasets and simpler molecules. This study aimed to overcome these limitations by utilizing a more comprehensive dataset and advanced geometric deep learning techniques.
Methodology
The study involved a multifaceted methodology combining literature analysis, the creation of an LSF informer library, high-throughput experimentation (HTE), and geometric deep learning.
**Literature Analysis:** A systematic analysis of chemical transformations (SACT) was performed on data extracted from 38 publications, resulting in a manually curated dataset of 1301 borylation reactions. This informed the HTE plate design.
**LSF Informer Library:** An LSF informer library was developed using a clustering method applied to 1174 approved small-molecule drugs, resulting in 8 structurally diverse groups. Three molecules were selected from each group, supplemented by 12 fragments relevant to Roche's chemical space, and five idealized substrates, creating a diverse set for screening. This library aimed to represent essential chemical motifs relevant in drug discovery.
**Screening Plate Design:** A 24-well borylation screening plate was designed using a meta-analysis of the literature data to determine optimal reaction conditions (temperature, time, concentration, and scale). The plate included variations in ligands and solvents to comprehensively test reaction parameters.
**HTE Borylation Screening:** The LSF informer library compounds underwent HTE screening using the designed 24-well plates in a glove box under nitrogen. Reactions were automated, and products were analyzed by LCMS. A standardized reaction data output format (SURF) was created to facilitate data sharing and analysis.
**Scaled-Up Reactions:** Selected reactions showing significant conversion were scaled up using the most promising conditions for further analysis. Products were purified, and structural elucidation was performed using NMR and HRMS.
**Deep Learning:** Three GNN architectures were developed: GNN, GTNN, and aGNN. Four different input molecular graph representations were used: 2D, 2DQM, 3D, and 3DQM, combining steric (3D) and electronic (QM) features. E(3)-invariant message passing was employed, using different pooling operations for each GNN type. GNN used sum pooling, GTNN used graph multiset transformer-based pooling, and aGNN used no pooling. Models were trained using PyTorch Geometric and PyTorch, optimized with Adam, and evaluated using mean absolute error (MAE), balanced accuracy, area under the curve (AUC), and F-score. An ECFP4NN model served as a baseline. The impact of steric and electronic effects was assessed by comparing model performance across different graph representations. Atomic properties (atom type, ring type, aromaticity, hybridization) and DFT-level Mulliken partial charges were encoded as features. 3D conformers were generated using RDKit and energy minimization via the UFF method.
Key Findings
The study demonstrated that the developed platform effectively predicts reaction outcomes, yields, and regioselectivity in late-stage borylation reactions of drug-like molecules.
**Reaction Yield Prediction:** The best-performing model (GTNN3DQM) achieved a mean absolute error (MAE) of 4.23 ± 0.08% for the experimental dataset, indicating highly accurate yield prediction.
**Binary Reaction Outcome Prediction:** The GTNN3DQM model showed a high area under the curve (AUC) value of 94.5% for a 1% yield threshold in the experimental dataset and 67 ± 2% for novel substrates, demonstrating excellent predictive capability.
**Regioselectivity Prediction:** The best-performing model (aGNN3DQM) achieved an F-score of 60 ± 4% on the literature dataset and demonstrated an accuracy of 90 ± 1% in predicting the occurrence of the borylation reaction at non-quaternary carbons. The model was successful at predicting the regioselectivity of borylation on unseen molecules (around 70% accuracy). Steric effects are significantly incorporated in the prediction.
**Influence of 3D Structure:** The incorporation of 3D information significantly improved the performance of all prediction tasks, underscoring the importance of steric information in borylation reactions. The inclusion of DFT-calculated partial charges did not yield further improvements in the prediction accuracy.
**LSF Informer Library:** The LSF informer library, consisting of 23 diverse drugs, 12 fragments, and 5 idealized substrates, provided a comprehensive coverage of functional groups relevant in drug discovery. The library revealed that the success of borylation reactions depends on the functional groups present, with some enhancing the reaction, and others suppressing it. Analysis of the screening results also provided insights into the influence of ligands and solvents.
**SURF Format:** The introduction of a Simple User-Friendly Reaction Format (SURF) streamlined reaction data capture and sharing, promoting FAIR data principles.
Discussion
This study successfully addressed the challenge of predicting the outcome of late-stage functionalization reactions, specifically C-H borylation, in complex drug-like molecules. The integration of geometric deep learning with high-throughput experimentation (HTE) proved highly effective. The high accuracy of the reaction yield and binary outcome predictions significantly reduces the need for extensive and time-consuming experimentation. The ability to accurately predict regioselectivity is particularly impactful, as it allows chemists to focus on the most likely successful reaction sites. The use of 3D structural information was crucial for model performance, highlighting the significant role of sterics in borylation reactions. The development of the SURF format fosters FAIR data practices, promoting data sharing and reusability within the field. The limitations of the models, particularly concerning regioselectivity prediction on molecules significantly different from those in the training dataset, suggest that further data augmentation is needed to improve accuracy and generalizability. However, the current results already demonstrate the platform's practical applicability in drug discovery projects.
Conclusion
This research demonstrates a robust geometric deep learning platform for predicting the outcome, yield, and regioselectivity of late-stage borylation reactions. This platform significantly accelerates drug diversification efforts by enabling *in silico* assessment of borylation opportunities. The high accuracy and efficiency of the platform, coupled with the introduction of the SURF data format, represent a significant advance in the field. Future work will focus on expanding the reaction conditions explored and augmenting the LSF informer library to further enhance the models' predictive power and generalizability.
Limitations
While the models showed high accuracy in predicting reaction outcomes and regioselectivity, some limitations exist. The regioselectivity model's performance is influenced by the chemical space covered in the training data. Molecules significantly different from those in the training set (e.g., those with sp3-carbon borylation or multiple ring systems) might not be accurately predicted. The experimental dataset, while standardized, covers a more limited reaction parameter space than the literature dataset. Further data generation and model refinement are necessary to expand the platform's applicability across a broader chemical space and improve the accuracy of predictions in edge cases. The accuracy of the predictions relies heavily on the quality and consistency of the experimental data, highlighting the importance of standardized experimental procedures.
Related Publications
Explore these studies to deepen your understanding of the subject.