logo
Loading...
Biocatalysed synthesis planning using data-driven learning

Biology

Biocatalysed synthesis planning using data-driven learning

D. Probst, M. Manica, et al.

This paper introduces cutting-edge forward and backward prediction models utilizing the Molecular Transformer, designed specifically to tackle the complexities of predicting enzymatic activity and enzyme selectivity on substrates that have not yet been reported. Conducted by Daniel Probst, Matteo Manica, Yves Gaetan Nana Teukam, Alessandro Castrogiovanni, Federico Paratore, and Teodoro Laino, the research employs the newly compiled ECREACT dataset and provides unprecedented accuracy in enzyme-catalysed reaction predictions.... show more
Introduction

Chemical synthesis must become more sustainable, and enzymes offer highly selective, reusable catalysts operating under mild, green conditions. Yet, widespread adoption of biocatalysis in synthesis planning is hindered by limited substrate scope in databases, difficulty predicting enzyme activity on unseen substrates, and enzyme-specific stereo- and regioselectivity. Existing computer-aided synthesis planning for biocatalysis is largely rule-based, and early data-driven methods have focused on forward prediction. This study aims to generalize transformer-based reaction prediction to biocatalysis by encoding enzyme information via EC (enzyme commission) numbers and training both forward and backward (retrosynthetic) models to enable template-free biocatalysed synthesis planning.

Literature Review

Prior approaches to biocatalytic synthesis planning predominantly use expert-curated reaction templates and rules, such as ATLAS of Biochemistry, RetroRules, and RetroBioCat, which facilitate pathway design but require manual curation and struggle with long-range substituent effects. A recent data-driven effort (Kreutter et al.) used the Molecular Transformer with enzyme names as tokens to predict forward enzymatic reactions, achieving up to 62% accuracy with detailed enzyme descriptors, but lacked a backward model for retrosynthesis and faced challenges learning across differently named but related enzymes. Transformer-based models have shown strong performance in traditional organic chemistry with learned reaction grammars and retrosynthesis via hyper-graph search, motivating extension to enzymatic reactions using standardized EC-number tokens for better generalization across enzyme families.

Methodology

Data: Enzymatic reactions with EC numbers were aggregated from Rhea (n=8,659), BRENDA (n=11,130), PathBank (n=31,047) and MetaNetX (n=34,485). Processing removed products that also appear as reactants, common cofactors/byproducts in multi-product reactions, product molecules with heavy atom count <4, and reactions with ≠1 product or no reactants. The resulting ECREACT dataset contains 62,222 unique reaction–EC combinations and is provided in five token schemes: ECO (no EC; n=55,115), EC1 (level 1; n=55,707), EC2 (levels 1–2; n=56,222), EC3 (levels 1–3; n=56,579), and EC4 (levels 1–4; n=62,222). EC3 balances specificity and sample size and is the main focus for modeling. Reaction SMILES were extended to include EC numbers (enzymatic reaction SMILES). Tokenization treated EC levels 1–3 as dedicated tokens reflecting the EC hierarchy, prefixed (v/u/t) and bracketed to avoid conflicts with SMILES syntax. Modeling: Molecular Transformer models (forward and backward) were trained using multi-task transfer learning on USPTO (~1M reactions) and ECREACT, with convex weighting 9:1 (USPTO:ECREACT). Architecture: transformer encoder/decoder, 6 layers; embedding size 512; gradient accumulation 8; Adam (β1=0.9, β2=0.998); batch size 4096 (tokens-based); learning rate 2.0 with Noam decay; dropout 0.1; label smoothing 0.1; positional encoding enabled. Data splits used 90/5/5 train/val/test with a strict constraint that no product in test appears as a product in training or validation to reduce memorization. Evaluation: Tasks included forward prediction (products from substrates+EC), backward prediction (precursors+EC from product), round-trip (forward followed by backward consistency), and EC-only prediction. Performance was stratified by EC class and EC-level-3 subclass sizes. Additional tests randomized EC tokens within/across classes to probe EC token utility. Retrosynthesis: The hyper-graph exploration method was adapted to include EC tokens. At each step, the backward model proposes disconnections and EC classes; candidates are scored using forward-model confidence reweighted by SCScore of precursors, and explored via beam search until commercially available starting materials are reached.

Key Findings
  • Incorporating EC-number tokens enables data-driven learning of enzymatic reaction patterns; attention analyses show EC tokens align with reaction centers and nucleophiles.
  • Forward model (EC3): top-1 accuracy 49.6%, top-5 63.5%, top-10 68.8%. Randomizing EC tokens reduces performance to 41.3% (within-class) and 38.3% (across-class), demonstrating utility of EC information, especially for smaller subclasses and heterogeneous chemical spaces.
  • Backward model: top-1 accuracy ~60% overall, with strong performance for transferases (class 2) driven by large EC-level-3 subclasses (e.g., 2.3.1.x and 2.7.8.x) and lower performance for oxidoreductases (class 1) due to many small subclasses.
  • Round-trip (single-step) accuracy: top-1 39.6%, top-5 42.3%, top-10 42.6% on EC3.
  • Class-wise effects: Accuracy correlates strongly with training sample sizes per EC-level-3 subclass. Transferase, lyase, and hydrolase substrates/products often occupy class-homogeneous chemical space, reducing reliance on EC tokens; other classes are heterogeneous and benefit more from EC tokens.
  • Stereochemistry remains a major challenge (notably isomerases, class 5); removing stereochemical labels doubles accuracy for class 5 in analysis.
  • EC token scheme comparison: EC3 offers the best balance of specificity and performance; EC4 underperforms due to sparse subclasses; EC1 aggregates too broadly for retrosynthesis.
  • Retrosynthesis use-cases demonstrate viable enzymatic routes under mild conditions for targets including aminoalcohols, homoaspartate, 4-hydroxy-L-glutamic acid, α-ketoacids, and (S)-norlaudanosoline, often differing from rule-based RetroBioCat routes and reflecting biosynthetic bias of training data.
Discussion

The study demonstrates that transformer-based models can learn enzymatic reactivity patterns when enzyme information is encoded via EC-number tokens. By training both forward and backward models and integrating them into a hyper-graph search, the approach enables template-free biocatalysed synthesis planning. Performance depends heavily on data availability and class balance; large, homogeneous subclasses perform best. EC tokens provide crucial explicit information for classes where substrates/products populate heterogeneous chemical spaces. Attention analyses indicate that models use EC tokens to localize reaction centers, supporting mechanistic relevance. While stereochemical prediction and sparsely populated classes limit accuracy in some areas, the models already generate practical, greener synthetic suggestions and generalize beyond seen substrates by associating enzyme classes with transformation centers.

Conclusion

This work introduces the first template-free, transformer-based framework for biocatalysed synthesis planning by encoding enzyme information with EC-number tokens and training forward and backward Molecular Transformer models using a curated, publicly available enzymatic reaction dataset (ECREACT) with multi-task transfer learning from USPTO. The models achieve strong forward, backward, and round-trip performances and enable retrosynthetic route generation with proposed enzyme classes. The open release of data and models facilitates broader adoption and future improvements. Future research should expand and balance publicly available enzymatic data (especially for underrepresented classes and stereochemical cases), incorporate higher-quality stereochemical annotations, fine-tune on specialized biocatalysis datasets, and experimentally validate and iteratively refine predicted routes to enhance generalizability and accuracy.

Limitations
  • Data imbalance across EC-level-3 subclasses, with many small subclasses (notably in oxidoreductases and isomerases), reduces accuracy and generalizability.
  • Sparse data for EC4 (full specificity) limits usefulness of highly detailed enzyme tokens; EC4 categories often have very few examples.
  • Stereochemistry prediction is challenging due to limited and inconsistent stereochemical annotations; performance drops notably for isomerases.
  • Dataset bias toward biosynthetic reactions and metabolites (limited public data on biocatalysed synthesis of non-natural products) leads models to favor natural product-like substrates.
  • Limited records for certain classes (e.g., translocases) reduce statistical significance; these were excluded from detailed analysis in places.
  • Conflicting or duplicate records (same substrates and EC leading to multiple products) and occasional data errors can cause apparent mispredictions.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny