
Chemistry
Inferring experimental procedures from text-based representations of chemical reactions
A. C. Vaucher, P. Schwaller, et al.
This groundbreaking research by Alain C. Vaucher, Philippe Schwaller, Joppe Geluykens, Vishnu H. Nair, Anna Iuliano, and Teodoro Laino unveils advanced data-driven models capable of predicting synthesis steps from chemical equations. With an impressive dataset of 693,517 entries and innovative models like Transformer and BART, over 50% of predicted sequences require no human intervention for execution.
~3 min • Beginner • English
Introduction
The study addresses the challenge of converting chemical equations into executable laboratory procedures—a task typically reliant on expert intuition, literature search, and trial-and-error. While AI has advanced retrosynthetic planning and reaction prediction, these do not specify operational steps (e.g., additions, stirring, filtration, temperature control). The research question is whether data-driven models can infer complete experimental action sequences directly from text-based representations of chemical equations (SMILES), thereby enabling automation and reducing manual effort. The authors argue that such models could facilitate automated synthesis platforms by generating stepwise instructions suitable for human or robotic execution.
Literature Review
Prior AI work has focused on predicting parts of reaction conditions rather than full procedures. Walker et al. predicted solvents for selected reaction classes; Maser et al. formulated multiclass prediction of reaction conditions (e.g., metal, ligand, base, solvent, additive, temperature, CO atmosphere) for cross-couplings; Nicolaou et al. coupled retrosynthesis with nearest-neighbor search to suggest procedures; Gao et al. predicted reagents, solvents, catalysts, and temperatures across reaction classes, but procedures still needed manual revision, especially for robotic execution. Limitations included domain complexity and insufficiently curated data for predicting end-to-end procedures. The authors position their work (Smiles2Actions) as the first to convert chemical equations to fully explicit action sequences for batch organic synthesis, building on NLP extraction of procedures from patents and modern sequence-to-sequence architectures (Transformer, BART) and reaction fingerprints.
Methodology
Task formulation: Predict a sequence of synthesis actions from a reaction represented as SMILES, treating all precursors (reactants + reagents) and products without distinguishing roles. Actions use the Vaucher et al. action schema with types and properties covering common batch operations.
Input/output preprocessing: (1) Replace explicit compound names in actions with positional tokens referencing their order in the input reaction (allowing only additional compounds from a fixed list of common reagents). (2) Tokenize temperatures and durations into predefined ranges to mitigate noise (e.g., 'overnight', broad temperature intervals). Quantities (masses/volumes) were removed due to inconsistent coverage and scale effects; the resulting procedures are effectively averaged across scales. Reaction SMILES are canonicalized (RDKit), duplicates removed, and tokenized for language models.
Dataset creation: Source = Pistachio v3.0 patent reactions. Starting from 8,377,878 records, removed entries without procedure text and duplicates to obtain 3,464,664 reactions. Used Paragraph2Actions NLP model to extract action sequences from procedures. Postprocessed and standardized actions (merging, infer kept phase on filtration, resolve 'same temperature', replace initial MakeSolution with Add steps, tokenize temperature/duration/pH ranges, ignore repetitions for Extract/Wash, ignore atmospheres except 'vacuum' for DrySolid). Mapped extracted compound names to SMILES via name-to-SMILES/SMILES-to-name dictionaries and reaction-specific mappings; retained names only if they matched reaction molecules or a curated common reagents list. Records failing mapping or quality checks (e.g., InvalidAction, temperature/duration parsing errors, multi-step indications, too-short sequences) were discarded. Deduplicated identical reaction SMILES, retaining one instance.
Final dataset: 693,517 reaction SMILES with associated action sequences (about 20% of 3,464,664 intermediate records). Reaction class coverage compared between original and final sets showed broadly similar distribution with acceptable deviations; most classes change by less than 50% in prevalence; 66 rare classes (fewer than 100 examples) disappear. Train/val/test split: 554,813 / 69,352 / 69,352.
Models: (1) Nearest-neighbor using rxnfp reaction fingerprints; FAISS search among training reactions constrained to same number of precursors; adapt neighbor’s action sequence. (2) Transformer encoder–decoder (OpenNMT-py) translating tokenized reaction SMILES to action sequences; 8 attention heads; reduced model size (4 layers, hidden size 256, embeddings 256); label_smoothing 0; other training hyperparameters as specified. (3) BART sequence-to-sequence (fairseq) fine-tuned for the task.
Evaluation: Metrics on the test set include validity (syntactic correctness and referencing all input molecules), BLEU score, and exact or thresholded normalized Levenshtein similarities (100%, 90%, 75%, 50%). Two random baselines (global random; compatible pattern by precursor/product counts) included. Additional analyses: distribution of predicted sequence lengths; categorization of differences vs. ground truth; approximate single-action accuracy under independence assumption; and a blind human assessment by a trained chemist on 500 reactions comparing ground truth vs. Transformer predictions.
Key Findings
- Dataset and preprocessing: From 8.38M patent reactions to 693,517 standardized reaction–procedure pairs; 326,929 duplicate reaction SMILES found across 871,112 records, with 47,299 having non-identical action sequences, reflecting multiple valid procedural variants and reporting noise.
- Model performance (test set, 69,352 reactions; values in %):
• Nearest neighbor: validity 99.6; BLEU 53.2; 100% acc 6.65; ≥90% 12.50; ≥75% 20.30; ≥50% 55.46.
• Transformer: validity 99.7; BLEU 54.7; 100% acc 3.60; ≥90% 10.10; ≥75% 24.74; ≥50% 68.73.
• BART: validity 99.6; BLEU 54.5; 100% acc 0.98; ≥90% 5.00; ≥75% 17.57; ≥50% 66.04.
• Random baselines much lower (e.g., compatible pattern: 100% acc 0.01; ≥50% 30.01).
- Although exact sequence matches are low (e.g., 3.6% for Transformer), many predictions are close to ground truth: ≥50% similarity for 68.73% (Transformer). Under an independence approximation, the average single-action correctness is estimated at ~72.7%.
- Difference analysis (Transformer vs. ground truth): ~0.9% differ only by action order; 5.4% differ only in properties (e.g., durations/temperatures) of a single action; swaps between similar actions (e.g., Stir vs. Reflux) occur, suggesting procedural flexibility; extra/missing actions often concern work-up/purification. Multiple missing/extra actions account for 18.8%; remaining 57% involve combined differences.
- Sequence length behavior: Nearest-neighbor mirrors ground-truth lengths; Transformer favors shorter sequences; BART biases toward mid-length; yet Transformer achieves 100% matches across a broad range of lengths (short sequences only slightly overrepresented among exact matches).
- Human expert assessment (500 reactions, blind, random order of ground truth vs. Transformer prediction): Predicted adequate in 313/500 (62.6%) cases (191 both adequate; 122 predicted adequate while ground truth inadequate). Predicted inadequate in 187/500 (108 predicted inadequate when ground truth adequate; 79 both inadequate). Overall, predicted procedures judged at least as adequate as ground truth and slightly fewer inadequate predictions than ground truth (187 vs. 201).
Discussion
The models successfully learn to infer experimentally meaningful action sequences from reaction SMILES, addressing the core problem of translating symbolic reactions into laboratory-executable protocols. High validity (~99.6–99.7%) indicates syntactically correct, molecule-referencing outputs. Despite low exact-match rates—expected due to noisy/ambiguous ground truth and multiple equivalent procedural variants—Transformer and BART achieve strong similarity metrics and outperform random baselines substantially. The nearest-neighbor approach attains higher exact matches in some cases by leveraging close analogs from training but is less general. Transformer-based models capture broader chemical context (transformation type, functional groups) and generalize beyond nearest analogs, making them more suitable for automated procedure generation.
Human evaluation underscores that many predicted procedures are practically adequate—even occasionally preferable to the extracted ground truth—highlighting the limitations of patent-derived training data and rigid string-based metrics. The findings indicate that improved data curation (cleaner mappings, better extraction of actions, handling multi-step texts) should directly enhance model performance. The work suggests that, with better-quality datasets and extended inputs (e.g., quantities, states, atmospheres), such models can facilitate automation by generating code-like procedures for robotic execution and reduce trial-and-error in conventional labs.
Conclusion
This work introduces Smiles2Actions, the first end-to-end AI framework to convert reaction SMILES into explicit, lab-ready action sequences for batch organic synthesis. Using a large, standardized dataset (693k reactions) derived from patent procedures, the authors trained and compared three approaches (nearest-neighbor, Transformer, BART). All models produce highly valid procedures; Transformer-based models provide the best balance of generalization and accuracy, and human assessment confirms that more than half of predictions are adequate without human intervention.
Main contributions: (1) creation and release-on-request of a large standardized reaction–procedure dataset; (2) formulation of procedure prediction as sequence-to-sequence translation with tokenized ranges and molecule placeholders; (3) comprehensive benchmarking of nearest-neighbor vs. modern transformers; (4) expert validation of procedural adequacy across diverse reaction classes.
Future directions: improve data curation to reduce ground-truth noise; incorporate additional experimental context (state, concentration, quantities, atmospheres, scale); enhance handling of multi-step syntheses; and integrate with retrosynthesis and condition prediction to form fully automated planning-to-execution pipelines for robotic platforms, with appropriate safety checks.
Limitations
- Training data derived from patents contain noise and inconsistencies: duplicate reactions with differing procedures, OCR/name variants, ambiguous or incomplete mappings, and occasional incorrect reaction equations.
- Ground-truth procedures may not be unique; multiple valid ways to run a reaction penalize exact-match metrics and complicate evaluation.
- Quantities (masses/volumes), concentrations, state of matter, and atmospheres (except vacuum for DrySolid) were removed or ignored, limiting scale- and context-specific accuracy; hydrogenation steps lacking atmosphere details can be incomplete.
- Durations and temperatures are tokenized into ranges; exact values are not learned, reflecting reporting variability but limiting precision.
- Action extraction errors (InvalidAction, parsing failures) and heuristic filters excluded many records; the final dataset is about 8% of the initial database and excludes non-English texts, potentially introducing sampling bias and reducing coverage of rare classes.
- Current models assume reaction inputs include all necessary reagents/solvents; omission impairs predictions and necessitates separate completion algorithms.
- Evaluation relies on string similarity to a single ground truth, which can underrate chemically equivalent predictions.
Related Publications
Explore these studies to deepen your understanding of the subject.