Chemistry

DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications

K. Rajan, H. O. Brinkhaus, et al.

Discover DECIMER.ai, an innovative open-source platform that automates the extraction and interpretation of chemical structures from scientific literature! Developed by Kohulan Rajan, Henning Otto Brinkhaus, M. Isabel Agea, Achim Zielesny, and Christoph Steinbeck, this powerful tool leverages advanced segmentation and classification technologies for superior chemical structure recognition. Explore its impressive capabilities and publicly available resources!

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses the challenge that most chemical information in the scientific literature is published as human-readable text and images rather than structured, machine-readable data, making manual extraction time-consuming and error-prone. Optical chemical structure recognition (OCSR) translates chemical structure depictions into machine-readable formats. Historically, OCSR tools were largely proprietary or rule-based, performing best on clean images but struggling with distortions. Recent advances in deep learning, computer vision, and NLP have enabled data-driven OCSR systems. The purpose of this study is to present DECIMER.ai, an open-source platform that integrates segmentation, classification, and image-to-structure translation to automate the extraction of chemical structures from literature, thereby reducing manual curation workload and facilitating large-scale data generation for research and databases.

Literature Review

Prior work in OCSR spans proprietary algorithms and open-source, rule-based tools such as OSRA, Imago, and MolVec, which tend to perform well on clean images but degrade with distortions. More recent deep learning approaches (e.g., Img2Mol, SwinOCSR, MolScribe, ChemGrapher, Image2SMILES, MolMiner) show marked improvements. However, at the time of this work, few open-source solutions combined segmentation, classification, and recognition in a single workflow, and several high-performing systems were not openly available. Object detection-based systems have also shown promise on benchmarks but are not openly available, limiting comparative evaluation. DECIMER.ai aims to fill this gap by offering an end-to-end, open-source platform with state-of-the-art performance.

Methodology

DECIMER.ai comprises four components: (1) DECIMER Segmentation (Mask R-CNN-based) to detect and segment chemical structures from document pages; (2) DECIMER Image Classifier (EfficientNet-V1-B0) to identify whether an image contains a chemical structure; (3) DECIMER Image Transformer, a deep learning OCSR engine based on an encoder-decoder architecture mapping images directly to SMILES; and (4) a web application orchestrating the workflow. Data and generation: Training molecules were sourced from PubChem. After filtering (>1500 Da removed, removal of explicit hydrogens, retention of stereochemistry, length limits), 108,541,884 molecules were selected (pubchem_1). Markush structures were synthetically generated by replacing 1–3 atoms with R/X/Z variables (with indices 0–20) and processed via CDK, yielding a combined set of 126,702,705 molecules (pubchem_2, pubchem_3). Datasets included diverse depiction styles from CDK, RDKit, Indigo, and PIKACHU using the RanDepict toolkit, with extensive augmentations (rotation, shearing, noise, pixelation, reaction/curved arrows, labels). Images were primarily generated at 512×512 pixels (pubchem_3), with earlier sets at 299×299 pixels (pubchem_2). In total, more than 450 million depictions were created; additionally, 127.5 million hand-drawn-like synthetic depictions were generated with RanDepict 1.1.4. Data were stored as TFRecords for TPU training. Tokenization: SMILES were tokenized with Keras rules splitting at atoms, brackets, bond symbols, special characters, and single-digit numbers, adding <start>/<end>, padding with <pad>, and using <unk> for unknowns. For Markush structures, to avoid token ambiguity (digits used for ring syntax vs. R-group indices), digits after R-group labels were replaced with special characters during training; digits were reinserted for evaluation. This modification improved valid predictions (+3.4%), Tanimoto 1.0 (+2.2%), and mean Tanimoto (+0.04) over unmodified SMILES. DECIMER Image Transformer architecture and training: Encoder is EfficientNet-V2 (B3/S/M variants tested; EfficientNet-V2-M used for 512×512 input), producing feature maps for a Transformer with 4 encoder and 4 decoder blocks, 8 attention heads, attention dimension 512, and feed-forward dimension 2048. Approximately 52M parameters (encoder) and 59M (Transformer). Models trained on Google Cloud TPUs (V3-32 pod slice for pubchem_1/2, V3-256 for pubchem_3) using Adam, custom LR scheduler, dropout 0.1, sparse categorical cross-entropy; per-node batch size 48 for 512×512. Training durations ranged from hours to days per epoch depending on TPU size and encoder variant. Segmentation: DECIMER Segmentation uses Mask R-CNN with custom post-processing, trained on manually annotated data (TensorFlow 2.3.0 initially, updated to 2.10.0). Packaged for PyPI and available on GitHub/Zenodo. Image classification: DECIMER Image Classifier (EfficientNet-B0) trained with 10,905,114 images, validated on 2,179,798, tested on 544,946, using augmentations (flips, rotations, contrast, zoom). Batch size 650; training ~52h on a Tesla V100 32GB GPU. Negative examples included millions of images from diverse non-chemical datasets and random graphs. Evaluation: Test sets comprised 250,000 molecules depicted with and without augmentations at 512×512, plus 250,000 Markush depictions with/without augmentation. Metrics included proportion of identical predictions (via canonical Isomeric CX SMILES string comparison, accounting for Markush labels), average Tanimoto similarity (PubChem fingerprints), validity rate, and BLEU scores for Markush SMILES similarity. Benchmarks used standard datasets (USPTO, UOB, CLEF, JPO, Indigo, USPTO_big, Img2Mol test, hand-drawn, RanDepict250k clean/augmented), additionally evaluated with mild distortions (shearing/rotation). Other tools (OSRA, MolVec, Imago, Img2Mol, SwinOCSR, MolScribe) were run per their recommended or contributed pipelines (some via Docker/Singularity or custom scripts) for comparison. Web application: Implemented in Laravel 8, deployed via Docker (nginx, UI/back-end communication, model servers). Pages of uploaded PDFs are distributed to preloaded socket server instances running segmentation; segments are sent to classifier and transformer instances; results (SMILES) are displayed and editable in an embedded Ketcher editor; downloadable MOL and SMILES files provided; optional IUPAC naming via STOUT. The public instance limits 10 pages and 20 structures per document; code and deployment instructions are open-source.

Key Findings

- DECIMER Image Transformer achieved state-of-the-art performance across public OCSR benchmarks, providing the highest proportion of identical predictions and highest average Tanimoto similarities among compared open tools in many settings. On distorted datasets, performance remained robust (e.g., JPO_dist: P_i 62%, T=0.93; CLEF_dist: P_i 73%, T=0.96; USPTO_dist: P_i 61%, T=0.99; UOB_dist: P_i 86%, T=0.99; USPTO_big_dist: P_i 57%, T=0.96; Indigo_dist: P_i 51%, T=0.97; DECIMER-Test augmented: P_i 90%, T=0.99). - Severe and catastrophic failure rates (Tanimoto ≤0.3 and Tanimoto=0 or invalid) were very low for DECIMER across benchmarks, including distorted variants (typically in the 1–4% range for catastrophic failures on several datasets), indicating robust predictions. - The DECIMER Image Transformer V2 markedly outperformed the earlier V1 on clean benchmarks: JPO P=64%, T=0.93 (vs V1: 0.22%, 0.33); CLEF 72%, 0.96 (vs 0.30%, 0.37); USPTO 61%, 0.97 (vs 0.10%, 0.33); UOB 88%, 0.98 (vs 5.16%, 0.47). - The model trained at higher image resolution (512×512) improved perfect recognition of Markush structures, especially by better resolving subscript indices for R-groups; BLEU scores for Markush predictions were consistently >0.9, reflecting high sequence-level similarity. - A training-data tweak replacing R-group digits with special symbols improved Markush performance: +3.4% valid SMILES, +2.2% Tanimoto 1.0 rate, and +0.04 mean Tanimoto versus unmodified SMILES. - DECIMER Image Classifier achieved an AUC of 0.99 on its test set, and showed robust performance across diverse real-world and public datasets (e.g., none of the electron microscopy images were misclassified as chemical structures; minor reductions observed on ChEBI due to isolated ions). - DECIMER Segmentation and Image Classifier are the only openly available tools in their respective categories, and the DECIMER web app uniquely integrates segmentation, classification, and OCSR in an open-source end-to-end workflow accessible via web and mobile devices.

Discussion

The findings demonstrate that DECIMER Image Transformer, trained on extremely large, diverse, and augmented datasets, reliably translates complex chemical structure depictions—including Markush structures and superatom abbreviations—into machine-readable SMILES with high similarity. Performance remains strong under distortions and generalizes surprisingly well to hand-drawn images despite no hand-drawn structures in training, suggesting potential for mining older literature. The system’s low catastrophic and severe failure rates reduce manual curation effort, as near-correct predictions can be minimally edited rather than redrawn. Resolution significantly impacts recognition of fine details (e.g., R-group indices), with 512×512 inputs improving perfect Markush recognition. Compared with alternative tools, DECIMER generally performs best or on par across benchmarks; however, MolScribe achieves strong results on several datasets despite a far smaller training set, likely due to architectural and post-processing differences. Object detection-based OCSR approaches also show promise but were not available for benchmarking. Overall, the data-driven DECIMER approach benefits from large-scale, diverse training to achieve robust predictive accuracy and generalization, and the integrated, open-source platform increases practical utility for automated chemical data extraction.

Conclusion

DECIMER.ai delivers a comprehensive, open-source platform for automated identification, segmentation, and recognition of chemical structure depictions in scientific literature. The DECIMER Image Transformer establishes state-of-the-art performance across multiple benchmarks with robust accuracy under diverse depiction styles and distortions, while the Image Classifier achieves high discrimination (AUC 0.99). The public web app integrates all components and enables immediate inspection and correction via an embedded editor, accessible even on mobile devices. The work contributes open code, models, and datasets under permissive licenses, fostering community-driven improvements and integration. Future directions include deeper integration with text-mining tools such as ChemDataExtractor to link recognized structures with extracted textual information from PDFs, further optimization to improve performance on challenging cases (e.g., fine indices at lower resolutions, unusual depiction styles), and exploration of complementary architectures (e.g., object detection-based systems) to further enhance efficiency and accuracy.

Limitations

- Dependence on very large-scale synthetic training data and substantial compute resources (TPU pods) for optimal performance; smaller training regimes may underperform relative to approaches with specialized post-processing. - Recognition of fine-grained elements (e.g., small R-group indices) can degrade at lower image resolutions; Markush perfect-match rates are lower than for standard structures, though high similarity is maintained. - Some benchmark comparisons suggest alternative architectures (e.g., MolScribe) can outperform DECIMER when trained on smaller datasets, indicating different efficiency trade-offs. - Image classifier performance is slightly reduced for certain edge cases (e.g., isolated ions misclassified as non-chemical). - The public web instance imposes processing limits (pages and structures per document), though these can be lifted in local deployments.

Related Publications

Explore these studies to deepen your understanding of the subject.

Biology

B-SOID, an open-source unsupervised algorithm for identification and fast prediction of behaviors

A. I. Hsu and E. A. Yttri

Medicine and Health

Automated detection of intracranial aneurysms using skeleton-based 3D patches, semantic segmentation, and auxiliary classification for overcoming data imbalance in brain TOF-MRA

S. Ham, J. Seo, et al.

Education

Satisfaction as a key antecedent for word of mouth and an essential mediator for service quality and brand trust in international education

H. Stribbell and S. Duangekanong

Education

How humour travels in the new and dynamic mediascape: a case study of a short video platform, Little Red Book, and an online teaching platform, Rain Classroom

L. Liang

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny