logo
ResearchBunny Logo
An end-to-end deep learning framework for translating mass spectra to de-novo molecules

Chemistry

An end-to-end deep learning framework for translating mass spectra to de-novo molecules

E. E. Litsa, V. Chenthamarakshan, et al.

Discover how Eleni E. Litsa and her team have developed Spec2Mol, a groundbreaking deep learning model that decodes mass spectra into molecular structures. This innovative approach outperforms traditional methods, paving the way for identifying novel molecules and advancing chemical research.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of identifying chemical structures from mass spectrometry data, especially for novel compounds absent from spectral or structural databases. Conventional workflows rely on database matching or in silico fragmentation and fingerprint prediction, which fail when the target molecule is unknown or underrepresented. The research question is whether an end-to-end deep learning model can directly translate tandem MS spectra into plausible, de novo molecular structures (SMILES) without relying solely on database retrieval. This is important for metabolomics, drug development, forensics, and biomarker discovery, where the coverage of experimental spectra is limited and a large fraction of detected features remains unidentified.
Literature Review
Existing methods include: (1) Database matching to reference spectra (e.g., HMDB, METLIN, MassBank, mzCloud), which suffers from limited coverage of real metabolomes. (2) In silico fragmentation expanding spectral libraries using rule-based, combinatorial/optimization, and machine learning predictors; effective for peptides but challenging for small molecules due to stochastic fragmentation. (3) Two-step approaches predicting molecular fingerprints from spectra and then querying structural databases; ineffective for truly novel molecules not in libraries. (4) Fragmentation trees computed from MS/MS elucidate dependencies between peaks and aid formula/fingerprint prediction and clustering, but are computationally expensive. (5) Machine learning methods: earlier SVM/RF and kernel-based models on spectra or fragmentation trees; more recently deep learning for spectra embeddings, similarity scoring (e.g., Spec2Vec), clustering and classification (CNNs, NLP-inspired models), and transfer learning/hybrid strategies. In proteomics, end-to-end DL (e.g., DeepNovo) performs de novo peptide sequencing from spectra; however, small-molecule structure elucidation is harder due to stochastic fragmentation and graph-like molecular structures. These limitations motivate an end-to-end model that can recommend de novo molecules directly from raw MS/MS.
Methodology
Spec2Mol is an encoder–decoder framework that translates MS/MS spectra into molecular SMILES sequences. The decoder is pre-trained on large unlabeled molecule datasets; the spectra encoder is subsequently trained to align spectra embeddings with the decoder’s SMILES embedding space. Key components: - Datasets: - Molecule dataset for pre-training: ~135 million molecules from PubChem and ZINC, represented as canonical SMILES without stereochemistry (stereochemistry excluded since it is difficult or impossible to recover reliably from MS/MS alone). - Spectral dataset for training the spectra encoder: NIST 2020 Tandem MS library (>1M spectra from >30K compounds), predominantly metabolites and drugs, with diverse instruments and conditions. After filtering to standardize conditions, the working dataset comprises ~23K molecules with up to four selected MS/MS spectra each. - Spectra selection and representation: - For each molecule, up to four MS/MS spectra are used: two precursor types ([M+H]+ and [M−H]−), each at low (35% NCE) and high (130% NCE) collision energies. Missing spectra are allowed and represented as empty (all-zero) channels to promote robustness. - Each spectrum is discretized into an intensity vector over m/z bins and normalized by the spectrum’s maximum intensity. - Data augmentation: - For training, additional instances are created by replacing each of the four spectra with the closest available collision energy spectrum while keeping precursor and instrument fixed, exploiting spectra similarity at nearby energies. - Data partitioning: - Approximately 23K molecules split into training, validation, and test sets (~1K molecules each in validation and test). The test set is disjoint from train/validation by fingerprint (Tanimoto) similarity to avoid leakage. - Model architectures and training: - Pre-trained SMILES autoencoder (AE): Sequence-to-sequence GRU encoder and GRU decoder trained to translate randomized input SMILES to canonical SMILES, learning a continuous latent embedding of molecules. Training minimizes character-level reconstruction loss; no stereochemistry. - Spectra encoder: 1-D CNN with four input channels (the four spectra conditions), followed by fully connected layers to output a latent vector. Trained in a supervised fashion to minimize the distance (RMSE) between spectra embeddings and the corresponding pre-trained SMILES embeddings, aligning both modalities in a shared latent space. - Inference and molecule recommendation: - Two strategies are used and combined: 1) Direct generation: Decode the spectra embedding with the pre-trained GRU decoder using sampling to generate multiple SMILES; retain only syntactically valid SMILES. 2) Indirect retrieval: Find nearest embeddings among the pre-training molecule pool in latent space and decode those embeddings to SMILES. - Ranking: Combine candidates from both strategies and rank by absolute discrepancy from expected molecular weight inferred from the MS data; return top-20 predictions. - Implementation details, hyperparameters, and additional preprocessing are described in the supplementary material. Code is available at https://github.com/KavrakiLab/Spec2Mol. The NIST spectra used for training/evaluation are commercial and not publicly distributable.
Key Findings
- AE reconstruction: The pre-trained SMILES AE correctly reconstructed canonical SMILES for ~93.3% of NIST molecules (vs 94.95% on a held-out pre-training test set), indicating strong, diverse structure coverage. - Overall Spec2Mol performance (full 1000-case test set, combined generation strategies): - Exact structure recovery: 7.0% of cases. - Exact molecular formula recovery: 39.3% of cases. - Molecular weight discrepancy (relative to average reference MW): DMW_min 2.3; DMW_avg 6.3. - Molecular formula distance (normalized by average heavy atom count): DMF_min 9.2; DMF_avg 21.7. - Fingerprint cosine similarity (Morgan radius 2, 1024 bits): max 0.53; avg 0.36. - Maximum common substructure (MCS) similarity: MCSratio max 0.68, avg 0.51; MCStan max 0.55, avg 0.38; MCScoef max 0.71, avg 0.54. - Effect of missing spectra (Table 1): Performance is stable with 3–4 spectra but degrades with ≤2 spectra. For 3 spectra: exact structure 15.2%, exact formula 46.9%, fingerprint cosine max 0.57, MCSratio max 0.72. With 1 spectrum: exact structure 5.1%, exact formula 20.5%. - Generation strategy ablation (Table 2): - Direct-only: exact structure 0.8%, exact formula 26.1%, fingerprint cosine max 0.46. - Indirect-only: exact structure 6.9%, exact formula 28.0%, fingerprint cosine max 0.53. - Combined: exact structure 7.0%, exact formula 39.3%, fingerprint cosine max 0.53. The indirect strategy contributes most; combining yields the best overall metrics. - Atom-species detection (Table 3): Sensitivity is high for common atoms (O 0.94, N 0.86) but with lower specificity for O (0.50). Rare atoms show high specificity (e.g., S 0.96, Cl 0.91, F 0.92, Br 0.98, P 0.99, I 0.99) but lower sensitivity (≥0.5 for all). - Effect of molecular weight and heteroatoms (Table 4): - MW < 300 Da: exact structure 8.5%, exact formula 45.4% (n=668). - MW ≥ 300 Da: exact structure 3.9%, exact formula 27.1% (n=332). - Only C and O: exact structure 9.8%, exact formula 63.6% (n=184). N, S, and halogens reduce identification rates. - Comparative evaluation vs SIRIUS/CSI:FingerID: - On 947 positive-mode cases, SIRIUS recovered correct formulas for ~98% and exact structures for ~67%; trained on NIST v17 and uses fragmentation trees. - On the 307 cases where SIRIUS failed to find the exact structure, Spec2Mol’s recommended structures were at least as relevant by structural similarity: fingerprint cosine max 0.49 (both); averages 0.34 (Spec2Mol) vs 0.33 (SIRIUS). MCSratio max 0.66 vs 0.65; MCScoef max 0.69 vs 0.66; MCStan max 0.53 vs 0.54; average MCS metrics slightly favor Spec2Mol (ratio 0.50 vs 0.49; coef 0.53 vs 0.49).
Discussion
The results demonstrate that an end-to-end deep learning approach can map raw MS/MS spectra into a molecular embedding space that supports de novo SMILES generation and retrieval of structurally relevant candidates, even without fragmentation trees or prior structural databases. Spec2Mol reliably infers coarse physicochemical attributes (notably molecular weight and elemental composition) and captures substantial substructure overlap with reference molecules (common substructures approaching ~70% of the reference size for the best predictions). Performance depends on input completeness (3–4 spectra preferred), molecular size (better for MW < 300 Da), and atom-type prevalence (better for common atoms like O and N). Combining direct generation with nearest-neighbor retrieval in latent space yields the best accuracy, suggesting the learned embedding organizes chemical space meaningfully for both generation and retrieval tasks. In comparison to a state-of-the-art database-retrieval approach (CSI:FingerID via SIRIUS), Spec2Mol provides recommendations of comparable structural relevance specifically for cases where exact matches are absent, indicating its utility for novel compound discovery and as a complementary tool when fragmentation-tree-based methods fail or are inapplicable.
Conclusion
Spec2Mol introduces an end-to-end framework that translates MS/MS spectra to molecular structures using a pre-trained SMILES autoencoder and a CNN-based spectra encoder aligned in a shared embedding space. It enables both de novo generation and retrieval of candidate molecules and achieves meaningful structural similarity to references, recovers exact formulas in a substantial fraction of cases, and performs on par with fragmentation-tree-dependent methods when exact database matches are unavailable. The approach can complement existing tools by proposing plausible candidates for novel molecules and guiding downstream analyses. Future work includes integrating prior knowledge (e.g., fragmentation trees) to further improve accuracy, enhancing stereochemical handling when feasible, expanding training coverage of underrepresented chemotypes (e.g., large-ring systems, rarer heteroatoms), and leveraging the indirect strategy for efficient library searches via substructure constraints or embedding-based nearest-neighbor search.
Limitations
- The method does not model stereochemistry and cannot recover stereochemical details from MS/MS. - Performance degrades with fewer than three input spectra and for higher molecular weight compounds. - Underrepresentation of certain chemotypes (e.g., large rings) and rare heteroatoms in pre-training impairs accuracy; generating valid SMILES for very large rings is challenging. - Sensitivity/specificity for atom presence varies; oxygen shows lower specificity (more false positives), while rare atoms exhibit lower sensitivity. - Relies on known or inferred molecular weight for ranking candidates. - Uses a commercial spectra dataset (NIST 2020) not publicly available; generalization to other instruments/conditions may vary despite normalization and filtering. - Quality of input spectra (e.g., very few peaks) limits reconstruction capability.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny