Introduction
Identifying chemical compounds in a sample is fundamental in chemical analysis, with wide-ranging applications including metabolomics (identifying molecules in biological samples for biomarker discovery and understanding human physiology and disease), plant-based bioactive compound identification for drug and supplement development, quality control and impurity detection in product development (pharmaceuticals and agrochemicals), safety studies (identifying potential human metabolites), and forensic analysis. Structure elucidation, often done via mass spectroscopy (MS) and nuclear magnetic resonance (NMR), is time-consuming and laborious. MS, due to its sensitivity and specificity, is more frequently used. In MS, molecules are separated (often using liquid chromatography (LC) or gas chromatography (GC)), fragmented into charged ions (via ionization sources like electron ionization (EI), chemical ionization (CI), or electrospray ionization (ESI)), and the mass-to-charge (m/z) ratios of the fragments are recorded as a mass spectrum (m/z on the horizontal axis, relative abundance on the vertical axis). Tandem mass spectrometry (MS/MS) uses sequential fragmentation for more detailed information, producing MS2, MS3, etc. spectra. The resulting spectra depend on the molecule's structure and experimental conditions (instrument, collision energy, precursor ion, ionization mode). Once obtained, the mass spectrum is typically matched against spectral databases (e.g., Human Metabolome Database, METLIN, MassBank, mzCloud) to identify the structure. However, these databases cover only a small percentage of existing metabolites, especially in humans (estimated <10% have experimental reference mass spectra). This limitation leads to a low identification rate in untargeted metabolomics studies (often <2%).
Literature Review
To address the limited experimental spectra data, in silico fragmentation attempts to predict spectra from known molecular structures lacking experimental spectra. While successful for peptides, it's challenging for small molecules due to the stochastic nature of their fragmentation. Alternative approaches map spectra to intermediate representations (e.g., molecular fingerprints) and then search structural databases, but this also fails for novel molecules. Rule-based methods associating spectral features with substructures have been explored, offering partial structure identification. Fragmentation trees, computationally derived from tandem mass spectra, provide complementary information by showing dependencies between mass peaks. They're used for predicting molecular fingerprints but are computationally expensive. Recent work explores deep learning (DL) for structure elucidation, including learning spectra embeddings for similarity searches and DL-based clustering. Early ML approaches used shallow models (SVMs, Random Forests) on mass spectra features or fragmentation trees. However, the limited data necessitates workarounds like hybrid approaches and transfer learning. In proteomics, DeepNovo, an end-to-end DL architecture, successfully performs de novo peptide sequencing from mass spectra. Structure elucidation of small molecules is harder because of the stochastic fragmentation and the graph-like representation of small molecule structures compared to the linear nature of peptide sequences.
Methodology
This paper presents Spec2Mol, an end-to-end deep learning architecture for translating MS/MS spectra to molecular structures (represented as SMILES sequences). It uses an encoder-decoder architecture: an encoder learns MS/MS spectrum embeddings, and a decoder generates SMILES sequences of recommended chemical structures. Due to limited spectra data, unsupervised pre-training is used. The decoder is pre-trained as part of an autoencoder (AE) architecture, trained to reconstruct molecules from their SMILES sequences. The encoder is then trained to match the AE's learned embeddings. The NIST Tandem Mass Spectral Library 2020, containing over 1 million spectra from more than 30,000 compounds, provides the mass spectra data. A large dataset of 135 million molecules from PubChem and ZINC datasets, represented using SMILES notation (without stereochemistry information), is used to pre-train the autoencoder. To minimize variations due to experimental conditions, only spectra from the most common precursor ions ([M+H]+ and [M-H]−) and two energy levels (low and high collision energy) are used. Missing spectra are represented as empty spectra to make the model robust to missing data. Each spectrum is represented as a vector with bits representing m/z values and their intensities (normalized by the maximum intensity). Data augmentation is performed by slightly perturbing collision energy in the spectra. The dataset is split into training, validation, and test sets. The autoencoder (with GRU-based encoder and decoder) is trained on a SMILES translation task (randomized input to canonical SMILES). The spectra encoder (with 1D CNN layers and fully connected layers) is trained to minimize the distance between its latent representation and the SMILES encoder's representation. During inference, Spec2Mol uses two strategies: direct generation of SMILES from the MS/MS embedding and retrieval of structures from the pre-training dataset based on embedding distance. Predictions are ranked by their molecular weight discrepancy from the reference and the top 20 are returned.
Key Findings
The pre-trained autoencoder correctly reconstructed SMILES sequences for approximately 93.3% of the NIST molecules, demonstrating its ability to handle diverse molecules. Spec2Mol's performance was evaluated by assessing the similarity between generated structures and reference structures. A coarse-level comparison used physicochemical properties (molecular weight, element composition). Substructure-level comparisons used fingerprint similarity (cosine coefficient) and maximum common substructure (MCS) metrics (MCS ratio, MCS Tanimoto, overlap coefficient). The model correctly predicted the exact structure for 7% and the exact formula for 39.3% of the test cases. Performance degraded when fewer than three of the four input spectra were available. The indirect approach (retrieving molecules from the pre-trained dataset) contributed more to effective structure generation than the direct approach. Predicted structures had molecular weights close to the reference compounds. The model's ability to identify atom species varied, correlating with their frequency in the training dataset. High sensitivity for common atoms (N, O) and high specificity for rare atoms were observed. The model identified the exact structure/formula for almost half of the molecules with MW < 300Da and for more than 60% of molecules containing only C and O. Higher molecular weight and under-represented atoms decreased identification accuracy. The model successfully identified key substructures (rings, long chains, functional groups, rare atoms) in several cases. Difficulties were observed with molecules containing large rings (under-represented in the training data) and molecules with poor-quality spectra. A comparative evaluation with SIRIUS 4, a state-of-the-art method using fragmentation trees and database retrieval (CSI:FingerID), was conducted on the cases where SIRIUS failed to identify the exact structure. Spec2Mol showed comparable or slightly better performance in terms of fingerprint similarity and MCS metrics, despite not using fragmentation tree information.
Discussion
Spec2Mol addresses the challenge of identifying novel molecules, which existing database-retrieval methods cannot handle. The results demonstrate that Spec2Mol effectively learns associations between spectra and structural features. Its performance on par with SIRIUS, a method utilizing additional fragmentation tree information, highlights the model's ability to extract relevant structural information directly from spectra. The capability of identifying key substructures even when the exact structure is not found is valuable in metabolomics and other fields dealing with novel or poorly characterized molecules. The reliance on raw MS/MS spectra as input makes Spec2Mol particularly useful when fragmentation tree data are unavailable or computationally prohibitive. The study’s findings significantly contribute to the field by offering a novel approach for de novo structure prediction.
Conclusion
Spec2Mol provides a significant advancement in structure elucidation by offering an end-to-end deep learning framework for generating molecular structures directly from MS/MS spectra. Its capacity to handle novel molecules and comparable performance to state-of-the-art methods using additional information establishes its value as a complementary tool in chemical analysis. Future work could incorporate prior knowledge (e.g., fragmentation trees) to potentially improve performance further. The indirect method could be extended to identify molecules from a relevant library using substructure-constrained similarity or nearest neighbor search.
Limitations
The study's reliance on a commercial dataset for mass spectra limits the generalizability of findings. The absence of stereochemistry information in the SMILES representation is a constraint; recovering such information from mass spectra alone remains challenging. Performance degradation with fewer available spectra indicates a need for further improvement in handling incomplete data. Difficulties with large rings suggest potential improvements through data augmentation or architectural modifications.
Related Publications
Explore these studies to deepen your understanding of the subject.