logo
ResearchBunny Logo
MSNovelist: de novo structure generation from mass spectra

Chemistry

MSNovelist: de novo structure generation from mass spectra

M. A. Stravs, K. Dührkop, et al.

Discover the groundbreaking MSNovelist, developed by Michael A. Stravs and collaborators, that transforms mass spectrometry (MS²) spectra into novel molecular structures. This innovative tool outshines traditional database searches, especially for unique compound classes, providing a leap forward in structural elucidation!

00:00
00:00
Playback language: English
Introduction
A major challenge in mass spectrometry, particularly in metabolomics and non-targeted analysis, is feature annotation—assigning chemical identities to unknown signals based on their exact mass and MS² spectra. Existing methods, such as library matching and in silico structural database searches, have limitations. Library matching is restricted by the availability of standards and curated data, while in silico methods face challenges in accurately simulating spectra and fragmentation patterns or predicting high-dimensional molecular descriptors. Critically, neither approach effectively identifies truly novel compounds, such as unknown natural products or drug metabolites. The ideal database-independent approach would involve determining the molecular formula, enumerating all possible candidates, and scoring them against experimental data; however, the combinatorial explosion of possible structures renders this impractical. Recent strategies have focused on expanding databases using reaction rules, utilizing spectral networking, employing hybrid search methods, or using machine learning for in silico chemical class assignment. In computational drug design, deep learning has emerged for de novo molecule generation, offering the possibility of querying a vast chemical space without explicit enumeration. While some approaches have utilized molecule generation for creating candidate libraries based on mass and collision cross-section or specific compound classes, these methods do not directly incorporate structural information from MS² spectra, requiring further filtering. The ability to directly use MS² spectra for targeted structure generation would overcome the combinatorial bottleneck in de novo structure elucidation. However, the limited availability of training data has hindered this approach. This paper introduces MSNovelist, addressing the challenge of MS² spectral information to structure generation by combining CSI:FingerID for MS-to-fingerprint prediction with an RNN generative model for fingerprint-to-structure conversion. This workflow allows for de novo structure elucidation independent of spectral libraries, making it particularly suitable for identifying poorly represented or novel compounds.
Literature Review
The authors review existing methods for small molecule structure elucidation, highlighting the limitations of library-based approaches and in silico methods. They discuss the challenges of de novo structure prediction due to the combinatorial explosion of possible structures. The literature review also covers recent advances in using machine learning and deep learning for molecule generation in the context of drug design. The authors cite studies using variational autoencoders and recurrent neural networks to generate molecules from SMILES strings or graphs. Finally, they address the specific limitations of applying these methods directly to mass spectrometry data due to limited training datasets, noting the use of simulated MS² data based on class-specific fragmentation rules, which are unsuitable for training diverse generative models.
Methodology
MSNovelist performs de novo structure elucidation in two steps. First, it uses SIRIUS and CSI:FingerID to predict a molecular formula and a high-dimensional molecular fingerprint from the MS² spectrum. The fingerprint represents the likelihood of specific structural characteristics. The molecular formula can be user-specified if known, bypassing the SIRIUS prediction. Second, an encoder-decoder recurrent neural network (RNN) model, trained independently of spectral libraries, predicts structures in SMILES format from the fingerprint, constrained by the molecular formula. The model learns to represent fingerprint features in SMILES strings. For each query, the model returns a ranked list of candidate structures based on the RNN model score, which is the probability of the sequence. Generated SMILES are validated and dereplicated, and candidates are re-ranked using a modified Platt score, measuring the match to the query fingerprint. The RNN model training is independent of spectral libraries; fingerprints are computed for any structure, enabling virtually unlimited training points without constraints from limited MS² data availability. The model was trained on a dataset of 1,232,184 chemical structures from HMDB, COCONUT, and DSSTox databases, along with 14,047 predicted fingerprints to introduce error into the input. Structures used for fingerprint simulation or evaluation were removed from the training set to ensure unbiased evaluation of the model's ability to identify unseen structures. The RNN model architecture uses three hidden layers in the encoder, generating a latent representation of the molecule. The decoder is a three-layer LSTM RNN that predicts SMILES characters sequentially. Auxiliary information, including remaining atom counts per element and the number of open brackets, is incorporated to generate valid SMILES and molecules with specific formulas. An auxiliary LSTM predicts hydrogen atom counts. Candidates are re-ranked using the modified Platt score to improve accuracy. The architecture includes batch normalization and ReLU activation functions in the encoder and LSTM layers in the decoder. The model was trained using stochastic gradient descent with the Adam optimizer, and the training was done using teacher forcing.
Key Findings
MSNovelist's performance was evaluated using two large MS² datasets: 3,863 spectra from GNPS and 127 spectra from the CASMI 2016 competition. On the GNPS dataset, MSNovelist correctly predicted the top-ranked structure for 25% of the spectra and retrieved the correct structure overall in 45% of cases. It reproduced 61% of correct database annotations. A database search with CSI:FingerID, using the same fingerprint, achieved a top-ranked accuracy of 39%. In a subset of high-quality GNPS spectra, MSNovelist achieved 68% retrieval and 61% top-ranked accuracy. Incorrect predictions frequently involved structural isomers or molecules with minor structural differences. Ablation studies, using a model without fingerprint input, showed significantly reduced performance (31% retrieval and 17% top-ranked accuracy on the GNPS dataset), demonstrating the importance of structural information. Similar results were observed for the CASMI 2016 dataset. Further analysis revealed that the best incorrect de novo predictions had higher Tanimoto similarity scores (median 0.80) compared to the best training set candidates (0.76), indicating the generation of novel chemical features. The modified Platt scores for the best MSNovelist candidates were comparable to database compounds and higher than training set compounds. Evaluations showed that the RNN model alone, without re-ranking by the modified Platt score, produced promising results, with 19% and 39% correct top-ranked identifications in the GNPS and GNPS-OK datasets respectively. This indicates that the raw RNN score is informative, but re-ranking enhances performance. The application of MSNovelist to a bryophyte LC-MS dataset identified seven novel chemical structures, where de novo predictions outperformed database candidates based on modified Platt scores. The analysis of these spectra revealed that for four cases, the de novo structure explained the MS² spectrum significantly better than the database top candidates; in one case both methods showed similar performance.
Discussion
MSNovelist demonstrates the feasibility of de novo structure generation from MS² spectra without relying on structural databases, challenging the assumption that small molecule chemical space complexity prevents such approaches. It represents the first direct application of a chemical generative model to mass spectrometry data, unlike previous methods which used generative models to create candidate libraries for subsequent identification. The success of MSNovelist is attributed to three key factors: the use of structural fingerprint predictions from MS² spectra, the decoupling of MS² interpretation and structure generation enabling large-scale training, and the successful application of the image captioning analogy to the fingerprint-to-structure translation task. While re-ranking with the modified Platt score improves results, the raw RNN score provides valuable initial information. Alternative approaches, such as treating de novo annotation as an optimization task with a predefined scoring function, could be explored, potentially allowing for the integration of orthogonal information such as retention time. The model’s performance is comparable to expectations from related benchmarks, despite the large space of possible molecules. Even though the training set limits the ability to predict compounds extremely different from known molecules, the application to a bryophyte dataset demonstrated success in finding plausible novel molecules.
Conclusion
MSNovelist successfully demonstrates de novo molecular structure generation from MS² spectra, independent of a structural database. This is a significant advance enabled by utilizing predicted molecular fingerprints, separating structure generation from MS² interpretation, and leveraging the analogy to image captioning. Future work could explore alternative approaches, such as optimization-based methods and the integration of orthogonal information. The model shows promise for identifying novel compounds in biological datasets.
Limitations
The current model is trained only on positive-mode data and requires spectra with at least three fragment ions for optimal performance. The accuracy of formula determination using SIRIUS, especially for high-m/z compounds, can impact results. While the model generates structures similar to those in the training set, which can limit the discovery of radically new chemistry, the results are still impressive. Further, the model's performance is affected by the accuracy of the predicted fingerprint and the quality of the MS2 spectra.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny