Chemistry

Prediction of glycopeptide fragment mass spectra by deep learning

Y. Yang and Q. Fang

Discover the potential of DeepGlyco, a cutting-edge deep learning technique developed by Yi Yang and Qun Fang to unravel the complexities of glycoproteomics. This innovative approach not only enhances the prediction of intact glycopeptide fragment spectra but also improves model explainability and differentiation of isomers. Dive into this promising research that could transform glycoproteomics!

00:00

Playback language: English

Index

Introduction

Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is the standard method for proteomics and glycoproteomics. Peptide identification relies on matching fragment spectra to theoretical or experimental spectra, often using database searching which scores matches based on the presence of fragment ions but largely ignores intensity. Spectral library searching, a complementary approach, correlates fragment ion intensity patterns to library spectra, providing more discriminative scores than database searching for data-dependent acquisition (DDA) and enabling deeper proteome coverage in data-independent acquisition (DIA) experiments across various proteomic fields. However, limited library coverage restricts identification capacity. Computational methods for predicting peptide spectral libraries are gaining attention, and deep learning has proven beneficial in various aspects of the MS-based proteomics workflow, including prediction of peptide properties like detectability, retention times, collisional cross sections, and fragment ion intensities. Deep learning models like pDeep, DeepMass, Prism, Prosit, and AlphaPeptDeep represent the state-of-the-art for peptide fragment intensity prediction. These predictions improve DDA peptide identification and benefit DIA analysis by enabling the generation of predicted spectral libraries directly from protein sequence databases. Deep learning models have also been specialized for post-translational modifications (PTMs), but current methods lack the ability to predict fragment spectra for intact glycopeptides. Intact glycopeptides, unlike deglycosylated peptides, retain the peptide-glycan link, providing information on peptide sequence, glycan structure, and glycosite. The complexity of glycan structure, composed of various monosaccharides and linkages, challenges existing linear sequence-based models. The fragmentation behavior of intact glycopeptides in MS/MS differs from non-glycosylated peptides, requiring novel architectures to learn glycan structures and their fragmentation relevance. This research addresses this gap by presenting DeepGlyco, a deep learning framework for predicting MS/MS spectra of intact glycopeptides.

Literature Review

Existing tools for peptide property prediction primarily use long-short term memory (LSTM) networks, gated recurrent units, or transformer-based models. These models process linear peptide sequences effectively, but they are not suitable for the non-linear structure of glycans. The fragmentation behavior of glycopeptides under higher-energy collisional dissociation (HCD) with stepped collision energy (CE) differs from that of non-glycosylated peptides. HCD produces sequential cleavages of both glycan and peptide bonds, resulting in spectra containing peptide (b/y ions) and glycan (B/Y ions) fragments. Existing peptide fragment prediction models do not account for these glycan fragments. Graph neural networks (GNNs), recently employed for de novo glycan sequencing, offer a suitable architecture to learn glycan structures and their relationship to fragment ions. Several studies have demonstrated the success of deep learning in predicting peptide properties and MS/MS spectra, improving the sensitivity and specificity of peptide identification and DIA analysis. However, these models are not designed to handle the complexities of intact glycopeptide fragmentation.

Methodology

DeepGlyco is a deep learning framework that processes peptide and glycan moieties separately and then integrates the information. Peptide sequences are processed using linear LSTMs, represented by one-hot indicators and element compositions. Glycan structures are represented as trees with monosaccharides as nodes and linkages as edges, processed by tree LSTMs in a bottom-up traversal (non-reducing to reducing end). A sub-node summing variant of tree LSTM is used, ignoring the order of glycan branches, as this information is not readily available from conventional HCD MS/MS glycopeptide identification. Peptide and glycan features are then fused. The precursor charge state is also included. Another linear LSTM processes peptide features to predict peptide b/y fragment intensities. A second tree LSTM processes glycan features in a top-down direction, enabling each node to contain information from all parts of the glycan structure. Potential cleavage sites are identified, and features of lost and retained nodes are aggregated with an attention mechanism to predict Y ion intensities. Isomeric fragments are combined by summing their intensities. A separate module, for complex and hybrid glycans, predicts B ion intensities from glycan branches at the non-reducing end, using a similar approach. Peptide and glycan fragment intensities are merged based on a predicted ratio, forming the final glycopeptide spectrum. The model has approximately 7 million parameters. The model was trained and validated using datasets from diverse organisms acquired on Orbitrap mass spectrometers. Redundant spectra were combined into consensus spectra. The spectral angle loss (SA) was used as the objective function, with SA values computed for peptide, glycan, and whole spectra. Multi-task learning optimized these objectives simultaneously. Model performance was evaluated using hold-out datasets and across different instrument settings. Fine-tuning was performed on datasets with different CE settings. The model incorporating branch fragments was trained on datasets excluding high-mannose glycopeptides. For the differentiation of structural isomers, predicted spectra were used in spectral library searching, ranking candidates based on a combined similarity score of Y and B ions. To assess the use of DeepGlyco for DIA analysis, models were trained to predict indexed retention time (iRT) values along with MS/MS spectra. The predicted spectral libraries were benchmarked against experimental libraries, evaluating both glycopeptide precursor and site-specific glycan detection. Libraries with extended coverage were also evaluated, as well as the false positive rate via entrapment analysis and quantitative precision.

Key Findings

DeepGlyco achieved high accuracy in predicting glycopeptide MS/MS spectra when trained on data from the same organism and instrument settings. Benchmarking on Mouse 1 and Human 1 datasets showed high spectral similarity (median SA values below 0.16 for whole spectra, corresponding to dot product values above 0.97). Cross-organism and cross-instrument predictions showed lower accuracy, indicating limited generalization capabilities. Fine-tuning improved prediction accuracy when applied to new instrument configurations. The model successfully differentiated MS/MS spectra of glycopeptide structural isomers; spectral library searching based on predicted spectra effectively ranked candidate glycan structures. In experiments differentiating structural isomers, the correct identity was ranked among the top three candidates in 92–95% of cases. Core fucosylation and bisecting HexNAc recognition achieved high accuracy (93–97% and 88–95%, respectively). Analysis of mouse brain datasets with endoglycosidase treatment further validated the model's ability to identify specific glycan modifications. Predicted spectral libraries proved compatible with DIA analysis. Using predicted libraries, slightly fewer glycopeptides were identified compared to experimental libraries while achieving comparable data completeness; however, libraries with expanded coverage improved the number of detected glycopeptides and site-specific glycans. In the analysis of a human serum dataset, using predicted fragment spectra resulted in a gain of 7% precursors and 10% site-specific glycans when compared to the experimental library. Analysis of mixed-organism samples showed that DeepGlyco's quantification accuracy was close to and sometimes better than the experimental library. The attention weights within the model provided insights into glycan fragmentation pathways and facilitated the differentiation of structural isomers.

Discussion

DeepGlyco's success in predicting glycopeptide MS/MS spectra stems from its ability to handle the non-linear nature of glycan structures using tree LSTMs and graph neural networks. The model's explainability, due to attention mechanisms, allows investigation of fragmentation pathways and differentiation of isomers. This capability extends the use of spectral library searching to include glycan structure differentiation, improving structural annotation. While the model's generalization across different organisms and instrument settings is limited, fine-tuning improves its adaptability. The application of DeepGlyco to DIA analysis demonstrates its utility in enhancing glycoproteome coverage and data quality, though its full potential will be further unlocked by future improvements in DIA data analysis algorithms. The current limitation of needing a starting glycopeptide list to manage the search space in DIA analysis can be addressed through further research on deep learning scoring models compatible with proteome-scale predicted libraries.

Conclusion

DeepGlyco is a novel deep learning method for predicting intact glycopeptide MS/MS spectra. Its use of tree-structured LSTMs and graph neural networks effectively models glycan structures and fragmentation pathways. The model exhibits high prediction accuracy, facilitates structural isomer differentiation through spectral library searching, and is suitable for DIA glycoproteomics, improving both depth and quality of identifications. Future work could focus on improving generalization ability by incorporating spectral metadata and adapting the model to other glycopeptide types and fragmentation techniques.

Limitations

The model's generalization ability across diverse organisms and instrument settings is limited, although fine-tuning helps mitigate this. The reliance on a starting glycopeptide list in DIA analysis restricts the full utilization of predicted spectral libraries. The accuracy of glycan structure identification from spectral library searching alone remains a challenge, though it offers valuable support to other identification methods. The study does not encompass all possible glycan structures or fragmentation pathways; limited training data currently restrict the comprehensive inclusion of diverse glycans.

Related Publications

Explore these studies to deepen your understanding of the subject.

Earth Sciences

Nonlinear sensitivity of glacier mass balance to future climate change unveiled by deep learning

J. Bolibar, A. Rabatel, et al.

Medicine and Health

Distinct brain morphometry patterns revealed by deep learning improve prediction of post-stroke aphasia severity

A. Teghipco, R. Newman-norlund, et al.

Medicine and Health

Prediction of base editor off-targets by deep learning

C. Zhang, Y. Yang, et al.

Medicine and Health

Impact of a deep learning sepsis prediction model on quality of care and survival

A. Boussina, S. P. Shashikumar, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny