Chemistry
Prediction of glycopeptide fragment mass spectra by deep learning
Y. Yang and Q. Fang
Discover the potential of DeepGlyco, a cutting-edge deep learning technique developed by Yi Yang and Qun Fang to unravel the complexities of glycoproteomics. This innovative approach not only enhances the prediction of intact glycopeptide fragment spectra but also improves model explainability and differentiation of isomers. Dive into this promising research that could transform glycoproteomics!
~3 min • Beginner • English
Introduction
The study addresses the challenge of predicting MS/MS fragment ion intensities for intact glycopeptides, which contain a peptide backbone covalently linked to complex, non-linear glycans. Existing deep learning predictors developed for peptides assume linear sequences and do not model glycan branching, nor do they predict glycan-derived fragment ions (B/Y ions) characteristic of stepped-HCD fragmentation of N-glycopeptides. The authors propose DeepGlyco, a model that integrates peptide-sequence encoders with tree-structured LSTMs for glycans and a graph neural framework for glycan fragmentation pathways, aiming to accurately predict complete glycopeptide spectra (peptide b/y and glycan B/Y ions), improve explainability, differentiate structural glycan isomers, and enable construction of predicted spectral libraries to enhance DIA glycoproteomics analyses.
Literature Review
Spectral library searching complements database searching by leveraging fragment intensity patterns and often improves identification in DDA and DIA workflows. Deep learning has been widely adopted in proteomics to predict peptide properties including detectability, LC retention times, ion mobility CCS, and MS/MS fragment intensities (e.g., pDeep, DeepMass, Prosit, AlphaPeptDeep). Predicted peptide spectra have boosted identification sensitivity/specificity and enabled proteome-scale DIA using in silico libraries. Specialized models have addressed PTMs (e.g., DeepPhospho, DeepFLR). However, prior models process linear peptide inputs and treat simple PTMs as tags; they cannot represent and learn the nonlinear tree topology of glycans nor predict glycan fragment intensities present in intact glycopeptide HCD spectra. Furthermore, glycopeptide fragmentation behaviors differ from non-glycosylated peptides, necessitating architectures (tree/graph models) that can encode glycan structures and fragmentation pathways.
Methodology
Model architecture: DeepGlyco splits each glycopeptide into peptide and glycan moieties. Peptides are represented by one-hot amino acid vectors with PTMs encoded as elemental composition embeddings and processed by stacked bidirectional LSTMs (hidden size 256) with dropout. Glycans are represented as trees with one-hot monosaccharide nodes (Hex, HexNAc, NeuAc, NeuGc, Fuc) and linkages as edges. A bottom-up tree LSTM (hidden size 256) summarizes subtree information; peptide and glycan features are fused bidirectionally (glycan-root feature added to glycosite position in peptide features; peptide-sequence feature added to glycan root). Precursor charge state is concatenated to both streams.
A second, top-down tree LSTM updates glycan node features to incorporate context from all directions. Potential cleavage sites partition the glycan into lost (reducing end) and retained (non-reducing end) nodes; features from lost/retained nodes are aggregated with attention to form cleavage features. A tripartite fragmentation graph links cleavage nodes to structure-specific fragments and to composition-level fragments (isomer sums). LSTM-based and attention-based aggregations output Y-ion intensities (charges 1–3+). Isomeric fragments of identical composition are summed. For complex/hybrid glycans, an optional branch module predicts B ions (branch losses and their subfragments, including branch+adjacent core mannose), using analogous cleavage aggregation; oligomannose B ions with multiple mannoses are excluded. Peptide b/y-ion intensities (charges 1–2+, including bare b/y, b/y+HexNAc, and b$/y$ from HexNAc cross-ring) are predicted via an additional peptide LSTM and dense layers. Peptide and glycan features are further pooled with attention and passed to a sigmoid output to predict the peptide-to-total intensity ratio, which is used to merge peptide and glycan predictions into a full spectrum. Total parameters ~7M (~5M peptide, ~2M glycan).
Datasets and preprocessing: HCD stepped CE datasets from mouse, human, and yeast on Orbitrap instruments were collected. For datasets with individual CEs (e.g., 20 and 33), intensities were averaged to form pseudo-stepped spectra. GPSMs were filtered to retain spectra with ≥5 peptide b/y and ≥5 glycan Y ions; for B-ion models, ≥2 B ions were required. Replicate spectra per precursor were merged to consensus spectra. Datasets were split into train (3/5), validation (1/5), and holdout (1/5).
Training: Multi-task learning optimized four objectives: spectral angle loss (SA) for peptide part, glycan part, and whole spectrum, plus MSE for peptide/glycan intensity ratio. Dynamic weight averaging adjusted subtask weights by convergence rates. Optimization used Adam, batch size 16, warmup then cosine annealing with warm restarts. Peptide-only modules were pre-trained on large HeLa peptide spectra; initial glycopeptide training froze the first two peptide BiLSTM layers. For the B-ion model, layers for b/y and Y ions were frozen and an additional SA term for B ions was added (lower learning rate with warm restarts). Fine-tuning used higher initial LR without warmup and reduced LR on plateau.
Isomer differentiation by spectral library searching: For each query GPSM, structural isomer candidates were generated by replacing the glycan with structural isomers from pGlyco glycan databases (human: 2922; mouse: 7878), augmented with StrucGP-unique structures. Experimental peaks were matched to theoretical fragments of candidates; peaks absent in all candidates were discarded. Similarity scores combined Y- and B-ion SA: Score = 0.5 × (1−SA_Y) + 0.5 × (1−SA_B); candidates were ranked by score.
DIA spectral library prediction and analysis: Predicted libraries were built either with predicted MS/MS and experimental iRTs (PredMS2) or with both predicted MS/MS and predicted iRTs (PredLib). Models for iRT prediction achieved Pearson r > 0.97 on holdouts. DIA data were analyzed with GproDIA (transition filtering, RT calibration, decoy generation, feature extraction, scoring, two-dimensional statistical control over peptide and glycan). For complex serum, glycoform inference was enabled. Extended predicted libraries (PredExt) increased glycopeptide coverage (~5k to ~10k precursors) using prior glycopeptide lists to constrain combinatorics. Statistical control with entrapment experiments added absent glycans as false targets at sizes comparable to original libraries to estimate relative false-positive rates (entrapment percentages). Quantification precision was assessed via CV across replicates and accuracy via fold-changes in mixed-organism samples.
Key Findings
Spectrum prediction accuracy: On Mouse1 and Human1 holdouts, median SA for peptide part was 0.22–0.16 (DP 0.94–0.96), glycan part <0.11 (DP >0.98), whole spectrum <0.16 (DP >0.97), with similar training vs. holdout metrics (limited overfitting). Across mouse datasets (trained on Mouse1, tested on Mouse2–4): peptide SA 0.28–0.26 (DP 0.90–0.91), glycan SA 0.22–0.18 (DP 0.94–0.96), whole SA 0.24–0.21 (DP 0.93–0.95). Across human datasets (trained on Human1, tested on Human2–4), peptide prediction was more CE-sensitive (SA 0.38–0.32; DP 0.82–0.88). Fine-tuning improved performance: e.g., Mouse3 peptide SA improved 0.26→0.21 (DP 0.91→0.94); Human4 peptide SA 0.39→0.24 (DP 0.82→0.93). Combining datasets improved generalization.
Branch (B) ions: Retrained models (excluding high-mannose) achieved high B-ion similarity, median SA 0.16–0.06 (DP 0.97–0.99) while maintaining strong Y-ion prediction, yielding whole-spectrum median SA 0.20–0.10 (DP 0.95–0.99). Reported similarities were computed against annotated b/y and B/Y ions constituting 26–34% of raw spectral intensity.
Isomer differentiation: Predicted spectral library searching ranked the correct structural identity among top three in 92–95% of cases across standard and biological datasets; bisecting HexNAc recognition accuracy 88–95%. Core fucosylation recognition achieved 93–97% accuracy, 94–99% precision, and 96–98% recall. In enzyme-treated mouse brain data, 96% (448/466) of spectra annotated as neither bisecting nor terminal HexNAc remained negative after rescoring; 75% (46/61) of terminal HexNAc and 13% (21/158) of bisecting HexNAc annotations became dual negatives. In Fut8 knockout vs wild-type mouse brain, after rescoring, core-fucosylated cases in knockout samples dropped from 12% (262/2229) to 5% (110/2229); in wild type, 97% (3058/3165) of core-fucosylated and 92% (301/326) of negatives were retained.
DIA with predicted libraries: Yeast dataset: predicted libraries (PredMS2/PredLib) yielded up to ~10% fewer detected precursors and site-specific glycans than DDALib but slightly improved data completeness. Serum dataset (avg over 3 runs): PredMS2 detected 859±3 precursors and 594±2 site-glycans per run; PredLib 798±3 precursors and 553±1; DDALib 799±14 precursors and 539±2. Across 3 runs, PredMS2 found 956 total precursors (647 site-glycans), with 76% and 81% shared across all runs, respectively. Considering IDs present in ≥2/3 runs, predicted spectra increased IDs: +7% precursors (893 vs 835) and +10% site-glycans (613 vs 556) compared to DDALib; replacing RT values caused −6% precursors and −7% site-glycans. Extended predicted libraries improved coverage: PredExt (~5k) yielded 991±5 precursors and 691±2 site-glycans per run; IDs in ≥2/3 runs increased by 24% (precursors) and 28% (site-glycans) over DDALib. Entrapment analyses: Yeast PredLib entrapment 1.4% vs DDALib 1.0%; Serum PredLib 2.3% vs DDALib 2.5%; PredExt 7k ~2.3%; PredExt 10k substantially higher entrapment, indicating large libraries challenge error control. Quantification: median CVs ~12% (precursor) and 13% (site-glycan) with PredExt comparable to DDALib; fold-change accuracy in mixed human/yeast samples was close to DDALib (slight overestimation for human, close or better for yeast).
Discussion
DeepGlyco brings tree-structured LSTMs and fragmentation-pathway-aware graph modeling to intact glycopeptide spectrum prediction, enabling accurate intensity predictions for peptide b/y and glycan Y (and B) ions under stepped-HCD. The attention mechanism provides interpretability by highlighting cleavage steps contributing to specific fragments, supporting rationalization of spectral differences among structural isomers. Performance is highest when training and application match organism and instrument conditions; fine-tuning and multi-dataset training improve generalization. Predicted spectral libraries are effective for DIA glycoproteomics: they can correct or surpass lower-quality spectra in experimental libraries and, when judiciously extended, substantially increase identification coverage with comparable precision and quantitative stability. However, excessively large predicted libraries inflate false positives under current DIA statistical control paradigms, necessitating constrained search spaces or improved scoring/control strategies. Overall, the approach advances both identification and structural differentiation in glycoproteomics and paves the way for broader applications in isomer ranking and DIA analysis.
Conclusion
This work introduces DeepGlyco, an explainable deep learning framework that models peptide sequences and non-linear glycan structures to predict intact glycopeptide MS/MS spectra with high accuracy across fragment types. The model differentiates structural isomers (e.g., core fucosylation, bisecting HexNAc) using predicted spectra and attention-derived fragmentation pathways, and generates predicted spectral libraries that enable DIA glycoproteomics with performance comparable to or better than experimental libraries, especially when extending coverage. Future directions include: training larger, diverse datasets to enhance generalization; incorporating explicit encoders for instrument and collision energy metadata; extending to O-glycopeptides, alternative dissociation methods (e.g., ETD) and multi-glycan peptides; improving glycopeptide iRT calibration standards; and developing DIA analysis/scoring methods compatible with proteome/glycome-scale predicted libraries for rigorous error control.
Limitations
Generalization degrades across organisms and instrument/CE settings without fine-tuning, reflecting limited availability of large, diverse high-quality glycopeptide datasets. The current model ignores the ordering of glycan branches, which can affect predictions for isomeric structures and contribute to replicate variability. Oligomannose branch (multi-mannose) B ions were not modeled; xylose-containing glycans are not supported due to lack of training data. Only HCD stepped-CE spectra were considered; other fragmentation methods and ion types are not yet supported. Glycopeptide iRT prediction is limited by the absence of standardized glycopeptide RT calibration kits. In DIA, very large predicted libraries increase entrapment (false-positive) rates under current statistical control frameworks, requiring constrained search spaces.
Related Publications
Explore these studies to deepen your understanding of the subject.

