Introduction
Liquid chromatography-mass spectrometry (LC-MS) is a widely used technique in proteomics and metabolomics for analyzing complex biological samples. However, variations in retention time (RT) of analytes across different LC-MS runs pose a significant challenge. These RT shifts can be caused by various factors, including matrix effects and instrument variations, making accurate alignment of features across multiple samples essential for reliable quantitative, comparative, and statistical analyses. This process, known as RT alignment or correspondence, aims to identify the same compound across different samples. While peptide identification can facilitate correspondence in proteomics, a substantial portion of precursors often remain unidentified, particularly in data-dependent acquisition (DDA) mode. Even in data-independent acquisition (DIA), numerous unidentified precursors remain. Existing tools for DDA and DIA data analysis, such as MaxQuant, PANDA, MSFragger, and DIA-NN, often use the match between runs (MBR) function for RT alignment, relying on identified peptides. This approach limits the application of RT alignment in clinical proteomics for biomarker discovery from unidentified precursors. In metabolomics, feature alignment is crucial for accurate identification and quantification, with RT alignment particularly important due to the high accuracy of m/z measurements. Current computational methods for RT alignment can be broadly categorized into warping methods and direct matching methods. Warping methods, employed in tools like XCMS, MZmine, and OpenMS, correct RT shifts using linear or non-linear warping functions. However, they cannot handle non-monotonic RT shifts. Direct matching methods, used in tools such as RTAlign, MassUntangler, and Peakmatch, directly compare signals across runs without warping functions, but they often underperform warping methods due to the inherent uncertainty in MS signals. The inability of these existing tools to handle both monotonic and non-monotonic RT shifts simultaneously necessitates the exploration of machine learning and deep learning techniques. While some studies have applied Siamese networks for peak alignment in gas chromatography-MS data, a deep learning-based solution for LC-MS data analysis was lacking. This study presents DeepRTAlign, a novel deep learning-based tool designed to address this limitation. By combining coarse alignment and a deep learning model, DeepRTAlign aims to accurately handle both monotonic and non-monotonic RT shifts, enhancing accuracy and sensitivity compared to existing methods.
Literature Review
The existing literature extensively covers methods for retention time alignment in LC-MS data. Warpping functions are a popular approach, with tools like XCMS, MZMine, and OpenMS utilizing different algorithms to model and correct for retention time shifts. These methods generally assume monotonicity, limiting their applicability to datasets with non-monotonic shifts. Direct matching methods represent an alternative approach, where feature similarity across runs is used to establish correspondences. However, direct matching methods often suffer from lower accuracy compared to warping methods, especially in the presence of noise and missing data. The use of machine learning techniques in alignment has become increasingly common, with some studies exploring the use of Siamese networks and other neural network architectures for improving the accuracy and robustness of peak alignment in both targeted and untargeted metabolomics. However, the application of deep learning techniques to LC-MS data for alignment remained relatively unexplored until now. This paper addresses this gap by introducing DeepRTAlign, a deep learning-based tool that offers a novel approach to overcome the limitations of existing techniques.
Methodology
DeepRTAlign's workflow is divided into training and application phases. The training phase involves: (1) Precursor detection and feature extraction using the in-house tool XICFinder; (2) Coarse alignment, where RTs are linearly scaled, and for each m/z, the highest intensity feature is selected as a reference for alignment; (3) Binning and filtering, grouping features based on m/z and optionally retaining only the most intense feature within each m/z bin; (4) Input vector construction, where input vectors are created based on RT and m/z values of adjacent features; (5) Deep neural network (DNN) training using a DNN with three hidden layers. 400,000 feature pairs are used for training (200,000 positive and 200,000 negative), sourced from the HCC-T dataset and labeled using Mascot identification results. The training process utilizes the BCELoss function, a sigmoid activation function, the Adam optimizer, and an initial learning rate of 0.001; (6) Hyperparameter optimization through 10-fold cross-validation; (7) Model evaluation on independent test datasets; and (8) Quality control (QC) of alignment results via decoy sample analysis to calculate the false discovery rate (FDR). The application phase involves feeding input features (from Dinosaur, MaxQuant, OpenMS, or XICFinder) through the trained DNN model for alignment after coarse alignment and input vector construction. Multiple datasets are utilized for benchmarking against existing tools: HCC-T (training), HCC-N, HCC-R, UPS2-M, UPS2-Y, EC-H, AT, and SC (test sets), along with additional proteomic and metabolomic datasets (NCC19, SM1100, MM, SO, GUS, MI, CD) for generalizability assessment. Simulated datasets are generated to evaluate the algorithm's tolerance to various RT shift distributions. Machine learning models (RF, KNN, SVM, LR) were trained and compared to the DNN model to evaluate performance.
Key Findings
DeepRTAlign demonstrates superior performance in RT alignment compared to existing tools like MZmine 2 and OpenMS across multiple proteomic and metabolomic datasets. The results show that DeepRTAlign achieves higher precision and recall, aligning more features accurately. Ablation analysis confirms the importance of both the coarse alignment step and the inclusion of RT-related features in the DNN model for optimal performance. Compared to tools using MS/MS information (Quandenser) or identification-based alignment (MaxQuant, MSFragger, DIA-NN with MBR), DeepRTAlign aligns significantly more features (ID-free approach) without compromising quantification accuracy. DeepRTAlign shows robustness across various feature extraction methods. Generalizability evaluation on simulated datasets reveals DeepRTAlign's performance is sensitive to the standard deviation of RT shifts, indicating a need for controlled RT shift distribution in practical applications. Application to HCC early recurrence prediction reveals that a classifier built on MS features aligned by DeepRTAlign exhibits significantly higher AUC values than classifiers based on identified peptides or proteins (0.998 vs. 0.931 and 0.757, respectively). Even a classifier using only the top 15 features achieves an AUC of 0.833 on an independent test set, highlighting the potential of DeepRTAlign for biomarker discovery.
Discussion
DeepRTAlign's superior performance in RT alignment stems from its ability to handle both monotonic and non-monotonic RT shifts simultaneously. The combination of coarse alignment and a deep learning model enables accurate alignment of features, even in complex datasets. The ID-free nature of DeepRTAlign allows the alignment of all precursors, maximizing information utilization and potentially uncovering novel biomarkers not detectable through traditional identification-based methods. The application in HCC early recurrence prediction demonstrates the practical utility of DeepRTAlign in downstream biological analyses. The higher predictive power of the DeepRTAlign-aligned MS feature-based classifier suggests the presence of hidden information in MS features that is not fully captured by peptide and protein quantification. This suggests that DeepRTAlign is especially effective at detecting subtle variations in LC-MS data. Future work could focus on improving the feature extraction process to further enhance the quantification accuracy, possibly by optimizing the alignment and extraction algorithms together.
Conclusion
DeepRTAlign offers a significant advancement in RT alignment for large cohort LC-MS data analysis. Its deep learning-based approach surpasses the performance of existing methods in accuracy, sensitivity, and generalizability. The ID-free nature and compatibility with multiple feature extraction tools enhance its usability and applicability across diverse research areas. The successful application in HCC recurrence prediction demonstrates its potential in biomarker discovery and precision medicine. Future work should explore strategies to improve quantification accuracy and investigate the applicability of DeepRTAlign to even more complex datasets with higher RT variations.
Limitations
While DeepRTAlign demonstrates significant improvements in RT alignment, the algorithm's performance is sensitive to the standard deviation of RT shifts in simulated data. Extreme RT variations can impact its effectiveness, requiring careful consideration and potentially pre-processing strategies. The performance on datasets with a large number of features with very close m/z values might need further investigation and improvement. Additionally, although DeepRTAlign is compatible with multiple feature extraction tools, the choice of feature extraction method still influences the final alignment results. Finally, the study focuses primarily on the alignment aspect; future research could explore integration with advanced feature extraction techniques for overall improved performance in quantitative proteomics and metabolomics.
Related Publications
Explore these studies to deepen your understanding of the subject.