Introduction
Accurate and rapid clinical diagnosis is crucial for effective treatment. Mass spectrometry (MS) offers a rapid, sensitive, and reproducible method for obtaining non-targeted molecular information, making it valuable for tasks like cancer tissue identification and microbial pathogen identification. Traditional machine learning (ML) methods, such as SVM, RF, and LDA, require extensive preprocessing steps (peak alignment, intensity calibration/normalization, peak detection, dimensionality reduction) to correct for artifacts in MS data, which hinders rapid analysis. These preprocessing steps are complex, dataset-specific, and lack standardized parameters, reducing reproducibility and reliability. Convolutional neural networks (CNNs) offer a promising alternative as they can learn representations directly from raw data, eliminating the need for extensive preprocessing. However, CNN performance deteriorates significantly with limited training data, a common problem in medical applications where sample availability is often restricted. Transfer learning, where a model trained on a large dataset is adapted to a smaller dataset, can mitigate this issue. This study explores transfer learning using 1D-CNNs for MS data classification and proposes a novel cumulative learning approach to further enhance performance when transfer learning alone is insufficient. The cumulative learning approach involves sequentially training a single model on multiple small datasets, accumulating knowledge to improve representation learning and classification accuracy. This study uses rat brain data as an initial training dataset and investigates the impact of this approach across several clinical MS datasets representing different biological contexts, organisms, and instruments.
Literature Review
Many studies have utilized machine learning (ML) models like support vector machines (SVM) [4], random forests (RF) [5], and linear discriminant analysis (LDA) [6] for MS-based classification tasks in cancer diagnosis and microbial identification. However, these methods heavily rely on extensive preprocessing steps, making them unsuitable for rapid applications. The inconsistency in preprocessing strategies hinders the comparison of results across studies. Several preprocessing methods including peak alignment [13, 14, 15, 16], intensity calibration/normalization [17, 18], peak detection [19], and dimensionality reduction [20] are necessary to ensure data quality and improve the performance of conventional ML algorithms. The choice of preprocessing parameters has shown to greatly affect the classification accuracy [21, 22]. Convolutional neural networks (CNNs) have emerged as a powerful alternative, offering high accuracy and the ability to learn from raw data, eliminating the need for extensive preprocessing [25]. However, the effectiveness of CNNs decreases with limited training data, which is a limitation in many biomedical applications [26, 27]. Transfer learning has been successfully applied in various fields [28, 29], and its potential for MS data analysis has been explored with 2D imaging data [30, 31]. However, few studies have explored its use with 1D spectral data [33, 34, 35, 25]. This study aims to address the challenges of limited sample sizes in MS-based clinical classification using CNNs by combining transfer learning with a novel cumulative learning approach.
Methodology
This study uses six datasets: two small clinical datasets (canine sarcoma and microorganisms), a large MALDI-MSI rat brain dataset, a beef liver dataset, and two publicly available human ovary datasets (high and low resolution). The canine sarcoma dataset consists of spectra from healthy tissue and 11 sarcoma types, acquired using a Synapt G2-S Q-TOF instrument. The microorganism dataset contains spectra from five human pathogens (two Gram-negative, two Gram-positive bacteria, and one yeast), also acquired using a Synapt G2-S Q-TOF instrument. The rat brain dataset comprises spectra of gray and white matter obtained using a Rapiflex MALDI-TOF instrument. The beef liver dataset contains spectra from healthy beef liver samples, acquired with a Synapt G2-S Q-TOF instrument. The two human ovary datasets contain spectra from healthy and cancerous samples acquired via SELDI TOF technology (high-resolution dataset) and a Protein Biological System II (PBSII) SELDI-TOF instrument (low-resolution dataset). Spectra are binned to create intensity matrices, with SpiderMass spectra binned to 0.1 Da and other datasets binned to match dimensions to allow transfer learning. Three CNN architectures (variant-Lecun, variant-LeNet, and variant-VGG9) are evaluated. For each architecture, the effectiveness of classification is measured with several metrics: classification accuracy (over all classes), sensitivity, specificity and confusion matrix. Experiments are performed to evaluate the performance of CNNs from scratch, using transfer learning, and using the proposed cumulative learning approach. The cumulative learning approach involves two scenarios. Scenario A includes training on the rat brain dataset, then fine-tuning with the beef liver dataset, and finally training on the target clinical dataset. Scenario B adds an extra step involving fine-tuning with the microorganism dataset before training on the target dataset. The performance of the best CNN models is compared to SVM, RF, and LDA, applied to both raw and preprocessed data (using a five-step preprocessing pipeline including log-intensity transformation, baseline subtraction, normalization, alignment, and peak detection).
Key Findings
The results demonstrate that all three CNN architectures perform poorly when trained from scratch on the small clinical datasets. However, transfer learning significantly improves classification accuracy compared to training from scratch. Variant_Lecun consistently outperforms the other architectures in transfer learning. The cumulative learning approach further enhances performance, particularly for canine sarcoma multi-class classification. Scenario B (rat brain → beef liver → microorganisms → canine sarcoma) yields the highest accuracy. The best results are obtained by freezing the convolutional layers and retraining only the fully connected layers. The cumulative learning strategy maintains generalization capability, as demonstrated by testing the final model on the datasets used during training. The CNN models consistently outperform SVM, RF, and LDA on both raw and preprocessed data, highlighting the advantage of learning directly from raw spectra without preprocessing. For the two human ovary datasets, transfer learning improves the high-resolution dataset from 0.78 to 0.98, while cumulative learning improves the low-resolution dataset from 0.80 to 0.99. Data augmentation attempts were unsuccessful, suggesting a need for a deeper understanding of biological variability for effective augmentation strategies.
Discussion
This study demonstrates the effectiveness of CNNs, particularly when combined with transfer learning and a novel cumulative learning approach, for classifying mass spectrometry data with limited training samples. The ability of CNNs to learn directly from raw data, bypassing time-consuming preprocessing steps, offers a significant advantage over traditional ML methods. The cumulative learning approach, by sequentially training a model on multiple datasets, effectively accumulates knowledge and improves the model's ability to generalize to new, unseen data. This approach is particularly valuable in medical applications where data scarcity is a common challenge. The results suggest that a 'generic' representation of MS data for classification tasks might exist, and that this representation is robust to variations in organisms, tissue phenotypes, and instruments. The superior performance of CNNs over traditional ML methods, even on preprocessed data, underscores the advantage of an end-to-end trainable system. The findings highlight the potential of CNNs to facilitate rapid, efficient, and reliable clinical diagnosis without the need for extensive preprocessing expertise.
Conclusion
This research introduces a cumulative learning strategy for training CNNs to classify mass spectrometry data, particularly useful in scenarios with limited training data. The cumulative learning approach significantly enhances classification accuracy compared to both training from scratch and transfer learning alone. This approach provides a robust and efficient solution for various biological contexts and instrument types. Future work will focus on interpreting the models to identify key spectral regions and extending this methodology to other spectroscopic techniques.
Limitations
While the cumulative learning approach proves highly effective, the study's limitations include the specific datasets used and the potential for bias in the selection of these datasets. The limited number of datasets may restrict the generalizability of findings. Further research is needed to validate the approach with a broader range of datasets. The study also acknowledges the complexity of MS data preprocessing, highlighting the need for further investigation into optimized and standardized preprocessing methods.
Related Publications
Explore these studies to deepen your understanding of the subject.