logo
ResearchBunny Logo
Introduction
Cancer's intratumoral heterogeneity leads to variable patient responses to therapy. Personalized medicine aims to identify biomarkers predicting treatment efficacy. While protein and genetic biomarkers have shown some success (e.g., HER2 and estrogen receptor in breast cancer), many drugs still lack reliable predictors. For instance, midostaurin and alpelisib show significant response variability despite targeting specific mutations. Current companion diagnostics often focus on single biomarkers, neglecting potential synergy or compensation from other drugs. Machine learning (ML) offers a powerful tool to integrate multi-omics data and predict drug responses more accurately. Previous studies have explored genomic features and gene expression, but large-scale proteomics and phosphoproteomics data remain under-explored despite their potential. Limitations such as low sample throughput and reliance on labeling methods have hindered their application in ML models. However, advancements in label-free LC-MS/MS and the availability of comprehensive drug response profiles now make large-scale proteomic analysis for ML feasible. This paper presents DRUML, designed to integrate proteomics and phosphoproteomics data to rank drugs based on their efficacy in reducing cancer cell proliferation. A key feature is its ability to predict drug rankings without needing comparisons to reference samples, crucial for clinical implementation.
Literature Review
The field of personalized oncology relies on identifying biomarkers predictive of drug response to optimize treatment strategies. While genetic markers and gene expression profiling have been utilized, their predictive power remains limited, especially for drugs with complex mechanisms of action and considerable response variability in patients with the same genetic profile. This study builds on previous efforts to predict drug responses using machine learning methods. However, the authors highlight the underutilization of large-scale proteomics and phosphoproteomics datasets, emphasizing the need for a systematic investigation. Previous limitations, such as the low throughput of traditional proteomics methods, which heavily rely on chemical labeling, have hindered their application in large-scale machine learning models. This study leverages recent advances in label-free LC-MS/MS technology to overcome such limitations and systematically assess the predictive power of large-scale proteomics data.
Methodology
DRUML, an ensemble of machine learning models, ranks drugs based on their predicted efficacy in reducing cancer cell proliferation. The method utilizes proteomic and phosphoproteomic data as input. To reduce noise and improve model robustness, DRUML employs a dimensionality reduction technique. It calculates empirical markers of drug responses (EMDRs) from the training dataset. EMDRs represent consistently significant changes in protein or phosphosite expression levels between drug-sensitive and drug-resistant cell lines. These are obtained through repeated resampling and statistical analysis (Limma package). A distance metric, D, is computed using the EMDRs. This metric quantifies the difference between the average expressions of sensitivity markers and resistance markers within a sample. This internal normalization eliminates the need for external control samples during prediction. Various machine learning algorithms (random forest, deep learning, neural networks, etc.) are trained on the D values to predict drug responses. The methodology involves multiple stages: 1. **Data Acquisition:** Proteomic and phosphoproteomic data were acquired via label-free LC-MS/MS from 48 cell lines (26 AML, 10 esophageal, and 12 hepatocellular carcinoma) in triplicate. Drug response data (area above the curve, AAC) was obtained from PharmacoDB. 2. **Dimensionality Reduction:** EMDRs were identified using 80% of the samples in the training set. The distance metric (D) was calculated, representing the difference between sensitivity and resistance markers. 3. **Model Training and Validation:** Different ML algorithms were applied for model construction and hyperparameter tuning using 10-fold cross-validation and RMSE as the loss function. Model performance was evaluated based on mean squared error and Spearman's rank correlation. Models were separately built for AML and solid tumor samples. 4. **Independent Verification:** The trained models were tested against independent proteomics and phosphoproteomics datasets from other laboratories. This included a colorectal cancer dataset (Piersma et al.) and a diverse solid tumor dataset (Jarnuczak et al.) encompassing data from 11 studies. 5. **Clinical Relevance Assessment:** DRUML's predictive ability was assessed using a clinical AML phosphoproteomics dataset (Casado et al.) to predict cytarabine responses and their correlation with patient survival (overall survival, OS). Kaplan-Meier survival curves and log-rank tests were used for survival analysis.
Key Findings
DRUML consistently shows high accuracy in predicting drug responses across various cancers and independent datasets. * **Training and Validation:** Deep learning (DL) models trained on phosphoproteomics data showed the lowest validation errors (RMSE < 0.1) for both AML and solid tumor datasets. DL consistently outperformed other machine learning algorithms in the validation sets. * **Independent Verification:** In the Piersma et al. (colorectal cancer) dataset, Random Forest (RF) models performed best, achieving a mean Spearman rho of 0.70 and >85% of predictions within 0.15 AAC units of measured values. * **Diverse Solid Tumors:** In the Jarnuczak et al. dataset (47 cell lines, 8 cancer types), RF models showed high correlation (mean Spearman rho of 0.64) and low mean square errors (MSE < 0.1). Over 85% of predictions had absolute errors < 0.15 AAC units. * **Clinical Significance:** In the clinical AML dataset, DRUML's predictions of cytarabine sensitivity significantly correlated with patient overall survival (OS). Patients with high predicted cytarabine responses had significantly longer OS compared to those with low predicted responses (Log-rank p < 0.005). The correlation of predicted response with OS was statistically significant in both the complete sample cohort (p=0.044) and in patients who underwent complete remission (CR) and received consolidation therapy (p=0.0049). This demonstrates the clinical relevance of DRUML's predictions.
Discussion
This study demonstrates the effectiveness of integrating large-scale proteomics and phosphoproteomics data into machine learning models to predict drug responses in cancer. DRUML successfully ranks drugs by predicted efficacy, showing high accuracy and generalizability across various cancer types and independent datasets. The use of internally normalized distance metrics (D), computed from EMDRs, enhances robustness and reduces noise, addressing challenges associated with high-dimensional omics data. The observed correlation between DRUML's cytarabine sensitivity predictions and patient survival in an independent AML cohort highlights the clinical relevance of this approach. This method provides a powerful tool for personalized medicine, guiding treatment decisions by prioritizing drugs based on their predicted effectiveness.
Conclusion
DRUML provides a robust and accurate method for ranking anti-cancer drugs based on predicted efficacy, utilizing large-scale proteomics and phosphoproteomics data. Its high performance across diverse cancer types and independent datasets, along with its demonstration of clinical relevance, positions it as a valuable tool for drug prioritization in personalized oncology. Future work could expand the drug library, refine model algorithms, and explore the integration of additional omics data and clinical parameters to further enhance predictive power.
Limitations
The current version of DRUML is limited to drugs included in existing drug response databases. The findings are based on data from cancer cell lines, which might not perfectly recapitulate the in vivo tumor microenvironment. The relatively small sample size used for training might also impact the generalizability of the model. While the study demonstrates a significant correlation between predicted drug sensitivity and clinical outcomes in AML patients, further validation in larger and more diverse patient cohorts is necessary to confirm these findings.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny