Medicine and Health

An integrated network representation of multiple cancer-specific data for graph-based machine learning

L. Pu, M. Singha, et al.

This innovative research conducted by Limeng Pu, Manali Singha, Hsiao-Chun Wu, Costas Busch, J. Ramanujam, and Michal Brylinski unveils a breakthrough in predicting cancer cell line responses to drug treatments using genomic data. By leveraging a unique graph reduction algorithm, the study enhances prediction accuracy through advanced feature representation, showcasing the power of non-Euclidean data in cancer pharmacotherapy.

00:00

Playback language: English

Index

Introduction

Cancer is a complex, system-level phenomenon involving malfunctions in cellular signal transduction. Biological networks, particularly protein-protein interaction (PPI) networks, are valuable tools for studying these alterations. Early network-based methods observed that genes associated with similar diseases exhibit similar topological characteristics in PPI networks. For example, Vavien prioritizes disease-gene candidates based on topological similarity to known disease genes. Another study analyzed FDA-approved anticancer drugs, categorizing them by mechanism of action (cytotoxic vs. target-based) and revealing novel drug-cancer associations. These studies demonstrate the potential of biological networks in cancer research, particularly in predicting anticancer drug efficacy. Machine learning applied to graph-structured data offers further improvements in information extraction and induction from biological networks. Graph-based machine learning approaches include graph kernels (like the Weisfeiler-Lehman algorithm) and spectral methods utilizing the graph Laplacian. Recent applications in biology include a method combining multiple kernels in a drug-target-disease interaction space to infer new drug-target interactions, and a framework identifying robust drug biomarkers from pharmacogenomic data in 3D organoid models. A key challenge is achieving a high signal-to-noise ratio in biological network data, and managing the size of large networks. This study addresses these challenges by developing a novel procedure to construct compact, information-rich, cancer-specific graphs by integrating heterogeneous data and using a knowledge-based graph reduction algorithm.

Literature Review

Existing research highlights the use of biological networks, particularly PPI networks, to analyze altered information flow in cancer cells due to oncogenic changes. Methods like Vavien utilize network information flow to identify drug targets by prioritizing genes based on topological similarity to known disease-associated genes. Studies on FDA-approved anticancer drugs reveal the complexity of drug mechanisms and highlight the need for more sophisticated predictive models. Graph-based machine learning methods, including graph kernels (e.g., Weisfeiler-Lehman) and spectral methods using graph Laplacians, have emerged as powerful tools for analyzing biological networks. Recent studies showcase their application in predicting drug-target interactions and identifying drug biomarkers. However, challenges remain in managing the size and complexity of biological networks and ensuring sufficient signal-to-noise ratio for effective machine learning.

Methodology

This study integrates heterogeneous cancer-specific data (differential gene expression, disease-gene association scores, kinase inhibitor profiling) onto a human PPI network. The integration process, illustrated in Figure 1, maps up- and down-regulated genes, disease association scores, and pIC50 values onto the network. The resulting graph has kinase and non-kinase nodes, with features including gene expression values, pIC50 values (for kinases), and disease association scores. Full-size networks, however, present challenges: identical graph topology across instances, sparsity, and a high proportion of uninformative features. To address this, a knowledge-based graph reduction algorithm was developed. This algorithm uses edge contraction, merging nodes based on connectivity and biological feature information (both incident nodes must be non-kinase proteins, share the same differential gene expression, and belong to the same biological process cluster). Biological processes are determined by clustering nodes based on Gene Ontology (GO) term similarity using GOGO, a method for calculating semantic similarities between GO terms. Figure 2 shows that GOGO similarities are highest for first-order neighbors, decreasing with distance. Hierarchical clustering analysis (HCA) was used to partition proteins into clusters (HCA-30, HCA-100, HCA-300). Figure 3 illustrates the graph reduction process, showing how edge contraction reduces the network size while preserving important features. The reduced graphs have a higher ratio of kinase to non-kinase proteins, increased density, and higher clustering coefficients, indicating improved information exchange. Figure 4 shows the information gain/loss after reduction, indicating that HCA-30 provides the best balance between feature entropy increase and graph-feature entropy preservation. To evaluate prediction performance, tissue-level cross-validation was performed using nine tissue groups. Matrix-based methods (MLP, SVM-PCA, RF-PCA) were compared to a graph-based approach using the Weisfeiler-Lehman (WL) Tree kernel. Details on data sources (STRING database for PPI, CCLE for gene expression, Team-SKI for kinase inhibitor profiling, DISEASES and DisGeNET for disease-gene associations, and LINCS for growth rate inhibition data) and data integration are provided in the Materials and Methods section. Graph statistics (average degree, density, diameter, clustering coefficient, betweenness centrality) and graph-feature entropy are defined and calculated. The methodology for matrix-based and graph-based machine learning approaches is detailed, including the specific models and features used.

Key Findings

The graph reduction algorithm successfully reduced the network size while enhancing the information content. The reduced networks showed improved graph statistics compared to full-size networks, including increased density, clustering coefficient, and average betweenness centrality. These changes indicate that information is more efficiently propagated through the reduced graphs. The information gain/loss analysis, depicted in Figure 4, demonstrated that using HCA-30 for clustering yielded the highest information gain for features while maintaining graph-feature information. The tissue-level cross-validation revealed that the graph-based approach using the WL Tree kernel significantly outperformed the matrix-based methods in predicting drug efficacy. Specifically, the graph-based method achieved an accuracy of 0.68, while the best-performing matrix-based method (MLP with DGE, KIP, and DGA features) reached an accuracy of only 0.60. Table 2 summarizes the performance of different algorithms, showing that the graph-based method also had better Matthews correlation coefficient (MCC) and F-score values compared to the matrix-based methods. These findings support the hypothesis that non-Euclidean representation of cancer-specific data significantly improves machine learning performance in predicting drug response.

Discussion

The superior performance of the graph-based approach highlights the importance of considering the topological structure of biological networks when predicting drug efficacy. The graph reduction algorithm effectively addresses the challenges posed by the high dimensionality and sparsity of biological data, leading to a more efficient and informative representation. The tissue-level cross-validation strategy effectively mitigates biases and ensures robust evaluation. The findings suggest that integrating multiple data sources and leveraging graph-based machine learning methods can significantly improve the accuracy of cancer drug response prediction. This could have important implications for personalized medicine, enabling more tailored and effective treatment strategies for cancer patients. Future research could explore more sophisticated graph neural network architectures to further enhance predictive performance and expand the approach to other cancer types and drug modalities.

Conclusion

This study presents a novel approach for integrating multiple cancer-specific data sources into a unified graph structure for improved machine learning performance in drug efficacy prediction. The knowledge-based graph reduction algorithm and the use of graph-based machine learning significantly improved prediction accuracy compared to traditional matrix-based methods. This approach offers a powerful tool for advancing cancer research and personalized medicine. Future work could involve exploring more advanced graph neural network architectures, incorporating additional data types, and validating the findings on larger and more diverse datasets.

Limitations

The study focused on a specific set of cancer cell lines and drugs, limiting the generalizability of the findings. The performance of the graph-based method relies on the quality of the integrated data and the accuracy of the graph reduction algorithm. Future studies should validate these findings on a broader range of cancers and drugs. Furthermore, the study assumes the availability of comprehensive and high-quality data for integration. Limitations in the completeness or accuracy of input data could affect the performance of the predictive models.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

UroPredict: Machine learning model on real-world data for prediction of kidney cancer recurrence (UroCCR-120)

G. Margue, L. Ferrer, et al.

Medicine and Health

Interpretable machine learning-based decision support for prediction of antibiotic resistance for complicated urinary tract infections

J. Yang, D. W. Eyre, et al.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Engineering and Technology

Machine Learning Techniques for the Performance Enhancement of Multiple Classifiers in the Detection of Cardiovascular Disease from PPG Signals

S. W. Rabkin, A. Cataldo, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny