
Medicine and Health
An integrated network representation of multiple cancer-specific data for graph-based machine learning
L. Pu, M. Singha, et al.
This innovative research conducted by Limeng Pu, Manali Singha, Hsiao-Chun Wu, Costas Busch, J. Ramanujam, and Michal Brylinski unveils a breakthrough in predicting cancer cell line responses to drug treatments using genomic data. By leveraging a unique graph reduction algorithm, the study enhances prediction accuracy through advanced feature representation, showcasing the power of non-Euclidean data in cancer pharmacotherapy.
~3 min • Beginner • English
Introduction
Cancer is a systems-level disease characterized by dysregulated cell signaling and complex phenotypes. Biological networks, especially protein–protein interaction (PPI) networks, capture alterations in information flow arising from oncogenic changes and can help relate gene products to disease mechanisms. While prior network-based studies and graph algorithms (e.g., diffusion/flow-based prioritization, network analyses of drug–cancer–target relationships) have shown promise, accurately predicting drug response from genomic features alone remains challenging due to cancer’s heterogeneity and multifactorial nature. This work addresses the problem by integrating heterogeneous, cancer-specific data—differential gene expression, kinase inhibitor profiling, protein interactions, and disease–gene associations—into unified, cancer-specific graphs. The goal is to construct compact yet information-rich network representations and evaluate whether graph-based machine learning can improve prediction of cell-line drug response compared with traditional matrix-based approaches.
Literature Review
The paper reviews network-based methods linking gene products to diseases (e.g., Vavien for prioritizing disease genes based on topological similarity) and studies mapping FDA-approved anticancer drugs into cancer–drug–target networks to reveal new associations. It outlines two broad classes of graph learning: graph kernels (e.g., Weisfeiler–Lehman, WL) that generate node/graph features through iterative label refinement, and spectral methods using the graph Laplacian for clustering and classification. PageRank is cited as a random-walk method for node importance. Recent biology-oriented graph ML includes integrating multiple kernels into a tripartite drug–target–disease network to infer interactions, and network-based biomarker discovery from pharmacogenomic organoid data predictive of patient response. These works establish the utility of network structure and motivate advanced graph ML for drug response prediction.
Methodology
Data sources and preprocessing: A human PPI network was built from STRING (confidence ≥500), yielding 19,144 proteins and 685,198 interactions after removing disconnected/small components. Differential gene expression (DGE) for 18,022 genes and 1,035 cancer cell lines was obtained from CCLE via Harmonizome, with categories of up-, down-, and normally regulated relative to healthy cells. Kinase inhibitor profiling (KIP) data were curated by Team-SKI, providing pIC50 values for 49,348 small molecules against 411 kinases (cutoff pIC50 ≥ 6.3). Disease–gene association (DGA) scores came from DISEASES (scores 1–10) and DisGeNET (0.01–1). Kinases in the PPI were identified via BLAST against known human kinases (95% similarity), giving 508 kinases in-network; Team-SKI provided pIC50 for 411 of these and 29 small molecules overlapping with available growth inhibition data. Cell-line disease identifiers (DOID/Concept IDs) were mapped via Cellosaurus to annotate DGA per cell line. Missing node feature values were imputed by the median of first-order neighbors. The final dataset comprises annotated graphs for 3,549 cell line–drug combinations across 359 cell lines and 29 drugs.
Graph construction and reduction: For each cell line–drug pair, features were mapped to the common PPI topology: nodes carry DGE status; some nodes (both kinase and non-kinase) carry DGA scores; kinase nodes may also carry pIC50 for the drug’s targets. To address sparsity and uniform topology across instances, a knowledge-based edge contraction (graph reduction) was developed. An edge is contractible only if both endpoints are non-kinases, share the same DGE status, and belong to the same biological process cluster. Biological process clusters were obtained by computing GO BP semantic similarities with GOGO and performing agglomerative hierarchical clustering (HCA) into 30, 100, or 300 clusters. HCA-30 was selected based on entropy analysis and alignment with the 30 level-1 GO BP categories.
Graph statistics and entropy: The study computed standard graph metrics (average degree, density, diameter, clustering coefficient, maximum/average betweenness) pre- and post-reduction. Information gain/loss from reduction was quantified via Shannon entropy on features alone and a graph-feature entropy that filters feature vectors through the graph Laplacian, combining topology and features. Relative entropy change δ = (S_reduced − S_original)/S_original was evaluated for different reduction schemes.
Drug response data and labeling: Growth rate inhibition metrics (GRmax) from multiple LINCS/partner datasets were used. Combinations with negative GRmax were labeled positive (cytotoxic response; n=2,124) and those with positive GRmax negative (cytostatic response; n=1,425).
Evaluation protocol and models: A 9-fold cross-validation at the tissue level (digestive, respiratory, hematopoietic/lymphoid, breast, female reproductive, skin, nervous, excretory, others) was employed, holding out each tissue as validation to minimize overlap in expression and disease-association patterns between train/validation.
- Matrix-based baselines: (i) DGE flattened (19,144) concatenated with 300-dim Mol2vec ligand embeddings (LE) for the drug (total 19,444 features) and (ii) flattened matrix of DGE, KIP, DGA (19,144 × 3 = 57,432 features). Both used an MLP with layers: input, 1024, 512, 256, and 2-unit output (effective vs ineffective). Additional classifiers included SVM-PCA and RF-PCA on the same feature sets.
- Graph-based model: Weisfeiler–Lehman (WL) graph kernel (WL Tree) applied to reduced graphs with DGE, KIP, and DGA node labels/features to generate graph-level representations for classification.
All performance metrics (ACC, PPV, TPR, MCC, F-score) were reported as tissue-level cross-validated results.
Key Findings
- Graph reduction outcomes: Reduced graphs decreased average nodes from 19,144 (full PPI) to 1,349 ± 80, with edges 12,613 ± 608. Density increased from 0.004 to 0.014 ± 0.0009; diameter decreased from 8 to 4.073 ± 0.26; clustering coefficient increased from 0.287 to 0.659 ± 0.006. Maximum betweenness centrality increased from 0.021 to 0.596 ± 0.011; average betweenness rose from 1.11×10^-4 to 7.88×10^-4 ± 4.49×10^-6, indicating more efficient information flow and richer local structure.
- Entropy analysis: A simple reduction requiring common GO BP terms increased feature-only entropy by 2.3 ± 0.6 but decreased graph-feature entropy by −0.4 ± 0.04. HCA-based reductions increased feature-only entropy while preserving or slightly increasing graph-feature entropy; HCA-30 achieved the highest feature-only entropy gain of 3.5 ± 0.9 with a slight increase in graph-feature entropy.
- Predictive performance (tissue-level CV):
• Matrix MLP (DGE+LE): ACC 0.55, PPV 0.63, TPR 0.64, MCC 0.27, F-score 0.55.
• Matrix MLP (DGE+KIP+DGA): ACC 0.60, PPV 0.60, TPR 0.60, MCC 0.20, F-score 0.60.
• SVM-PCA (DGE+KIP+DGA): ACC 0.62, PPV 0.72, TPR 0.53, MCC 0.16, F-score 0.45.
• RF-PCA (DGE+KIP+DGA): ACC 0.44, PPV 0.56, TPR 0.53, MCC 0.09, F-score 0.39.
• Graph WL Tree (reduced graphs; DGE+KIP+DGA): ACC 0.68, PPV 0.67, TPR 0.65, MCC 0.32, F-score 0.65.
The graph-based WL Tree approach on reduced, integrated graphs outperformed all matrix-based baselines, demonstrating the benefit of non-Euclidean representations and knowledge-based reduction for drug response prediction.
Discussion
By integrating heterogeneous molecular and clinical-relevance features (DGE, KIP, DGA) onto a common PPI backbone and applying a biologically constrained graph reduction, the study increases graph compactness and diversity across cell lines while preserving critical biological context. This transformation mitigates feature sparsity and identical-topology issues inherent to full PPI graphs, enriching local structure and improving information propagation as evidenced by increased density, clustering, and betweenness metrics. Entropy analyses confirm that HCA-30 maximizes feature information without sacrificing topology–feature coherence. Under tissue-level cross-validation, the WL kernel on reduced graphs substantially improves predictive accuracy, MCC, and F-score over matrix-based models using the same features, indicating that graph representations more effectively capture relational and pathway-level dependencies underlying drug response. These results directly address the challenge of predicting pharmacotherapy effects from complex cancer data by leveraging system-level structure and graph-based learning.
Conclusion
The study introduces an integrated, cancer-specific graph representation combining PPI topology with differential expression, kinase inhibition, and disease–gene associations, alongside a knowledge-driven edge contraction procedure grounded in GO BP clustering. The resulting reduced graphs are compact yet information-rich and diverse across cell lines. Applying the WL graph kernel to these graphs outperforms matrix-based baselines in predicting cytotoxic versus cytostatic responses under rigorous tissue-level cross-validation. The generated datasets are publicly available, facilitating broader application.
Potential future directions include: expanding the drug and cell line coverage to improve generalizability; incorporating additional omics layers (e.g., mutations, copy-number, phosphoproteomics); exploring end-to-end graph neural networks and attention mechanisms on the reduced graphs; and assessing transferability to patient-derived models and clinical outcomes.
Limitations
- Limited drug coverage: Only 29 small molecules overlapped between kinase profiling and GR datasets, potentially constraining generalizability across chemotypes and mechanisms.
- Feature sparsity and imputation: Missing node features were imputed from first-order neighbor medians, which may introduce bias for sparsely annotated regions of the network.
- Dependency on GO BP clustering: The reduction relies on GO BP semantic similarity and HCA-30; performance may vary with ontology completeness, annotation quality, or alternative clustering choices.
- Uniform base topology: All instances share the same underlying PPI topology; improvements stem from feature mapping and reduction rather than topology changes at the protein–protein edge level.
- Algorithmic scope: Evaluation focused on WL Tree for graph learning; other graph learning paradigms (e.g., GNNs) were not benchmarked here.
- Practical constraints: The choice to avoid sparse matrix formats reflects current library limitations and may influence baseline matrix-model efficiency comparisons.
Related Publications
Explore these studies to deepen your understanding of the subject.