logo
ResearchBunny Logo
MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer

Medicine and Health

MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer

X. Wang, M. Duan, et al.

MarsGT, an innovative deep learning model, excels in pinpointing rare cell populations critical for understanding disease progression and therapy responses. This groundbreaking approach offers unprecedented insights into unique subpopulations in various datasets, highlighting potential avenues for early detection and therapeutic intervention. This research was conducted by Xiaoying Wang, Maoteng Duan, Jingxian Li, Anjun Ma, Gang Xin, Dong Xu, Zihai Li, Bingqiang Liu, and Qin Ma.

00:00
00:00
Playback language: English
Introduction
Rare cell populations, despite their low abundance, play significant roles in various biological processes, including neoplastic progression and therapeutic response. Their identification is challenging due to their scarcity and transient expression. While single-cell RNA sequencing (scRNA-seq) has advanced our ability to identify cell types, current tools struggle with accurately identifying rare populations, often yielding high false positives or failing with complex samples. The integration of scATAC-seq data offers additional regulatory information which, when combined with scRNA-seq, can improve the identification of rare cells through gene regulatory network construction. Graph neural networks, particularly heterogeneous graph transformers (HGTs), have proven effective in analyzing complex biological data, making them suitable for integrating multi-omics data and improving rare cell identification. This paper introduces MarsGT, a novel method designed to overcome the limitations of existing techniques.
Literature Review
Existing methods for rare cell identification, such as FIRE, GapClust, TooManyCells, GiniClust, RaceID, and SCMER, face several challenges. These include high false-positive rates, limited performance in complex samples (e.g., tumor biopsies), inability to simultaneously identify both major and rare cell types, and compromised accuracy with ultra-rare cell types. These limitations often stem from insufficient representation of rare cells in the data, leading to inaccurate grouping with more prevalent populations when relying solely on gene expression data. The advancements in single-cell ATAC sequencing (scATAC-seq) provide a complementary data source, offering insights into gene regulatory mechanisms and further enhancing the accuracy of rare cell identification. Graph neural networks have shown promise in analyzing complex biological data, particularly scMulti-omics data; however, a dedicated tool designed specifically for efficient and accurate rare cell population identification using this approach has been lacking.
Methodology
MarsGT is an end-to-end deep learning model that leverages a probability-based heterogeneous graph transformer to identify rare cell populations from scMulti-omics data. It integrates scRNA-seq and scATAC-seq data to construct a heterogeneous graph encompassing cells, genes, and enhancers. The model utilizes a probability-based subgraph sampling technique that focuses on genes and peaks highly expressed in a target cell but lowly expressed in others. This subgraph sampling improves efficiency while highlighting rare cell-related features. A multi-head attention mechanism updates the joint embeddings of cells, genes, and peaks. The model concurrently determines peak-gene relationships and rare cell populations iteratively during training, improving accuracy and reducing false positives. Regularization terms are included to avoid diminishing the features of major cell populations. After training, the model is applied to the entire heterogeneous graph, and a transcription factor (TF) database is integrated to construct cell cluster enhancer gene regulatory networks (eGRNs). The performance of MarsGT was evaluated using various metrics (F1 score, NMI, Purity, Entropy) across extensive simulations and real datasets, including mouse retina, human lymph node, and melanoma data. The methodology included comparing MarsGT to existing state-of-the-art methods (CellCUIS, FIRE, GapClust, GiniClust, RaceID, and SCMER) to demonstrate its superior performance.
Key Findings
MarsGT demonstrated superior performance in identifying rare and major cell populations compared to existing methods across 550 simulated datasets and four real human datasets. In mouse retina data, MarsGT identified unique subpopulations of rare bipolar cells and a Müller glia cell subpopulation, providing novel biological insights. In human lymph node data, MarsGT detected an intermediate B cell population (BLS1) potentially acting as lymphoma precursors, suggesting potential avenues for early detection and intervention. In human melanoma data, MarsGT identified a rare MAIT-like population affected by a high IFN-I response, clarifying the mechanism of immunotherapy response and predicting patient survival based on interferon response capacity (IRC). Across all datasets, MarsGT consistently outperformed existing methods in F1 score and NMI metrics, demonstrating its robustness and accuracy, particularly in identifying ultra-rare cell types.
Discussion
MarsGT's superior performance across diverse datasets highlights its potential as a powerful tool for rare cell population identification. The ability to integrate scRNA-seq and scATAC-seq data, coupled with the effectiveness of the heterogeneous graph transformer and probability-based subgraph sampling, significantly improves the accuracy and efficiency of rare cell detection. The discovery of novel rare cell populations in multiple datasets underscores the potential of MarsGT to uncover previously unknown biological insights relevant to disease progression and treatment response. The findings on melanoma datasets, for example, provide a deeper understanding of the role of MAIT-like cells and the mechanism of immunotherapy response based on IRC, offering new strategies for improving treatment outcomes. The identification of a potential precursor state for B-cell lymphoma highlights MarsGT’s potential for early disease detection and development of preventative strategies.
Conclusion
MarsGT offers a significant advancement in rare cell population identification, outperforming existing methods in accuracy and efficiency. Its ability to integrate multi-omics data and leverage the power of graph transformers enables the discovery of novel cell subpopulations and mechanistic insights relevant to diverse diseases. Future research should focus on developing more sophisticated significance testing methods, addressing challenges posed by highly heterogeneous rare cell types, incorporating advanced batch correction techniques, and improving model efficiency to enhance reproducibility and accessibility.
Limitations
While MarsGT demonstrates impressive performance, some limitations exist. Statistical significance testing for rare cell identification requires further refinement. The model's reliance on GPU computations might limit reproducibility for researchers with limited computational resources. The regularization term used to balance the identification of rare and major cell populations slightly impacts the accuracy of major cell population identification. Finally, complex rare cell types, such as senescent cells, with high heterogeneity remain a challenge requiring further investigation and methodological improvements.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny