logo
ResearchBunny Logo
MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer

Medicine and Health

MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer

X. Wang, M. Duan, et al.

MarsGT, an innovative deep learning model, excels in pinpointing rare cell populations critical for understanding disease progression and therapy responses. This groundbreaking approach offers unprecedented insights into unique subpopulations in various datasets, highlighting potential avenues for early detection and therapeutic intervention. This research was conducted by Xiaoying Wang, Maoteng Duan, Jingxian Li, Anjun Ma, Gang Xin, Dong Xu, Zihai Li, Bingqiang Liu, and Qin Ma.

00:00
00:00
~3 min • Beginner • English
Introduction
Identifying rare cell populations is critical for understanding tumor microenvironments, immunotherapy response, and disease progression, yet these populations are difficult to detect with standard single-cell analysis due to their scarcity and transient expression. While single-cell RNA-seq provides high-resolution molecular profiles, reliance on expression alone can misclassify rare cells with more prevalent types. Integrating scATAC-seq adds regulatory context (e.g., enhancer activity) that helps preserve cell identity signals. Existing rare-cell tools often suffer from high false positives, poor performance on complex tissues, and difficulty detecting ultra-rare populations (<1%). The study proposes MarsGT, a multi-omics, graph-transformer-based framework that jointly models cells, genes, and regulatory elements to accurately identify both major and rare cell populations and their regulatory programs.
Literature Review
Prior work on rare cell identification spans FIRE, GapClust, TooManyCells, GiniClust, RaceID, and SCMER, which often exhibit limitations such as high false positives, difficulty in complex or tumor biopsies, inability to simultaneously identify major and rare types, and reduced accuracy for ultra-rare populations. The growing availability of scATAC-seq and joint scRNA-seq/scATAC-seq enables construction of gene regulatory networks to better characterize rare populations. Graph neural networks and heterogeneous graph transformers have shown strong performance in complex biological data integration and single-cell analyses (e.g., DeepMAPS), motivating a unified framework to integrate diverse single-cell modalities, capture cross-modal relationships, and leverage attention-based message passing for improved clustering and regulatory inference.
Methodology
Data preprocessing: MarsGT inputs matched scRNA-seq and scATAC-seq count matrices. Rows are genes (RNA) or peaks (ATAC), columns are cells. Rows/columns with <0.1% non-zero values are removed. Standard QC (e.g., total reads, mitochondrial ratio) uses Seurat v3. A regulatory potential matrix between peaks and genes is computed following MAESTRO: by default, potential decays with genomic distance (half-decay 10 kb); peaks >150 kb from a gene have zero potential; peaks overlapping exons use exon length; peaks overlapping other nearby genes are excluded. Multiple datasets integration: For multi-sample scRNA-seq, Harmony is used for batch correction. For scATAC-seq, peaks are binned at 5 kb across samples; counts within bins are aggregated to form a unified matrix. Heterogeneous graph construction: A tripartite heterogeneous graph is built with node types for cells, genes, and peaks. Unweighted edges link cell–gene and cell–peak pairs when expression/accessibility is non-zero, and gene–peak relations are included conceptually for regulatory potential. Initial node features are derived from the corresponding matrices, then linearly projected to a common embedding space (dimension 256). Probability-based subgraph sampling: To highlight rare cell signals and improve efficiency, MarsGT selects for each target cell those genes/peaks that are highly expressed/accessibly in that cell but low in others. Genes/peaks are first split by a cell-specific first quartile threshold (high vs low). Selection probabilities favor features with high within-cell value relative to their total across all cells, enriching for rare-related features. By default, up to 20 genes and 20 peaks per cell are selected. Subgraphs comprise 30 randomly selected cells and their chosen neighbors. Multiple subgraphs form mini-batches for training. Heterogeneous graph transformer and joint embeddings: A multi-head attention heterogeneous transformer passes messages across cell, gene, and peak nodes. For each layer, query/key/value projections are computed per node; attention is type-aware to handle heterogeneous relations. The model iteratively updates joint embeddings for all node types over sampled subgraphs. Cell assignment and regulatory link prediction: From learned embeddings, MarsGT predicts (1) a cell–cluster assignment probability matrix; cells sharing the maximal probability are assigned to the same cluster; and (2) a peak–gene link assignment probability matrix per pseudo-cluster, capturing cluster-specific regulatory links. Training alternates between refining cell clusters and peak–gene links until convergence. Regularization terms preserve information pertinent to major cell types to avoid overemphasizing rare signals. Whole-graph prediction and eGRN inference: The trained model is applied to the full graph to obtain final cell clusters and cluster-specific peak–gene links. Peak–gene link scores combine gene expression, peak accessibility, and regulatory potential. eGRNs are inferred by intersecting predicted peaks with TF binding sites (from JASPAR; p<0.05) to derive TF–peak–gene relationships per cluster. Evaluation design: Benchmarks include 550 simulated datasets: 100 cell-line–based (homogeneous) with 500 cells and 2–3 types; 300 PBMC-based (heterogeneous) with 500 cells and 2–3 types; and 150 PBMC-based with 5,000 cells and 5–15 types. Additional gradient tests vary rare-cell proportion (0.5%, 1%, 2%, 3%) across datasets, and a negative control set without rare cells tests false positives. Real-data benchmarks comprise four PBMC datasets with ground-truth labels (three training, one independent test). Metrics include F1, Precision, Recall for rare-cell identification; and NMI, Purity, Entropy for clustering. Downstream case studies include mouse retina (snRNA/snATAC), human lymph node lymphoma (scRNA/scATAC), and melanoma PBMC (multi-sample) with pseudotime (slingshot), gene signature scoring, CellChat-based cell–cell communication, and pathway enrichment analyses.
Key Findings
Performance on simulations and real data: Across 350 simulated datasets (Sim-CL 1–2; Sim-PBMC 1–5), MarsGT outperformed CellSIUS, FIRE, and GapClust in F1, Precision, and Recall for rare-cell identification. On larger, more complex simulations (Sim-PBMC 7–9; 150 datasets), MarsGT also surpassed clustering-like tools (GiniClust, RaceID, SCMER) in Purity and Entropy, with NMI comparable to or better than GiniClust. In gradient tests with rare-cell proportions of 0.5–3%, MarsGT achieved F1 scores 11.56%–143.49% higher than the next best method. On a simulated dataset without rare cells (Sim-PBMC 6; 50 datasets), MarsGT did not force rare clusters, indicating low false positives. On real PBMC datasets (three training, one independent test), MarsGT achieved the best overall performance; in the independent test (PBMC-test), F1 improvements were 100% (1% rare) and 63.71% (3% rare) over the second-best tool (GiniClust), and NMI was 7.14% higher. Mouse retina (9,383 cells, matched snRNA/snATAC): MarsGT identified 18 clusters (AC, 8 BC subtypes, Cone, HC, 3 MG, RGC, 3 Rod), including 12 rare populations (8 with 95% confidence by scPower). It resolved eight BC subpopulations with distinct marker profiles and pathway enrichments (e.g., neuron migration in BC1B; extracellular ligand-gated ion channel activity enriched in OFF types). Cell–cell communication analysis suggested non-canonical Wnt signaling from RBC to BC3/BC6, consistent with prior literature. MarsGT distinguished a rare MG subpopulation (MG-2; 127 cells) from MG-1 with functional differences: MG-1 enriched in sprouting angiogenesis; MG-2 in structural constituent of eye lens. eGRNs revealed distinct peak–gene regulatory networks underpinning MG-1 vs MG-2. Human lymph node with B-cell lymphoma (14,566 cells, matched scRNA/scATAC): MarsGT resolved 14 clusters and four B-cell subpopulations, including a rare B lymphoma intermediate state (BLS1; 95% confidence) not detected by Seurat or DeepMAPS. Pseudotime suggested progression Normal B → BLS1 → BLS2 → BLS3. Gene signatures (anti-apoptosis, metastatic, PD-PDL1) and eGRN scores increased along this trajectory; regulatory activity on STAT1, HIF1A for PDL1 and on BCL2 increased towards BLS3. MarsGT highlighted MEF2C as a unique TF for BLS1 and identified switch-enhancer TFs (POU2F2, FOXP1, SPI1, NFIC). In silico TF knockouts shifted lymphoma states toward normal B, suggesting potential intervention points and that BLS1 may act as a precursor state. Melanoma PBMC multi-sample (10 matched samples: 2 healthy, 8 patients): MarsGT identified 13 clusters, including two rare MAIT-like CD8+ subpopulations (Clusters 9 and 12) marked by ZBTB16 and SLC4A10. eGRNs showed shared and unique enhancers and target genes across CD8+ subsets, with distinct pathway enrichments: MAIT-like 1 enriched for positive regulation of cytokine production, IL-12 signaling, and type I IFN production; MAIT-like 2 enriched for MAPK cascade. ZBTB16 regulons differed mainly in regulatory relations rather than expression/accessibility levels, supporting the value of regulatory information in defining rare states. Immunotherapy mechanism (Interferon-I response capacity, IRC): Among MAIT-like cells, low-IRC patients contributed the majority (MAIT-like 1: 83.57%; MAIT-like 2: 70.18%). Despite IRC group definitions, ISG expression was higher in low IRC patients. Low-IRC eGRNs uniquely contained TCF1 and BCL6, consistent with maintaining T-cell stemness. MAIT-like 1 cells exhibited effective signatures in low IRC and exhaustion signatures in high IRC. Mechanistically, high IRC was associated with elevated DC IL10 and reduced IL15/IL18 costimulation, dampening MAIT activation. In low IRC, IFN-I, IL-15, and IL-18 pathways more robustly relayed via TYK2–NFκB, JAK–STAT, and JNK–FOS/JUN to drive IFNG, GZMB, and PRF1, yielding more positive regulatory relations and enhanced cytotoxic response, potentially explaining better outcomes after PD-1 blockade in low-IRC patients.
Discussion
The study addresses the challenge of detecting rare cell populations by integrating transcriptomic and chromatin accessibility data within a heterogeneous graph transformer that prioritizes rare-associated features via probability-based subgraph sampling. This design enhances the signal-to-noise ratio in the presence of single-cell dropout, reduces false positives, and simultaneously infers cell clusters and their regulatory programs. Benchmarking across extensive simulations and real datasets shows robust improvements over state-of-the-art tools, including performance on ultra-rare populations and avoidance of false positives when rare cells are absent. Case studies demonstrate biological utility: identification of rare neuronal and glial subtypes in mouse retina with distinct regulatory networks; discovery of a rare intermediate B lymphoma state (BLS1) with testable TF dependencies and potential as a precursor for disease progression; and resolution of MAIT-like rare subpopulations in melanoma PBMCs with regulatory differences tied to IFN-I responsiveness and immunotherapy outcomes. Collectively, MarsGT advances rare-cell detection and mechanistic interpretation, providing actionable insights for early detection and immunotherapy strategy development.
Conclusion
MarsGT introduces an end-to-end, multi-omics heterogeneous graph transformer that jointly identifies major and rare cell populations and their enhancer–gene regulatory networks. It consistently outperforms existing methods across diverse simulations and real datasets, uncovers biologically meaningful rare populations missed by other tools, and elucidates regulatory mechanisms in cancer and immunotherapy contexts. Future work should incorporate formal statistical significance testing for rare-cell calls, integrate batch correction within model training, improve computational efficiency and reproducibility beyond GPU constraints, and further refine regularization to better balance performance on major and rare populations. MarsGT sets a foundation for precision medicine applications by enabling discovery of disease-associated rare populations and their regulatory drivers.
Limitations
- Lack of a dedicated statistical significance framework for rare-cell identification; current use of scPower provides confidence but a built-in method is desirable. - Challenges with highly heterogeneous rare states (e.g., senescent cells) not fully addressed. - Need for integrated batch correction during training for multi-sample analyses; current workflow uses pre-processing tools (Harmony, binning). - GPU dependence may affect reproducibility across runs, though variance was reported negligible; small rare-cell counts can still limit stability. - Regularization prioritizing rare-cell detection slightly sacrifices major-cell clustering performance; more sophisticated balancing is needed.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny