logo
ResearchBunny Logo
DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data

Biology

DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data

B. Ranjan, W. Sun, et al.

Discover DUBStepR, a groundbreaking feature selection algorithm that enhances the accuracy of single-cell data clustering by utilizing gene-gene correlations. This innovative research, conducted by renowned authors, outperforms existing methods and expertly deconvolves cell heterogeneity in rheumatoid arthritis patient data. Its scalability makes it an essential tool for analyzing large datasets across various data types.... show more
Introduction

The study addresses the challenge of selecting informative features for clustering in single-cell RNA-seq analyses. Conventional pipelines rely on steps including quality control, normalization, feature selection, dimensionality reduction, clustering, and annotation. However, existing feature selection methods (e.g., highly variable genes, dropout-based approaches) can be inconsistent and do not exploit gene-gene correlation structure. The authors hypothesize that leveraging gene-gene correlations and optimizing the number of features based on the distribution of cells in feature space will improve cluster separation and marker gene selection. They introduce DUBStepR, which uses correlation-based candidate selection, stepwise regression to identify minimally redundant seed features, and a Density Index to optimize feature set size, aiming to enhance clustering accuracy and scalability across diverse datasets.

Literature Review

Prior approaches to single-cell feature selection include mean-variance modeling to identify highly variable genes (HVGs), PCA loading-based selection (HLG), dropout-based methods such as M3Drop/DANB, and GiniClust for rare cell types. These methods typically evaluate genes independently and do not account for gene-gene correlations. Benchmarking studies of such methods have been limited and sometimes show inconsistent performance, with some methods resulting in worse clustering than using all genes. Furthermore, existing strategies for choosing the size of the feature set are not directly tied to optimizing cell-type separation in feature space.

Methodology

DUBStepR is a correlation-based feature selection workflow for single-cell data. Key steps:

  • Gene filtering: remove mitochondrial, ribosomal, spike-in, and pseudogenes; retain genes expressed in at least 5% of cells by default (threshold adjustable).
  • Construct a gene-gene correlation (GGC) matrix from normalized expression data. Genes are binned by mean expression to control for technical noise; within each bin, compute a z-scored correlation range per gene. Correlation range is defined as the difference between the strongest positive and a down-weighted strongest negative correlation (using the second-largest non-self correlation to avoid artifacts from overlapping exons), with negative correlations weighted by 0.75. Genes with z-score ≤ 0.7 are filtered out.
  • Stepwise regression on the GGC matrix to identify minimally redundant "seed" genes: iteratively regress out the gene explaining the most variance in the residual correlation structure. An efficient computation uses a single G^T G multiplication to score variance explained for all genes per step. Perform steps until an elbow point in the scree plot of variance explained is reached; the genes regressed out up to the elbow form the seed set. For efficiency, the procedure explicitly computes 30 steps and extrapolates to 100 steps if needed, as variance contributions stabilize.
  • Guilt-by-association expansion: iteratively add genes most strongly correlated (Pearson) to the current feature set, expanding from seeds through the GGC network to prioritize genes that best represent coherent expression signatures, until the target feature set size is reached.
  • Density Index (DI) for selecting feature set size: For each candidate feature set size, compute PCA (default top 20 PCs) to obtain cell embeddings. Define DI as the ratio of global pairwise root-mean-squared cell distance to local neighborhood distances (operationalized via Frobenius norm formulations and k-nearest-neighbor distances). DI quantifies inhomogeneity (clumpiness) of cells in feature space; higher DI indicates better clustering potential. Choose the feature set size that maximizes DI.
  • Scalability: Implementations on sparse matrices, optimized elbow computation, and matrix operations enable processing of datasets with over 1 million cells. The method integrates seamlessly with Seurat for downstream PCA, clustering (e.g., Louvain), and visualization (e.g., UMAP).
  • Benchmarks: Compared against HVG variants (HVGDisp, HVGVST, trendVar), deviance-based selection (devianceFS), HLG, M3Drop/DANB, and GiniClust using silhouette index (SI) for cluster separation and AUROC for DE vs non-DE gene discrimination, across 7 datasets spanning 4 scRNA-seq protocols. Scalability benchmarked on a mouse organogenesis atlas with up to ~1M cells. Application demonstrated on RA PBMC T/NK cells and generalization tested on scATAC-seq peaks.
Key Findings
  • Correlation structure and DE genes: Genes with high correlation range scores are significantly enriched for cell-type-specific DE genes; correlation range predicts fold-change magnitude and DE significance.
  • Cluster separation performance: Across feature set sizes from 50–4000, DUBStepR achieved the highest mean scaled silhouette index (SI) among tested methods, typically peaking at 200–300 features. It ranked top on 5 of 7 datasets.
  • Marker gene detection: DUBStepR achieved AUROC > 0.97 for distinguishing DE vs non-DE genes on all 7 datasets, indicating near-perfect discrimination. devianceFS exceeded 0.97 on 4/7 datasets; other methods performed lower.
  • Optimal feature set size via DI: DI correlates positively with SI and often peaks at similar feature set sizes. In 5/7 datasets, the DI-maximizing feature set also maximized SI; in 2 cases DI guidance improved upon the maximum SI observed across tested sizes.
  • Scalability: On a ~1M-cell mouse organogenesis dataset, DUBStepR processed the full dataset, ranking 4th (of 8) in memory and compute time. Memory usage scaled by ~6.4x from 100k to 1M cells, versus ~12.5x for several other methods, indicating favorable memory scalability.
  • Complex primary samples (RA PBMCs): In 5329 T/NK cells from 4 RA patients, DUBStepR identified 10 subtypes, including rare populations largely missed by other methods: red blood cells (1.8%), proliferating cells (2%), platelet–T doublets (3.4%), and platelet–NK doublets (3%). It also resolved a dichotomy within CD4+ T, CD8+ T, and NK cells marked by coordinated expression (e.g., SET, C1orf56, C16orf54, CDC42SE1, HNRNPH1) linked to a latently infected T-cell program.
  • Generalization to scATAC-seq: Using DUBStepR-selected peaks improved trajectory analysis in human hematopoietic scATAC-seq data, recovering lineage topology (lymphoid, myeloid, megakaryocyte/erythroid) consistent with known differentiation hierarchies, outperforming no feature selection and HVF approaches.
Discussion

The findings support the hypothesis that leveraging gene-gene correlations enhances feature selection for single-cell clustering. By identifying minimally redundant seeds and expanding via guilt-by-association, DUBStepR captures coherent expression programs that maximize separation among biologically distinct cell types. The Density Index provides a practical, label-free proxy for cluster separation, enabling data-driven selection of feature set size, which correlates with and often recapitulates the silhouette index maximum. DUBStepR consistently improves both downstream cluster separability and marker gene selection accuracy across multiple platforms and datasets, and it scales effectively to very large datasets. Its utility extends to complex primary samples, where it resolves both rare populations and subtle functional states that other methods miss, and to scATAC-seq, where correlation-based selection overcomes sparsity limitations of variance-based methods. Collectively, these results underscore the value of correlation-aware, DI-optimized feature selection as a foundation for robust single-cell clustering and downstream biological discovery.

Conclusion

DUBStepR introduces a correlation-based, stepwise regression framework for feature selection coupled with a Density Index to optimize feature set size. It outperforms existing methods in cluster separation and DE gene identification, robustly detects rare and cryptic cell states, scales to million-cell datasets, and generalizes to scATAC-seq. The approach suggests that relatively small, optimally chosen feature sets (often 200–500 genes) can suffice for high-quality clustering. Future work could extend the DI concept to optimize other pipeline stages (e.g., choosing the number of PCs, evaluating normalization strategies) and further explore applications to other single-cell omics (e.g., scChIP-seq, single-cell methylomes).

Limitations
  • The optimal feature set size varies across datasets; while DI correlates with SI, it does not always exactly match the SI-optimal size (agreement in 5/7 datasets).
  • In complex primary samples (e.g., RA PBMCs), absence of definitive ground-truth labels necessitated qualitative comparisons, limiting absolute performance quantification.
  • Although scalable and efficient, DUBStepR ranked fourth in memory and compute time among tested methods on the 1M-cell benchmark, indicating room for further optimization.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny