logo
ResearchBunny Logo
Introduction
Single-cell RNA sequencing (scRNA-seq) data analysis often begins with clustering cells based on gene expression to identify distinct cell types or states. This process typically involves several steps: quality control, normalization, feature selection, dimensionality reduction (often PCA), clustering, and cell type annotation. Feature selection, the process of choosing a subset of genes most informative for clustering, is critical but inconsistently performed. Existing methods, such as those based on mean-variance modeling (HVGs), principal component loadings (HLGs), or dropout rates (M3Drop), often select genes individually, neglecting valuable information in gene-gene correlations. These methods also lack a principled way to determine the optimal number of features. This paper introduces DUBStepR, a novel method designed to address these limitations by incorporating gene-gene correlation information and a novel metric for feature set size optimization.
Literature Review
Several feature selection methods have been proposed for scRNA-seq data. Mean-variance modeling identifies highly variable genes (HVGs), assuming that genes with high variance across cells are important for distinguishing cell types. Other approaches select genes with high loadings on principal components (HLGs) or those with high dropout rates (M3Drop). GiniClust uses a modified Gini index to identify genes highly expressed in a small subset of cells, focusing on rare cell types. However, these methods generally treat genes independently, ignoring valuable information embedded in gene-gene correlation networks. Existing benchmarking studies on feature selection methods have been limited and lack comprehensive evaluations across diverse datasets and clustering metrics. A robust feature selection method should select cell-type-specific marker genes (DE genes) while maximizing the separation between distinct cell clusters.
Methodology
DUBStepR (Determining the Underlying Basis using Stepwise Regression) is a correlation-based feature selection algorithm. It starts by filtering out low-quality genes (e.g., mitochondrial genes, ribosomal genes, and genes expressed in fewer than 5% of cells). Next, it calculates a correlation range score for each gene, reflecting the difference between its strongest positive and negative correlations. Genes with high correlation range scores are considered candidate features, as highly correlated genes are likely to be related to the same cell type or subtype. DUBStepR then uses stepwise regression on the gene-gene correlation matrix to identify a minimally redundant subset of seed genes that collectively represent diverse patterns of cell-type-specific expression. The number of seed genes is determined using an elbow point in the scree plot of explained variance. Finally, DUBStepR expands the seed gene set iteratively by adding correlated genes using a guilt-by-association approach. This prioritizes genes that strongly represent the same expression signature as the seed genes. The optimal size of the final feature set is determined by maximizing a novel metric called the Density Index (DI), which quantifies the inhomogeneity or "clumpiness" of the cell distribution in the feature space. The DI is calculated as the root-mean-squared distance between all cell pairs divided by the mean distance between a cell and its k nearest neighbors. A highly clustered distribution, indicative of well-separated cell types, leads to a high DI. DUBStepR uses sparse matrix computations to efficiently handle large datasets.
Key Findings
DUBStepR was benchmarked against seven commonly used feature selection methods (HVGDisp, HVGVST, trendVar, devianceFS, HLG, M3Drop/DANB, and GiniClust) using seven diverse scRNA-seq datasets covering four different protocols (10x Genomics, Drop-Seq, CEL-Seq2, and Smart-Seq2). The evaluation included two metrics: the Silhouette index (SI), a measure of cluster separation, and the area under the receiver operating characteristic curve (AUROC), a measure of the ability to distinguish differentially expressed (DE) genes from non-DE genes. DUBStepR consistently outperformed all other methods in both metrics across a range of feature set sizes, demonstrating significantly better cluster separation and accurate identification of cell-type-specific marker genes. Notably, DUBStepR's performance peaked at a relatively small number of features (200-300), suggesting that optimizing feature selection can reduce the need for a large number of features. The Density Index (DI) was found to be a good predictor of the optimal feature set size, and DUBStepR's use of DI for feature set size selection improved the SI in several cases. Furthermore, DUBStepR demonstrated robust scalability to over one million cells. In an analysis of PBMCs from rheumatoid arthritis patients, DUBStepR uniquely identified several rare cell populations (e.g., RBCs, platelet-containing doublets, and proliferating T cells) and subtle cell states within T and NK cell populations, highlighting its ability to detect subtle heterogeneity. Finally, applying DUBStepR to single-cell ATAC-seq data showed that it improves the resolution of cell type inference and reveals more coherent hematopoietic differentiation trajectories.
Discussion
DUBStepR's superior performance stems from its use of gene-gene correlations, a source of information ignored by most existing feature selection methods. The method’s effectiveness is further demonstrated by its robust scalability and ability to identify rare cell types and subtle cell states. The Density Index (DI) provides a novel and efficient way to determine the optimal feature set size, improving cluster separation. The application of DUBStepR to single-cell ATAC-seq data opens new avenues for feature selection in epigenomic studies. These results suggest that integrating gene-gene correlations into feature selection significantly improves single-cell data clustering accuracy and is crucial for analyzing large and complex datasets.
Conclusion
DUBStepR offers a powerful, scalable, and versatile solution for feature selection in single-cell data analysis. Its superior performance in separating cell types, accurately identifying marker genes, and detecting rare populations, combined with its scalability to large datasets, makes it a valuable tool for scRNA-seq and scATAC-seq data analysis. Future work could explore the application of the Density Index (DI) to other stages of single-cell data analysis and extending DUBStepR to other single-cell omics data types.
Limitations
While DUBStepR demonstrates significant improvements over existing methods, several limitations exist. The algorithm's performance is dependent on the quality of the input data and the accuracy of gene-gene correlation estimates. The DI, while effective in most cases, may not always perfectly correlate with the Silhouette index. The algorithm's computational complexity might still be a limiting factor for extremely large datasets although it scales significantly better than most existing feature selection methods. Further research is needed to thoroughly investigate its performance across a broader range of datasets and experimental conditions.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny