Computer Science
Fast and precise single-cell data analysis using a hierarchical autoencoder
D. Tran, H. Nguyen, et al.
Discover scDHA, an innovative single-cell Decomposition using Hierarchical Autoencoder framework developed by Duc Tran, Hung Nguyen, Bang Tran, Carlo La Vecchia, and Hung N. Luu. This powerful tool enhances scRNA-seq data analysis by effectively filtering noise and projecting significant data into a lower-dimensional space, surpassing existing techniques in visualization and classification.
~3 min • Beginner • English
Introduction
Advances in microfluidics and sequencing have enabled single-cell resolution profiling, opening opportunities in developmental biology and clinical research. However, scRNA-seq analyses are challenged by massive cell numbers, technical noise, and high dropout rates, which degrade accuracy and scalability. Numerous computational tools exist for unsupervised cell segregation (e.g., SC3, SEURAT, SINCERA, CIDR, SCANPY), visualization of high-dimensional transcriptomes (Isomap, Diffusion Map, t-SNE, UMAP), supervised cell classification (XGBoost, Random Forest, Deep Learning, GBM), and pseudo-time inference (Monocle, TSCAN, Slingshot, SCANPY). To more effectively separate noise from informative biological signals and to improve performance across these tasks, the authors introduce scDHA, a hierarchical autoencoder framework comprising a non-negative kernel autoencoder for denoising/feature selection and a stacked Bayesian autoencoder for low-dimensional embedding, enabling accurate and scalable downstream analyses within a single pipeline.
Literature Review
The paper outlines the landscape of scRNA-seq analysis methods across key tasks. For unsupervised clustering (cell segregation), widely used methods include SC3, SEURAT, SINCERA, CIDR, and SCANPY; these underpin atlas-building efforts across organisms and developmental stages. For visualization and dimensionality reduction, non-linear methods such as Isomap, Diffusion Map, t-SNE, and UMAP aim to preserve local structure and avoid overcrowding; t-SNE is most common, while UMAP and SCANPY are more recent. After validating subpopulations, supervised classifiers such as XGBoost, Random Forest, deep learning, and Gradient Boosting Machine can label new cells. For trajectory (pseudo-time) inference modeling processes like proliferation and differentiation, Monocle, TSCAN, Slingshot, and SCANPY are cited as state-of-the-art. As scRNA-seq scales exponentially, these tools face challenges from noise and sparsity, motivating integrated, robust approaches like scDHA.
Methodology
Data and preprocessing: The study evaluates 34 scRNA-seq datasets spanning diverse tissues, protocols, and sizes. Ambiguously labeled cells were removed. If an input matrix range exceeded 100, log2 transformation was applied. For scDHA, per-cell min–max scaling transforms each cell’s expression to [0,1] to reduce variance and suppress outliers.
Pipeline overview: scDHA has two core modules. Module 1 is a non-negative kernel autoencoder yielding a non-negative, part-based representation. Encoder weights are constrained to be non-negative, shrinking less-informative features toward zero. Genes/components with high weight variance are retained (by default ~5000 genes; user-adjustable). The bottleneck layer uses 50 nodes; varying this had minimal impact.
Module 2 is a stacked Bayesian autoencoder, a modified variational autoencoder (VAE) with multiple latent spaces and explicit generation of multiple realizations of latent variables via the reparameterization trick z = μ + σ·N(0,1). The latent dimensionality is small (m ≈ 15) to avoid overfitting and keep latent variables disentangled; varying m between 10 and 20 did not alter results. Training uses AdamW optimization with a two-stage schedule: warm-up using reconstruction loss only, followed by a VAE stage adding KL divergence to regularize latent distributions. Scaled exponential linear unit activations are used. The encoder provides compressed representations for downstream tasks.
Clustering: The number of clusters is estimated via two indices computed over k-nearest neighbor spectral clustering (k-nn SC) across candidate cluster counts: (i) between sum of squares over total sum of squares, and (ii) increase in within sum of squares with added clusters. k-nn SC builds a sparse affinity matrix using Pearson correlation among k-nearest neighbors, computes the normalized Laplacian, selects eigenvectors for smallest eigenvalues, and applies k-means on eigenvectors. To improve robustness, consensus clustering aggregates multiple projection/clustering runs using wMetaC (weighted meta-clustering) with a final voting scheme. For large datasets, a voting procedure clusters a subset and assigns remaining cells via k-NN to reduce runtime without compromising accuracy.
Visualization: From compressed data (10–15 dims), a distance matrix is computed, then log- and z-transformations are applied to normalize neighbor distance distributions. Probabilities pij proportional to exp(Dij) are computed in the high-dimensional space. A neural network projects data to 2D (or 3D), where probabilities qij are recomputed; the Kullback–Leibler divergence KL(P||Q) is minimized to align neighborhood structure between spaces.
Classification: Training and test datasets (same tissue) are concatenated on common genes, jointly normalized and compressed using scDHA. A similarity matrix (Pearson correlation) is computed, and test labels are assigned using k-nearest neighbors. Joint processing leverages denoising, dimension reduction, and batch effect mitigation by scDHA.
Pseudo-time inference: For non-branching trajectories, given a user-specified starting cell, scDHA computes Pearson correlation distances on compressed data, constructs a graph, derives a minimum spanning tree, and defines pseudo-time as the distance along the MST from the starting point.
Software comparisons: Competing tools were run per authors’ instructions with default parameters and ample memory (200 GB limit). Specific versions of SC3, SEURAT, SINCERA, CIDR, SCANPY, PCA, t-SNE, UMAP, H2O (for XGB, RF, DL, GBM), Monocle3, TSCAN, and Slingshot were used. Reproducibility is supported via a Docker image and fixed random seeds.
Key Findings
Clustering across 34 datasets with known labels: scDHA achieved the highest accuracy and consistency. Average adjusted Rand index (ARI) was 0.81 with low variability, compared to 0.50 for the second-best method (CIDR). A one-sided Wilcoxon test showed scDHA’s ARIs were significantly higher than others (p = 2.2 × 10^-16). scDHA and SCANPY were the fastest; on the 44,808-cell Macosko dataset, scDHA finished in under 5 minutes, while CIDR required 3312 minutes (>2 days). Simulations with Splatter (25 datasets, up to 100,000 cells and varying sparsity) likewise showed highest ARIs for scDHA and fastest runtimes alongside SCANPY. Re-normalization experiments indicated a slight performance advantage for TPM over CPM and RPKM when using scDHA.
Visualization on the same 34 datasets: scDHA produced clearer separations of true cell types and reduced batch-driven splits compared to PCA, t-SNE, UMAP, and SCANPY. Silhouette index (SI) values were consistently highest for scDHA across datasets and platforms, with significance by one-sided Wilcoxon test (p = 1.7 × 10^-6). Examples include Kolodziejczyk (mouse ESCs) and Segerstolpe (human pancreas), where scDHA avoided unnecessary splitting and overcrowding seen with t-SNE/UMAP/SCANPY.
Classification across five human pancreas datasets (Baron, Segerstolpe, Muraro, Xin, Wang) in 20 train-test combinations: scDHA achieved accuracies from 0.88 to 1.00, with an average of 0.96, outperforming XGB (0.77), RF (0.69), DL (0.43), and GBM (0.72). Performance remained high even with large train-test size mismatches (e.g., training on small Wang, testing on large Baron: scDHA 0.93 accuracy). Accuracy differences were significant (one-sided Wilcoxon p = 2.1 × 10^-8). Average runtime was ~2 minutes per analysis.
Pseudo-time inference on embryo development datasets: scDHA reconstructed trajectories aligned with known stages and achieved higher R^2 values than Monocle, TSCAN, Slingshot, and SCANPY. For Yan: scDHA R^2 = 0.93 vs Monocle R^2 = 0.22; Goolam: 0.79 vs 0.60; Deng: 0.93 vs 0.84. Monocle failed to estimate pseudo-time for many 8-cell, 16-cell, and blast cells in Goolam and Deng (assigned infinity), while scDHA provided complete, stage-consistent orderings. Overall, scDHA had the highest R^2 in every analysis.
Discussion
The study addresses core challenges in scRNA-seq—noise, sparsity, and scale—by introducing scDHA, which couples denoising/feature selection via a non-negative kernel autoencoder with robust compression through a stacked Bayesian (modified VAE) architecture. By shrinking uninformative features and learning stable, low-dimensional representations through latent perturbations and careful training, scDHA enables accurate, scalable analyses across clustering, visualization, classification, and pseudo-time. Extensive benchmarking on 34 diverse datasets and multiple platforms shows significant gains in clustering accuracy (ARI), visualization quality (silhouette), classification accuracy, and trajectory fidelity (R^2), often with superior runtimes. Compared with related autoencoder frameworks (e.g., scVI), scDHA provides an end-to-end pipeline encompassing feature selection, embedding, and downstream tasks, and its hierarchical design with added denoising and multiple latent realizations improves robustness. The findings support scDHA as a practical, user-friendly tool that enhances the extraction of biological signal from large single-cell datasets and can serve as a foundation for a range of downstream analyses.
Conclusion
scDHA is a unified, hierarchical autoencoder framework for scRNA-seq that integrates non-negative denoising/feature selection with a stacked Bayesian autoencoder to produce robust low-dimensional embeddings. Across large, heterogeneous benchmarks, scDHA consistently outperforms state-of-the-art methods in unsupervised clustering, visualization, supervised classification, and pseudo-time inference, while maintaining favorable computational efficiency. The framework is packaged for end-to-end use without additional dependencies for core analyses and is broadly applicable to diverse high-throughput datasets beyond single-cell transcriptomics.
Limitations
The implemented pseudo-time procedure infers non-branching trajectories and requires a user-specified starting point, which may limit applications involving complex branching lineages. Default design choices (e.g., selecting ~5000 genes, fixed latent dimensionality) are user-adjustable but may influence results if not tuned for specific datasets. Comparative evaluations used default parameters for competing tools, which, while standardizing comparisons, might not reflect their optimal configurations.
Related Publications
Explore these studies to deepen your understanding of the subject.

