Medicine and Health

HIDDEN: a machine learning method for detection of disease-relevant populations in case-control single-cell transcriptomics data

A. Goeva, M. Dolan, et al.

Discover how HIDDEN, a groundbreaking computational method developed by Aleksandrina Goeva and colleagues at the Broad Institute, transforms traditional single-cell RNA-seq studies. By accurately discerning affected cells, HIDDEN enhances our understanding of malignancy in early-stage samples and uncovers critical insights into blood-brain barrier dysfunction.

00:00

~3 min • Beginner • English

Index

Introduction

High-dimensional single-cell transcriptional profiling enables characterization of cellular responses to perturbations such as disease, drugs, or gene knockouts. Existing computational strategies—differential expression and differential abundance tests, and condition-agnostic approaches—often assume that sample-level condition labels accurately reflect cell-level perturbation status, that the perturbation signal dominates the latent space, or that confounders are known and removable. In practice, perturbation effects may be subtle relative to biological heterogeneity and technical noise, and only a subset of case cells may be affected, making sample-level labels inaccurate at the single-cell level. Standard label-agnostic workflows (dimensionality reduction and clustering, followed by within-cluster comparisons) can thus fail to isolate affected cells and their markers, particularly under class imbalance or weak effects. To address these challenges, the authors developed HIDDEN, a statistical framework that refines sample-level labels into per-cell perturbation scores and binary affected/unaffected labels, thereby enhancing detection of perturbation-associated signals and marker genes.

Literature Review

The paper situates HIDDEN among recent methods for perturbation analysis in single-cell data. CNA provides cluster-free detection of regions with uneven mixing of condition labels; MELD estimates a continuous perturbation gradient by diffusing labels across a graph; Milo tests differential abundance along trajectories; Mixscape separates successfully from unsuccessfully perturbed cells in pooled CRISPR screens after removing known confounders. These approaches rely on assumptions that may fail when cell-level labels are inaccurate, perturbation signals are not dominant in latent space, or confounders are unknown. The authors highlight how these limitations motivate a method that refines labels to capture true per-cell perturbation status and can, in turn, improve downstream tools that depend on reliable cell-level labels.

Methodology

HIDDEN consists of three main components: (1) latent feature construction, (2) cell-level continuous perturbation scoring via supervised prediction using sample-level labels, and (3) binarization of case-cell scores to refine affected/unaffected labels. - Latent space: From the gene expression matrix X (N cells by M genes), compute a reduced representation Z (N by K) using principal component analysis (PCA). Alternative dimensionality reduction methods (e.g., NMF, autoencoders via scVI) are supported; empirical comparisons showed PCA performed as well or better with greater efficiency. - Continuous perturbation score: Train a binary classifier h on Z and sample-level labels Y (0 control, 1 case) to estimate p = P(Y=1|Z). The implementation uses logistic regression to balance simplicity, interpretability, and computational efficiency, though other classifiers are possible. - Refined binary labels: For cells from the case condition (Y=1), cluster their continuous scores p into two groups (k-means with k=2 by default; Gaussian mixture models also evaluated). Assign Ŷ=1 to the higher-score group (affected) and Ŷ=0 to the lower-score group (unaffected). The method outputs continuous scores and refined labels for downstream analyses. - Selecting K (number of latent dimensions): Two data-driven heuristics are provided: (a) choose K maximizing the number of DE genes defined by HIDDEN-refined labels; (b) choose the smallest K that maximizes the two-sample Kolmogorov–Smirnov statistic comparing p-distributions of predicted affected vs unaffected case cells. Both heuristics performed robustly across datasets. - Simulations: Semi-simulated ground-truth datasets were created from human PBMC Naive B and Memory B cells. The case sample contained a controlled fraction p of perturbed cells (true Memory B or Memory-Naive hybrid profiles of varying perturbation strength), while controls contained Naive B cells. Standard workflows (PCA/graph clustering) were evaluated across hyperparameters; HIDDEN was trained with scanpy preprocessing, PCA, logistic regression (sklearn), and k-means or GMM binarization. Performance metrics included AUROC for continuous scores and recall/precision/F1 for DE marker recovery. - Comparisons: HIDDEN was compared to CNA, MELD, Milo, and Mixscape by deriving both continuous perturbation scores and binarized labels, then assessing agreement with ground truth (AUROC, F1). The authors also evaluated these methods when supplied with HIDDEN-refined labels as inputs to test whether improved labels enhance their performance. - Human plasma cell dataset: Single-cell RNA-seq of plasma cells from NBM (healthy), MGUS, SMM, and MM across patients, with strong batch effects. Two training strategies were used: batch-sensitive (train on all NBM, one precursor sample, and all MM at a time) and batch-agnostic (train on all samples together). Logistic regression was fit on PCA of all genes; K selected by maximizing DE output. HIDDEN scores and labels were compared against manual cell-level malignancy annotations and Bayesian tumor purity estimates (using Beta–Binomial framework for significance of differences). - Mouse demyelination dataset: snRNA-seq endothelial cells from PBS control vs LPC-treated mice at 3, 7, 12, 18 dpi. After standard preprocessing and PCA, logistic regression was trained on PBS/LPC labels; K=5 selected by the DE-maximization heuristic. At 3 dpi, bimodal case scores were split into LPC1 (affected) and LPC0 (unaffected) via k-means. Differential expression (Wilcoxon, adjusted p<0.05) identified markers unique to HIDDEN-derived labels. Functional interpretation used g:Profiler and ReviGO; ligand–receptor changes were analyzed with a targeted hypothesis testing framework.

Key Findings

- Simulations of Naive B/Memory B mixtures: - Standard clustering failed to isolate perturbed (Memory B) cells when perturbed fractions were small (e.g., 5%) or perturbation strength was mild; clusters had mixed identities despite attempts to tune PCs, resolution, or feature sets. - HIDDEN produced accurate continuous scores separating perturbed from unperturbed cells even when only 5% of case cells were slightly perturbed, with high AUROC across perturbation strengths. - With fewer than 20% perturbed cells, DE using sample-level labels retrieved few to no ground-truth markers; HIDDEN-refined labels substantially improved marker recovery (higher recall, F1). In some cases, HIDDEN identified additional genes consistent with true markers beyond ground-truth annotations, suggesting corrective power. - Across ground-truth datasets, HIDDEN’s continuous scores and binarized labels outperformed CNA, MELD, Milo, and Mixscape for recovering perturbation labels; providing HIDDEN-refined labels as inputs improved the performance of these other methods. - Model robustness: The single parameter K was not especially influential; two heuristics for selecting K yielded stable, accurate results. - Human bone marrow plasma cells (NBM, MGUS, SMM, MM): - In mixed precursor samples (3 MGUS, 5 SMM), HIDDEN’s continuous scores agreed strongly with manual malignant/healthy annotations (high AUROC across samples). - Sample purity estimates from HIDDEN-refined labels closely matched Bayesian point estimates and, in most cases, better aligned than manual estimates (significance via Beta–Binomial tests with Bonferroni adjustment). - In three low-purity MGUS samples where manual annotation found no neoplastic cells, HIDDEN identified malignant cells missed previously; validation showed strong differential enrichment of known normal vs malignant plasma cell signatures between HIDDEN-defined groups. - Differential expression across precursor samples using HIDDEN labels identified 8,208 genes, with 2,400 overlapping known NBM-vs-MM malignancy markers (hypergeometric p=3.066e-31) and 5,808 uniquely found using HIDDEN in precursor states, highlighting early disease markers. - Mouse demyelination model (endothelial cells): - Standard workflows showed modest effects in ECs and mixed PBS/LPC identities; no perturbation-enriched cluster was found. - HIDDEN revealed a bimodal distribution of case scores at 3 dpi and defined LPC1 (affected) vs LPC0 (unaffected) EC subpopulations. - LPC1 exhibited 28 unique marker genes; in situ hybridization confirmed lesion-specific expression (e.g., Lgals1, S100a6) at 3 dpi. - Gene set enrichment indicated processes consistent with early demyelination: increased angiogenesis, extracellular matrix remodeling, and blood–brain barrier breakdown. Ligand–receptor analysis showed decreased anti-angiogenic interactions (e.g., Flt1 with Vegfa/Vegfb), increased collagen–integrin interactions, reduced tight-junction-supporting interactions (Jam2/Jam3 with integrins), and increased Vcam1 expression, implicating immune cell recruitment.

Discussion

The study addresses the challenge that case-control labels in single-cell studies often misrepresent cell-level perturbation status, especially when perturbations are subtle or only affect a minority of cells. HIDDEN leverages sample-level labels and gene expression features to compute per-cell perturbation scores and refined labels, thereby improving detection of affected subpopulations and their markers. In simulations, HIDDEN consistently identified subtle effects and recovered ground-truth markers where standard workflows failed, outperforming CNA, MELD, Milo, and Mixscape on the target task. In real datasets, HIDDEN matched or exceeded manual annotation for malignancy detection and purity estimation in plasma cells and uncovered previously missed malignant cells in early MGUS. In a demyelination model, HIDDEN uncovered a temporally specific EC subpopulation (LPC1) with lesion-validated markers and pathway changes consistent with angiogenesis and BBB dysfunction, offering new biological insights. The method scales efficiently (simple PCA + logistic regression), tolerates batch effects without complex alignment, and can enhance other perturbation-focused methods by supplying improved per-cell labels. These findings demonstrate that refining cell-level perturbation labels is crucial to harnessing the full potential of single-cell datasets to detect subtle, heterogeneous biological effects.

Conclusion

HIDDEN provides a robust, scalable framework to refine case-control labels into accurate per-cell perturbation scores and binary affected/unaffected labels, boosting sensitivity to subtle transcriptional changes and enabling discovery of affected subpopulations and markers missed by standard workflows. It outperforms related methods on the label-refinement task and can improve their performance when used in combination. Applications to human plasma cell malignancy and a mouse demyelination model highlight its ability to recapitulate expert annotations, discover early disease signatures, and reveal functionally relevant cell subsets. Future extensions include adapting the framework to spatial and multi-omics data, modeling multi-stage or continuous perturbations, and focusing latent spaces on specific distinctions such as genotype effects (eQTLs) or sexual dimorphism.

Limitations

- Cell-type specificity: Because perturbation effects likely differ by cell type, HIDDEN must be applied one cell type at a time. - Within-affected heterogeneity: Identifying an affected subpopulation does not imply a homogeneous response; further analyses are needed to resolve potentially multiple perturbation responses among affected cells. - Binary vs continuous effects: The current framework does not provide a statistical test to distinguish binary from continuous perturbation effects; researchers must interpret score distributions to decide on binarization or alternative modeling. - Dependence on latent space construction: While PCA performed well, the approach assumes the latent space retains relevant perturbation signal; poor embeddings or extreme batch effects could reduce performance, though heuristics for K selection mitigate overfitting.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

Machine Learning Techniques for the Performance Enhancement of Multiple Classifiers in the Detection of Cardiovascular Disease from PPG Signals

S. W. Rabkin, A. Cataldo, et al.

Computer Science

On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning

J. Giner-miguelez, A. Gómez, et al.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Medicine and Health

Single-cell transcriptomics reveals aberrant skin-resident cell populations and identifies fibroblasts as a determinant in rosacea

M. Chen, L. Yang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny