Biology
Transformer for One-Stop Interpretable Cell Type Annotation
J. Chen, H. Xu, et al.
The study addresses the need for consistent, reproducible transfer of cell type annotations from reference to query single-cell RNA-seq datasets. Traditional workflows require multiple steps and manual curation, leading to variability in labels across studies, especially for subtypes defined by limited marker genes. As datasets are generated across batches and time, there is a growing need for automated, scalable annotation transfer methods that maintain a consistent standard. Existing deep learning tools, particularly autoencoders, can handle large datasets but often sacrifice interpretability due to non-linear feature aggregation and dimensionality reduction that obscure links to input genes and technical factors. The Transformer framework, which avoids forced dimensionality reduction and computes attention between tokens, allows tracing attention back to original features, enabling interpretability. The authors propose TOSICA (Transformer for One-Stop Interpretable Cell-type Annotation) to provide accurate, interpretable, and batch-insensitive annotation that maps gene expression to biologically meaningful tokens (pathways/regulons).
The paper reviews unsupervised clustering and manual annotation workflows in scRNA-seq, noting their time cost and inconsistency across studies. It highlights AI-based annotators and autoencoder-based methods that, while powerful, develop abstract latent spaces due to non-linear aggregations and depth, reducing interpretability and traceability to genes and technical factors. The authors point to Transformer architectures as advantageous because they do not enforce dimensionality reduction and their attention mechanisms remain traceable, thus supporting interpretability. They benchmark TOSICA against 18 existing annotators (e.g., Seurat, SingleCellNet, SingleR, SciBet, ACTINN, CELLBLAST, chetah) and integration methods (e.g., Seurat, scGen) using standardized metrics (accuracy, runtime, scIB batch removal and biological conservation), situating their approach within the landscape of existing tools.
TOSICA is a supervised Transformer-based architecture with three components: (1) Cell Embedding, (2) Multi-head Self-Attention, and (3) a Cell-Type Classifier. Gene expression for each cell (n genes) is linearly projected into k tokens representing biological entities (e.g., pathways/regulons) using a learnable weight matrix W that is element-wise multiplied by a binary mask M derived from expert knowledge (e.g., GSEA gene sets). Only connections between genes and their member gene sets are retained, ensuring each token aggregates information from biologically defined gene sets. This embedding is replicated m times in parallel (default m=48) and concatenated to form a token matrix T (k×m). A learnable class token (CLS) is prepended to T to form the input matrix I. In the multi-head self-attention layer, Q, K, V are computed via linear projections of I. Attention scores A=softmax(QK^T/√d_k) quantify relationships among tokens; critically, attention between CLS and pathway tokens indicates the importance of each pathway to classification. Multi-head attention is computed H times with separate projections (W^Q, W^K, W^V), concatenated, and linearly transformed. The CLS output vector feeds a fully connected classifier followed by softmax to produce cell type probability distributions. Residual connections and additional fully connected layers are included to enhance learning and mitigate overfitting. Training uses supervised cross-entropy loss with SGD optimizer and cosine learning rate decay; typical convergence occurs within ~20 epochs. Masks: Knowledge-based masks are constructed from GSEA gene sets (e.g., Reactome c2.cp.reactome.v7.5.1, c3 regulons), with optional limits on gene set size and number (default 300 each). Random masks with sparse connectivity (e.g., 1% or 5% connections) are also evaluated to test robustness. Data splitting: Training/validation/test splits use different studies, platforms, subjects, time points, or disease states to simulate real-world reference-to-query transfer. Evaluation: Accuracy is defined as fraction correctly predicted. Additional dataset characteristics are quantified: log size, number of types, Shannon entropy of type distribution, and KL divergence (D_KL) between reference and query distributions. Integration benchmarking employs scIB to assess batch effect removal (e.g., batch ASW, kBET, graph connectivity) and biological conservation (e.g., NMI, ARI, ASW, isolated label F1). Attention embeddings are normalized (library size to 10,000), PCA-transformed, neighbor graphs built, and UMAP/diffusion maps computed for visualization and trajectory inference. Signature attentions are identified by Wilcoxon tests with BH correction; subclusters via Louvain (resolution=0.3). Gene importance to tokens comes from average absolute weights in the masked linear layer. Trajectories are inferred via diffusion maps and PAGA, with dynamic attentions identified using generalized additive models (|coef|>0.5, FDR<0.01).
- Across six benchmark datasets (human artery, human bone, human pancreas, mouse brain, mouse pancreas, mouse atlas), TOSICA ranks within the top 6 on each dataset and achieves the highest mean accuracy among 19 methods tested: 86.69%. On easy datasets (hArtery, hPancreas), TOSICA attains 93.75% and 95.76% accuracy, close to top methods (e.g., Seurat 96.37% on hArtery; SingleCellNet 97.53% on hPancreas). On challenging datasets (hBone, mPancreas, mAtlas), TOSICA ranks top-2; on mAtlas (largest, most cell types), TOSICA achieves 81.06% vs. 79.57% for ACTINN, with stable runtime (fourth shortest) scaling to large data.
- Dataset characteristic analysis shows annotation accuracy is most negatively impacted by distribution mismatch between reference and query cell types: Pearson correlation between accuracy and D_KL ≈ -0.9. Despite this, TOSICA outperforms SingleR and SciBet on five highly unbalanced hBone cell types (e.g., 76.47% vs. 63.23%/68.18% mean accuracy on selected classes).
- Mask robustness: Random sparse masks (1% or 5% connectivity) can yield similar accuracies to knowledge-based masks but require more epochs to converge and, in some cases (e.g., mPancreas), converge to slightly lower accuracy.
- New cell type discovery: When a major class (alpha cells) is removed from the hPancreas reference, TOSICA clusters alpha cells in query and labels 76% as 'Unknown' (cutoff max prob <0.95), assigning the remainder to a related endocrine class (PP). Other high-accuracy annotators (SingleR, SciBet, ACTINN) mislabel alpha cells as existing known types; CELLBLAST and chetah only partially detect them as new.
- High-resolution annotation and interpretability: In mPancreas, TOSICA’s regulon attention distinguishes mature vs. proliferative acinar cells and identifies an intermediate state (MP) closer to proliferative acinar cells, driven by MIR-29B-3P regulon attention and implicating Sparc among key genes. Hierarchical clustering and PCA support these transitions.
- Dynamic trajectories: Attention-based diffusion maps recapitulate OA-associated chondrocyte transitions and reveal regulatory shifts (NF1 to CEBP dominance) associated with OA onset.
- Batch insensitivity and integration: Without using batch labels, TOSICA generates batch-insensitive embeddings and high annotation accuracy. In scIB benchmarks across multiple datasets, TOSICA ranks in the top tier for batch removal (batch ASW) and biological conservation (NMI), excelling on large, multi-batch datasets where some methods (scGen, Seurat variants) fail to run.
- Pan-cancer myeloid cells: TOSICA ranks second overall (scIB combined score) among 11 methods and uncovers pathway-level distinctions among cDC subsets (e.g., NOD1/2 signaling, Toll receptor cascades) and trajectories (origins of cDC3_LAMP3), as well as cancer-type-specific states in LYVE1+ RTMs (e.g., ESCA enriched cytokine and insulin signaling). It finds associations with disease stage and aging: FGFR signaling increases with ESCA stage in LYVE1+ RTMs (RCC≈0.29, p≈2.28e-24), while in CD14+ monocytes innate immune system activity declines with age (RCC≈-0.26, p≈2.68e-177) and interferon signaling slightly increases (RCC≈0.14, p≈2.0e-47). It also identifies novel monocyte subtypes with distinct tissue distributions.
- Pan-cancer T cells: TOSICA ranks second overall among 10 methods with the shortest runtime (minutes; scGen ~5 days). Attention-based embeddings reveal CD4+ T-cell differentiation paths and that GZMK+ Tex cells act as common endpoints along two CD8+ T-cell trajectories, revising prior assumptions.
- COVID-19 atlas: Using healthy PBMCs as reference and COVID-19 patients as query (~1.41M cells), TOSICA identifies de novo cell types (e.g., DC_LAMP3, epithelial, mast) and ranks first among 13 methods by scIB combined score. In monocytes, it resolves 7 subtypes (1 CD16+, 6 CD14+); subtype C3 decreases and C4 increases from healthy to moderate to severe COVID-19. TF regulon attentions (e.g., AP2_Q6, FOXO4_01 down; AP4_01, MIR3617_5P, NFKB_Q6, ATF3_Q6 up) track disease progression with target gene expression corroboration.
- Transfer to SLE: A COVID-trained model maps SLE PBMCs and captures IFN-β-induced state shifts, including upregulated SREBP activity and downregulated FOXO1/3, consistent with known interferon-driven lipogenesis.
TOSICA directly addresses the challenge of consistent, scalable, and interpretable cell type annotation across heterogeneous single-cell datasets. By masking gene-to-token projections with biologically grounded gene sets and leveraging multi-head self-attention, TOSICA focuses on meaningful pathway/regulon interactions, improving accuracy and robustness while enabling traceable interpretability from pathways to genes. The method maintains high performance under batch heterogeneity without explicit batch labels, supports discovery of unseen or rare cell types, and elucidates dynamic trajectories with regulatory insights. Extensive benchmarks against numerous state-of-the-art annotators and integrators demonstrate superior or top-tier accuracy, integration quality, computational efficiency, and scalability, particularly on large multi-batch datasets. Case studies in cancer immunology and infectious/autoimmune disease further show that TOSICA uncovers biologically coherent pathway activities and state transitions that align with or extend prior knowledge, illustrating its utility for hypothesis generation and cross-study annotation transfer.
The study introduces TOSICA, an interpretable Transformer-based framework for one-stop cell type annotation and data integration in single-cell RNA-seq. It achieves state-of-the-art mean accuracy across diverse datasets, robustly handles batch heterogeneity without batch labels, discovers novel and rare cell types, resolves fine-grained subtypes, and reveals regulatory trajectories. Its hierarchical interpretability links pathway/regulon-level attentions to contributing genes, facilitating biological insight and validation. The open-source toolkit and demonstrated scalability position TOSICA as a practical solution for large-scale, reproducible annotation transfer and integrative single-cell analyses.
- Performance depends on the distribution alignment between reference and query sets; large divergence (high D_KL) negatively impacts accuracy across methods, though TOSICA remains comparatively strong.
- While robust to mask choice, random masks typically require more training epochs and may converge to slightly lower accuracy on some datasets (e.g., mPancreas) compared to knowledge-based masks.
- On some easier datasets, TOSICA is slightly below the top method despite high absolute accuracy (e.g., hArtery, hPancreas).
- The approach relies on curated gene set resources for optimal interpretability and convergence; the choice and quality of gene sets can influence results.
Related Publications
Explore these studies to deepen your understanding of the subject.

