Biology

Transformer for One-Stop Interpretable Cell Type Annotation

J. Chen, H. Xu, et al.

TOSICA, developed by Jiawei Chen and colleagues, revolutionizes cell type annotation in single-cell research with its Transformer-based model. This innovative approach not only ensures fast and accurate identification but also enhances interpretability, shedding light on rare cell types and their behavior, especially in tumor-infiltrating immune cells and COVID-19 monocytes.

00:00

Playback language: English

Index

Introduction

Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research, providing unprecedented resolution for studying cellular processes and diseases. A key step in scRNA-seq analysis is cell type annotation, which involves identifying and classifying different cell populations. Traditional annotation methods are often laborious and time-consuming, involving multiple steps such as preprocessing, dimensionality reduction, clustering, differential expression analysis, and manual annotation based on marker genes. Inconsistencies can arise due to manual annotation and variations in marker gene selection across different studies. This lack of standardization hinders reproducibility and comparability. The need for consistent and reproducible annotation transfer between reference and query datasets has become increasingly important. While AI-based tools can handle large datasets, many use deep architectures with complex non-linear transformations, making the learned features abstract and difficult to interpret. The resulting loss of information and feature resolution limits biological insight. In contrast, the Transformer framework, with its self-attention mechanism, retains traceability to input features, maintaining interpretability. This paper introduces TOSICA (Transformer for One-Stop Interpretable Cell-type Annotation), a novel AI-based tool for cell type label transfer, leveraging the advantages of the Transformer architecture to address the challenges of accuracy, speed, and interpretability in cell type annotation.

Literature Review

Existing AI-based methods for single-cell data analysis, while powerful, often lack interpretability due to the complexity of deep learning architectures. Many rely on autoencoders, which, while capable of dimensionality reduction and feature extraction, obscure the relationship between input features (genes) and the learned representation. This makes it challenging to understand the biological basis of the annotations. Previous attempts to address this issue have focused on various methods, but a robust and universally applicable solution remained elusive. The need for a method that balances accuracy, speed, and interpretability prompted the development of TOSICA, which uses the Transformer architecture to achieve this goal. The Transformer's self-attention mechanism facilitates interpretability by maintaining a clear connection between input features and the final annotation, unlike the complex transformations found in other deep learning models. This focus on interpretability allows for a deeper understanding of the biological mechanisms driving cellular behavior.

Methodology

TOSICA is a multi-head self-attention network designed for interpretable cell type annotation and simultaneous data integration. It comprises three main components: a Cell Embedding layer, a Multi-head Self-attention layer, and a Cell-Type Classifier. The Cell Embedding layer transforms gene expression data into tokens, using a learnable transformation matrix that is masked based on prior biological knowledge (e.g., pathway membership). This masking operation ensures that each token captures information from specific genes within a given pathway. Multiple token vectors are generated in parallel, merged, and then concatenated with a class token (CLS), a trainable parameter that abstracts information across pathways. This combined matrix is fed into the Multi-head Self-attention layer, where query (Q), key (K), and value (V) matrices are linearly projected from the input. The self-attention mechanism calculates attention scores (A) representing the relationships between pathways. Importantly, the attention scores between the CLS token and pathway tokens reflect the importance of each pathway for cell type classification. The output matrix (O) incorporates these interaction scores, and the CLS token in O, representing a comprehensive score for each cell, is fed into the Cell-Type Classifier. This classifier predicts the cell type probability for each cell. The Transformer's self-attention mechanism enables interpretability by calculating relationships between tokens, similar to Vision Transformers analyzing pixel importance for image classification. TOSICA calculates the attention between the CLS token and pathway tokens, allowing downstream analyses of pathway importance. The model is trained using supervised learning on labeled scRNA-seq data, learning the mapping from gene expression to cell type and converting high-dimensional sparse expression data to a low-dimensional dense feature space. The model incorporates residual connections and additional fully connected layers to improve performance. Knowledge-based mask matrices, derived from gene set databases like Reactome or GSEA, are used to define the connections between genes and pathways, influencing the model's focus on biologically relevant information. The training process uses cross-entropy loss, stochastic gradient descent (SGD), and cosine learning rate decay. The training and testing datasets are chosen to reflect diverse biological states and experimental conditions, with a portion of the training data reserved for validation. Performance is evaluated using accuracy, runtime, and other metrics, including those assessing batch effect removal and biological variation conservation.

Key Findings

TOSICA demonstrates superior performance in accuracy and efficiency compared to 18 other cell type annotation methods across six diverse datasets (human artery, bone, pancreas; mouse brain, pancreas, atlas). Its mean accuracy of 86.69% surpasses all other methods. Even on challenging datasets with uneven cell type distribution, TOSICA consistently ranks among the top performers. TOSICA's runtime remains relatively stable with increasing dataset size, unlike many other methods. Experiments with different masks (including random masks) show TOSICA's robustness to the choice of prior biological knowledge, while expert knowledge accelerates convergence to optimal performance. The uneven distribution of cell types between reference and query sets negatively impacts annotation accuracy, as highlighted by the correlation analysis (PCC between accuracy and Kullback-Leibler Divergence = -0.9). TOSICA excels at identifying rare cell types, even those unseen during training. When a high-percentage cell type is removed from the reference set, TOSICA accurately identifies and labels most of the corresponding cells in the query set as 'Unknown', unlike other methods. Moreover, TOSICA reveals an ability to detect previously unknown cell types, highlighting its high resolution. In an analysis of mouse pancreas data, TOSICA distinguishes between mature and proliferative acinar cells by identifying distinguishing regulons (MIR-6382 and MIR-29B-3P) and associated genes (e.g., Sparc), providing insights into cellular development. TOSICA enables interpretable dynamic trajectory analysis, revealing key pathways in biological processes. For example, in an osteoarthritis study, TOSICA's regulon attention-based trajectories highlight the role of NF1 and CEBP regulons in chondrocyte differentiation. Furthermore, TOSICA is insensitive to batch effects, consistently predicting cell types accurately even when reference and query datasets originate from different batches and experimental conditions. This is a notable advantage over other methods that require batch information for data integration. Benchmarking using scIB confirms TOSICA's effectiveness in batch effect removal and biological conservation. TOSICA's interpretability is hierarchical, providing insights at both pathway and gene levels, unlike other gene-based methods. Analysis of tumor-infiltrating myeloid and T cells demonstrates TOSICA's ability to reveal cell type heterogeneity and dynamic trajectories, providing functional and biological insights into the tumor microenvironment. TOSICA identifies pathways associated with cancer type and disease progression (e.g., FGFR signaling in ESCA). Finally, in a COVID-19 and SLE analysis, TOSICA identifies novel monocyte subtypes and key transcription factors associated with disease progression and response to interferon treatment, demonstrating its large-scale applicability.

Discussion

TOSICA offers a significant advancement in single-cell analysis by combining high accuracy, computational efficiency, and unparalleled interpretability. The use of the Transformer architecture with biologically-informed masking allows for the identification of biologically relevant patterns while mitigating the effects of noise and batch variation. TOSICA's ability to handle diverse datasets, reveal rare cell types, and discover dynamic trajectories provides valuable insights into cellular behavior in health and disease. The hierarchical nature of the interpretability further enhances the utility of the tool by providing biological insights at multiple levels of granularity. The batch insensitivity of TOSICA is particularly beneficial for integrating data from diverse sources, fostering collaborative research and large-scale data analysis. The success of TOSICA highlights the potential of Transformer architectures for complex biological data analysis, representing a paradigm shift towards more interpretable and biologically relevant findings.

Conclusion

TOSICA is a robust and versatile tool for single-cell data analysis, providing accurate, high-resolution, and interpretable cell type annotation. Its superior performance across various datasets and tasks, coupled with its batch insensitivity and hierarchical interpretability, makes it a valuable asset for single-cell research. Future research could explore the application of TOSICA to other omics datasets and refine the masking strategies for improved performance. The availability of TOSICA as an open-source tool will further contribute to the advancement of single-cell genomics research.

Limitations

While TOSICA demonstrates significant improvements over existing methods, certain limitations should be considered. The accuracy of the annotation depends on the quality and quantity of the reference dataset. The performance might be affected by the choice of the mask matrix, requiring careful selection based on the specific biological context. Although TOSICA shows robustness to batch effects, extreme batch effects might still pose challenges. Further validation on a wider range of datasets and biological systems is necessary to assess the generalizability of the findings.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer

X. Wang, M. Duan, et al.

Chemistry

Unraveling two distinct polymorph transition mechanisms in one n-type single crystal for dynamic electronics

D. W. Davies, B. Seo, et al.

Food Science and Technology

Cell-based, cell-cultured, cell-cultivated, cultured, or cultivated. What is the best name for meat, poultry, and seafood made directly from the cells of animals?

W. K. Hallman, W. K. H. Ii, et al.

Food Science and Technology

Fungus-derived protein particles as cell-adhesive matrices for cell-cultivated food

Y. X. Teo, K. Y. Lee, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny