Introduction
The study of somatic variation, the less abundant mutations accumulating throughout an organism's life, has gained significant attention with the advent of high-throughput single-cell sequencing. This type of variation, often understudied, reveals cell-to-cell heterogeneity and provides insights into normal development and disease progression, particularly in cancer. Estimates of somatic single nucleotide variants (SNVs) per genome position per cell division vary widely, highlighting the complexity of this genomic mosaic. The ability to accurately profile genetic variation in single cells is crucial for understanding lineage tracing in development and the evolutionary dynamics of cancer. Single-cell sequencing requires whole genome amplification (WGA) methods, such as multiple displacement amplification (MDA), to achieve sufficient genome coverage for SNV identification. However, MDA introduces amplification errors and allelic bias, violating the assumptions of variant callers designed for bulk sequencing data. Existing single-cell variant callers often employ global rates for modeling uneven allelic coverage and amplification errors, neglecting the site-specific and cell-specific variations. This limitation motivates the development of a more accurate and robust variant caller that considers these biases.
Literature Review
Several state-of-the-art single-cell SNV callers exist, but each has limitations. MonoVar and SCcaller employ global false-positive error rates, ignoring the local sequence context dependency of the ϕ29 polymerase used in MDA. SCIPHI assumes a global rate for allele dropout, failing to account for its genome-wide and cell-to-cell variability. While SCcaller and SCAN-SNV attempt to model allelic amplification bias more variably, they still utilize fixed global error rates or heuristic filtering, respectively. No existing method comprehensively models both local variation of bias and errors, and none provide statistically sound false discovery rate (FDR) control for SNV calling and genotyping in single cells. This gap in existing tools necessitates a new approach that addresses these limitations.
Methodology
ProSolo, a novel variant caller, is presented to address the shortcomings of existing tools. ProSolo uses a probabilistic model that comprehensively accounts for both amplification bias and amplification errors introduced by MDA. This is achieved through two key innovations: (1) a mechanistically motivated, empirically derived model of differential allele amplification, and (2) the joint modeling of single-cell samples with a bulk sequencing sample from the same cell population. The bulk sample acts as an unbiased reference, providing crucial information for evaluating amplification errors and biases. The model leverages beta-binomial distributions to capture the amplification bias, with parameters varying depending on the site's coverage. This allows for accurate modeling of allele dropout. The joint modeling with the bulk sample significantly enhances the accuracy and sensitivity of SNV calls, enabling the calculation of posterior probabilities for various single-cell events (e.g., homozygous reference, heterozygous, homozygous alternative, allele dropout, amplification errors). ProSolo’s implementation in the Varlociraptor library facilitates efficient computation and flexible FDR control. ProSolo provides detailed posterior probabilities for single-cell events, which can be used to accurately control FDRs and provides the capacity to impute genotypes in regions with insufficient coverage. This imputation is performed by using the bulk sample as a reference, thus providing a biologically relevant approach.
Key Findings
ProSolo demonstrates superior performance compared to state-of-the-art tools (MonoVar, SCAN-SNV, SCcaller, SCIPhI) across three benchmark datasets: a whole-genome cell line dataset, a whole-exome granulocyte dataset, and a whole-exome TNBC dataset. In the whole-genome dataset, ProSolo shows a significant increase in recall at high precision, outperforming other tools. In the whole-exome granulocyte dataset, ProSolo achieves a 20% increase in recall compared to SCIPhI and SCcaller while maintaining high precision. In the TNBC dataset, ProSolo is the only tool achieving precision above 0.99 on tumor cells. The analysis of allele dropout rates shows that ProSolo's estimates are consistent with previously published rates. The software is implemented in an extendable framework and is computationally efficient. ProSolo's FDR control is flexible and reliable, allowing users to adjust the balance between precision and recall. Moreover, ProSolo's joint modeling with a bulk sample allows for a biologically relevant imputation of genotypes at sites with low coverage in single cells.
Discussion
ProSolo's superior accuracy in SNV calling stems from its comprehensive modeling of MDA-related biases and errors. The joint modeling of single-cell and bulk data significantly improves sensitivity and specificity, surpassing the performance of existing methods that rely on consensus rules or phylogenetic inference. The flexible FDR control empowers researchers to tailor their analysis to specific needs, balancing precision and recall. ProSolo's ability to impute missing genotypes based on the bulk sample offers a practical advantage for downstream analyses. The modular implementation in the Varlociraptor library provides a scalable and extensible platform for future development. The use of a bulk sample, while improving performance, requires sufficient bulk coverage, which serves as a limitation and an important consideration. This framework can be extended to include other types of variants, such as insertions, deletions and MNVs.
Conclusion
ProSolo represents a significant advancement in single-cell variant calling. Its comprehensive modeling of MDA biases, coupled with efficient computation and flexible FDR control, yields highly accurate and scalable results. The ability to integrate bulk sequencing data significantly enhances the reliability of SNV calls and enables biologically relevant genotype imputation. Future work will focus on refining the empirical models of amplification bias and extending the model to accommodate other variant types and potentially utilizing copy number profiles. ProSolo is a valuable tool for researchers working with single-cell DNA sequencing data, enabling more robust and informative analyses.
Limitations
While ProSolo significantly improves accuracy and scalability, some limitations remain. The accuracy of allele dropout rate estimations can vary slightly depending on the dataset, suggesting the need for refinement of the empirical distributions used. The model's performance is influenced by the coverage depth of the bulk background sample; insufficient coverage may lead to underperformance in detecting subclonal somatic mutations. The current model may not fully capture all aspects of the amplification process, particularly with respect to the impact of doublets. Finally, the use of fixed empirical distributions might not be optimal for all datasets and protocols.
Related Publications
Explore these studies to deepen your understanding of the subject.