
Biology
Accurate and scalable variant calling from single cell DNA sequencing data with ProSolo
D. Lähnemann, J. Köster, et al.
Discover how ProSolo, developed by David Lähnemann and colleagues, revolutionizes the accuracy of single nucleotide variant calling from single cell DNA sequencing data. This innovative tool tackles allelic bias and copy errors, enabling precise genomic analyses unlike any state-of-the-art methods before it.
~3 min • Beginner • English
Introduction
Somatic single nucleotide variants (SNVs) accumulate during cell divisions and create genomic mosaicism, enabling lineage tracing and insights into development and cancer evolution. Single-cell DNA sequencing can profile this variation but typically requires whole-genome amplification (WGA), most commonly multiple displacement amplification (MDA). MDA introduces amplification errors and strong allelic bias, including allele-specific coverage skew and dropout, violating assumptions of bulk-oriented variant callers. Existing single-cell callers often assume global, fixed error or dropout rates and insufficiently capture the local, site-specific variability driven by polymerase behavior and sequence context. There is a need for a statistical framework that models both amplification bias and amplification errors in a site-specific manner and provides sound false discovery rate (FDR) control for SNV calling and genotyping in single cells. ProSolo addresses this by jointly modeling single-cell data with an accompanying bulk sample from the same population to mitigate MDA-induced uncertainties while enabling accurate, scalable variant calling.
Literature Review
Previous single-cell variant callers such as MonoVar and SCcaller employ global false-positive error rates or fixed dropout rates, assuming uniformity that contradicts known sequence-context dependencies of the Φ29 polymerase used in MDA. SCIPHI models allele bias via a global beta-binomial distribution and also assumes global dropout rates, integrating information across cells through phylogenetic inference but at high computational cost and with limited genotype resolution. SCcaller and SCAN-SNV estimate local allelic bias using nearby germline heterozygous sites, but SCcaller still relies on a fixed false-positive rate, and SCAN-SNV uses heuristic filters and is geared towards somatic variants. No prior method jointly modeled local amplification bias and amplification errors with rigorous FDR control. Empirical studies have shown MDA error rates orders of magnitude higher than somatic mutation rates and significant site-specific variation in amplification performance, motivating a mechanistically-informed, data-driven probabilistic approach.
Methodology
ProSolo is a probabilistic variant caller tailored to diploid single cells amplified by MDA. It addresses two key MDA issues: (i) allelic amplification bias and dropout, and (ii) amplification (copy) errors introduced by Φ29 polymerase. The model represents two allele frequency notions per site: the true underlying single-cell allele frequency θ_s ∈ {0, 0.5, 1} and the observed post-amplification allele fraction p = k/l from l total reads with k alternative reads. To model P(p|θ_s), ProSolo adopts an empirically and mechanistically motivated formulation derived from Lodato et al.: homozygous reference and homozygous alternative sites follow beta-binomial distributions with coverage-dependent parameters, and heterozygous sites follow a mixture of two symmetric beta-binomials that capture allelic bias and dropout, with mixture weight and shape parameters linear in coverage. This mirrors a Pólya urn process analogy for MDA. Symmetry is enforced for homozygous cases by swapping shape parameters.
To disambiguate amplification errors from true variants, ProSolo jointly models an accompanying bulk sequencing sample from the same population, unaffected by MDA. For the bulk, it computes read-level likelihoods over discrete allele frequency states and forms a likelihood density L(θ_b) across allele frequencies using a read emission model (from Varlociraptor). ProSolo defines mutually exclusive single-cell events over the 2D space of (θ_s, θ_b), encompassing: homozygous reference, heterozygous, homozygous alternative, allele dropout to reference, allele dropout to alternative, and amplification errors (to reference or alternative). Assuming flat priors over allele frequencies, it computes per-site posterior probabilities for each event by integrating the single-cell beta-binomial likelihoods with the bulk read-based likelihoods. Compound events (e.g., presence of an alternative allele) are obtained by summing the posterior probabilities of constituent events. Genotyping chooses the genotype with maximum posterior probability.
ProSolo provides FDR control by thresholding posterior probabilities using a multiple-testing framework (following Müller et al.), enabling user-specified trade-offs between precision and recall for events such as alternative allele presence or allele dropout. It also supports biologically meaningful imputation at single-cell sites with zero coverage by deferring to the bulk sample (setting single-cell read likelihoods to neutrality so posterior probabilities depend solely on bulk data), though imputation is cautioned against for low-frequency somatic variants. The implementation is modular within the Rust-based Varlociraptor library, scales linearly in site coverage, and parallelizes across genomic regions. Benchmarking compared ProSolo with MonoVar, SCAN-SNV, SCcaller, and SCIPHI on three datasets: (1) WGS of two kindred single cells with a clone as ground truth (IL-11, IL-12, IL-1C); (2) WES of five granulocytes with pedigree-derived germline ground truth; and (3) WES of 16 tumor and 16 normal nuclei from a TNBC patient with validated bulk ground truths.
Key Findings
Across datasets, ProSolo achieved consistently high precision with improved recall relative to state-of-the-art tools and uniquely enabled flexible FDR control.
- Whole-genome (cell line) dataset: At precision >0.99, ProSolo's maximum recall reached 0.766 versus MonoVar 0.705, SCIPHI 0.687, SCcaller 0.610; SCAN-SNV recall was 0.0001 at precision 0.992 (focused on somatic variants).
- Whole-exome (granulocytes) dataset: Only SCcaller, SCIPHI, and ProSolo exceeded 0.99 precision. ProSolo achieved recall 0.178, vs SCIPHI 0.146 and SCcaller 0.072. MonoVar's maximum precision was 0.962 (recall 0.141). SCAN-SNV achieved recall 0.0016 at precision 0.897, likely underestimated precision due to germline ground truth penalizing true somatic calls.
- TNBC single-nucleus WES: On tumor cells, ProSolo was the only tool attaining precision >0.99 (recall 0.319). On normal cells, SCIPHI and ProSolo both exceeded 0.99 precision; SCIPHI achieved higher recall (0.650) than ProSolo (0.548), though ProSolo could reach recall 0.625 at higher FDR (reduced precision). SCIPHI's across-cell integration improves recall but can introduce false-positive subclonal heterozygous calls and incurs very long runtimes (weeks, single-core) versus days for parallelizable tools including ProSolo.
- FDR control: ProSolo uniquely provided reliable, flexible control of the false discovery rate over a wide precision-recall range; extremely small FDR targets (<0.01% in WGS alt-allele calling) reduced accuracy.
- Allele dropout: ProSolo's expected allele dropout rates, computed from posterior event probabilities, were concordant with rates derived from genotyping against ground truth and fell within published ranges. Slight underestimation or overestimation occurred depending on dataset. Naive dropout estimators overestimated dropout, especially in higher-coverage or potential doublet samples.
- Bulk integration: Joint modeling with a bulk sample markedly improved sensitivity and specificity; even ~4× bulk coverage yielded large performance gains and enabled meaningful FDR control, though deeper bulk is required for detecting low-frequency somatic variants.
- Scalability: Posterior probability calculations scale linearly with site coverage and parallelize across regions, providing substantial runtime advantages over models integrating all cells/sites simultaneously.
Discussion
The work addresses the challenge of accurate SNV calling in single-cell MDA data by jointly modeling local allelic bias and amplification errors while leveraging a bulk sample as an unbiased population reference. The mechanistically motivated beta-binomial mixture for heterozygous sites captures coverage-dependent allelic skew and dropout, and the bulk likelihoods effectively distinguish true variants from amplification errors. As a result, ProSolo attains superior or competitive recall at high precision across diverse datasets and uniquely offers rigorous FDR control, enabling users to tune discoveries to study-specific needs. The model's per-site posterior event probabilities (e.g., allele presence, dropout, amplification error) pass uncertainty to downstream analyses and can inform phylogenetic methods without sacrificing scalability. ProSolo's parallelizable design contrasts with phylogeny-integrating approaches that, while increasing recall in larger cohorts, may yield false positives and become computationally prohibitive. Overall, the findings support that integrating site-specific bias modeling with bulk evidence yields more reliable single-cell variant calls and genotypes, facilitating downstream biological inference.
Conclusion
ProSolo introduces a comprehensive, scalable framework for SNV calling from MDA-amplified single-cell DNA sequencing that models coverage-dependent allelic bias, accounts for amplification errors via joint bulk modeling, and provides rigorous FDR control. Benchmarking on whole-genome and whole-exome datasets demonstrates consistently high precision and improved recall compared to state-of-the-art tools, accurate allele dropout estimation, and practical runtimes. The tool's fine-grained posterior probabilities enable uncertainty-aware downstream analyses and biologically meaningful imputation when needed. Future directions include learning empirical bias parameters per cell (coverage-dependent), incorporating priors on somatic mutation rates in bulk to improve recall at low bulk depths, refining amplification error models separate from sequencing errors (especially for homozygous genotypes), integrating copy-number profiles and read-backed phasing, and extending to indels and MNVs within the Varlociraptor framework.
Limitations
Accuracy degrades when attempting to control for extremely small false discovery rates (<0.01% in some settings). The current empirical allelic bias distributions (derived from Lodato et al.) may not generalize perfectly across datasets, affecting genotyping and allele dropout estimates. Limited bulk coverage can prevent detection of subclonal somatic variants absent from bulk reads, leading to potential misclassification; deeper bulk sequencing or priors on somatic rates are needed. Modeling of homozygous genotypes currently conflates amplification and sequencing errors, potentially misclassifying some subclonal heterozygous variants as homozygous. Dataset-specific factors (e.g., doublets, variable WGA protocols) can influence performance, suggesting benefits from per-sample parameter learning.
Related Publications
Explore these studies to deepen your understanding of the subject.