Biology

Enabling accurate and early detection of recently emerged SARS-CoV-2 variants of concern in wastewater

N. Sapoval, Y. Liu, et al.

Discover QualD, a groundbreaking bioinformatics tool designed to detect SARS-CoV-2 variants in wastewater, offering faster detection up to three weeks earlier and exceptional accuracy over 95%. Developed by Nicolae Sapoval, Yunxi Liu, Esther G. Lou, Loren Hopkins, Katherine B. Ensor, Rebecca Schneider, Lauren B. Stadler, and Todd J. Treangen, QualD outperforms traditional tools, setting a new standard in public health monitoring.

00:00

~3 min • Beginner • English

Index

Introduction

Wastewater monitoring provides a pooled signal of infections across communities and can reveal the emergence and spread of SARS-CoV-2 variants of concern (VoCs) even as clinical testing declines. However, wastewater-derived viral genomes present multiple challenges: they contain mixtures of variants from many hosts, can include previously unreported genotypes, and often suffer from uneven coverage and RNA degradation that hinder phasing and assembly. Many current VoC detection approaches require broad and deep genome coverage and thus may miss early, low-prevalence signals. Furthermore, most approaches rely primarily on single nucleotide variants (SNVs) and ignore insertions/deletions (indels), despite informative deletions such as N:DEL31-33 in Omicron. Methods that depend on reference genome databases can be biased by database composition and metadata errors, leading to false positives/negatives. To address these challenges, the study introduces QualD, a pipeline that leverages quasi-unique mutations, including both SNVs and indels, for early and accurate VoC detection from wastewater sequencing data.

Literature Review

The paper situates its contribution within wastewater genomic surveillance literature demonstrating the value of wastewater for early detection and population-level monitoring of SARS-CoV-2 variants. It critiques existing methods that require high coverage and predominantly use SNVs, noting the emergence of approaches that incorporate indels (e.g., COJAC) and co-occurrence information to improve specificity. It also highlights the dependency of many tools on reference databases (e.g., GISAID, GenBank) and associated biases due to database composition and potential metadata errors, which can lead to erroneous inferences. Freyja is identified as a state-of-the-art tool for abundance estimation from wastewater, but it does not currently leverage deletions in its phylogeny-based signatures, potentially reducing sensitivity for variants like Omicron.

Methodology

Study setting and sampling: Between February 23, 2021 and May 31, 2022, 2,637 wastewater samples were collected weekly from 39 wastewater treatment plants (WWTPs) in Houston, TX, covering ~580 square miles and serving >2.3 million people. Each site used refrigerated 24-hour composite samplers collecting 200 mL aliquots hourly. Samples were transported on ice to Houston Water’s laboratory for aliquoting and then to Rice University for processing. Concentration, extraction, and sequencing: SARS-CoV-2 RNA was concentrated via electronegative filtration. RNA extraction followed prior work using a Chemagic system (PerkinElmer CMG-1433) and PerkinElmer viral RNA/DNA kits. cDNA synthesis used Superscript IV. Libraries were prepared with several kits and primer panels (updated over time to track VoCs) and sequenced on Illumina MiSeq. Mutation database construction: Using a pre-generated GISAID H5M file, mutations were extracted with vdb (nucleotide mode, ambiguous bases included), trimmed (vdb trim), and joined to metadata by accession. Mutations yielding ambiguous bases (e.g., N, W, S) were removed. Data were aggregated by week and PANGO lineage over a sliding time window (default 4 weeks) to compute mutation prevalence. Quasi-unique mutation processing: For each lineage, the prevalence of each mutation was computed as the fraction of lineage genomes harboring it. A mutation was deemed quasi-unique to a lineage if it occurred in >50% of genomes assigned to that lineage and not in >50% of genomes of any other lineage. Thresholds are user-adjustable to trade sensitivity and specificity. Rare lineages (fewer than a user-defined number of genomes within the time window; default 2) were excluded to reduce noise. For each quasi-unique mutation m and lineage l, QualD estimates predictive power via P(l|m) = [P(m|l) P(l)]/P(m), where N(m) is the proportion of genomes with m, N(l) is the proportion of genomes in l, and N(l|m) is the fraction of l genomes containing m. These probabilities can inform downstream interpretation. Mutation signature aggregation: Because PANGO hierarchies evolve, quasi-unique mutations are aggregated to a fixed lineage level by taking the union of descendants’ quasi-unique sets (e.g., Omicron as descendants of B.1.1.529 at level 4). The same hierarchy level is used for exclusion during construction. Variant of concern detection in samples: For a sample collected on date D, QualD constructs quasi-unique mutation sets from the prior time window up to D (falling back to the last available window if needed). SNVs yielding synonymous changes are filtered out. Sample variant calls are matched to quasi-unique mutation sites (including indels; deletions evaluated by coverage at flanking positions). Outputs per variant include total combined allele frequency of detected quasi-unique mutations, number detected vs total possible quasi-unique sites, and percent of quasi-unique sites with coverage to distinguish absence from lack of data. Benchmarking with simulated data: Three simulation protocols were used with base data from GenBank sequences collected between April 1, 2020 and April 15, 2022. Seventy-five samples were grouped into 10 weekly groups. Protocol (a) random SNV dropout retained 10%, 25%, or 50% of called SNVs from iVar and LoFreq outputs. Protocol (b) SNV resampling used real coverage templates from 4 weeks across all 39 WWTPs to adjust depth at SNV positions and re-evaluate detection. Protocol (c) read resampling used the same coverage templates to resample reads to target coverage near ARTIC v3 amplicon ends, re-ran the pipeline, and benchmarked tools. In total, 32,448 simulated samples were generated and analyzed to compare QualD and Freyja. Empirical evaluation: Weekly aggregated detection signals across 39 WWTPs were compared to clinical variant prevalence in Texas (GenBank) to assess early detection for Alpha, Delta, and Omicron. Heatmaps of quasi-unique mutation allele frequencies (e.g., N:DEL31-33) across sites were used to visualize early Omicron detections.

Key Findings

- Early real-world detection in Houston wastewater: QualD detected the Delta VoC approximately two weeks before the first sequenced clinical Delta sample in Texas (signal sustained for four weeks thereafter, 2021-04-05 to 2021-05-03). QualD also detected Omicron approximately two weeks prior to the first clinical sample collection date, whereas Freyja required an additional week after the first clinical sample to detect Omicron. - Sublineage detection: QualD reported consistent detections of Omicron BA.2 starting 2022-12-16 and BA.5 starting 2022-01-13, with detections persisting through 2022. - Indel utility for Omicron: In the week of December 2, 2021, over 50% (out of 10) samples with Omicron presence showed the 9 bp deletion N:DEL31-33, a stable Omicron mutation (95.1% prevalence among Omicron genomes). By December 10, 2021, N:DEL31-33 was present in 16 of 23 sites with detections. Freyja’s reliance on a phylogeny lacking deletion signatures likely contributed to later Omicron detection relative to QualD. - Robustness to low SNV retention: In simulations with severe SNV dropout (e.g., retaining only 10% of SNV calls), QualD reliably detected Delta and Omicron and sparsely detected Alpha and Gamma; Freyja failed to detect VoCs under such dropout. With 25% retention, QualD detected most present VoCs; Freyja showed sparse detections mainly when a single VoC dominated. Freyja required substantially higher retention of SNV calls to reliably detect VoCs. - Coverage of deletion flanks: Over 61% of samples had at least 10 reads covering bases flanking N:DEL31-33, supporting the stability of deletion-based detection. - Large-scale simulation benchmark (protocol c; n=32,448): QualD achieved the highest precision across all VoCs; Freyja achieved higher F1 scores for most VoCs, indicating better precision–recall balance. • Alpha: QualD precision 0.954, recall 0.517, F1 0.671; Freyja precision 0.847, recall 0.555, F1 0.670. • Delta: QualD precision 0.979, recall 0.532, F1 0.689; Freyja precision 0.747, recall 0.663, F1 0.720. • Gamma: QualD precision 0.999, recall 0.343, F1 0.511; Freyja precision 0.686, recall 0.414, F1 0.516. • Omicron: QualD precision 1.000, recall 0.472, F1 0.642; Freyja precision 0.859, recall 0.614, F1 0.716. - Variant calling inputs: Using combined variant calling outputs in QualD increased precision; using only iVar increased recall but reduced precision relative to the combined approach.

Discussion

The study addresses the need for earlier and more reliable VoC detection from wastewater, where low viral concentrations, mixed populations, and degraded RNA complicate inference. QualD’s quasi-unique mutation framework, explicitly including indels, enables earlier detection of VoCs like Delta and Omicron relative to clinical data and tools that ignore deletions. In real data from Houston, QualD detected Delta and Omicron earlier than Freyja, and in simulations it maintained high precision and sensitivity under substantial SNV dropout, supporting robustness to the uneven coverage characteristic of wastewater samples. The high precision of QualD is advantageous for early warning, while Freyja’s higher F1 in some simulations reflects its strength in abundance estimation when data quality is sufficient. Together, these findings suggest complementary use: QualD for sensitive early detection (especially leveraging indels), Freyja for relative abundance estimation of dominant variants, and tools like COJAC for high-specificity confirmation via co-occurrence. The results underscore the importance of incorporating indel signatures and adapting detection frameworks to evolving lineage taxonomies and database contents.

Conclusion

QualD provides a practical, accurate, and early detection framework for SARS-CoV-2 VoCs from wastewater by leveraging quasi-unique mutations, including both SNVs and indels. It demonstrated earlier detection of Delta and Omicron in Houston wastewater than clinical sequencing and outperformed a leading tool in precision across extensive simulations while remaining robust to data degradation. The work contributes a full pipeline spanning mutation database construction, quasi-unique signature generation, and sample-level detection and reporting. Future research should integrate QualD with complementary tools for abundance estimation and co-occurrence validation, extend the framework to other pathogens detectable in wastewater, incorporate probabilistic weighting by predictive power of mutations, and develop comprehensive simulated and synthetic datasets to systematically evaluate the impacts of RNA degradation, sampling, and sequencing protocols on detection performance.

Limitations

- Dependence on reference databases (e.g., GISAID, GenBank) introduces potential biases and errors due to database composition and metadata inaccuracies, which can affect quasi-unique mutation sets and detection outcomes. - Wastewater data variability (uneven coverage, RNA degradation, mixed variants) can lead to missing regions, complicating detection and quantification; not all samples have coverage at informative sites. - Threshold choices for defining quasi-unique mutations (>50% inclusion/exclusion by default) and time-window aggregation may trade sensitivity for specificity and can affect generalizability across settings or time. - Current Freyja phylogeny used for comparison does not incorporate deletions, potentially disadvantaging it for indel-rich variants; tool comparisons may shift as methods evolve. - QualD’s highest precision sometimes comes with lower recall relative to Freyja in simulations, affecting F1; early detection signals may not directly estimate variant abundance. - The study relies on existing sequencing protocols and multiple library/primer kits, which may introduce batch effects; methodological changes over time could influence detection sensitivity.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Risk factors for and pregnancy outcomes after SARS-CoV-2 in pregnancy according to disease severity: A nationwide cohort study with validation of the SARS-CoV-2 diagnosis of Nordic Federation of Societies of Obstetrics and Gynecology (NFOG)

A. J. M. Aabakke, T. G. Petersen, et al.

Medicine and Health

Efficacy of primary series AZD1222 (ChAdOx1 nCoV-19) vaccination against SARS-CoV-2 variants of concern: Final analysis of a randomized, placebo-controlled, phase 1b/2 study in South African adults (COV005)

A. L. Koen, A. Izu, et al.

Medicine and Health

Systematic detection of co-infection and intra-host recombination in more than 2 million global SARS-CoV-2 samples

O. A. Pipek, A. Medgyes-horváth, et al.

Medicine and Health

Emergence and spread of two SARS-CoV-2 variants of interest in Nigeria

I. B. Olawoye, P. E. Oluniyi, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny