logo
ResearchBunny Logo
Systematic detection of co-infection and intra-host recombination in more than 2 million global SARS-CoV-2 samples

Medicine and Health

Systematic detection of co-infection and intra-host recombination in more than 2 million global SARS-CoV-2 samples

O. A. Pipek, A. Medgyes-horváth, et al.

In an extensive analysis of over 2 million SARS-CoV-2 samples, this research delves into the intriguing realm of co-infections and intra-host recombination. With a co-infection rate of just 0.35%, and the identification of three recombination hotspots in Delta and Omicron BA.1 variants, this study reveals critical insights into viral behavior. Authored by Orsolya Anna Pipek and colleagues, this research sheds light on complex challenges in viral detection methods.

00:00
00:00
~3 min • Beginner • English
Introduction
SARS-CoV-2’s relatively high mutation rate has led to the emergence of many lineages that can co-circulate, creating opportunities for co-infections (simultaneous infection by two or more variants) and subsequent recombination within hosts. Early reports indicated co-infection rates around 0.2–0.6%, with conflicting associations to disease severity. Recombinant lineages (e.g., Omicron XBB) have since been observed, implying that co-infections occur and can yield novel transmissible variants. Detecting co-infection and intra-host recombination from sequencing data is challenging due to primer biases, reliance on consensus sequences, limited lineage-defining mutations, and potential sample mix-ups. Prior studies often focused on subsets of variants or local datasets. This study aims to perform a global, standardized, large-scale detection of co-infections using raw read data and to systematically identify intra-host recombination signals, thereby informing surveillance and evolutionary risk assessment.
Literature Review
Co-infections have been identified via ambiguous lineage assignments, heterozygous positions in consensus sequences, or inconclusive PCR genotyping. Several country-level studies (USA, France, UAE, Brazil, Costa Rica) performed systematic searches for co-infections using allele frequencies at lineage-defining sites, mainly for variants of concern. Methods such as ASV-like read assignment and hypergeometric models (e.g., CozaCoinfect) estimate lineage proportions from defining mutations, reporting co-infection rates of ~0.3–0.5%. However, AFS often diverges from true proportions due to primer bias, and such methods may be confounded by intra-host recombination or bias from amplification. Evidence for recombination is well documented in coronaviruses, but detecting intra-host recombination in SARS-CoV-2 is difficult given limited diversity, uneven distribution of defining mutations, and short reads. Previous intra-host recombination studies often analyzed small cohorts or aggregated frequency data; large-scale, read-level analyses remain scarce.
Methodology
Data source and preprocessing: The study leveraged the CoV2co/CoVEO database built from raw read datasets uploaded to the European COVID-19 Data Portal/ENA and processed through a standardized variant-calling workflow (Illumina and Nanopore) to generate uniform VCFs with allele frequencies and depth. Of 3,093,454 human-host samples (Dec 30, 2019–Jun 30, 2022), quality filters excluded samples with total base count ≤100,000 or with depth <10 across >10% of the genome, yielding 2,172,927 good-quality samples. Defining mutations: Instead of a fixed principled list, the study used a marker table derived from GISAID consensus sequences (aligned to Wuhan reference; calling with VarScan2) to identify lineage-defining mutations. Because many lineages have few truly unique markers, the authors expanded to mutually exclusive defining mutations specific to each candidate variant combination: mutations present in ≥80% of one lineage and <10% of the others in that combination, thereby increasing discriminatory power in the co-infection context. Co-infection detection: Initial candidates required >50% of unique defining mutations from at least two strains, yielding 29,666 putative co-infections. Next, for each candidate combination, the ratio of mutually exclusive defining mutations present per lineage was computed (observed count divided by potential count for that combination). Final selection required at least 50% of mutually exclusive markers for all involved variants, then applied a stringent threshold of 0.8 (i.e., ≥80% of mutually exclusive defining mutations present for each variant), resulting in 7,700 co-infection samples among 2,172,927, an overall prevalence of ~0.35%. This threshold avoids misclassifying clonal recombinants, which cannot reach 0.8 for both parents simultaneously if only one breakpoint exists. Temporal and geographic analysis: The collection dates of co-infection samples were compared against weekly prevalence curves of Pango lineages from GISAID at global and country levels. Study- and country-specific co-infection prevalences were computed to evaluate potential contamination and sampling strategy effects. Genetic diversity metrics included the number of circulating lineages, information entropy of lineage distributions, and cumulative lineage counts. Intra-host recombination detection (AF-shift method): For 7,290 co-infections containing exactly two strains, allele frequency (AF) trajectories of mutually exclusive defining mutations were analyzed to detect shifts consistent with recombination breakpoints. AFs were corrected for systematic primer-related biases. Competing models with and without a breakpoint were compared via odds ratios (OR); OR>1 indicated a putative breakpoint. The fraction of recombinant genomes (R) was inferred from AF shifts. Intra-host recombination detection (read-level method): For a subset with available BAMs, short reads overlapping mutually exclusive defining mutations from both parents were identified (filters: mapping quality >30; base quality >30 at sites). Reads carrying both parental markers were counted as recombinant reads. At each genomic position, the recombinant read ratio (recombinant/overlapping reads) was computed. Positions with ≥10 recombinant reads and ratio ≥0.1 across multiple samples defined recombination hotspot ranges. Recombinant reads were further examined for sgRNA signatures (leader sequence presence and/or soft clipping). Artificially mixed samples served as negative controls to gauge PCR chimera rates and guide thresholds (e.g., 0.1 ratio). Correlations were computed between defining-mutation density and overlapping-read density, and with recombination landscapes from clonal recombinants reported previously. Code and data: Processed VCF/BAM under PRJEB45555; code and derived datasets available on GitHub/Zenodo.
Key Findings
- Scale and prevalence: Among 2,172,927 quality-controlled human SARS-CoV-2 samples, 7,700 were classified as co-infections using a stringent 0.8 threshold for mutually exclusive defining mutations per lineage, giving an overall co-infection prevalence of ~0.35%. 72 samples contained all mutually exclusive defining mutations of all comprising variants. - Temporal/geographic patterns: Co-infections were most often detected when two variants co-circulated with sufficient sequencing volume. Prevalence varied by country and study, influenced by sampling strategies; among countries with >1,000 samples, co-infection rates ranged ~0–1.60%, with South Africa noted at ~0.92%. - Validation and robustness: A near-linear relationship existed between the number of samples assigned to a variant and the number of co-infections involving that variant, reflecting dataset composition and underscoring the need for broad surveillance. The method recovered many known co-infection instances from prior studies despite independent pipelines, indicating robustness. - Intra-host recombination (AF-based): Identified 13 putative intra-host recombinant samples, with estimated recombinant genome fractions ranging from 6% to 16%, predominantly in Delta–Omicron (BA.1) co-infections. - Read-level recombination evidence: Density of mutually exclusive defining mutations correlated with density of overlapping reads across genes (Pearson R = 0.725, p = 0.012). Genomic distribution of inferred recombination breakpoint support from reads was moderately correlated with clonal recombinant landscapes reported by Turakhia et al. (R = 0.595, p < 0.001). Artificially mixed controls showed fewer positions with recombinant read ratios <0.1 compared to true co-infections; positions with ratios >0.1 tended to have higher duplicate fractions, implicating PCR chimeras. A threshold of ≥10 recombinant reads and recombinant read ratio ≥0.1 was used to define hotspots while mitigating artefacts. - Hotspots: Many putative breakpoint ranges coincided with gene boundaries. Additional intragenic hotspots in Delta–Omicron (BA.1) co-infections were identified in S (22578–23202, 23225–24302, 24310–24503) and M (26530–26767). Some recombinant reads bore sgRNA hallmarks near known junctions, suggesting that a subset of events reflects transcriptional template switching with limited evolutionary impact. - Artefact mitigation: PCR amplification biases distort AFs; chimeric sequences can appear at ~0.04–0.3% or more in artificial mixtures. Using read-level thresholds and duplicate-read assessments reduces false positives. Hybrid capture and long-read sequencing could further improve detection.
Discussion
The findings substantiate that SARS-CoV-2 co-infections, while relatively uncommon (~0.35%), are globally distributed and aligned with periods of co-circulation of major variants. The standardized, large-scale approach demonstrates that co-infection detection is feasible at pandemic scale when raw read data and allele frequencies are available. Evidence of intra-host recombination was detected in a small subset of co-infections, with recombinant fractions typically low (6–16%), consistent with expectations that recombinant genomes comprise a minority amid parental strains. Methodological advances—expanding defining-mutation sets to mutually exclusive markers within specific variant combinations and correcting AF biases—enhanced sensitivity to co-infections and recombination signals. The read-level analysis provided orthogonal support while enabling artefact discrimination via thresholds and duplicate-read patterns. The observed enrichment of breakpoint support at gene boundaries and specific intragenic regions (notably in S and M) suggests biological mechanisms such as template switching, including during sgRNA transcription; however, such transcriptional events may not translate into transmissible recombinants. The study underscores challenges that temper interpretation: amplification-based sequencing introduces AF distortions and chimeric reads; defining mutations are unevenly distributed, limiting breakpoint localization; and short reads reduce direct detection of recombinant haplotypes. Despite these constraints, the convergence of AF-shift and read-level evidence, supported by negative controls and cross-study reproducibility, indicates that real intra-host recombination does occur and can be systematically detected with careful controls. The results emphasize the importance of comprehensive, globally representative surveillance and the utility of standardized raw data processing for tracking viral evolution.
Conclusion
This work delivers a global, standardized analysis of >2 million SARS-CoV-2 sequencing datasets to detect co-infections and intra-host recombination. It reports an overall co-infection prevalence of ~0.35%, robust to stringent criteria, and identifies 13 putative intra-host recombinants with low recombinant fractions, primarily in Delta–Omicron (BA.1) co-infections. By expanding to mutually exclusive, context-specific defining mutations and integrating AF-based breakpoint modeling with read-level evidence, the study proposes practical thresholds and guidelines to mitigate PCR-induced artefacts and highlights recombination hotspot regions, often at gene boundaries and within S and M. Future work should prioritize: (1) increased sharing of raw reads and rich metadata; (2) validation via re-sequencing with hybrid capture or long-read technologies to reduce PCR artefacts and improve haplotype resolution; (3) expanded application to additional time periods and variants; and (4) improved statistical models that incorporate primer bias, sgRNA contributions, and read-duplication patterns to better distinguish true recombination from artefacts.
Limitations
- PCR-related artefacts: Amplicon sequencing introduces primer bias that skews allele frequencies and fosters PCR-mediated chimeras, complicating recombination inference. - Low recombinant abundance: Recombinant genomes are typically a small fraction of intra-host populations, reducing coverage and statistical power. - Defining-mutation constraints: Limited numbers and uneven genomic distribution of lineage-defining mutations restrict sensitivity and widen breakpoint uncertainty; detection is biased toward N and M genes where markers are denser. - Short-read limitations: Illumina read lengths (100–200 bp) limit direct observation of reads spanning multiple defining mutations; long-read platforms could help but have higher error rates and were underrepresented in this dataset. - Metadata limitations: Incomplete and inconsistent metadata hinder contamination assessment (e.g., flow-cell/run information). Accidental sample mixing cannot be entirely excluded. - Sampling biases: Geographic and study-specific sampling strategies and capacities impact observed co-infection compositions; conclusions rely on representativeness of available data.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny