Introduction
The high mutation rate of SARS-CoV-2 has led to numerous variants. Co-infection, where an individual is infected with two or more variants simultaneously, has been reported, with estimates ranging from 0.2% to 0.6%. The impact of co-infection on disease severity remains unclear, with studies reporting conflicting results. The emergence of recombinant lineages, such as Omicron XBB variants, further complicates the situation. Recombination requires co-infection, suggesting a link between the frequency of co-infection and the emergence of new variants. Co-infections have been identified using various methods, including ambiguous PCR results and heterozygous consensus sequences. However, a comprehensive global study analyzing a large dataset was lacking. This study leverages the vast amount of SARS-CoV-2 raw data available through the European COVID-19 Data Portal, processed using a standardized workflow by the Versatile Emerging Infectious Disease Observatory (VEO) consortium, to systematically identify co-infections and investigate intra-host recombination on an unprecedented scale. The study acknowledges several technical challenges in detecting co-infection from whole-genome sequencing (WGS) data, including PCR amplification bias, the prevalence of consensus sequences (rather than raw read data), and the limited number of unique lineage-defining mutations. Existing methods for detecting co-infection vary in their filtering criteria and the set of lineage-defining mutations used. Some studies employ manual evaluation, which is not scalable to large datasets. Others utilize bioinformatic pipelines, analyzing allele frequency (AF) distributions or lineage-defining mutations but differ in their filtering and mutation sets. The study aims to overcome these limitations by developing a computational pipeline to analyze a global dataset, meticulously investigate co-infection cases, and trace intra-host recombination events, which were previously difficult due to the limited genomic diversity and the small fraction of recombinant reads in co-infection samples.
Literature Review
Previous research on SARS-CoV-2 co-infection has yielded inconsistent results regarding disease severity. While some studies reported increased severity in co-infected individuals, others found co-infection in patients with mild symptoms. Existing methods for detecting co-infection include manual inspection of samples with ambiguous results and analysis of allele frequencies (AFs) in lineage-defining positions. These methods have limitations, including scalability issues and variations in criteria and mutation sets. Studies have explored recombination in Betacoronaviruses, noting the dependence on co-infection frequency. While recombination is suspected to be common in virus evolution, it can be hard to detect in SARS-CoV-2 due to the limited genomic diversity and short evolutionary history. Previous studies demonstrating intra-host recombination often analyzed aggregated AF data or were limited to single patients, hindering comprehensive understanding. This study addresses these gaps by analyzing a large-scale dataset and employing automated methods for detection of co-infection and intra-host recombination.
Methodology
The study used the CoV2O database, developed by the VEO consortium, containing mutational data from SARS-CoV-2 samples submitted to the European COVID-19 Data Portal. After quality filtering, which excluded samples with low base count or uneven genome coverage, 2,172,927 samples were analyzed. Co-infection was defined as the presence of mutually exclusive variant-defining mutations from at least two different viral strains. A threshold of 0.8 was used, meaning at least 80% of the mutations should be present for each variant. Two approaches were used to detect intra-host recombination: 1) analysis of alternate AF distributions in co-infection samples to identify shifts in AFs at recombination breakpoints; 2) investigation of raw sequencing reads to find reads containing mutually exclusive mutations of multiple parental strains. The first approach involved correction for systematic biases in AF distributions and calculation of odds ratios (ORs) to identify putative recombinants. The second approach considered the density of mutually exclusive lineage-defining mutations, the ratio of recombinant reads to overlapping reads (recombinant read ratio), and the percentage of genomic positions with recombinant reads. A threshold of 0.1 was used for the recombinant read ratio to minimize PCR artifacts. The study also analyzed the geographic and temporal distribution of co-infection samples, comparing collection dates with variant prevalence data from GISAID. Artificial mixtures of viral strains were used as negative controls to evaluate the method's sensitivity and specificity. The study also considered potential confounding factors such as contamination and PCR artifacts.
Key Findings
The study identified 7700 co-infection samples (0.35% of the analyzed samples) using a stringent threshold of 0.8 for the ratio of mutually exclusive defining mutations. The most frequent variant compositions included Alpha, Delta, Omicron, and Epsilon variants. A near-linear relationship was found between the number of samples assigned to a given variant in the database and the number of co-infection cases containing that variant, highlighting the impact of sampling strategy and geographical distribution. Analysis of allele frequency (AF) distributions identified 13 putative recombinant samples, mostly involving Delta and Omicron (BA.1). Raw read analysis revealed a correlation between the density of mutually exclusive defining mutations and the density of reads containing mutations from multiple parental strains. This analysis also suggested that recombination breakpoints supported by less than 10% of overlapping reads were likely PCR artifacts. Three intergenic recombination hotspots were identified in Delta-Omicron (BA.1) co-infection samples, often located at gene boundaries, with some recombinant reads showing signs of sgRNA origin. Analysis of artificial mixed samples confirmed the robustness of the method, which correctly identified only a small fraction of these samples as recombinants. The study found a significant positive correlation (Pearson R=0.725, p=0.012) between density of mutually exclusive defining mutations and overlapping reads. The geographic distribution of co-infection cases varied across countries, likely due to differences in sequencing capacity and local sampling strategies. This highlighted the need for systematic global surveillance. The temporal distribution of co-infection samples mostly corresponded with periods when both the number of samples and local prevalences of multiple variants were high.
Discussion
The findings of this study provide a comprehensive analysis of SARS-CoV-2 co-infection and intra-host recombination on a global scale. The identification of co-infection in 0.35% of samples is consistent with previous reports, but the large dataset and rigorous methodology enhance the reliability of this estimate. The identification of recombination hotspots provides valuable insights into the mechanisms of viral evolution and the potential emergence of new variants. The limitations of existing methods for detecting co-infection and recombination are addressed in the study’s methodology, making the study’s findings a substantial contribution. The study highlights the importance of systematic global surveillance and the use of raw sequencing data in studying viral evolution. The findings also have implications for understanding the clinical impact of co-infection and developing effective strategies for preventing and managing COVID-19.
Conclusion
This study presents a comprehensive, large-scale analysis of SARS-CoV-2 co-infection and intra-host recombination, employing advanced methods to overcome existing limitations. The findings underscore the importance of systematic global surveillance for monitoring viral evolution and detecting potential threats. Future research should focus on validating these findings using additional datasets and exploring the clinical significance of co-infection and recombination.
Limitations
The study acknowledges limitations related to the inherent biases in sequencing data, including PCR amplification artifacts and the uneven distribution of lineage-defining mutations. The reliance on publicly available data may also introduce biases due to variations in sampling and sequencing protocols across different countries and studies. While the study employed rigorous quality control measures, the possibility of undetected contamination cannot be entirely ruled out. The study also focused primarily on the analysis of Illumina sequencing data, and the limited availability of long-read sequencing data prevented a comprehensive evaluation of the potential benefits of that technology for recombination detection.
Related Publications
Explore these studies to deepen your understanding of the subject.