Biology
Adaptive Evolution of the Spike Protein in Coronaviruses
X. Tang, Z. Qian, et al.
Coronaviruses (CoVs) are divided into Alphacoronavirus, Betacoronavirus, Gammacoronavirus, and Deltacoronavirus, with typical host ranges across mammals and birds. Prior to 2019, six human-infecting CoVs were known (229E, NL63, HKU1, OC43, SARS-CoV, and MERS-CoV). SARS-CoV-2, the cause of COVID-19, is a Betacoronavirus. The trimeric S protein initiates cell entry via receptor binding (S1) and membrane fusion (S2). S1 contains the N-terminal domain (S1-NTD) and C-terminal domain (S1-CTD), either of which can function as the receptor-binding domain (RBD), with S1-NTD typically recognizing sialic acids and occasionally protein receptors, and S1-CTD commonly binding protein receptors such as ACE2. Numerous S mutations (e.g., D614G) have been shown to be positively selected, increasing replication, transmissibility, and immune escape. Given that most coronaviruses circulate in nonhuman hosts, the authors hypothesize that host environment differences after SARS-CoV-2’s invasion of humans shifted the targets of positive selection within the S gene, influenced by stronger human antibody responses, vastly increased viral population size, and widespread vaccination.
Background literature establishes: (1) coronavirus diversity, host ranges, and prior human CoVs; (2) S protein structure-function relationships, with S1 determining receptor usage and tropism and acting as a key neutralizing antibody target; (3) evidence of adaptive evolution in SARS-CoV-2 S, including D614G and lineage-defining mutations in variants of concern/interest affecting receptor binding and immune evasion; (4) potential drivers of adaptive evolution such as host immune pressures (including vaccination), large effective population size during the pandemic, and recombination in coronaviruses. Prior structural and functional studies on RBD-ACE2 interactions and neutralizing epitopes, and mapping of escape mutations in RBD and NTD, frame the study’s focus on where positive selection concentrates across CoVs versus in ongoing SARS-CoV-2 evolution.
- SARS-CoV-2 genome analysis: Downloaded 7,269,791 high-quality genomes from GISAID (as of Feb 27, 2022). Each genome was aligned to NC_045512 using MAFFT v7.453. SNPs were annotated with SnpEff v5.0e. Numbers of synonymous (S) and nonsynonymous (N) sites for non-S (concatenated ORFs except S), S, S1, and S2 were obtained using YN00 (PAML v4.9a). For dN/dS (ω) analyses, retained 3,394,571 genomes with at least one synonymous change both inside and outside S; for S1 vs S2, required at least one synonymous change within both S1 and S2. Wilcoxon signed-rank tests and rank-sum tests assessed differences; one-tailed tests evaluated ω relative to 1.
- Variant group analyses: Defined VOCs/VOIs per WHO (as of Oct 3, 2022): VOCs (Alpha, Beta, Gamma, Delta, Omicron) and VOIs (Lambda, Mu, Epsilon, Zeta, Eta, Theta, Iota, Kappa). Calculated ω between NC_045512 and sequences from each lineage. For phylogeny, subsampled up to 1,000 genomes per lineage if >1,000 total; constructed maximum likelihood trees with RAxML v8.2.12; visualized in iTOL.
- McDonald–Kreitman (MK) tests: Removed mutations fixed across all VOCs/VOIs. Within each lineage, defined fixed variants as frequency >0.8 (also tested >0.9) and polymorphisms as 0.01–0.8 (or 0.01–0.9). Counted fixed/polymorphic synonymous (ds/ps) and nonsynonymous (dn/pn) sites to compute α = 1 − (ds/dn) × (pn/ps). Performed Fisher’s exact tests for genome-wide, S, non-S, S1, S2, NTD, and CTD.
- Across-genera coronavirus analysis: Downloaded α-, β-, γ-, and δ-CoV genomes from NCBI Virus; removed redundancy in β-CoV using USEARCH (identity ≥0.996). Retained 1,050 α-, 851 β-, 193 γ-, and 160 δ-CoVs. Aligned proteins for orf1ab, S, E, M, N with MUSCLE v3.8.31; codon-aligned CDS with RevTrans. Computed pairwise dN, dS, ω using YN00. Considered pairs with genomic dS between 0.05 and 1 and dS > 0.05 in S1, S2, NTD, CTD, and non-S. Final pairs: 44,104 (α), 56,229 (β), 14,290 (γ), 2,668 (δ). Defined S1/NTD/CTD regions by homology to SARS-CoV-2 S.
- Recombination control: Concatenated five genes and screened with RDP4 (RDP, GENECONV, BootScan, MaxChi, Chimera, SiScan, 3Seq). Considered events detected by ≥4 programs as true; excluded parent–recombinant pairs involving S; re-ran pairwise ω comparisons.
- Positive selection site inference with CODEML: Assembled SARS-CoV-2 plus 26 closely related CoVs (nine conserved ORFs concatenated). Built NJ phylogeny (MEGA X, JTT model, pairwise deletion). Tested M7 vs M8 in PAML (via EasyCodeML); identified sites with BEB ≥ 0.95.
- Genera-wide codon models: Randomly selected 50 genomes per genus (α, β, γ, δ); concatenated five conserved ORFs; ran M7 vs M8; mapped positively selected sites to S1-NTD and S1-CTD by homology to SARS-CoV-2.
- Intra-species S gene divergence: For 13 CoV species with known receptor-binding domain usage (NTD vs CTD vs both), retrieved S CDSs and computed pairwise ω for S1, S2, NTD, CTD among strains, restricting to pairs with dS > 0.05 in all subregions (SARS-CoV used dS > 0.005 due to high similarity). Statistical comparisons used Wilcoxon signed-rank tests.
- SARS-CoV-2 genome-wide selection patterns: Across 3,394,571 genomes, ω was significantly higher for S than non-S (median S: 2.191 [2.5th–97.5th: 0.274–8.215]; median non-S: 0.523 [0.226–1.694]; P < 1e-10). S showed ω > 1 (P < 1e-10), while non-S showed ω < 1 (P < 1e-10), indicating positive selection on S and purifying selection elsewhere.
- VOC/VOI comparisons: For all VOCs and VOIs, S had significantly higher ω than non-S (P < 0.001). S ω medians for VOCs: Alpha 1.917, Delta 1.917, Beta 2.191, Gamma 3.286, Omicron 7.394 (all ω > 1, P < 1e-10). Pooling lineages, VOC S ω > VOI S ω (P < 1e-10). Several VOIs (Mu 2.465, Kappa 1.917, Lambda 1.643, Iota 1.643, Eta 1.643) also had S ω > 1 (P < 1e-10).
- Regional selection within S: ω was consistently higher in S1 than S2 across all genomes and in each VOC/VOI. VOC S1 ω > 1 (Alpha 1.122, Beta 1.964, Gamma 2.806, Delta 1.684, Omicron 6.454; all P < 1e-10). S2 ω < 1 for VOCs except Omicron (Alpha 0.798, Beta 0.266, Gamma 0.532, Delta 0.266; Omicron 1.596).
- McDonald–Kreitman tests: Positive selection detected when pooling all genes in VOCs (α = 0.696, P = 2.76×10^-6) and VOIs (α = 0.493, P = 0.006). S gene showed strong positive selection (VOCs: α = 0.96, P = 2.75×10^-5; VOIs: α = 0.797, P = 0.009). Excluding S, VOC genomes still showed selection (P = 0.02) but VOIs did not (P = 0.18). Within S, selection concentrated in S1 (VOCs P = 0.001; VOIs P = 0.017) and specifically S1-CTD/RBD (VOCs P = 0.011; VOIs P = 0.042), not in S1-NTD (ns). Results held with alternative fixation threshold (>0.9).
- Across-genera CoV divergence: In α-, β-, γ-, and δ-CoVs, ω < 1 overall (purifying selection), but S had higher ω than non-S, and S1 had higher ω than S2 in all genera. These patterns persisted after excluding recombinants detected by ≥4 RDP4 methods.
- Positive selection site mapping (SARS-CoV-2 + relatives): CODEML M7 vs M8 detected strong selection (P < 1e-10) with 12 positively selected sites, all in S1 (2 signal peptide, 9 NTD, 1 CTD). After length adjustment, S1-NTD had higher site density than S1-CTD (9/292 vs 1/223; P = 0.049).
- Genera-wide site counts (50 genomes/genus): Putative positively selected sites in S1 greatly exceeded S2 across genera (α: 21 vs 0; β: 56 vs 0; γ: 37 vs 1; δ: 11 vs 4). Within S1, more sites were in NTD than CTD (α: 16 vs 3; β: 33 vs 21; γ: 21 vs 13; δ: 10 vs 0).
- Intra-species S divergence (13 CoVs with known RBD usage): For all species, S1 ω > S2 ω (both < 1). In NTD-binding species (BCoV, HCoV-OC43, MHV, IBV), NTD ω > CTD ω. In CTD-binding species, results were mixed: CTD ω > NTD ω in HCoV-229E, HKU4, and MERS-CoV; NTD ω > CTD ω in CCoV, FCoV, HCoV-NL63, SARS-CoV, and PDCoV. PEDV (uses both) showed NTD ω > CTD ω. Overall, positive selection frequently targets S1-NTD even when CTD mediates receptor binding.
- Synthesis: Across broader CoV evolution, S1-NTD is often the main selection target, likely reflecting diversifying receptor usage and immune pressure. In contrast, during the ongoing human pandemic, SARS-CoV-2 shows strongest selection in S1-CTD (RBD), consistent with adaptation in ACE2 binding and escape from human immunity (infection and vaccination).
The study demonstrates strong positive selection on the SARS-CoV-2 S gene, especially in S1 and particularly the RBD, aligning with the virus’s rapid adaptation to human hosts, ACE2 affinity changes, and immunity-driven escape. In contrast, comparative analyses across coronavirus genera and closely related CoVs indicate S1-NTD often bears more positive selection than S1-CTD, likely due to frequent shifts in receptor usage (e.g., sialic acid binding preferences) and antigenic drift to evade host antibodies. The divergence between patterns in general CoV evolution and ongoing SARS-CoV-2 evolution suggests that the human host environment—including robust antibody responses and widespread vaccination—has shifted selective pressure toward the RBD. While receptor binding can drive selection in both NTD and CTD, the mixed intra-species patterns in CTD-binding CoVs imply that receptor usage alone does not determine which S1 subdomain is most selected; immune pressure and epidemiological context also play major roles. These findings are relevant for anticipating evolutionary trajectories and guiding vaccine and therapeutic design, particularly focusing on RBD mutations that repeatedly arise under immune selection.
- The S gene is under strong adaptive evolution in SARS-CoV-2 and, more broadly, in coronaviruses; S1 drives much of this signal.
- Across coronavirus genera and in close relatives of SARS-CoV-2, S1-NTD is frequently the main target of positive selection, likely reflecting diversifying receptor usage and immune evasion.
- In ongoing SARS-CoV-2 evolution in humans, positive selection concentrates in S1-CTD (RBD), consistent with adaptation for ACE2 binding and escape from human immune responses driven by extensive infection and vaccination.
- Implications include the need for continued genomic surveillance, monitoring convergent RBD mutations, and updating vaccine strategies to maintain efficacy.
- Future research should quantify the relative contributions of vaccination versus natural infection to immune selection, dissect the roles of receptor-binding changes versus antibody escape in fitness, and further assess recombination and its interplay with selection in S evolution.
- dN/dS-based inference cannot fully distinguish positive selection from relaxation of purifying selection; authors argue relaxation is less likely but acknowledge it as a possibility.
- Variant prevalence comparisons are confounded by factors such as regional vaccination coverage, public health policies, and competition among co-circulating strains.
- Pairwise ω estimates required dS thresholds and subregion-specific filters that may exclude some comparisons; SARS-CoV required a lower dS cutoff due to high similarity.
- Recombination was addressed by RDP4 filtering, but undetected recombination could remain.
- CODEML analyses used a subset (50 genomes per genus) for computational feasibility, which may not capture full diversity.
- Some sequence annotations and domain mappings rely on homology to SARS-CoV-2 and may introduce boundary uncertainty.
Related Publications
Explore these studies to deepen your understanding of the subject.

