logo
ResearchBunny Logo
Predicting the antigenic evolution of SARS-COV-2 with deep learning

Medicine and Health

Predicting the antigenic evolution of SARS-COV-2 with deep learning

W. Han, N. Chen, et al.

Explore groundbreaking research on SARS-CoV-2’s antigenic evolution with the innovative Machine Learning-guided Antigenic Evolution Prediction (MLAEP). This study, conducted by Wenkai Han, Ningning Chen, Xinzhou Xu, and others, showcases how MLAEP predicts viral fitness and identifies novel mutations, aiding in vaccine development and boosting preparedness against future variants.

00:00
00:00
~3 min • Beginner • English
Introduction
SARS-CoV-2 has accumulated numerous mutations as it spread globally, with some combinations enhancing ACE2 binding, transmissibility, and especially immune escape. A large fraction of neutralizing antibodies target the spike RBD, with four epitope classes that are differentially impacted by mutations; recent variants (e.g., Omicron) show substantial loss of neutralization across classes. While deep mutational scanning (DMS) experiments provide detailed single-mutation effects on ACE2 and monoclonal antibody binding, they are resource-intensive and do not scale to combinatorial sequence space, where epistasis is critical. Existing computational approaches forecast risks largely at the single-mutation level or rely on language models trained on evolutionary sequences, limiting perspective on multi-mutant antigenic evolution and often omitting certain antibody classes/epitopes. This study hypothesizes that under high immune pressure, short-term antigenic evolution favors increased antibody escape without major loss of ACE2 binding. The goal is to learn a fitness landscape over RBD that captures ACE2 and multi-class antibody binding changes and to search this landscape to forecast plausible high-risk, multi-mutation variants.
Literature Review
Prior work includes DMS mapping of RBD mutations for ACE2 and antibody binding (Starr et al.; Greaney et al.), enabling quantitative single-site effects but not combinatorial exploration. Computational risk modeling at the single-mutant level (e.g., Maher et al.) identifies potential drivers but does not capture epistasis in multi-mutant VOCs. Unsupervised protein language models (e.g., ESM-1b; Hie et al.) can estimate evolutionary likelihoods and risk signals, and methods combining sequence models with structural modeling (Karim et al.) monitor existing variants but provide limited prospective design. Deep learning on RBM sequences (Taft et al.) offered predictive profiles for ACE2 and escape for some antibody classes but searched a restricted region, missed much of class 3 epitopes, and did not include class 4. The need remains for a comprehensive, prospective framework spanning the full RBD and multiple antibody classes to propose plausible antigenic trajectories.
Methodology
Datasets: Nine DMS datasets measuring binding affinity of RBD variants to ACE2 and eight monoclonal antibodies (covering classes 1–4) were curated, cleaned, and binarized into functional labels (ACE2: enhanced vs not relative to WT; antibodies: escaped vs non-escaped via mixture-of-Gaussians on log-scores). The final dataset included 19,132 RBD sequences, each with nine labels, with class imbalance quantified for each task. External validation used pVNT data (17 mAbs across 10 VOC pseudoviruses) reporting fold-change in IC50. Model: A supervised multi-task deep neural network predicts binding specificity across nine targets. Sequence features are extracted by fine-tuning ESM-1b on RBD sequences. Binding partner structures (ACE2/antibodies) are converted to k-NN graphs from 3D contact/biophysical properties and encoded by a Structured Transformer. Joint sequence-structure representations feed nine parallel classification heads (hard parameter sharing) optimized end-to-end with class-weighted binary cross-entropy. Training used AdamW, dropout, weight decay, warmup schedule, gradient accumulation, weighted sampling to address imbalance, and model selection by macro-F1 across tasks. Search (in silico directed evolution): The trained model provides a fitness score (average of nine task probabilities; ACE2 binding and antibody escape). A modified genetic algorithm initializes from recent GISAID RBD sequences or Delta, perturbs sequences guided by BLOSUM62 neighborhood and model-scored mutation choices, applies selection proportional to fitness, and performs crossover, iterating within a 15-mutation trust radius to yield high-fitness candidates. Searches from Jan 1–Mar 8, 2022 GISAID subset were repeated to generate 38,870 unique variants; Delta-based runs generated 3,876 variants for wet-lab selection. Evolutionary analysis and visualization: Evo-velocity was applied to existing GISAID RBD sequences using either ESM-1b or the model-derived embeddings and directed by model scores to infer a vector field and pseudotime, visualized with UMAP. Diversity and positional shifts between generated and initial sequences were assessed via distance-preserving MDS and probability-weighted Kullback–Leibler logo plots mapped onto RBD structure (PDB 6m0j). Computational docking: Top-scoring generated RBDs were homology-modeled (SWISS-MODEL) and docked to representative class 1–4 antibodies (LY-CoV16, LY-CoV555, S309, CR3022) using Rosetta SnugDock, with 1,000 replicas per pair and interface scores analyzed. Wet-lab validation: Eight synthetic RBD variants (plus WT and Delta) were expressed and purified alongside eight neutralizing mAbs (two per class). A homogeneous time-resolved fluorescence (HTRF) binding assay measured mAb–RBD binding dose-response and IC50, evaluating loss of binding/escape including epistatic and non-epitope mutation effects.
Key Findings
- Performance: The multi-task model, integrating fine-tuned ESM-1b sequence features with structural features, outperformed augmented Potts, gUniRep, eUniRep, CNN, RNN, LSTM, linear regression, SVM, and random forest baselines across nine imbalanced classification tasks by macro-precision/recall/F1 (Fig. 2a). Ablations showed both fine-tuning and structure features are critical. External DMS validations showed consistent performance. - pVNT validation: Predicted escape potential correlated strongly with observed log fold-change in IC50 from pseudovirus neutralization tests across VOCs and antibodies (e.g., class 4 antibody 10-40) (Fig. 2b). - Evolutionary dynamics: Using Evo-velocity on GISAID RBD sequences (Dec 2019–Mar 2022), embeddings clustered VOCs (Alpha, Beta, Delta, Omicron) with vector directions matching known trajectories. Model-derived pseudotime correlated with sampling time (Spearman ~0.55), and model prediction scores alone correlated even more strongly with sampling time (Spearman r=0.65, p<1e-308). Predicted antibody escape potential correlated with time (r=0.67, p<1e-308), highlighting increasing importance during waves (e.g., Omicron). ACE2-binding score was more informative early and less so with Delta/Omicron emergence. - In silico variants: Genetic algorithm generated 38,870 high-fitness RBD variants (within 15 mutations of initial sequences). Generated mutations overlapped with those observed in immunocompromised patients (e.g., R493Q, E340K, E484T, G485R, F490L/E484G) and emerging lineages (BA.4/5: L452R, F486V, R493Q; XBB.1.5: F486P), suggesting realistic evolutionary moves. - Structural/computational validation: Docking and additional computational analyses supported high escape potential of generated variants. - In vitro binding assay: Eight MLAEP-designed variants (Delta-based background) showed markedly reduced or abolished binding to representative mAbs across all four classes in HTRF assays. Several variants exhibited complete escape (IC50 >1000 nM) for multiple antibodies. Notably, variants such as RBD8 escaped class 3 antibodies and RBD4/RBD7/RBD8/RBD9 reduced class 4 binding despite lacking direct epitope mutations, evidencing epistasis and non-epitope effects captured by the model.
Discussion
The findings support the central hypothesis that under immune selection, SARS-CoV-2 variants evolve toward higher antibody escape while maintaining ACE2 binding. The multi-task model captures antigenically meaningful features and epistatic interactions, enabling both monitoring of existing variant risk and prospective identification of high-risk combinatorial mutations across the full RBD and antibody classes 1–4. Strong correlations between predicted escape/fitness and real-world sampling time, as well as agreement with Evo-velocity trajectories, connect model-derived signals to observed evolutionary dynamics. The generation of variants recapitulating mutations from immunocompromised hosts and emerging lineages, and the substantial loss of mAb binding in wet-lab assays (including non-epitope-mediated escape), demonstrate the practical predictive value of MLAEP. These insights can inform surveillance, therapeutic antibody design, and vaccine antigen updates by highlighting likely antigenic trajectories and escape mechanisms.
Conclusion
MLAEP integrates structure-aware multi-task learning with directed sequence search to predict SARS-CoV-2 antigenic evolution. It accurately forecasts ACE2 binding and multi-class antibody escape, orders variants along evolutionary trajectories, and proposes realistic high-risk variants whose immune evasion was validated in vitro. The framework recovered key mutations found in chronic infections and recent VOCs (e.g., BA.4/5, XBB.1.5) and revealed epistatic, non-epitope escape. Future work will model quantitative effect sizes, extend beyond RBD to broader spike/viral regions, incorporate additional fitness components (e.g., epidemiological traits, T-cell responses), expand ACE2 variant data, and regularly update with new DMS and in vivo co-evolution datasets. MLAEP can generalize to other rapidly evolving pathogens and resistance problems, aiding public health preparedness and vaccine/therapeutic design.
Limitations
- The model predicts directionality (increase/decrease in binding) rather than quantitative magnitudes of effects. - Focus is limited to the RBD; many impactful mutations reside outside RBD and are not modeled here. - Fitness optimization considered only ACE2 binding and antibody escape; other drivers (e.g., transmissibility determinants beyond ACE2, viral fitness costs, T-cell immunity) were not explicitly modeled. - Limited availability of variant ACE2 datasets constrained fitness landscape coverage. - Antibody panels evolve over time as therapeutic and population immunity change; continual model updates are required. - Genetic algorithm search space bounded by a 15-mutation radius and starting sequence choices may miss distant but plausible solutions.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny