Introduction
The ongoing evolution of SARS-CoV-2, the virus responsible for COVID-19, presents a continuous challenge to global public health. The virus's ability to adapt to immune pressure, whether from vaccines or prior infections, necessitates the development of tools capable of predicting its future antigenic shifts. The sheer size of the SARS-CoV-2 sequence space makes experimental approaches alone insufficient for comprehensive prediction. This paper addresses this challenge by introducing a novel machine learning-based approach, Machine Learning-guided Antigenic Evolution Prediction (MLAEP), to forecast the antigenic evolution of SARS-CoV-2. The emergence of variants of concern (VOCs) like Alpha, Beta, Gamma, Delta, and Omicron underscores the urgent need for such predictive capabilities. Each VOC demonstrated varying degrees of increased transmissibility, immune evasion, and/or virulence, highlighting the unpredictable nature of the virus's evolution. This unpredictability necessitates computational methods capable of analyzing the vast sequence space to identify potential future threats. The focus on the spike receptor-binding domain (RBD) is crucial because a substantial fraction of neutralizing antibodies target this region, making mutations here key drivers of immune escape. Existing computational methods for variant prediction have limitations. While some effectively model the impact of single mutations, they often fail to capture the complex epistatic interactions involved in multiple mutations, which are characteristic of many VOCs. Others may predict risk but lack the ability to explore potential evolutionary pathways. MLAEP aims to overcome these limitations by integrating several powerful techniques, including deep mutational scanning (DMS) data analysis, multi-task deep learning, and genetic algorithms, providing a more holistic approach to forecasting.
Literature Review
Several studies have addressed the challenge of predicting SARS-CoV-2 evolution. Deep mutational scanning (DMS) experiments, such as those by Starr et al. and Greaney et al., have provided valuable insights into the effects of individual mutations on RBD binding to ACE2 and various antibodies. However, the high cost and time constraints of wet-lab experiments limit their scalability. Computational methods, leveraging machine learning, have emerged as an important complement. Maher et al. employed a computational model to forecast driver mutations for future VOCs, focusing on single-position substitutions. However, this approach doesn't fully account for epistatic effects or combinatorial mutations found in many high-risk variants like Omicron. Other studies have utilized language models trained on evolutionarily related sequences to predict the risk of SARS-CoV-2 variants. While effective in risk monitoring, these methods rely on existing data and don't directly explore the potential for previously unseen variants. Taft et al. explored prospective mutations by applying deep learning to RBM sequences, but their search strategy had limitations in scope and consideration of antibody classes. These previous efforts highlight the need for a method capable of handling combinatorial mutations, capturing epistatic effects, and predicting potential future variants beyond those already observed.
Methodology
The MLAEP framework employs a three-pronged approach: (1) a multi-task deep learning model to predict binding specificity; (2) a genetic algorithm to perform *in silico* directed evolution; and (3) *in vitro* validation of predicted variants. The multi-task deep learning model is the core of MLAEP. It simultaneously predicts the binding affinity of RBD variants to ACE2 and eight antibodies representing four classes, based on both sequence and structural information. The model leverages the ESM-1b language model for sequence feature extraction, and a structured transformer for structural feature extraction from 3D contact maps. Multi-task learning allows the model to learn shared representations across the nine targets (ACE2 and eight antibodies), improving performance and efficiency. The model is trained on nine deep mutational scanning datasets containing information on 19,132 RBD variants and their binding affinities. The continuous binding scores were binarized to enhance interpretability and manage class imbalances. The genetic algorithm uses the trained model as a scoring function to simulate antigenic evolution. It starts from existing RBD sequences and iteratively generates new variants with improved antibody escape potential while maintaining ACE2 binding ability. Mutations and crossover operations explore the sequence space, and selection favors variants with higher fitness scores based on the model's predictions. The *in vitro* validation involves expressing and purifying the predicted variants for neutralizing antibody binding assays using HTRF. This allows for empirical verification of the model's predictions regarding immune evasion.
Key Findings
The MLAEP model demonstrated robust performance. In five-fold cross-validation experiments, MLAEP outperformed other state-of-the-art methods for predicting ACE2 and antibody binding specificity. The model's predictions were validated using *in vitro* pseudovirus neutralization test (pVNT) data, showing a strong correlation between predicted antibody escape potential and observed immune evasion in VOCs. The MLAEP model also accurately inferred the evolutionary trajectory of existing SARS-CoV-2 variants, as evidenced by a high Spearman correlation (r=0.65, p<1e-308) between model scores and variant sampling time. Notably, this correlation was higher than that obtained using the pseudotime inferred by Evo-velocity (r=0.55), suggesting that the model's prediction scores themselves capture the evolutionary directionality. *In silico* directed evolution identified novel mutations found in immunocompromised COVID-19 patients and emerging variants such as XBB.1.5. Furthermore, *in vitro* HTRF-based assays validated the immune evasion potential of synthetic variants generated by MLAEP, even for variants with non-epitope mutations demonstrating that the model captures epistatic interactions. Specifically, certain variants showed enhanced evasion of class 4 antibodies, despite lacking mutations within the known class 4 epitope region. This suggests that the model can capture higher-order relationships between mutations and immune evasion that were not apparent from a simple epitope map analysis.
Discussion
The results demonstrate the success of MLAEP in predicting SARS-CoV-2 antigenic evolution. The model's accuracy in predicting binding specificity, its ability to infer evolutionary trajectories, and its identification of novel mutations found in both immunocompromised patients and emerging variants strongly support its utility. The finding that model predictions correlate more strongly with variant sampling time than Evo-velocity pseudotime suggests that the model's scores themselves provide valuable directionality regarding antigenic evolution. The *in vitro* validation confirms the model's ability to identify variants with significant immune escape potential, including those with non-epitope mutations or complex epistatic interactions. The predictive power of MLAEP has significant implications for vaccine development and public health preparedness. The ability to forecast potential future variants can inform the design of more effective vaccines and facilitate the development of antiviral therapies.
Conclusion
MLAEP represents a significant advancement in predicting SARS-CoV-2 antigenic evolution. Its integration of deep mutational scanning data, multi-task learning, and genetic algorithms provides a powerful and accurate approach for identifying potential future high-risk variants. The model's success in both *in silico* and *in vitro* validation confirms its predictive power, highlighting its value in guiding vaccine development and public health strategies. Future work could focus on incorporating additional data, such as T-cell responses and epidemiological data, to enhance model accuracy and further refine the prediction of viral evolution. The MLAEP methodology could also be applied to other rapidly evolving pathogens.
Limitations
The study primarily focused on the RBD region, while mutations outside this region can also contribute to viral evolution. The model's predictions are based on the limited set of antibodies used in the training data. The current model focuses on predicting the directionality of mutational effects rather than the magnitude; improving the model's ability to quantify the impact of mutations would strengthen its predictive power. The availability of comprehensive datasets for ACE2 binding variants also presented some limitations in the study.
Related Publications
Explore these studies to deepen your understanding of the subject.