Biology

Optimization of C-to-G base editors with sequence context preference predictable by machine learning methods

T. Yuan, N. Yan, et al.

Discover groundbreaking advancements in base editing technology! This study reveals efficient and precise C-to-G base editors engineered for high fidelity and predictable outcomes, making a significant leap in genetic editing. Conducted by an expert team including authors from Shenzhen and Shanghai institutes, these findings pave the way for enhanced genetic modifications in various applications.... show more

Introduction

The study addresses the need for efficient tools to perform C-to-G transversion edits, which are not enabled by canonical cytosine or adenine base editors but are required to potentially correct a large fraction of pathogenic point mutations. Prior work showed that replacing UGI with UNG in CBEs can yield C-to-G edits but only at limited targets and without clear rules for efficient editing. This work aims to optimize C-to-G base editors (CGBEs) for higher efficiency and fidelity, define sequence context preferences governing editing outcomes, and develop machine learning models to predict editing efficiency based on local sequence context. The goal is to expand base editing capabilities and provide rational design criteria for C-to-G editing in research and therapeutic contexts.

Literature Review

Base editors (CBEs and ABEs) allow C-to-T and A-to-G conversions, respectively, but not C-to-G or A-to-T transversions. Earlier reports demonstrated C-to-G edits by substituting UGI with UNG in CBEs, though with limited efficiency and scope. BE3 has been associated with off-target DNA and RNA mutations, which can be mitigated by engineering rAPOBEC1 (e.g., YE1 mutations W90Y, R126E). Prime editors can perform diverse edits but may be less efficient for certain C-to-G changes. Variants of deaminases (APOBEC1, APOBEC3A, APOBEC3G) have distinct motif preferences (e.g., APOBEC3A prefers TCN; APOBEC3G favors C-rich motifs), suggesting that deaminase choice can shape sequence context dependence. Computational models (BE-Hive, DeepCBE) and machine learning have previously predicted editing outcomes for CBEs and ABEs; however, predictive models tailored to CGBEs were lacking.

Methodology

Engineering of CGBEs: Constructed variants by replacing UGI with UNGs from different species (human, E. coli, mouse, C. elegans) in BE3; introduced rAPOBEC1 YE1 mutations (W90Y, R126E) to reduce off-targets; added an N-terminal FNLS for enhanced nuclear localization; performed codon optimization; tested domain positioning by fusing UNG at N-terminus versus C-terminus. Key optimized constructs include FNLS-eUNG-YE1-CGBE and FNLS-cUNG-YE1-CGBE, termed eOPTI-CGBE and cOPTI-CGBE.
Cell-based assays: Assessed editing at 34 endogenous HEK293T targets; quantified C-to-G, C-to-A, C-to-T, and indels; mapped editing windows; compared with a published CGBE (CGBE1) and prime editors (PE2/PE3).
Off-target assessments: Applied GOTI in mouse embryos to measure genome-wide DNA SNVs; performed RNA-seq to quantify transcriptome-wide RNA SNVs; evaluated sgRNA-dependent off-targets with Cas-OFFinder predictions and targeted sequencing.
Motif preference experiments: Analyzed sequence context of successful edits; tested additional 20 sites with the putative preferred WCW motif; extended to alternative deaminases: eA3A and APOBEC3G (full-length and CTD), and to broader PAM recognition using Cas9n-NG, spG, and xCas9n.
High-throughput library screen: Used a paired sgRNA-target library (41,388 sequences) in HEK293T; lentiviral delivery of library; transfection with eight OPTI-CGBEs; deep sequencing to quantify editing at positions 1–20 (PAM at 21–23); analyzed motif effects for positions 4–7 with coverage >100×.
Computational modeling:
- Logistic regression: One-hot-encoded sequence features around targeted C (positions 4–7) to learn motif weights; trained on 80% of data and tested on 20%.
- Deep learning (CGBE-SMART): Position-centric neural networks with multiple window sizes (7–11 nt), embedding nucleotides into 16-d vectors, two hidden layers (256, 128, ReLU), sigmoid output per position; dropout 30%; trained to minimize MSE of editing efficiency; integrated Bayesian/Markov network to model dependencies among adjacent edited positions to predict outcome proportions.
- Data splits: Exogenous library split 6:1:3 (train:val:test) or trisection scheme; separate models per each of eight CGBEs; benchmarking against BE-Hive and DeepCBE; also trained to predict C-to-T on published CBE datasets.
In vivo embryo editing: In vitro transcription of eOPTI-/cOPTI-CGBE mRNA and sgRNAs; microinjection into mouse zygotes or two-cell embryos targeting Tyr (three sites); monitored development; deep sequencing/Sanger validation; phenotypic assessment (coat color) and germline transmission; applied CGBE-SMART to predict efficiencies at Tyr sites.

Key Findings

Engineering and efficiency:
- eUNG and cUNG outperformed human UNG for C-to-G editing at 34 HEK293T sites.
- Introducing YE1 mutations reduced bystander C-to-A and C-to-T edits and increased product purity (C-to-G/C-to-others).
- FNLS and codon optimization increased expression and editing; N-terminally fusing UNG (FNLS-eUNG-YE1-CGBE; FNLS-cUNG-YE1-CGBE) further elevated editing (e.g., average 22.7% at optimized constructs) and purity with a narrowed editing window at protospacer positions 4–7.
- Optimized CGBEs outperformed CGBE1 at positions 5–6 and had higher product purity; prime editors PE2/PE3 were substantially less efficient at tested targets and PE3 had higher indel rates.
Off-target profiles:
- GOTI showed DNA SNV counts at spontaneous levels, far below BE3; no mutation bias observed; RNA-seq showed no increase in RNA SNVs or bias relative to controls; no obvious sgRNA-dependent off-targets detected.
Motif preferences:
- eOPTI- and cOPTI-CGBE preferentially edited WCW motifs, with cOPTI showing stronger T preference at W. Targets with WCW had ~3.2-fold (eOPTI) and ~2.8-fold (cOPTI) higher on-target C-to-G efficiency, and reduced bystander and indels.
- eA3A-OPTI-CGBEs preferred TCW motifs, consistent with APOBEC3A behavior.
- APOBEC3G variants (full-length and CTD) preferred CCN motifs with ~3–5-fold higher efficiency at CCN sites; when ≥3 consecutive Cs present, the third C was edited most efficiently, differing from hA3G-CBE C-to-T preference for the second C.
- Cas9n-NG and spG expanded NG PAM targeting with higher efficiency than xCas9n; Cas9n-NG had lower indels than spG, making it preferred for NG PAM sites.
High-throughput and modeling:
- Library screen confirmed motif preferences for positions 4–7 across eight OPTI-CGBEs.
- Logistic regression explained ~20–30% variance in test sets and visualized motif logos: WCW (eOPTI), TCW (cOPTI and eA3A variants), CCN (A3G variants).
- CGBE-SMART achieved Pearson R ~0.20–0.60 for per-site efficiency and 0.37–0.60 for outcome proportions on test library data; cOPTI-CGBE had best efficiency prediction among the eight.
- Outperformed BE-Hive and DeepCBE on seven CGBE models (average R ~0.47 vs 0.15 vs 0.33), except hA3G-CTD-cOPTI.
- Generalized to C-to-T datasets with high accuracy (average R ~0.75 across four datasets) and comparable or better performance relative to prior models depending on dataset.
- On 80 endogenous sites, averaged R ~0.64 (efficiency) and ~0.66 (proportion) between predictions and observations.
In vivo embryo editing:
- Both eOPTI- and cOPTI-CGBE achieved high C-to-G editing at three Tyr sites; two-cell injections significantly increased editing relative to zygote injections.
- cOPTI-CGBE had lower indel frequency than eOPTI-CGBE in embryos, consistent with cell data.
- Tyr-C editing introduced a stop codon yielding albino phenotype; two-cell injections produced higher rates of mosaic and white coat color; F1 progeny from mosaic F0s yielded >50% white offspring.
- Model predictions matched two of three Tyr targets (Tyr-A, Tyr-B); Tyr-C was overpredicted, suggesting in vivo factors beyond sequence context.

Discussion

The study demonstrates that rational engineering of CGBEs—optimizing UNG species, deaminase mutations, domain positioning, and expression—yields high-efficiency, high-purity C-to-G base editors with minimized off-target effects. Crucially, editing outcomes are strongly influenced by local sequence context, with clear motif preferences (WCW for APOBEC1-based, TCW for APOBEC3A-based, CCN for APOBEC3G-based editors). Recognizing and leveraging these motifs enables more predictable and efficient editing. The deep-learning model CGBE-SMART captures these dependencies and accurately predicts per-site efficiencies and outcome distributions across exogenous libraries and endogenous sites, facilitating sgRNA/editor selection. In vivo applications in mouse embryos validate the editors’ practical utility, showing efficient generation of phenotypic edits and germline transmission. These findings address the need for robust C-to-G editing and provide computational tools to guide target design, advancing both basic research and potential therapeutic genome editing.

Conclusion

This work delivers optimized C-to-G base editors (OPTI-CGBEs) with elevated efficiency, narrowed and favorable editing windows, improved product purity, and minimal detectable off-target effects. It establishes sequence motif preferences that guide target selection and introduces CGBE-SMART, a machine-learning framework that predicts editing efficiency and outcome proportions, generalizing across editors and targets, including endogenous loci. The editors function efficiently in mouse embryos, enabling heritable phenotypic edits. Future research should investigate chromatin state, epigenetic regulation, and DNA repair pathway influences on editing efficiency in diverse cell types and tissues, refine models with additional in vivo datasets, expand PAM compatibility and delivery modalities, and further minimize indels and bystander edits to enhance therapeutic applicability.

Limitations

High-throughput library assays sometimes showed low absolute C-to-G efficiencies, potentially underestimating model performance and not fully reflecting optimized endogenous editing.
Model predictions are based primarily on sequence context; discrepancies (e.g., Tyr-C) indicate that in vivo factors such as chromatin accessibility, epigenetic state, and DNA repair activity significantly affect outcomes.
Performance varies across editor variants (e.g., lower prediction accuracy for hA3G-CTD-cOPTI in comparisons), suggesting editor-specific nuances not fully captured.
Off-target analyses, while comprehensive (GOTI, RNA-seq), cannot completely exclude rare or context-specific off-target events.
Generalizability to other cell types and organisms may require retraining or calibration due to experimental condition differences (transfection/injection methods, expression levels).

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

Testing the predictive power of reverse screening to infer drug targets, with the help of machine learning

A. Daina and V. Zoete

Computer Science

Using the interest theory of rights and Hohfeldian taxonomy to address a gap in machine learning methods for legal document analysis

A. Izzidien

Medicine and Health

Combining machine learning with high-content imaging to infer ciprofloxacin susceptibility in isolates of *Salmonella Typhimurium*

T. Tran, S. Sridhar, et al.

Food Science and Technology

Metabolomics integrated with machine learning to discriminate the geographic origin of Rougui Wuyi rock tea

Y. Peng, C. Zheng, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny