Biology

Automated high-throughput genome editing platform with an AI learning in situ prediction model

S. Li, J. An, et al.

Discover the groundbreaking automated high-throughput platform for genome editing, which can edit thousands of samples in just a week! This innovative system integrates gRNA design and a machine learning model to predict base editing performance. Conducted by a collaborative team of leading researchers including Siwei Li, Jingjing An, and Meng Wang, this study accelerates the development of BE-based genetic therapies.... show more

Introduction

Many genetic diseases arise from pathogenic SNVs; ClinVar lists over 37,000 diseases associated with such variants, affecting hundreds of millions worldwide, with most lacking effective treatments. Base editors (BEs)—including CBEs, ABEs, and GBEs—enable precise base conversions without introducing double-strand breaks, offering promise for correcting a large fraction of pathogenic SNVs. However, realizing BE-based therapies requires constructing mammalian cell disease models at scale, which is hindered by labor-intensive manual workflows and limited predictive models that often ignore endogenous chromatin context. The study aims to create an automated high-throughput in situ genome-editing platform and to develop a machine-learning model (CAELM) that integrates sequence context and chromatin accessibility to predict CBE efficiencies at endogenous loci, thereby enabling scalable model generation and improved prediction of editing outcomes.

Literature Review

Previous high-throughput assessments of base editing often used integrated target libraries (e.g., BeRedit), enabling AI models but lacking in situ chromosomal context. Studies have shown chromatin accessibility strongly influences CRISPR-Cas9 activity: editing is more efficient in euchromatin than heterochromatin; high activity correlates with lower nucleosome occupancy; nucleosomes impede Cas9 binding and cleavage. Pioneer factors can modulate local chromatin to improve base editing. Existing deep learning approaches have predominantly trained on integrated targets and sequence-only features, risking bias and limited generalization to endogenous loci. Automation can reduce errors and costs in large-scale experiments. These insights motivate acquiring large in situ datasets and incorporating chromatin accessibility into predictive models.

Methodology

The automated platform comprises four modules: (1) Computational design of endogenous target gRNAs; (2) Construction of gRNA expression plasmids; (3) Automated base editing in mammalian cells; and (4) Machine learning to build a CBE performance model (CAELM). Equipment includes an acoustic liquid handler, plate sealer, centrifuge, automated thermocycler, automated colony picker (ClonePiK), incubator shaker, and Beckman liquid handlers (I7/T7).

gRNA design: Pathogenic C-to-T or G-to-A SNVs were queried from ClinVar. For each target, a 20-nt spacer was selected to place the SNV within the BE4max editing window (positions 3–18) with an NGC PAM and an additional C2 where applicable. Based on GRCh37 sequences, primer3 was scripted in batch to design PCR primers spanning ±750 bp around targets. A total of 1210 genes/targets were selected; protospacers and primers were batched (Supplementary Data 1–3).
gRNA plasmid construction: Golden Gate cloning was miniaturized (1 µL) and automated by Echo acoustic liquid handling. DH5α (or H1295) competent cells were transformed using automated workflows; colonies were picked via ClonePiK. Sanger sequencing verified constructs; scripts parsed chromatograms, generating CSV picklists to repick incorrect assemblies and lists of correct constructs for plasmid extraction (up to 576 plasmids/day) (Supplementary Data 4).
Automated cell editing: HEK293T (and HepG2) cells were seeded in 96-well plates and co-transfected with gRNA and BE4max plasmids by Beckman I7/T7 protocols. Media exchange and selection steps were automated. After 5 days, cells were lysed; target loci were PCR-amplified and Sanger-sequenced. EditR quantified C-to-T editing efficiencies. Python scripts automated batch processing, with CSVs for re-running failed PCRs with alternate primer pairs (Supplementary Data 6–8).
Machine learning (CAELM): Using 1134 valid in situ editing results, features included one-hot-encoded 20-mer protospacers and chromatin accessibility (average DNase I hypersensitive site density over the target sequence) retrieved from ENCODE (UCSC hg19/GRCh37). An XGBoost Regressor was trained with 80/20 train/test split; hyperparameters were tuned via GridSearchCV; performance was evaluated by Pearson’s r, with additional 5×5 nested cross-validation. Feature importance from XGBoost quantified contributions of sequence versus chromatin. Models were extended to other CBEs (Anc-BE4max, hyA3-BE4max) and to HepG2 by continual training from a BE4max_HEK293T base model (85/15 splits), excluding sequence overlaps between train and test when applicable. Cell line model isolation: For selected high-efficiency edits, FACS sorted single cells into 96-well plates using GFP from the BE4max plasmid as a marker; clones were expanded and verified by Sanger sequencing to generate homogeneous SNV cell lines. These models were subsequently used to test ABE-mediated correction.

Key Findings

Automated vs manual editing: Across 32 genomic loci, the automated high-throughput system achieved comparable or significantly higher C-to-T editing efficiencies than manual workflows; 16 loci were similar, and 14 showed higher efficiencies with automation (n=3 per condition).
Throughput and performance across endogenous loci: Of 1210 targets edited with BE4max in HEK293T, 823 exhibited 10–50% C-to-T conversion, 248 had approximately 50% efficiency, 136 had 5–10%, 175 were <5%, and 76 lacked results due to PCR failure (Supplementary Data 9). Performance variability across loci matched prior observations.
APOBEC3A-nCas9 rescues methylation-associated low-efficiency sites: From 175 low-efficiency targets, 80 were retested with APOBEC3A-nCas9; 23 showed higher efficiencies with APOBEC3A-nCas9 (mean 29.1%) compared with BE4max (mean 3.4%).
CAELM prediction accuracy for BE4max in HEK293T: Using 1134 in situ data points with sequence and chromatin accessibility features, CAELM achieved Pearson’s r = 0.64 on held-out test data; 5×5 nested cross-validation also averaged r = 0.64.
Comparison to BE-Hive: On 209 testing loci, BE-Hive predictions (sequence-only) correlated less with in situ outcomes (r = 0.53) than CAELM.
Generalization to other editors and cell types: Continual training produced good correlations: HEK293T—Anc-BE4max r = 0.86, hyA3-BE4max r = 0.72; HepG2—BE4max r = 0.70, Anc-BE4max r = 0.87, hyA3-BE4max r = 0.42.
Feature importance: Chromatin accessibility contributed a measurable but smaller portion of predictive power relative to sequence context, with relative contribution less than approximately 16%.
Isolation of disease SNV cell models and ABE correction: From ten edited pools, single-cell sorting yielded an average survival of 16.7% ± 4.4% post-sort and 68.7% ± 4.2% after expansion; 47.30% ± 5.18% of sequenced clones had 100% target C:G-to-T:A conversion. Nine homogeneous disease SNV cell models were established. Subsequent ABE-mediated correction achieved an average 62% efficiency, with up to 98% and 91% in two models and 30% minimum in another.

Discussion

The automated platform streamlines and standardizes the end-to-end process of generating in situ edited mammalian cells, addressing bottlenecks of manual editing—time, cost, variability, and errors. The resulting large, consistent dataset enables building predictive models that incorporate endogenous chromatin context. CAELM’s integration of DNase-based accessibility with sequence captures key determinants of editing efficiencies, achieving moderate-to-strong correlations that outperform sequence-only tools (e.g., BE-Hive) on in situ data. The findings confirm that while target sequence is the dominant determinant, chromatin accessibility substantially influences base editing outcomes and should be considered in predictive modeling. The platform also rapidly generates homogeneous disease SNV cell lines suitable for downstream therapeutic development and benchmarking of editors, demonstrated by effective ABE-mediated corrections in multiple models. Extending CAELM through continual learning allows adaptation to different CBEs and cell types, supporting broader applicability.

Conclusion

This work delivers (1) an automated high-throughput in situ genome editing platform for mammalian cells capable of processing thousands of samples per week with efficiencies comparable or superior to manual methods; and (2) CAELM, a machine-learning model that integrates sequence context and chromatin accessibility to predict CBE editing efficiencies at endogenous loci. The platform generates disease-relevant cell models at scale and provides high-quality in situ datasets for modeling. CAELM outperforms sequence-only predictors on endogenous data and can be adapted to additional editors and cell types via continual training. Future directions include increasing dataset size and diversity across loci, cell types, and editors; incorporating additional epigenomic features (e.g., methylation, histone marks); improving assays with deeper sequencing; and expanding deployment as accessible tools and web services.

Limitations

Dataset size and assay modality: The primary CAELM model was trained on 1134 in situ targets quantified by Sanger-based EditR, which provides limited resolution compared to deep sequencing. This likely contributes to lower correlations than some prior models trained on larger NGS datasets.
Incomplete coverage and experimental failures: PCR amplification failed for 76 targets; 175 targets showed <5% editing with BE4max, constraining training data diversity.
Model scope: Initial training focused on CBEs (BE4max and variants) in HEK293T and HepG2; performance for other editors, PAMs, or cell types may require additional data and continual training.
Chromatin feature simplification: Chromatin accessibility was represented by averaged DNase I hypersensitivity signals over target regions from ENCODE; this may not capture locus-specific or dynamic chromatin states fully.
Reporting inconsistencies: Some counts and subsets (e.g., numbers of targets in certain analyses) contain minor inconsistencies, which may reflect batching or subset usage in figures/supplementary tables.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

An integrated high-throughput robotic platform and active learning approach for accelerated discovery of optimal electrolyte formulations

J. Noh, H. A. Doan, et al.

Medicine and Health

Machine learning-based prediction of in-hospital death for patients with takotsubo syndrome: The InterTAK-ML model

O. D. Filippo, V. L. Cammann, et al.

Medicine and Health

A type I interferon footprint in pre-operative biopsies is an independent biomarker that in combination with CD8+ T cell quantification can improve the prediction of response to neoadjuvant treatment of rectal adenocarcinoma

A. Rezapour, D. Rydbeck, et al.

Business

A humanistic model of corporate social responsibility in e-commerce with high-tech support in the artificial intelligence economy

E. B. Zavyalova, V. A. Volokhina, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny