Biology

Machine-learning approach expands the repertoire of anti-CRISPR protein families

A. B. Gussow, A. E. Park, et al.

Discover the groundbreaking advancements in CRISPR-Cas systems and their counteractions by anti-CRISPR proteins (Acrs). This exciting study showcases a novel machine-learning approach that has identified 2,500 potential Acr families and confirmed two new Acrs, significantly enriching our understanding of bacterial immunity. This remarkable research was conducted by Ayal B. Gussow and colleagues.... show more

Introduction

CRISPR-Cas systems provide adaptive immunity in many bacteria and most archaea and have been widely repurposed for genome engineering. Their viral antagonists—anti-CRISPR proteins (Acrs)—inhibit CRISPR-Cas via diverse mechanisms, ranging from DNA mimicry to enzymatic modifications, and can serve as modulators of genome editing. However, Acrs are typically short, rapidly evolving, and highly diverse, which makes comprehensive bioinformatic discovery challenging. Prior discovery strategies have included: (i) guilt-by-association leveraging conserved anti-CRISPR associated (Aca) helix–turn–helix (HTH) regulators adjacent to Acr loci, (ii) searches in self-targeting genomes where CRISPR spacers match the host, implying the presence of Acrs, and (iii) homology-based expansion from known families. Despite successes, these approaches are not comprehensive. This study aims to develop and validate a machine-learning framework that integrates generalizable features of Acrs to predict novel Acr families at scale and to experimentally validate top candidates.

Literature Review

Known Acrs (approximately 46 families at the time) are encoded mainly by phages and other mobile genetic elements, often clustered in short, co-directional operons with small genes and an adjacent Aca HTH regulator. Some Acrs are narrow-spectrum inhibitors while others (e.g., DNA mimics) show broader specificity. Enzymatic Acrs (e.g., acyltransferases, nucleases) have also been reported. Previous discovery relied on: (1) homology searches from known Acrs; (2) guilt-by-association via Aca markers; and (3) enrichment in self-targeting hosts that would otherwise be killed by their own CRISPR-Cas unless Acrs are present. While effective, these methods can miss distant or atypical Acrs due to extreme sequence variability and sparse distinctive motifs. This necessitates a more systematic, feature-based approach capable of generalizing beyond sequence similarity.

Methodology

Data curation and positive/negative sets: Initially, 39 known Acr families were curated; PSI-BLAST expansion against NCBI NR produced a high-confidence positive set of 3564 proteins spanning 32 families (after excluding large/functionally assigned proteins). For modeling, the training positive set included Acrs known before September 2018 (n=2775; 26 families). The negative set (n=2600) comprised representative proteins randomly sampled from 1000 prokaryotic genomes and 4000 CRISPR-Cas-containing prokaryotic genomes, clustered to avoid oversampling. An unseen test set consisted of Acrs published after September 2018 (positive n=879; negative n=600). Class weights were balanced during training. Features: Twelve features were computed, capturing protein- and operon-level properties. Final selected features (via a genetic algorithm) included: presence in self-targeting genome; fraction of annotated proteins in the operon (direction); mean protein length in the direction; mean membrane-association fraction in the direction; direction size; protein annotation status; presence of downstream HTH protein; protein membrane association; protein length; and mean hydrophobicity. Acrs are typically short (weighted mean ~104 aa), occur in short co-directional operons with small, mostly unannotated genes, and often have an adjacent HTH regulator. Model: An Extremely Randomized Trees (ExtraTrees) classifier (scikit-learn) with 1000 trees was trained (Gini impurity splits; predict_proba scores). Performance assessment used five iterations of three-fold cross-validation over families and permutation tests for AUC significance. The probability threshold for binary calls was chosen to maximize balanced accuracy on the training set. Acr search space and candidate generation: A search space was built from prokaryotic proteins enriched for virus/provirus contexts by selecting genes near proteins with viral signatures (CDD/pVOG hits), plus viral proteomes, yielding 10,938,430 proteins. The model scored all proteins, generating 1,546,505 candidates; removal of proteins with CDD/pVOG domain hits left 892,830 proteins; clustering by sequence similarity produced 232,616 clusters. Heuristic filtering: To enrich for true Acrs, clusters were filtered requiring (i) at least one member with a downstream HTH-domain gene in the same direction and (ii) at least one member in a self-targeting genome, yielding 11,304 clusters. Additional filters (thresholds chosen by maximizing balanced accuracy using known Acrs) included maximum mean direction size ≤5; limited breadth of homology (HHblits hit counts and total homolog counts); and enrichment for virus homologs, resulting in 2526 clusters. Families significantly similar to known Acrs or Acr-related sequences (PSI-BLAST e<1e-3) were removed (26 families), leaving 2500 novel candidate families (16,919 proteins). Characterization: Candidates exhibited weighted mean length 109 aa (SD 21.6), occurred in short directions (mean ~3 genes) with small proteins (mean 200 aa), had lower membrane association rates (~15%) than background (~28%), and showed modest differences in predicted secondary structure (slightly fewer beta-strand residues than negatives; Mann–Whitney p≈7.45e−13). About 10% of candidate families had viral genome members (279 virus strains; including archaeal viruses). Genomic neighborhood analysis was performed for large clusters (presence–absence matrices across ±10 genes). Experimental validation: A focused set of Proteobacteria candidates associated with type I-C, I-E, and I-F systems in Pseudomonas aeruginosa were screened using phage plaque assays (type I-E and I-F at endogenous expression with native spacers; type I-C heterologously expressed in PAO1 with engineered spacer). Ninety-one genes were targeted; 32 were successfully cloned for testing. Two candidates corresponded to known inhibitors identified independently (AcrIC3 and AcrIC4 homologs). Two novel inhibitors, AcrIC9 (79 aa, acidic pI 3.69) and AcrIC10 (94 aa, pI 6.57), inhibited type I-C, with AcrIC9 showing strong activity. Additional case studies examined top clusters (e.g., C4338 near AcrIIA1 in Listeria prophages and later reported as an anti-Cas9; C2091 and C12085 in Listeria prophages adjacent to Aca-like regulators). Laboratory closures (COVID-19) limited broader testing.

Key Findings

Cross-validation performance: Mean ROC AUC 0.93 across five iterations of three-fold cross-validation (permutation p=0.001).
Unseen test set performance: ROC AUC 0.83 (permutation p=0.001). Binary classification at the selected threshold achieved precision 78% and recall 57%.
Large-scale prediction: From 10.94 million virus/provirus-enriched proteins, the model identified 1.55 million candidates; after removing domain-annotated proteins and clustering, 232,616 clusters were obtained. Heuristic filters yielded 2526 clusters; removing 26 known/related families resulted in 2500 novel candidate Acr families (16,919 proteins). Nearly half of families were singletons; largest family had 319 members.
Candidate characteristics: Weighted mean protein length 109 aa; candidates reside in short directions (mean ~3 genes) enriched for small, unannotated proteins; 15% predicted membrane-associated (vs ~28% background); subtle but significant reduction in beta-sheet content vs negatives.
Taxonomic and genomic context: Candidates span diverse bacteria and archaea; 10% have members encoded in viral genomes (279 strains). Some phages encode multiple candidates; archaeal viruses also represented.
Experimental validation: Two novel anti-CRISPRs against type I-C, AcrIC9 and AcrIC10, were discovered; two additional screened candidates matched independently identified type I-C inhibitors (AcrIC3 and AcrIC4). AcrIC9 was highly active, fully inactivating type I-C in assays; AcrIC10 showed weaker activity. Case analyses supported additional strong candidates in Listeria prophages and other contexts.

Discussion

By integrating a small set of generalizable features of Acrs—short protein length, occurrence in compact operons with low annotation density, proximity to HTH regulators, lower membrane association, and enrichment in self-targeting and virus/provirus contexts—the machine-learning model effectively discriminated Acrs from non-Acrs, including on an unseen test set. Combining model scores with heuristic filters enriched for hallmark Acr genomic contexts, yielding 2500 candidate families and substantially expanding the putative Acr repertoire. Experimental assays validated two novel type I-C inhibitors (AcrIC9, AcrIC10) and confirmed that top-ranked predictions are enriched for true Acrs. These findings demonstrate that feature-based machine learning can overcome sequence divergence that hampers homology-driven discovery, providing a resource for the community to explore Acr diversity, mechanisms, and potential as precision regulators of CRISPR-Cas technologies.

Conclusion

This work introduces and validates a machine-learning pipeline for Acr discovery, achieving strong performance on blinded data and producing a curated set of 2500 candidate Acr families. The study experimentally identifies two new type I-C inhibitors (AcrIC9 and AcrIC10) and recovers independently discovered AcrIC3 and AcrIC4, underscoring the approach’s practical utility. The resulting candidate compendium broadens the landscape of Acr diversity and offers a roadmap for systematic experimental follow-up. Future work should expand validation across additional CRISPR-Cas subtypes, integrate structural and motif features, refine viral/proviral context detection, and apply iterative learning as new Acrs are confirmed.

Limitations

Training data constraints: relatively small and uneven positive set of known Acrs/ families may bias the model.
Feature and heuristic bias: reliance on characteristics like HTH adjacency, short operons, and self-targeting may under-detect atypical Acrs lacking these contexts, skewing predictions toward Acrs similar to known families.
Incomplete provirus detection: enrichment by proximity to viral proteins may miss degenerated or atypical MGEs and introduce annotation errors.
Moderate recall on the unseen test set (57%) indicates many true Acrs may be missed at the chosen threshold.
Limited experimental validation: only a subset of candidates was tested; COVID-19 disruptions curtailed broader screening and subtype coverage.
Potential database/annotation noise and sequence curation errors (e.g., truncated HTH annotations) can affect feature extraction and filtering.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

Exploring the mechanism of path-creating strategy for latecomers: a combined approach of econometrics and causal machine learning

Y. Teng, Y. Li, et al.

Medicine and Health

Drug ranking using machine learning systematically predicts the efficacy of anti-cancer drugs

H. Gerdes, P. Casado, et al.

Engineering and Technology

Machine learning enables the discovery of 2D Invar and anti-Invar monolayers

S. Tian, K. Zhou, et al.

Chemistry

Unraveling the energetic significance of chemical events in enzyme catalysis via machine-learning based regression approach

Z. Song, H. Zhou, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny