Biology
Deciphering microbial gene function using natural language processing
D. Miller, A. Stern, et al.
The study addresses the challenge of assigning functions to the vast fraction of microbial genes that remain uncharacterized, especially in metagenomic datasets comprising uncultivated microbes. Leveraging the observation that genomic context (co-localization and co-occurrence of genes) is informative about gene function in prokaryotes, the authors hypothesize that methods from natural language processing can model "gene semantics" by treating gene families as words and contigs/genomic regions as sentences. The purpose is to build embeddings of gene families from large-scale genomic context and use them to predict functional categories for hypothetical genes, thereby accelerating discovery of microbial systems relevant to biotechnology and medicine. The importance lies in enabling function inference beyond sequence homology, uncovering novel systems (e.g., defense, secretion) and expanding knowledge of microbial gene functions across diverse taxa.
Prior work has shown genomic context can predict functional associations and that many prokaryotic systems (e.g., CRISPR-Cas, secretion, defense islands) exhibit strong gene neighborhood signatures. NLP has been applied to protein sequences and k-mers to model protein language and predict various properties, and to classify biosynthetic gene clusters using Pfam domains. Other studies have used NLP on DNA k-mers for tasks like taxonomic classification and enhancer-promoter or chromatin accessibility prediction. However, a universal, higher-level representation modeling complete genes as words across all microbial genomes had not been broadly implemented. This work builds on these foundations by abstracting to gene families and using their genomic co-occurrence to learn embeddings and perform functional classification.
Data: All assembled microbial genomes (excluding Metazoa, Fungi, Viridiplantae) and metagenomes publicly available in NCBI WGS and EBI MGnify as of March 14, 2020 were downloaded (596,338 genomes; 22,923 metagenomes). Redundancy was reduced by restricting genomes to UniProt non-redundant proteomes and deduping metagenomes with BBMAP dedupe. Contigs shorter than 10 kbp were filtered out. Genes were predicted with Prodigal v3.0.0 and initially annotated with Prokka v1.14.6.
KO-based annotation: KEGG KO proteins (17,107,806 proteins from 24,307 KOs; downloaded May 14, 2021) were subclustered with MMseqs2 (-s 7.5 -c 0.5). Subclusters (>5 proteins) were aligned (MAFFT) and HMMs built (HMMER v3.3.2), yielding 63,234 HMMs representing 23,199 KOs. Proteins on contigs were scanned with hmmsearch (E ≤ 1e-5); matches were assigned identifiers KXXXXX.YY (KO and subcluster index). Proteins without KO matches were deemed hypothetical.
Clustering hypothetical proteins: Hypotheticals were first clustered with CD-HIT (-s 0.80 -c 0.80), then representatives clustered into families with MMseqs2 (-s 7.5 -c 0.5), producing identifiers hypo.clst.ZZZZ. Rare families (appearances <24) and ubiquitous tokens (frequency >1e-3) were removed. The resulting corpus contained 360,039,110 genes across 11,119,550 contigs, represented by 563,589 unique tokens (words). Additional NR (DIAMOND v2.0.11) searches were used to assess annotation status of hypothetical families (informative if E < 1e-10 with non-ambiguous description; unannotated if no hit E < 10).
Embedding (gene space): Contigs were treated as sentences of tokens (KO subclusters and hypothetical clusters). A word2vec Skip-gram model with negative sampling was trained on the corpus with window size w=5 and embedding dimension k=300 to learn embeddings for all tokens, capturing co-occurrence-based semantics of gene families.
Classifier: Using embeddings as sole features, a 4-layer deep neural network (DNN) classified annotated KO tokens into functional categories (KEGG 3rd-level mapped to a reduced set). Architecture: hidden layers of sizes 256, 128, 64 with ReLU activation, dropout 0.2, softmax output; trained with Adam optimizer, categorical cross-entropy, 20 epochs. Alternative classifiers (SVM with RBF kernel C=1; Random Forest 1000 estimators, max depth 50; XGBoost 800 estimators, max depth 6, learning rate 0.05) were tuned via nested CV for comparison.
Functional categories: Selected categories included amino sugar and nucleotide sugar metabolism, benzoate degradation, energy metabolism, oxidative phosphorylation, porphyrin and chlorophyll metabolism, prokaryotic defense, ribosome, secretion systems, two-component systems; plus an "Other" category from sampled remaining KOs.
Cross-validation: Two schemes—(i) naive 5-fold CV over contigs; (ii) leave-one-taxonomic-group-out CV over five major groups (Gammaproteobacteria, Alphaproteobacteria, Actinobacteria, Firmicutes, Bacteroidetes). Performance metrics included per-category F1 and AUPR. Benchmarking vs remote homology used leave-one-KO-out CV across 2,915 KOs (7,043 words), comparing DNN predictions to HHblits (v3.3.0, 2 iterations, E ≤ 1e-3), HMMER jackhmmer (v3.3.2, -N 2 -E 1e-3 -incE 1e-3), and PSI-BLAST (v2.7.1, 2 iterations, E ≤ 1e-3). Hyperparameters were explored for iterations and E-value thresholds.
Prediction on hypotheticals: The trained DNN assigned functional categories to hypothetical families, using reliability thresholds combining prediction score and category AUPR (weighted threshold t=0.9; unweighted t=0.99). Validation included NR annotations for a subset and recovery of recently discovered defense system genes (BREX, DISARM, Theoris, Durantia, Gabija, Hachiman, Kiwa, Lamssu, Septu, Wadjet, Zoria).
Discovery potential: Adjusted rarefaction analysis per category repeatedly sub-sampled predicted genes to estimate expected growth in unique predicted families with increasing sample size, indicating saturation or continued discovery potential.
System discovery: High-frequency predicted families (>1000 occurrences) were used to build a contig-by-word presence matrix and correlation matrix; highly correlated clusters were selected as candidate operons/systems. Domain/homology analyses were done with HHpred (databases: PDB mmCIF, Pfam-A v35, NCBI CD v3.18, TIGRFAMs v15.0) and BLAST against NR. Defense novelty was checked with PADLOC and DefenseFinder.
- Gene embedding space: Embeddings clustered functionally related genes; CRISPR-Cas genes clustered near other defense systems but remained separable, and secretion systems formed distinct clusters reflecting system types and evolutionary relationships. cas4 families split across multiple clusters, consistent with CRISPR and non-CRISPR-associated roles.
- Classification performance: In leave-one-taxonomic-group-out CV, the DNN outperformed SVM, RF, and XGBoost overall. Per-category AUPR ranged from 0.56 to 0.97: Secretion system (0.97), Prokaryotic defense system (0.90), Two-component system (0.83), Benzoate degradation (0.76), Energy metabolism (0.75), Porin/permease system (0.71), Oxidative phosphorylation (0.62), Amino sugar and nucleotide sugar metabolism (0.56), Other (0.56). Context-based inference worked across large evolutionary distances using only genomic context.
- Benchmarking vs sequence homology: Across 2,915 leave-one-KO-out models, the embedding-based DNN was most often better or comparable to HHblits, and on average 1.4× more sensitive; 27–44% of tested genes lacked significant hits in sequence-based searches. HHblits had 4–6% higher precision in some categories, as expected for homology-based methods.
- Predictions on hypotheticals: Of 519,398 hypothetical families (444,521 also unannotated in NR), 56,617 families (>20 million genes) received high-confidence functional predictions. For 7,691 predicted families with NR hits, descriptions agreed with predicted categories (e.g., energy metabolism genes related to nitrogen/methane metabolism; amino sugar metabolism genes as glycosyltransferases or glucosamine-6-phosphate deaminase).
- Validation on recent defense systems: Of 1,369 genes mapped to newly discovered defense systems (BREX, DISARM, Theoris, Durantia, Gabija, Hachiman, Kiwa, Lamssu, Septu, Wadjet, Zoria), 98.6% were correctly predicted as defense. Additionally, 40,247 uncharacterized families were predicted as defense, suggesting extensive undiscovered diversity.
- Discovery potential: Rarefaction showed core functions (ribosome, nucleotide metabolism) near saturation, whereas prokaryotic defense, secretion, and two-component systems exhibited high discovery potential (non-plateauing curves). Subsequent large-scale studies reporting many new defense systems corroborate this.
- Newly identified candidate systems: (1) A secretion-related operon of 8–9 genes prevalent in Roseburia, Ruminococcus, and Eubacterium (Clostridia), sharing CpaF (PilB/GspE), PilC/GspF, DUF5411, and a cell invasion domain protein; other genes lacked clear homology to known T4P/T2SS components but co-occurred near pilus/adhesion genes. (2) A distant type IV pilus variant in Veillonella (present in 86% of genomes examined), with five to six unannotated genes adjacent to pil genes; all components showed homology except outer membrane PilQ/GspD located distantly. (3) A putative defense system widespread across bacteria with core genes: a large Z1-domain protein (~925 aa), a PD-(D/E)XK nuclease with DUF4220, and a cytosine methyltransferase; Type I (six genes) associated with DNA repair endonucleases; Type II (II-A, II-B) included an AIPR abortive infection protein and was associated with MORC ATPases and a transcriptional regulator; showed low sequence similarity (~20% identity) to mzaBCDE (mzaA present in few cases) and passed novelty checks (PADLOC, DefenseFinder).
The results support the central hypothesis that genomic context encodes sufficient information to infer microbial gene function at scale. The learned gene embeddings grouped functionally related genes and distinguished subfunctions within annotated families (e.g., cas4 variants), enabling discovery beyond sequence homology. The embedding-driven classifier achieved strong cross-taxa performance, particularly for systems with strong contextual signatures (defense, secretion, two-component, oxidative phosphorylation), and matched or exceeded sensitive homology methods in sensitivity. Applying the model to the vast pool of hypothetical genes yielded tens of thousands of functional assignments, highlighted categories with high discovery potential, and uncovered candidate secretion machineries in gut-associated bacteria and a widely distributed defense system enriched for DNA-binding/cleaving activities. These advances directly address the challenge of annotating uncharacterized microbial genes and underscore the value of combining genomic context with NLP to reveal novel biology relevant to microbial interactions, immunity, and potential biotechnological applications.
This study introduces a scalable framework that models "gene semantics" by training word2vec embeddings on a global microbial genomic corpus and classifying gene functions with a DNN using only genomic context. The approach recapitulates known functional organization, generalizes across distant taxa, rivals sensitive sequence-homology methods, assigns high-confidence functions to 56,617 hypothetical gene families, and pinpoints categories with high discovery potential. It further identifies candidate secretion systems in Clostridia and Veillonella and a putative, widely distributed prokaryotic defense system. Future directions include enriching embeddings with additional genomic features (gene distance, co-directionality, regulatory elements), integrating sequence and structure features with embeddings, adopting advanced architectures (e.g., transformers) as they mature for long genomic inputs, and developing domain-focused embeddings for specific taxa, microbiomes, or functional systems to enhance targeted discovery.
- Corpus filtering removed rare gene families (<24 occurrences) and ubiquitous tokens (>1e-3 frequency), potentially excluding rare functions and broad housekeeping signals. Short contigs (<10 kbp) were excluded, limiting detection on small mobile elements (transposons, small plasmids). Standard gene prediction omitted small ORFs, overlooking short peptides. The dataset is biased toward human-associated microbes. Context-based classification is less informative for functions lacking strong genomic co-localization or for genes shared across multiple pathways (e.g., amino sugar metabolism). The embedding model (word2vec) is relatively simple and may not fully capture long-range dependencies or sparsity; more advanced models would require significant computational resources and are not yet optimized for very long genomic sequences.
Related Publications
Explore these studies to deepen your understanding of the subject.

