Biology
An expanded reference map of the human gut microbiome reveals hundreds of previously unknown species
S. Leviatan, S. Shoer, et al.
This groundbreaking research conducted by Sigal Leviatan, Saar Shoer, Daphna Rothschild, Maria Gorodetski, and Eran Segal reveals a new reference set of 3,594 high-quality species genomes derived from the human gut microbiome, incorporating 310 novel species and showcasing the remarkable diversity yet to be explored.
~3 min • Beginner • English
Introduction
The human gut harbors an estimated 10^13 microbial cells and comprises hundreds of species per individual drawn from a much larger species pool. Many gut microbes are difficult to culture, and limitations in assembling and classifying metagenome-assembled genomes (MAGs) constrain our ability to comprehensively identify microbial content and relate it to health and disease. Existing approaches rely on 16S rRNA gene references or shotgun read assignment against genome catalogs, both of which depend on the breadth and quality of reference genomes. The research goal of this study is to expand and improve a curated genome reference for the human gut microbiome by assembling and integrating a very large cohort of gut metagenomes and leveraging rigorous clustering and quality control, thereby enabling better read mapping, taxonomic resolution, and discovery of previously unknown species.
Literature Review
Prior work has expanded human microbiome genome catalogs using metagenomic assemblies and isolates (e.g., Pasolli et al., the UNITN/UNTIN set) and more recently the Unified Human Gastrointestinal Genome collection (UHGG). These efforts showed substantial unexplored diversity and improved read assignment but included genomes from multiple body sites and public repositories that may be biased toward cultured or well-studied species (e.g., Escherichia coli overrepresented; Prevotella underrepresented). Standards such as completeness and contamination thresholds (e.g., MIMAG) and distance metrics (ANI ~95% species boundary; MASH as a proxy) guide clustering and species delineation. However, potential mis-binning, redundancy among representatives, and ecological niche heterogeneity can hamper accurate classification and read alignment. This study builds on these advances with a larger gut-focused dataset, stricter QC, and alternative sequencing modalities to reduce bias and improve representation.
Methodology
- Data sources: Assembled genomes from 51,052 human gut metagenome samples (largely Israeli adults; age 9–93, median 55), complemented with curated external MAGs and isolates from prior studies (through 2019), and nanopore-based MAGs. The final assembly pool comprised 241,118 genomes with available metadata. Public repositories (e.g., NCBI RefSeq) were excluded due to lack of consistent metadata and species bias.
- Sequencing and assembly: Most samples underwent short-read shotgun sequencing (single-end; some paired-end treated as single-end for alignment). A subset of 70 samples received nanopore long-read sequencing with deep paired-end support. Assembly pipelines followed Pasolli et al. with modifications: SPAdes for error correction and contig building; Bowtie2 for read mapping to contigs; MetaPhlAn for taxonomic profiling; binning and quality evaluation using Prodigal, jgi_summarize_bam_contig_depths, and CheckM; annotation using Prokka; contamination screening with stringent thresholds; and additional QC filters.
- Quality control thresholds: Completeness ≥70–75% (increased vs. prior 50% to reduce systematic bias and ensure accurate clustering), contamination <5%. Additional steps reduced chimerism and mis-binning risks (e.g., GUNC for chimerism assessment; GC frequency checks).
- Clustering and species delineation: All-vs-all genomic distances computed using MASH (sketch size 10k). Hierarchical clustering applied with species-level threshold equivalent to 95% ANI (MASH distance ~0.05; representatives bounded with ≤0.045 average distance). Bias-reduction strategies prevented overrepresentation from duplicate assemblies of the same individual over time (processed multiple samples but selected a subset per individual to limit near-duplicates).
- Representative genome selection: For each cluster, a representative was chosen by balancing completeness, contamination, and N50 (weighted scoring). Median representative genome quality achieved: completeness 95%, contamination 0.67%.
- Taxonomic identification and novelty assessment: GTDB-Tk (GTDB R202) assigned taxonomy to representatives. Novelty evaluated by MASH distance to closest UNITN and UHGG representatives; genomes exceeding species-level distance were considered novel species, with further categorization by genus/family/order-level distances.
- Annotation: Standard pipelines annotated coding sequences and non-coding features; functional categorization via eggNOG/COG. Antibiotic resistance potential assessed with ABRICate across multiple databases (NCBI, CARD, ARG-ANNOT, ResFinder), harmonizing hits across databases.
- Read mapping validation: External and internal validation cohorts from multiple geographies (e.g., Netherlands n=1,528; Israel n=3,096; India n=110; El Salvador n=113; Tanzania n=68). Reads (100 bp after QC; human-filtered) aligned to WIS, UNITN, and UHGG references using bioBakery; uniquely aligned reads defined by best single mapping. Unique relative abundance (URA) method used for prevalence/abundance estimates.
- Statistics: Mann–Whitney U tests with Bonferroni correction for multiple comparisons; no a priori sample size determination; alpha 0.05.
Key Findings
- Reference construction: From 241,118 curated assemblies, the WIS reference set comprises 3,594 species-level representative genomes (mostly bacterial; also archaeal), with median completeness 95% and median contamination 0.67%.
- Quality vs. UNITN: Representatives have higher completeness (95% vs 91%, q<0.0001), similar/less contamination (0.67% vs 0.70%, q<0.05), higher N50 (median 5.57 vs 3.04, q<0.0001), and longer genome length (median ~2.24 Mb vs ~1.97 Mb, q<0.0001). Clusters are larger on average (mean size 67.09 vs 39.61). Minimal inter-representative MASH distance is 0.040 in WIS (vs closer pairs in UNITN), reducing redundancy and ambiguous read assignment.
- Read mapping performance: Across validation cohorts (Western and non-Western), WIS aligns a higher fraction of total reads than UNITN; median aligned reads in the external cohort are 83.65% for WIS (significantly higher than UNITN; UHGG slightly lower total alignment but slightly higher unique alignment). In Netherlands, Israel, India, El Salvador, and Tanzania cohorts, WIS generally increases total and uniquely aligned reads relative to UNITN, with some cohort-specific exceptions vs. UHGG in unique alignment.
- Annotation breadth: WIS has more annotations per genome across categories (CDS, tRNA, rRNA, miscRNA, tmRNA, repeats; q<0.0001 vs UNITN). rRNA genes per genome are markedly higher in WIS (e.g., 16S: 0.68 vs 0.10 per genome), driven by inclusion of isolates and nanopore MAGs. WIS exhibits more uniquely annotated genes per bp than UNITN and UHGG, indicating broader functional coverage within the gut niche.
- Novel species discovery: 1,055/3,594 representatives lack GTDB species-level IDs; 340 exceed species-level distance from UNITN/UHGG, of which 310 constitute previously undescribed species (79 exceed genus-level, 6 exceed family-level distances). Novel species are mostly singletons (76%), predominantly from the authors’ cohort (89%) and short-read MAGs (95%). Novel species have lower GUNC reference representation scores and fewer predicted antibiotic resistance genes (21.94% harbor at least one vs 30.21% in known species; mean 1.06 vs 2.18 resistances/genome).
- Prevalence examples: The most prevalent novel species (Faecalibacterium genus) appears in up to 19% of validation samples (per abstract), with additional prevalent novel species in Ruminococcus and Prevotella.
- UHGG comparison: UHGG representatives show greater redundancy (smaller clusters, closer minimal inter-representative distances). WIS captures a higher fraction of total reads than UHGG in external validation (83.65% vs 82.85%, q<0.0001), while UHGG achieves slightly higher uniquely aligned reads (66.12% vs 65.60%, q<0.0001). UHGG-unique species tend to be lower prevalence than WIS-unique species.
Discussion
The study addresses the need for a comprehensive, gut-focused microbial genome reference to improve read mapping and taxonomic/functional resolution in metagenomic studies. By assembling genomes from an exceptionally large stool metagenome cohort and integrating curated external MAGs/isolates while enforcing stringent quality control, the authors deliver a reference (WIS) that outperforms the prior benchmark (UNITN) in genome quality, read mappability, and functional annotation density. The gut-only focus reduces ecological heterogeneity and taxonomy ambiguity, improving clustering and representative selection. Discovery of 310 previously unknown species—including taxa exceeding genus and family distances—expands the known phylogenetic and functional landscape of the gut microbiome. Increased rRNA gene recovery and broader functional annotations facilitate cross-modal microbiome profiling (e.g., 16S mapping) and hypothesis generation regarding microbiome–host interactions. Comparisons with UHGG reveal complementary strengths: WIS better total alignment performance and per-genome annotation depth within the gut niche, while UHGG offers slightly higher unique alignment. Together, these catalogs can be leveraged to maximize coverage and specificity, suggesting a path toward harmonized, non-redundant references.
Conclusion
This work presents the WIS gut microbiome reference comprising 3,594 high-quality species genomes built from 241,118 assemblies across 51,052 gut metagenomes and curated external datasets. The reference improves read alignment rates relative to established catalogs, increases per-genome annotations (notably rRNA), and reveals 310 previously unknown species, including taxa beyond genus and family boundaries. These resources enable more comprehensive microbiome profiling and functional inference in human gut studies. Future directions include: integrating WIS with UHGG and subsequent catalogs into a unified, deduplicated resource; expanding long-read and isolate sequencing to enhance contiguity and reduce chimerism; targeted sampling of individuals/samples enriched for rare taxa; extending references to other microbial kingdoms (viruses, fungi, archaea) and other body sites; and deeper functional characterization of novel species and their host/microbe interaction mechanisms.
Limitations
- Cohort and niche focus: The dataset is predominantly Israeli adults and focused solely on the gut, which may limit generalizability to other populations or body sites. Performance in non-Western cohorts showed some mixed results vs. UHGG for unique alignments.
- Assembly modality mix: Although nanopore and isolates improved contiguity and rRNA recovery, most genomes (assemblies and representatives) are short-read MAGs, which are more susceptible to fragmentation and potential contamination than long-read or isolate assemblies.
- Novel species singletons and chimerism risk: The majority of novel species are singletons, increasing the chance of mis-binning; chimerism indicators (GUNC CSS) are higher among novel species than known ones.
- Repository exclusions: Excluding public repositories like NCBI RefSeq to avoid bias may omit legitimately useful genomes and reduce overlap with other studies.
- Threshold and tool dependencies: Species delineation relies on MASH distance proxies to ANI and specific QC thresholds; alternative parameters or pipelines might yield different clustering outcomes. Differences in QC between catalogs (e.g., UHGG thresholds) complicate direct comparisons.
- Limited long-read coverage: Only 70 samples underwent nanopore sequencing, constraining the long-read contribution and potentially leaving some complex genomes under-resolved.
Related Publications
Explore these studies to deepen your understanding of the subject.

