Biology
Ecosystem-wide metagenomic binning enables prediction of ecological niches from genomes
J. Alneberg, C. Bennke, et al.
The study asks whether an organism’s ecological niche, reflected by its distribution across environmental gradients, can be predicted directly from its genome. Building on Hutchinson’s niche concept and observations of phylogenetic clustering and ecological similarity among related taxa, the authors posit that genomes, encoding traits subject to environmental filtering, should be strong predictors of realized niches. While genome-based phenotype inference exists for many microbial traits, it remains unclear if species’ distribution patterns can be predicted from genome content alone. With abundant metagenomic data and quantitative abundance profiles across Baltic Sea gradients (salinity, depth/oxygen, size fraction, season, geography), the authors aim to reconstruct genomes ecosystem-wide and test genome-based prediction of niche placement using machine learning.
Prior work has shown: (1) community phylogenetic structure and correlations between phylogenetic relatedness and ecological similarity in macro- and microorganisms; (2) genomic inference of microbial phenotypes (e.g., oxygen tolerance, substrate use, temperature range, pathogenicity, antibiotic resistance, trophic strategy); (3) large-scale metagenomic binning in global oceans and lakes; (4) existence of a global brackish microbiome with bacterioplankton genetically distinct from freshwater and marine relatives; and (5) links between gene content, phylogeny, and environmental conditions. However, a systematic prediction of ecological distributions directly from genomes within an ecosystem had not been demonstrated.
Sampling and sequencing: 123 metagenomic samples from the Baltic Sea across five datasets (Askö Time Series 2011; Redoxcline 2014; Transect 2014; LMO Time Series 2013–2014; Coastal Transect 2015) covering gradients of salinity, depth/oxygen, season, size fraction, and geography. Sequential filtration protocols varied by dataset (e.g., 3.0, 0.8, 0.1 µm; or 0.2 µm with/without pre-filtration). Libraries were prepared with Rubicon ThruPlex and sequenced on Illumina HiSeq 2500 (~44 million paired reads per sample). Preprocessing: Cutadapt for quality/adapter trimming; FastUniq for duplicate removal. Assembly and quantification: Each sample assembled with MEGAHIT (meta-sensitive). Long contigs were cut into non-overlapping 10 kb parts for binning. Kallisto quantified contig abundances across all samples (cross-quantification; 124×124 runs), with counts converted to approximate coverages. Binning and quality screening: CONCOCT (SpeedUp_Mp) used for per-sample binning; cut-up contigs reconciled by majority vote to original contigs. CheckM assessed completeness/contamination; MAGs retained with ≥75% completeness and ≤5% contamination. Internal standard: Thermus thermophilus genome spiked into 29 samples validated sensitivity (MAG recovered in 28/29). Read recruitment: 1,000 R1 reads per sample matched via BLASTN to MAG contigs (≥97% identity, ≥90% read length) to estimate recruitment proportions. Clustering and taxonomy: MAGs clustered at 96.5% ANI with fastANI and hierarchical clustering to define Baltic Sea clusters (BACLs). Taxonomic assignment via GTDB-Tk (release 80). Representative MAG per BACL chosen by highest (completeness − contamination). Abundance profiles: MAG abundance per sample computed from Kallisto contig coverages (length-weighted) and normalized to coverage per million read pairs; BACL abundance used from representative MAG (intra-BACL profiles highly correlated, mean Spearman 0.98). Ordination and environmental associations: Pairwise BACL abundance dissimilarity computed as (1 − Spearman)/2; PCoA with Cailliez correction produced principal coordinates (PCs). Associations with measured environmental parameters (salinity, depth, oxygen, temperature, filter size, nitrate, phosphate, silicate, chlorophyll a, DOC) derived by correlating BACL–environment correlations with PCoA coordinates. Gene profiles: Genes predicted with Prodigal; functions annotated with eggNOG-mapper (eggNOG @NOG, 35,593 groups total; 4,115 COGs). BACL gene profile averaged from constituent MAGs; pairwise gene dissimilarities computed as (1 − Spearman)/2. Prediction tasks and features: A priori niches defined as log-abundance ratios: high vs low salinity (Transect 2014 surface; >14 vs <6 PSU); surface vs mid layer (Transect 2014 paired samples); 3.0 µm vs 0.8 µm fractions (Askö 2011). Virtual niche space defined by PCoA of abundance profiles (PC1–PC10 targeted). Low-abundance BACLs excluded per threshold for specific tasks; BACLs with cross-cluster ANI >0.95 excluded to prevent mis-quantification, leaving 342 for PCoA predictions. Machine learning: Gene profile predictors filtered to eggNOGs present in ≥10% of BACLs (3,476 eggNOGs; 2,360 COGs). Algorithms: ridge regression (glmnet, alpha=0, CV-tuned lambda), random forest (randomForest, 2000 trees), gradient boosting (gbm, Gaussian loss; tuned hyperparameters n.trees=10,000; shrinkage=0.001; interaction.depth=2; n.minobsinnode=1). Validation: 10-fold cross-validation for ridge and GBM; out-of-bag for RF; performance via Spearman correlation between predicted and observed values. Phylogeny-based prediction: Ancestral state estimation (picante) on GTDB-inserted trees pruned to BACL representatives. Statistics: Spearman rank correlations for predictions; Mantel and partial Mantel tests (10,000 permutations) for relationships among abundance dissimilarity, gene dissimilarity, and phylogenetic distance. Data deposition and availability provided (ENA/NCBI accessions).
- Genome reconstruction: 1,961 MAGs (mean completeness 90.9%, contamination 2.5%) from 123 samples; plus 83 previous MAGs aggregated into 355 BACLs (352 species-level clusters reported in abstract; main text refers to 355 including previous data). Internal standard Thermus thermophilus recovered in 28/29 spiked samples. - Read recruitment: MAGs recruited on average 32% of metagenomic reads (97% identity); 36% when excluding largest (3.0 µm) and smallest (<0.1 µm) fractions—higher than Tara Oceans MAG recruitment (6.8%). - Taxonomic breadth and novelty: Broad phylogenetic coverage (notably Actinobacteria, Bacteroidetes, Cyanobacteria, Planctomycetes, Proteobacteria, Verrucomicrobia). 320/352 BACLs unclassified to species level; 180 unclassified to genus; 56 to family, indicating substantial novelty. Certain phyla (e.g., Eisenbacteria, Myxococcaeota, SAR324) had large distances to nearest GTDB neighbors. - Niche distributions: Along salinity gradient, Actinobacteria and Betaproteobacteria biased toward low salinity; Alpha- and Gammaproteobacteria toward high salinity. Surface vs mid-depth: Cyanobacteria preferred surface; Planctomycetes, Crenarchaeota, Thermoplasmataeota biased deeper. Size fraction: Actinobacteria, Alpha- and Gammaproteobacteria underrepresented on 3.0 µm (free-living); filamentous cyanobacteria (Nostocales, Pseudanabaenales) enriched on 3.0 µm; Bacteroidetes and Planctomycetes varied. - Predicting a priori niches from gene content: Significant correlations between predicted and observed placements along salinity, depth, and size-fraction gradients using ML on gene (eggNOG) profiles: Spearman rho ~0.70–0.81 (all P < 1e-16). - Virtual niche space (PCoA of abundance profiles): PC1–PC3 correlated with temperature, oxygen, and silicate, respectively. Gene-content-based predictions of BACL positions showed significant correlations for PC1–PC10; example rhos: PC1=0.74, PC2=0.70, PC3=0.64 (all P < 1e-16). - Gene content vs phylogeny: Phylogeny-based ancestral state predictions were significant but generally weaker than gene-content models for 8/10 leading PCs. Pairwise relationships: abundance dissimilarity vs phylogenetic distance rho=0.21; abundance vs gene-profile dissimilarity rho=0.55; gene-profile dissimilarity vs phylogenetic distance rho=0.07; all Mantel P=1e-4. Partial Mantel: abundance vs gene content remained significant controlling for phylogeny (ρ=0.21, P<1e-4); abundance vs phylogeny lost significance when controlling for gene content (ρ=−0.06, P=1). - Within-clade predictions: For major clades (Actinobacteria, Bacteroidetes, Alpha- and Gammaproteobacteria), first three PCs predicted with mean rho ~0.61 using gradient boosting, outperforming phylogeny-based predictions.
The findings directly address whether ecological niches can be predicted from genomes: machine learning models using gene content accurately predicted placements of microbial populations along key environmental gradients and within a data-driven niche space, explaining roughly half the variance in leading niche dimensions. Gene-content models generally outperformed phylogeny-based approaches, indicating that specific functional genes capture ecological determinants beyond phylogenetic signal, consistent with lateral gene transfer and functional convergence shaping niches. The results establish a robust genome–niche link and suggest that integrating genomic features into ecological modeling can improve predictions of microbial distributions. This approach enables quantitative niche comparisons without a priori knowledge of all controlling factors and can be extended within clades to resolve finer ecological differentiation. Implications include advancing predictive microbial ecology, informing environmental management (e.g., potential distributions of alien species), and enhancing species distribution models by incorporating genomic data.
This study presents an ecosystem-scale catalog of Baltic Sea bacterioplankton genomes and demonstrates that ecological niche positions—both along defined gradients and within a virtual niche space—can be predicted from genome-encoded functions using machine learning. Gene content provides superior predictive power over phylogenetic information for leading niche axes, underscoring the central role of functional genes in determining ecological distributions. Future work should refine predictions using strain-level genome reconstructions, more complete and cleaner MAGs, finer-grained functional annotations, and potentially sequence variation within genes. The framework is transferable beyond prokaryotes as genomic resources expand for microbial eukaryotes and can be integrated into next-generation species distribution modeling.
- MAG-based gene profiles may be incomplete or contain impurities due to assembly/binning limitations, potentially reducing predictive accuracy. - Orthologous groups (eggNOG/COGs) can be functionally heterogeneous; finer functional definitions or genotypic variation could improve models. - Predictions explain about half of the variance in leading niche axes; unmeasured biotic/abiotic factors likely drive remaining variation. - Lower-rank niche dimensions did not correlate with measured environmental parameters, indicating missing drivers. - Low-abundance BACLs were excluded in certain analyses, potentially biasing scope. - Potential confounding from phylogenetic signal, though analyses suggest gene content captures ecology beyond phylogeny. - Rarefaction suggests additional BACLs remain undetected, so genomic coverage of the ecosystem is not exhaustive.
Related Publications
Explore these studies to deepen your understanding of the subject.

