Agriculture
Agricultural practices influence soil microbiome assembly and interactions at different depths identified by machine learning
Y. Mo, R. Bier, et al.
Soil microbiomes contribute to crop yields and sustainable agriculture by enhancing nutrient availability, producing growth-promoting hormones, and increasing plant stress resilience. Management practices such as tillage, fertilizers, and cover crops can shape soil microbial communities through both deterministic (environmental selection) and stochastic processes. Traditional multivariate approaches (e.g., PCoA, RDA) assume linear or unimodal species–environment responses and often cannot capture non-additive effects or complex nonlinear structures. With the advent of machine learning (ML) and interpretable frameworks (e.g., SHAP), there is an opportunity to better elucidate microbial assembly mechanisms and interactions. This study asks: (1) which abiotic and management components most strongly affect assembly of soil prokaryotic and fungal communities and how can ML facilitate interpretation, and (2) to what extent soil microbiomes carry signals that indicate agricultural practices. The work leverages ML-based canonical analysis, Random Forests, and SHAP to evaluate environmental and management contributions across soil depths in a long-term field trial.
Prior studies highlight the ecological and economic functions of soil microbiomes in agriculture and the influence of management practices on microbial structure and function. Deterministic processes linked to environmental gradients and stochastic processes both contribute to community assembly, with stochasticity potentially more influential in later successional stages. Traditional ordination methods often fail to capture complex nonlinear relationships and interactions among environmental drivers. Explainable ML approaches such as SHAP have been introduced to interpret complex models used in microbiome research. Co-occurrence networks, while widely used to infer microbial interactions, can be difficult to interpret and may not capture higher-order interactions; integrating networks with ML can better quantify taxa contributions to module formation. Neutral community models (NCM) can complement ML by quantifying the role of stochastic processes not explicitly modeled by ML. Previous linear analyses at the same field site emphasized depth signals over management effects, potentially missing nonlinear patterns that ML can reveal.
Study site and design: Samples were collected January 28–29, 2019, from the Rodale Institute Farming Systems Trial (Kutztown, PA, USA; established 1981; Clarksburg silt loam, Alfisol; ~3% slope). Agricultural treatments included fertility source (organic legume, organic manure, synthetic inorganic fertilizer), tillage (full or reduced), and winter cover (cover crop or fallow for synthetic fertilizer), yielding 8 treatments with 3 replicate fields each (0.05 ha). Sampling and physicochemical data: Four 4.58 cm-diameter hydraulic soil cores per replicate were collected to 1 m and composited into depth sections: 0–10, 10–20, 20–30, and 30–60 cm, producing 96 samples. Soil physicochemical analyses included gravimetric water content, bulk density, stability index, CEC and base cation saturations, pH/acidity, macro- and micronutrients, loss-on-ignition organic matter, total C/N/H, mineralizable C, permanganate-oxidizable C, and autoclaved-citrate extractable protein. Sequencing and processing: DNA (0.25 g soil) was extracted (Qiagen DNeasy PowerSoil). Prokaryotes were amplified with 16S rRNA V4 primers 515F/806R; fungi with ITS3/ITS4 targeting ITS2. Libraries were sequenced on Illumina NovaSeq 6000. Reads were processed in R with dada2 to generate ASVs; taxonomy was assigned with SILVA v138.1 (prokaryotes) and UNITE v8.3 (fungi). Contaminants were removed with decontam (16S threshold 0.52, ITS threshold 0.1); chloroplast and mitochondrial ASVs were removed. Data deposition: SRA PRJNA635685. ML-based canonical analysis: Bray–Curtis distance matrices were reduced using t-SNE (perplexity tuned: 83 prokaryotes, 90 fungi) to maximize explained variance; degree of locality preservation was compared to PCoA. Ordinations were min–max rescaled to [-1,1]. Random Forest regression (10-fold CV) modeled t-SNE coordinates using categorical features (depth, fertility, tillage, cover) and numeric features (organic matter, C:N, total C, total H, total N, bulk density; note: some properties missing at 30–60 cm). SHAP values (TreeExplainer) quantified feature importance; significance was assessed against random features via t-tests (p<0.05). Neutral community model: NCM related ASV occurrence frequency to mean relative abundance, estimating immigration rate m and Nm (dispersal) via nonlinear least squares with 95% binomial CIs; fits (R²) were computed across all depths, by depth, and by tillage intensity. RF for ASV abundance prediction: For each depth, RF regressions predicted relative abundance of common ASVs (present in ≥50% of samples in that depth). Two feature sets were used: abiotic (24 soil properties + 3 management factors) and biotic (abundances of other ASVs). Ten-fold CV assessed R²; ASVs with R²>0.2 were deemed predictable (threshold from 1000 random-input simulations, 95% CI). Co-occurrence networks: For each depth, networks were constructed from common ASVs using Spearman correlations with thresholds from random matrix theory (|ρ|≥0.76) and FDR-adjusted p≤0.005. Topology metrics (edges, degree, density, clustering coefficient, modularity) were computed (igraph). Sensitive ASVs were identified via correlation-based indicator species analysis (indicspecies; 1000 permutations, p<0.05) and differential abundance (edgeR LRT; FDR<0.05); ASVs significant by both were deemed sensitive to a practice (fertility, tillage, cover). Modules were defined using greedy modularity optimization; fertility-sensitive modules required >5% of network nodes and >25% sensitive ASVs. Wilcoxon tests compared cumulative relative abundances among fertility sources. Hub taxa identification: Within modules, RF models predicted ASV abundances using biotic (module ASVs) and abiotic features; SHAP values quantified the impact of each ASV on others. The top 10% ASVs by summed absolute SHAP impact on predictable targets (R²>0.2) were designated hub taxa; models retrained on hub taxa were evaluated. Structural redundancy analysis compared SHAP-derived hubs versus degree-based hubs by correlating Bray–Curtis similarity matrices with the full module (higher Pearson r indicates better consistency). Classification of practices and biomarker selection: RF classifiers (n_estimators=1000; 10-fold CV; ovo ROC AUC) used common ASV abundances to classify fertility source (3 classes), tillage (2 classes), and cover cropping (2 classes) at each depth. To reduce overfitting, SHAP-based feature selection retained the top 50 ASVs per fold; models were retrained on selected features and combined across folds to compute AUC and prioritize primary biomarkers (top 50 by accumulated test-set SHAP).
Ordination and drivers: t-SNE outperformed PCoA in preserving nonlinear community structure, especially for fungi. Ordination explained variances were high (prokaryotes 81.6%; fungi 63.3%). Mapped analyses indicated environmental and management variables collectively explained major variation in ordination coordinates (prokaryotes 79.1%; fungi 72.6%). SHAP-based feature importance identified fertility source, organic matter, depth, gravimetric water content, bulk density, C:N, total C/N/H as significant (p<0.05). Fertility source was the dominant factor for fungal community variation, while depth and organic matter dominated for prokaryotes. Tillage significantly impacted prokaryotes; cover crops significantly impacted fungi, both with low importance. Neutral processes and depth: Across all depths, NCM fit was strong (prokaryotes R²=0.836, Nm=1920; fungi R²=0.788, Nm=851). Depth-specific NCMs showed higher immigration rates m in surface layers that declined with depth. Prokaryotes: R²/m at 0–10=0.855/0.0139; 10–20=0.858/0.0137; 20–30=0.825/0.0108; 30–60=0.782/0.00788. Fungi: R²/m at 0–10=0.780/0.00536; 10–20=0.781/0.00487; 20–30=0.698/0.00230; 30–60=0.535/0.00117. Tilled soils (0–20 cm) had higher m and R² than reduced-till, indicating enhanced dispersal and stochasticity. Predictability of ASV abundances: Biotic factors predicted a larger share of ASVs than abiotic factors at all depths. Prokaryotes predictable by biotic (%): 0–10=39.7, 10–20=28.3, 20–30=34.4, 30–60=17.0; by abiotic: 15.2, 15.4, 8.5, NA. Fungi predictable by biotic: 22.8, 15.4, 12.2, 2.8; by abiotic: 13.5, 8.4, 5.9, NA. Predictive accuracy with biotic features exceeded abiotic; predictability decreased with depth. Co-occurrence networks: Only 36.7% of nodes and 2.06% of edges recurred across depths, indicating strong depth stratification. The 30–60 cm network had the highest average degree (7.85), edge density (0.00664), and clustering coefficient (0.706), indicating tighter associations at depth; the 10–20 cm network was sparsest. Edge density increased with depth (R²=0.959, p=0.020). Bacteria–bacteria and fungi–fungi edge proportions varied significantly with depth (bacteria–bacteria R²=0.971, p=0.014; fungi–fungi R²=0.980, p=0.010). Bacteria–bacteria connections dominated overall. Practice-sensitive ASVs and modules: Tillage-sensitive ASVs concentrated at 0–20 cm; fertility- and cover-sensitive ASVs occurred across depths, with fertility-sensitive ASVs mostly within 0–30 cm and cover crop-sensitive ASVs mainly at 30–60 cm. Fertility- and cover-sensitive ASVs exhibited higher average degree than network means, while tillage-sensitive ASVs had lower degrees. Fertility-sensitive modules (synthetic fertilizer and legume) were identified at 0–10, 10–20, and 20–30 cm; manure formed no distinct module. Synthetic modules were enriched under synthetic fertilization; legume modules were enriched under legume or manure. Hub taxa and interactions: Within fertility modules, a large fraction of ASV abundances were mutually predictable with biotic features (69.9–86.3%), whereas abiotic predictability was limited (6.4–38.6%). SHAP-derived hub taxa (top 10% by impact) were predominantly fungi. Models retrained on SHAP hubs predicted 50.0–71.9% of module ASVs. Structural redundancy analysis showed SHAP hubs matched module variation closely (r=0.916–0.979), outperforming degree-based hubs (r=0.746–0.925). Taxonomic patterns included archaeal hubs (Nitrosotaleaceae, Nitrososphaeraceae) and bacterial hubs (e.g., Bacteroidota, Proteobacteria, Verrucomicrobiota) differing by depth and module. Classification of practices and biomarkers: Microbial communities at all depths carried signals to differentiate fertility sources, strongest at 0–10 cm (AUC: prokaryotes 0.968; fungi 0.996), with predictive power decreasing after 20 cm. Tillage was identifiable only at 0–20 cm; cover crop signals were weaker but >0.5 AUC at 0–10 and 10–20 cm. SHAP-based biomarker selection improved performance: fertility source AUC (0–60 cm) prokaryotes 0.905, fungi 0.960; tillage (0–20 cm) 0.896 and 0.917; cover crop (0–20 cm) 0.778 and 0.750. Primary biomarkers: Fungal biomarkers for fertility were enriched under organic sources (families Chaetomiaceae, Nectriaceae, Pyronemataceae; also Corynesporascaceae, Hypocreales, Magnaporthaceae, Orbiliaceae). Prokaryotic fertility biomarkers were more evenly distributed, with Microscillaceae, Nitrosomonadaceae, and Gemmatimonadaceae prominent; Nannocystis was elevated under legume/manure, Nitrospira under manure, and Acidibacter under synthetic fertilizer. Tillage biomarkers included fungal Lasiosphaeriaceae, Didymellaceae (more abundant under full tillage), and Nectriaceae (reduced tillage), and prokaryotic Nitrosomonadaceae, Gemmatimonadaceae, Roseiflexaceae, and Chitinophagaceae (often enriched under full tillage). Cover crop biomarkers at 0–20 cm included fungal Bolbitiaceae and Herpotrichiellaceae (cover), Helotiales, Dictyosporiaceae, Tubeufiaceae (no cover), and prokaryotic classes Dehalococcoidia, Acidimicrobiia, Oligoflexia, and family Rhizobiaceae (Oligoflexia indicated no cover).
ML analyses highlighted fertility source as the dominant management factor shaping fungal communities across soil profiles, contrasting with traditional linear analyses that emphasized depth for both domains. For prokaryotes, depth and organic matter covary and strongly structure communities, aligning partially with traditional findings but clarifying nonlinear depth–OM effects. Tillage effects were localized to surface soils (0–20 cm), where neutral processes and dispersal were enhanced, likely due to physical disturbance and increased hydrologic connectivity; in contrast, deeper layers showed reduced neutral model fits and lower predictability, reflecting greater heterogeneity and dispersal limitation. Fertility source and cover cropping introduced organic inputs that fostered more cohesive microbial co-occurrence patterns and fertility-sensitive modules dominated by fungi, suggesting organic amendments initially affect specific fungal taxa and cascade through the community. Integrating NCM with ML provided a more holistic view, acknowledging that ML may not capture stochasticity. SHAP-based interpretation overcame linearity assumptions, quantified feature contributions to ordination, and identified hub taxa whose dynamics closely tracked module-level changes, outperforming degree-based network hubs. Reliable microbial biomarker sets were identified for management practices, especially fertility sources, enabling accurate classification across depths and providing candidate taxa for monitoring or management.
By combining ML with traditional ecological modeling, the study reveals that soil microbial community assembly in a long-term agricultural trial is jointly shaped by deterministic factors (notably fertility source for fungi; depth and organic matter for prokaryotes) and stochastic processes that are strongest in surface soils and enhanced by tillage. Fertility sources recruit distinct, fungi-dominated co-occurrence modules whose influence diminishes with depth. ML-based canonical analysis and SHAP yielded nuanced insights beyond linear methods, improved identification of key environmental and management drivers, and enabled robust biomarker discovery for agricultural practices. Future work should expand environmental measurements to capture fine-scale heterogeneity, increase sampling depth and breadth to strengthen ML generalization, and integrate experimental manipulations to validate inferred interactions and biomarker functionality.
ML models can be difficult to interpret and often require large sample sizes; although SHAP improves interpretability, the approach may still miss stochastic processes, necessitating complementary use of NCM. Modeling environmental properties together with management practices may understate management effects since practices influence soil properties. Environmental variables exhibited low within-layer heterogeneity, limiting abiotic predictability at individual depths. The scope of measured variables focused on bulk soil properties and may not capture centimeter-scale spatial heterogeneity influencing microbes. Co-occurrence networks do not prove biological interactions and can be sensitive to sampling and compositing. Predictive power and network inference decreased with depth, potentially reflecting increased heterogeneity and limited sample sizes for complex models.
Related Publications
Explore these studies to deepen your understanding of the subject.

