Agriculture
The patterns of deleterious mutations during the domestication of soybean
M. Kim, R. Lozano, et al.
Soybean (Glycine max) is a globally important source of protein and oil, domesticated from wild Glycine soja in East Asia ~7000–9000 years ago. Both wild and domesticated soybean are predominantly self-pollinating, a mating system that can rapidly fix mutations and, together with domestication bottlenecks, reduce effective population size and alter the efficacy of selection. Prior work cataloged genome-wide SNP variation using arrays and resequencing but lacked integrated resources for large-scale genotype imputation, and wild soybean variation remained relatively underexplored. Deleterious mutations have been hypothesized to accumulate during domestication in many outcrossing crops, contributing to genetic load, but this pattern is unclear in selfing species like soybean. The study aims to characterize genome-wide variation, identify domestication-selective sweeps, quantify patterns of putatively deleterious mutations across wild-to-domesticated continua, and test whether an expanded haplotype map enhances GWAS resolution for seed protein and oil.
After the soybean reference genome release, SNP arrays and WGS expanded catalogs of common and rare variants, though prior datasets were not fully integrated for high-density imputation. In many outcrossing species (maize, cassava, dogs, grape), domestication and improvement are associated with increased deleterious variant burden ('cost of domestication'). However, reproduction mode differs across crops: maize is outcrossing, rice and sorghum cultivated forms are more selfing than their wild progenitors, and cassava and grape are outcrossing but clonally propagated. The implications for genetic load in a consistently selfing species like soybean remained unresolved. Earlier soybean studies indicated substantial diversity loss during domestication and broad LD patterns, but selective sweeps and deleterious mutation distributions with dense WGS across many wild accessions were less characterized. Imputation has improved GWAS in other species (rice, Arabidopsis, human), motivating similar approaches in soybean.
Plant materials and sequencing: The study assembled 855 WGS samples from 833 accessions representing global soybean diversity with strong East Asian representation. After QC, 74 samples with high heterozygosity or inbreeding coefficient <0.8 were excluded, yielding 781 accessions: 418 G. max (332 landraces, 86 improved), 345 G. soja, and 18 natural hybrids. DNA libraries (TruSeq DNA PCR-Free, ~500 bp inserts) were sequenced on Illumina HiSeq 2500/4000 (2×151 bp), targeting ≥10× coverage (achieved mean 14.09–61.27×). Read mapping and variant calling: Reads were QC’d (FastQC), aligned to Wm82.a2.v1 with BWA-MEM (-M), duplicates marked (Picard), base recalibrated and variants called using GATK (v4.0.1.2) joint calling and filtration. SNP quality filters included ReadPosRankSum < -2.0, MQRankSum < -2.0, QUAL < 30, QD < 3, FS > 30, MQ < 30, DP < 100 (site), and per-genotype DP < 5, GQ < 10; biallelic only; repetitive-region masking via high-depth thresholds; heterozygous allele balance filter. Indels were filtered with ReadPosRankSum < -20, QUAL < 30, QD < 2, FS > 200, biallelic only. Post-filtering retained 10,597,683 high-quality SNPs (for population analyses) and 1,436,499 indels (1,414,161 ≤50 bp; 22,338 >50 bp structural variants). For mutation load analyses, a broader set of 30,753,511 SNPs (no 1% MAF filter) was used. False-positive rate estimated via Williams 82 variants was <0.01%; detection power at 1% MAF >99% versus 180K array. Population structure and diversity: PCA (Eigensoft), FastStructure (K=2–12), and NJ tree (MEGA7) described structure. Diversity metrics (π, Tajima’s D, SNP/indel/SV densities) were computed in 100 kb windows (VCFtools). Recombination: historical ρ estimated with FastEPRR; empirical recombination (R) from a Williams 82K × IT182932 RIL map (~38,000 crossovers) used for comparison (Spearman ρ=0.256, P=7.945e-16). LD decay computed with PopLDdecay; genome-wide LD scores with GCTA. Circos plots summarized genomic landscapes. Selective sweeps: XP-CLR (updated version) scanned for domestication signals comparing 418 domesticated vs 345 wild accessions. Missing genotypes were imputed (Beagle v5.0). Genetic positions were based on an RIL genetic map; highly correlated markers down-weighted. XP-CLR scores were computed in 100 bp bins (max 200 SNPs per 0.05 cM window; correlation threshold 0.7). The top 5% genomewide XP-CLR windows (>89.4) merged into 183 sweep regions; candidate gene per sweep assigned to the nearest gene to the peak window. Sweeps were compared against domestication-related QTL from Williams 82 × PI 479752 mapping. Constraint/deleteriousness annotations: Multi-species alignment of 12 angiosperms (including 6 Fabaceae) to Wm82.a2.v1 with LASTz/MULTIz produced GERP++ scores with the soybean reference projected out to minimize bias. Evolutionarily constrained genome fractions were defined (GERP > 0 and >2). Deleterious mutations for burden analyses were primarily nonsynonymous CDS SNPs (including stop-gain) with GERP > 2. As sensitivity analyses, partitions by synonymous/nonsynonymous and GERP 2–4 vs >4 were also considered. Ancestral/derived polarization and burden estimation: Outgroups Phaseolus vulgaris and Vigna radiata anchored allele polarization of 1,187,829 CDS SNPs; est-sfs inferred posterior probabilities of derived vs ancestral states. Burden was summarized as per-accession counts of derived deleterious alleles. SIFT 4G (UniRef-based database; EnsemblPlants GTF) annotated nonsynonymous SNPs; deleteriousness defined as SIFT < 0.05. Reference bias in SIFT was corrected following Simons et al. for sites where the reference carried the derived allele. Demography: PSMC on pseudodiploids constructed from pairs of eight high-coverage domesticated and eight wild accessions (various East Asian origins) inferred changes in effective population size (Ne) over time (mutation rate 1.5×10^-8 per site per year; generation time 1 year). Of 28 pairwise pseudodiploids per group, 13 domesticated and 13 wild with acceptable profiles (wild had 15 spurious profiles removed) were retained. Imputation and GWAS: SoySNP50K VCF was curated (orientation corrections, exclusions), and 12,116 accessions from prior GWAS were filtered to 8,844 non-duplicated, high-quality G. max accessions (IBD >0.98 removed duplicates; per-SNP and per-sample QC). Using 481 G. max WGS genomes as a reference panel and a fine-scale recombination map, 4,467,134 SNPs were imputed with Beagle; after filters (Beagle r^2 < 0.3, MAF < 0.01), 3,082,234 SNPs remained (median Beagle r^2 ~0.95). GWAS for seed protein and oil used GEMMA LMM and mvLMM with FDR 5% (Benjamini–Hochberg) on both unimputed (SoySNP50K) and imputed datasets; MLMM with LD-pruned imputed set (291,388 markers) assessed multi-locus signals. Manhattan plots compared signals and examined known candidate gene regions (e.g., GmSWEET39).
- Variant catalog: 10,597,683 high-quality SNPs and 1,436,499 indels were identified across 781 accessions (418 domesticated, 345 wild, 18 hybrids); variant calling false-positive rate <0.01%; SNP detection power at 1% MAF >99%.
- Diversity and LD: Nucleotide diversity π was ~0.0023 in G. soja vs 0.0012 in G. max (≈48% reduction in domestication). LD decayed to r^2 < 0.2 within ~9–11 kb in wild soybean and extended to ~97 kb in domesticated soybean; pericentromeric regions showed much slower LD decay (~97 kb) than euchromatic regions (~7 kb). Historical recombination rate ρ correlated with empirical R (Spearman 0.256, P=7.9e-16). Genomic landscapes of gene/SNP/indel densities and GERP scores mirrored recombination patterns.
- Selective sweeps: 183 domestication-selective sweep regions (mean ~368 kb; ~20 genes per sweep) covering 6.4% of coding sequence were detected by XP-CLR. Sweeps overlapped with ~70% of domestication-related QTL regions from an interspecific mapping study; 13 of 17 QTL with >5% PVE corresponded to sweeps. A large pericentromeric sweep (~13 Mb) on chromosome 1 exhibited low diversity and highly negative Tajima’s D in domesticated soybean.
- Deleterious burden (GERP-based): 24.3% (237.5 Mb) of the genome was constrained (GERP > 0); 11.4% (111.5 Mb) highly constrained (GERP > 2). Of CDS SNPs, 41.1% overall and 48.9% of nonsynonymous (incl. stop) had GERP > 2. Using 742,149 nonsynonymous deleterious SNPs (GERP > 2): domesticated landraces carried 7.1% fewer derived deleterious alleles than wild accessions (P < 2.2e-16), and improved lines had an additional 1.4% reduction vs landraces (P = 0.0003). In sweep regions, domesticated soybean had 11.7–11.8% fewer deleterious alleles than wild (P < 2.2e-16). Within each population, sweep regions had only a modest 2.4% lower deleterious burden than control regions (P = 5.4e-13). Results were robust across deleteriousness thresholds (GERP 2–4 vs >4; synonymous vs nonsynonymous subsets).
- SIFT-based burden: Using 315,029 nonsynonymous SNPs with SIFT < 0.05 (with reference-bias correction), trends paralleled GERP-based results, though greater apparent decreases in domesticated vs wild (≈37–68%) suggested residual reference bias in SIFT. The reference cultivar Williams 82K showed unusually low SIFT-based burden but only the fifth lowest by GERP, supporting GERP’s reduced reference bias.
- Candidate domestication genes: 29 genes harbored nonsynonymous alleles nearly fixed in domesticated (AF > 0.99) and rare in wild (AF < 0.01); 24 of these lay within XP-CLR sweeps and five within 200 kb of sweeps. Many such SNPs were tolerated by SIFT (>0.05) but had GERP > 2 (21/29 genes), underscoring differences between metrics.
- Demography: PSMC on pseudodiploids indicated domesticated soybean underwent a continuous Ne decline starting ~15,000 years ago, reaching a bottleneck ~5000–9000 years ago (Ne minimum ~3,500, with bottleneck sizes ranging up to ~9,000), coinciding with domestication, followed by rapid growth. No severe bottleneck was inferred for wild soybean.
- Imputation and GWAS: Imputing 4.47 million SNPs into 8,844 accessions (3.08 million high-quality; median Beagle r^2 ~0.95) increased the number and strength of significant associations and refined peak positions for seed oil and protein. Several novel minor peaks appeared with imputed data (not retained under MLMM), suggesting weak-effect signals or complex LD. At the known chromosome 15 locus, imputation revealed a dense, highly significant cluster of 34 SNPs within the GmSWEET39 gene region (−log10 p > 27.86) where unimputed data showed a non-significant valley, facilitating candidate gene pinpointing.
The study addresses how domestication and improvement shaped deleterious mutation patterns in a predominantly selfing crop. Contrary to the ‘cost of domestication’ commonly observed in outcrossing species, domesticated soybean exhibits a reduced deleterious burden relative to its wild progenitor, with further reductions in improved lines. This pattern likely reflects selfing-mediated exposure and purging of recessive deleterious alleles, alongside artificial selection that favored haplotypes with fewer deleterious mutations, especially in sweep regions. The modest within-population reduction of deleterious burden in sweeps versus the genomic background suggests selection acted on haplotypes already depleted of deleterious alleles rather than tolerating deleterious hitchhikers. Demographic reconstructions revealed a domestication bottleneck contemporaneous with reduced diversity and extended LD, consistent with strong selfing and selection. The comprehensive WGS-based haplotype map enabled accurate genotype imputation into a large germplasm panel, enhancing GWAS power and resolution and improving localization of known causal gene regions (e.g., GmSWEET39). Collectively, these findings refine our understanding of how mating system modulates deleterious load during domestication and demonstrate practical benefits of dense reference panels for trait mapping in selfing crops.
This work delivers a high-resolution soybean haplotype map (10.6M SNPs; 1.4M indels) across 781 accessions, identifies 183 domestication-selective sweeps, and demonstrates that domestication and improvement in a selfing species reduced deleterious mutation burden—by ~7.1% from wild to landraces and an additional ~1.4% to improved lines—particularly within sweep regions. The reference panel enables accurate imputation (3.08M high-quality SNPs) into large diversity panels, improving GWAS discovery and fine-mapping, as exemplified at GmSWEET39 for seed oil. Future research should: (1) integrate additional wild and underrepresented geographic accessions to capture untapped variation; (2) functionally validate candidate domestication genes and deleterious variants; (3) extend purging-aware breeding strategies to introgress beneficial wild alleles while minimizing deleterious load; and (4) apply similar frameworks to other selfing crops with large genomes (e.g., wheat, barley) as resources mature.
- Reference bias: SIFT-based deleteriousness retained residual reference bias despite correction, inflating differences; GERP mitigated this by projecting out the soybean reference but still relies on multi-species alignment quality.
- Sweep-QTL comparisons: Some QTL span large, low-recombination pericentromeric regions (>20 Mb), limiting resolution and making overlaps suggestive rather than definitive.
- Demography: PSMC on pseudodiploids in wild soybean produced spurious profiles for many within-country pairs (15/28), reducing the number of reliable trajectories; PSMC has limited recent-time resolution.
- Sampling and exclusions: 74 samples with unexpected heterozygosity were excluded; although the final set is globally representative, it is enriched for East Asian accessions, potentially biasing patterns.
- GWAS signals: Novel minor peaks from imputed GWAS did not remain under MLMM, indicating weak effects or confounding from long-range LD; causal inference remains to be validated experimentally.
- Within-population sweep effect: Deleterious burden reductions within sweeps versus genomic background were modest (~2.4%), so selection on reduced-load haplotypes may be subtle and context-dependent.
Related Publications
Explore these studies to deepen your understanding of the subject.

