logo
ResearchBunny Logo
Diversity analysis of 80,000 wheat accessions reveals consequences and opportunities of selection footprints

Agriculture

Diversity analysis of 80,000 wheat accessions reveals consequences and opportunities of selection footprints

C. Sansaloni, J. Franco, et al.

This research study delves into the genetic diversity of wheat, revealing important insights into landraces and crop wild relatives that could be pivotal for future breeding programs. Conducted by a diverse group of experts including Carolina Sansaloni, Jorge Franco, and many others, it highlights how unexplored genetic footprints can shape wheat varieties to withstand climate challenges and support growing populations.... show more
Introduction

Wheat is the most widely grown crop globally and a major source of calories, protein, and micronutrients. However, over 10,000 years of domestication and modern breeding have narrowed the genetic diversity of elite wheat germplasm, challenging progress toward climate resilience and sustainable yield gains. Ex situ genebanks at CIMMYT and ICARDA collectively conserve large numbers of crop wild relatives (CWR) and landraces that harbor untapped diversity, but their effective use is limited by difficulty identifying accessions with favorable alleles and by linkage drag during introgression. Advances in high-throughput genotyping enable large-scale characterization and informed selection of diverse materials. This study aims to (1) characterize global genetic diversity in nearly 80,000 wheat accessions across three biological categories (CWR, domesticated tetraploids, and domesticated hexaploids); (2) map DArTseq-derived SNP and SilicoDArT variants to the IWGSC RefSeq v1.0 bread wheat genome, the Svevo durum genome, and a DArT consensus genetic map; and (3) validate the approach by examining genomic regions associated with key agronomic traits and uncover selection footprints relevant to modern breeding.

Literature Review

The study builds on prior work documenting domestication, diversity loss, and breeding history in wheat, and on the importance of CWR and landraces as reservoirs of adaptive variation. Previous genebank genomics efforts in barley and wheat demonstrated the utility of genome-wide marker data for diversity assessment, core subset formation, and genomic prediction. Earlier studies identified major QTL for grain protein content and quality traits and highlighted the role of introgressions (e.g., rye 1B/1R translocation) in wheat improvement. Strategies such as core collections and FIGS have been used to mine diversity; recent approaches integrate passport, environmental, and genomic data to target breeder-relevant alleles.

Methodology

Plant materials comprised 79,191 wheat accessions from CIMMYT (50,053) and ICARDA (29,138): 56,342 domesticated hexaploids (AABBDD), 18,946 domesticated tetraploids (AABB), and 3,903 crop wild relatives spanning 27 Aegilops–Triticum species. Five seeds per accession were germinated; leaves from a single plant per accession were harvested, lyophilized, and DNA extracted (modified CTAB). DNA samples were tracked using CIMMYT’s DNA Sample Tracker and DArTdb LIMS. High-throughput genotyping used DArTseq™ (PstI–HpaII digestion/ligation, barcoded libraries, PCR amplification, pooling, and 77-cycle Illumina HiSeq 2500 sequencing). Proprietary DArTsoft14 pipelines performed de novo clustering and marker calling to generate SNP and SilicoDArT (dominant presence/absence) markers, with technical replicates (23% of samples) used to compute reproducibility. Marker filters included call rate ≥0.5 and reproducibility ≥0.95 for SNPs (SilicoDArT call rate >0.8); downstream analyses filtered SNPs by missingness ≤0.5 and MAF >0.001 (wild relatives: accession missingness ≤0.75). Marker sequences were aligned to reference genomes using a reciprocal Bowtie2 strategy: end-to-end and local alignments to IWGSC RefSeq v1.0 (all groups) and to the Svevo durum genome (tetraploids); alignments were retained when both alleles agreed reciprocally with minimal mismatches. SNP effects were annotated using SnpEff; density statistics were computed with veftools. A DArT consensus genetic map (v4) was constructed from 81 individual maps using an iterative integration algorithm requiring correlation >0.5 and ≥3 shared markers per group, yielding 105,122 mapped markers across 21 chromosomes. Diversity analyses were run separately per biological group: SNPs for hexaploid and tetraploid sets; SilicoDArTs for CWR (better for deep phylogeny). After filtering, genetic distances were computed using modified Roger’s distance (MRD) for SNPs and Jaccard distance for SilicoDArTs. Multidimensional scaling (MDS) visualized diversity (classical MDS for hexaploids; SMACOF for tetraploids and CWR). Iterative hierarchical clustering (Ward’s method on MRD) defined nested clusters; the number of clusters was guided by changes in a pseudo-F statistic and analysis of molecular variance (AMOVA). Pairwise fixation indices (FST) were calculated per 1 Mb sliding window using veftools for each split across 21 chromosomes to identify selection footprints. ADMIXTURE analyses were conducted on LD-thinned SNPs (MAF >5%) across multiple K to complement clustering. Core subsets representing ~20% of each collection were formed using stratified random sampling proportional to within-cluster diversity (D-method), selecting the subset with maximum average pairwise distance after 1,000 candidates. Misclassification diagnostics in the combined cores used the percentage of SilicoDArT markers mapping to A, B, and D subgenomes. For GWAS, 3,870 accessions were phenotyped for grain protein content (GPC) and SDS sedimentation. SNPs were filtered (RepAvg ≥0.95, MAF ≥0.01, missing ≤0.6, heterozygosity ≤0.2), aligned to reference and consensus maps, retaining 35,141 mapped SNPs. Association scans used the FarmCPU method in R with kinship correction and the first three PCs as fixed effects; multiple testing control included Bonferroni and FDR thresholds. Visualization used Flapjack and CurlyWhirly; data were deposited in CIMMYT Dataverse and Germinate.

Key Findings
  • Scale and marker resources: 79,191 accessions genotyped, including 56,342 hexaploids, 18,946 tetraploids, and 3,903 CWR from 109, 75, and 55 countries, respectively. Over 300,000 high-quality SNP and SilicoDArT markers were generated. Unique mapping rates: hexaploids 70% (66,067 markers), tetraploids 68% (30,806) to RefSeq v1.0 and 69% (31,181) to Svevo, CWR 50% (28,054) to RefSeq v1.0. Approximately 50% of markers linked to genes; most SNPs fall in intergenic, non-repetitive regions.
  • Genetic map anchoring: 105,122 markers distributed across 21 chromosomes (mean ~5,006 per chromosome). Proportions of markers with genetic map positions: SNPs—hexaploid 52.03% (44,501), tetraploid 53.29% (24,185), CWR 33.61% (18,738); SilicoDArTs—hexaploid 89.0% (23,571), tetraploid 70.5% (18,711), CWR 30.9% (19,022).
  • SNP and PAV counts after filtering (Table 1): hexaploid—85,531 SNPs (50,068 with MAF ≥1%, 28,078 with MAF ≥5%) and 26,507 SilicoDArTs; tetraploid—45,376 SNPs (38,935 ≥1%, 25,084 ≥5%) and 26,526 SilicoDArTs; CWR—55,739 SNPs (51,626 ≥1%, 39,907 ≥5%) and 61,505 SilicoDArTs.
  • Hexaploids: MDS and clustering show extensive unexplored diversity in landraces relative to elite germplasm; 70.1% of landraces lie at MRD >0.24 from the average elite lines. ADMIXTURE (K=8) identified groups corresponding to traditional landraces (Mexico, Iran, Turkey, Pakistan), modern Mexican landraces, primary synthetics (Syn A), synthetic derivatives (Syn B), and elites (including 2,260 elite lines and 4,400 elite–modern landrace admixtures). Clustering into 12 groups further differentiated traditional vs modern landraces and synthetics. Two small outlier subgroups likely misclassified tetraploids (lack of D-genome markers).
  • Tetraploids: Elite durum lines are broadly distributed across landrace diversity except for a distinct cluster of several hundred Ethiopian landraces. MRD from elites among landraces: 7.8% close (MRD <0.20), 55.9% intermediate (0.20–0.30; enriched for Turkey and Iran), 32.9% higher (0.30–0.35; 92% Ethiopia), 3.4% very high (>0.35; 42% Turkey). ADMIXTURE (K=7) resolved Ethiopian landraces, North African/Portugal landraces, West Asia landraces, additional Turkey/Azerbaijan/Iran/Russia landraces, elites (72% pure, 28% admixtures), a T. dicoccum group, and outliers likely hexaploids. A set of 1,008 accessions had ~20% SNPs on D chromosomes, indicating hexaploid misclassification.
  • CWR: 3,903 accessions across 27 species resolved primarily by genome constitution, then by species and sections (e.g., UU-genome group, section Cylindropyron vs Vertebrata vs Sitopsis/Triticum). Subspecies and ploidy differentiation observed (e.g., Ae. neglecta ssp. neglecta UUMM vs ssp. recta UUMMNN). About 5.1% (199) were outliers/potentially misclassified.
  • Core subsets: 20% cores captured diversity comparable to full sets (He, inbreeding, Shannon indices): 11,235 hexaploids, 3,157 tetraploids, 746 CWR. Combined core analysis flagged misclassifications: 4.4% of putative tetraploids had >10% markers on D genome (97 had >20%); 2.4% of putative hexaploids had <20% D-genome markers.
  • Selection footprints (FST): Variant-wise FST across chromosomes highlighted regions under selection distinguishing traditional vs elite groups and consolidating elite clusters. Notably, on 3A, high-FST clusters overlap the preharvest sprouting gene TaMFT; alleles present in cluster 2 (elites + Mexican landraces) are absent in cluster 4 (elites + cultivars), indicating recent selection history.
  • GWAS: In 3,870 accessions, 18 genomic regions associated with grain protein content (GPC) on 12 chromosomes, with highest signals on 4A and 4B, followed by 5A, 5B, 7A, 7B. Comparisons suggest potential novel QTL on 5BS and 7AL (QGPC.ndsu.5B, QGPC.ndsu.7A.2). For SDS sedimentation (gluten quality), 19 regions detected, major QTL on 1A, 1B, 1D co-locating with HMW- and LMW-glutenin loci (Glu-A1/B1/D1 and Glu-A3/B3/D3), plus a QTL on 2A.
Discussion

The large-scale genotypic characterization of CIMMYT and ICARDA wheat germplasm addressed the central question of how much breeder-relevant diversity resides in landraces and CWR and how it relates to elite pools. In hexaploid wheat, substantial landrace diversity remains largely untapped in modern breeding, while synthetic derivatives introduce distinct D-genome alleles that separate from elites. In contrast, elite durum germplasm spans much of tetraploid diversity, except for a genetically distinct Ethiopian landrace cluster representing priority material for allele mining. CWR analyses confirmed strong genome- and section-level differentiation and facilitated detection of misclassified accessions and ploidy levels, improving genebank curation. Genome-wide FST profiles pinpointed regions with strong differentiation between traditional and elite clusters, providing hypotheses for selective sweeps and targets for breeding, exemplified by the TaMFT region on 3A related to preharvest sprouting. GWAS validated the genotyping platform by recovering known quality loci and identifying additional candidate regions for GPC and SDS. Together, anchoring markers to physical and genetic maps, providing curated core subsets, and delivering visualization tools enable breeders and geneticists to trace selection footprints, identify diversity gaps, and more efficiently introgress valuable alleles into elite backgrounds.

Conclusion

This work delivers one of the largest diversity analyses for a crop, generating >300k markers and mapping them to reference genomes and a consensus genetic map for nearly 80,000 wheat accessions. It reveals extensive unexplored diversity in hexaploid landraces, a distinct Ethiopian cluster in tetraploids, and clear genome-based structure among CWR, while flagging misclassifications in genebank records. Selection footprints highlight genomic regions shaped by modern breeding, including loci underlying preharvest sprouting and end-use quality. The public release of genotypic data, consensus map, and visualization tools (CurlyWhirly, Flapjack) and construction of representative core subsets equip the community to mine, clone, and deploy functional diversity. Future research should integrate genomic, phenotypic, and environmental metadata to prioritize accessions, refine selection footprints across traits and environments, validate novel QTL (e.g., for GPC and SDS), and develop targeted pre-breeding pipelines that minimize linkage drag and accelerate the use of landrace and CWR alleles.

Limitations

Most accessions were represented by a single plant, limiting capture of within-accession heterogeneity. The DArTseq pipeline and some marker-calling parameters are proprietary; although robust and largely free from ascertainment bias, the approach may be less suitable than arrays for some applications and lacks full repeatability across platforms. Filtering thresholds (e.g., missingness, MAF) and reliance on de novo markers result in incomplete placement (not all markers mapped uniquely). Passport data incompleteness and misclassification (e.g., ploidy/group mislabels) required genomic curation. GWAS was conducted on a subset (3,870 accessions) and limited to two quality traits, so findings for other traits remain to be validated.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny