logo
ResearchBunny Logo
High-quality genome assembly and resequencing of modern cotton cultivars provide resources for crop improvement

Agriculture

High-quality genome assembly and resequencing of modern cotton cultivars provide resources for crop improvement

Z. Ma, Y. Zhang, et al.

Discover groundbreaking research conducted by Zhiying Ma and colleagues, revealing the hidden genetic secrets of cotton through high-quality genomes of *Gossypium hirsutum* and *Gossypium barbadense*. This study uncovers structural variations that impact fiber quality and resilience against *Verticillium* wilt, offering insights into crop improvement strategies.

00:00
00:00
~3 min • Beginner • English
Introduction
Cotton is a major fiber crop with Gossypium hirsutum contributing over 90% of global production. Although thousands of improved varieties exist, breeding for simultaneous gains in yield, fiber quality, and resistance is constrained by limited knowledge of the genomic basis of key agronomic traits. High-quality genome assemblies have been reported for obsolete varieties (for example, TM-1 and ZM24), but recent modern cultivars and the extent of genomic diversification during modern breeding remain insufficiently characterized. G. barbadense provides superior fiber quality and disease resistance, making interspecific introgression an attractive strategy; however, genomic variation between G. barbadense and modern G. hirsutum is not well resolved. While SNP-based GWAS have identified loci for cotton agronomic traits, the genome-wide contribution of structural variations (insertions, deletions, inversions, translocations) to phenotype remains unclear. This study aims to assemble chromosome-scale genomes for modern cultivars (G. hirsutum NDM8 and G. barbadense Pima90), resequence a large G. hirsutum panel, catalogue structural variations, and assess their genetic effects on fiber quality, yield traits, and Verticillium wilt resistance.
Literature Review
Prior work has produced reference genomes for tetraploid cottons, including G. hirsutum (TM-1) and G. barbadense (Hai7124, 3-79), as well as diploid A and D genomes, enabling discovery of SNPs associated with agronomic traits. Reports have also documented extensive intraspecific gene order/structural variation among upland cotton cultivars and genomic diversifications across allopolyploid Gossypium species. However, high-quality assemblies of recently bred, modern cultivars and a comprehensive survey of structural variations and their phenotypic associations in large resequenced populations have been lacking. The importance of transposable elements, particularly LTR retrotransposons (Copia and Gypsy), in plant genome evolution and domestication is well-established, motivating analysis of TE insertions relative to gene expression and trait diversification.
Methodology
Plant materials: Modern G. hirsutum cv. NDM8 and G. barbadense acc. Pima90 (selfed >10 generations) were selected for de novo assembly. A total of 1,081 worldwide G. hirsutum accessions were resequenced; for Verticillium wilt resistance assessment, 401 accessions were phenotyped. Sequencing and assembly: High-molecular-weight DNA was extracted (CTAB). PacBio Sequel long reads were generated (NDM8: 205.41 Gb; Pima90: 200.82 Gb; subread N50 ~19.84 kb and 18.82 kb). Illumina paired-end reads (350-bp inserts) were produced for polishing and resequencing. 10x Genomics linked reads (GemCode) provided barcoded long-range information. Hi-C libraries (HindIII digestion) from leaf tissue captured chromatin contacts for scaffolding. Reads were quality-filtered to remove adapters, low-quality bases, Ns, and PCR duplicates. Assembly pipeline: PacBio reads were self-corrected and assembled with FALCON; contigs were polished with Quiver and further corrected using Pilon with Illumina reads. 10x linked reads were used to superscaffold with fragScaff. Hi-C reads were mapped (Bowtie2), processed (Hiclib/HiC-Pro), and scaffolds were anchored/oriented with LACHESIS and manual curation based on contact maps. Assembly assessment: Short reads were aligned with BWA to assess base-level accuracy and coverage; BUSCO (embryophyta set v3) evaluated completeness. Genetic map concordance and BAC alignments were checked; centromeric region collinearity and LAI scores assessed assembly contiguity and TE assembly quality. Annotation: Repeats were annotated via homology (RepeatMasker/Repbase, RepeatProteinMask) and ab initio tools (TRF, LTR_FINDER, PILER, RepeatScout). Gene models were predicted using evidence integration: homology (TBLASTN, GeneWise), transcript evidence (RNA-seq assembled with Trinity/PASA; PacBio Iso-Seq; TopHat/Cufflinks), and ab initio predictors (Augustus, GeneID, GenScan, GlimmerHMM, SNAP). EvidenceModeler integrated models; PASA refined UTRs and splicing. Functional annotation leveraged SwissProt, Pfam, NR, GO, and KEGG. Structural variation discovery: For interspecific comparisons, smartie-sv called insertions/deletions between assemblies, with read-based validation; inversions/translocations were inferred from synteny (MUMmer-based) patterns. For the population panel, SNPs were called with BWA/Sentieon; small indels (≤250 bp) with Sentieon; larger SVs (insertions, deletions, duplications, inversions, translocations) were identified using SVMerge integrating LUMPY and BreakDancer; local assembly with Velvet refined breakpoints. Calls across individuals were merged into a nonredundant set, retaining SVs supported by at least ten accessions. Phylogeny was constructed with TreeBest. GWAS: Phenotypes included fiber quality traits (fiber length, fiber strength, micronaire), yield traits (boll weight, lint percentage, seed index) across multi-environments (12–14 for core/expanded sets and an additional 2019 environment for 1,041 accessions), and Verticillium wilt disease index (401 accessions; multiple experiments with resistant/susceptible checks). BLUPs (lme4) and means were computed. Due to abundant inversions on At08, SVs on At08 were excluded from GWAS. Association mapping used GEMMA with top three PCs (population structure) and kinship (simple matching coefficients). Genome-wide threshold set at P = 1/n (n = number of SVs tested). Expression and functional assays: RNA-seq quantified expression (RPKM; Cufflinks). qRT-PCR used Ghhistone3b as control and 2^-ΔΔCt calculation. Virus-induced gene silencing (TRV-based) in cotton (NDM8 and CCRI8) targeted GhNCS; disease assays were performed with V. dahliae (LX2-1) inoculation. Arabidopsis overexpression lines (35S::GhNCS) were generated (pGreen, Basta selection) and tested for disease susceptibility. Statistical analyses included permutation tests for density enrichment, Mann–Whitney U-tests, t-tests, and ANOVA (SPSS).
Key Findings
Genome assemblies: Chromosome-scale assemblies were produced for NDM8 (2,291.77 Mb) and Pima90 (2,210.14 Mb). Anchoring to pseudochromosomes: 99.57% (NDM8) and 99.75% (Pima90). Contig N50: 13.15 Mb (NDM8) and 9.24 Mb (Pima90); scaffold N50: 107.67 Mb and 102.45 Mb; gap ratios: 0.003% and 0.06%; GC content ~34.3%. BUSCO completeness: 96.1% (NDM8) and 95.9% (Pima90). LAI scores: 14.2 (NDM8) and 12.1 (Pima90). Predicted protein-coding genes: 80,124 (NDM8) and 79,613 (Pima90); 98.6% and 98.6% expressed, respectively. Newly predicted genes: 1,499 (NDM8) and 1,267 (Pima90), with >92% showing transcriptional evidence. Transposable elements: LTRs cover ~55% of genomes (NDM8: 1,263.36 Mb; Pima90: 1,204.74 Mb). Gypsy elements predominate over Copia (NDM8: 81.29% vs 17.82% of LTRs; Pima90: 81.07% vs 18.14%; P < 1e-21). Copia insertions showed a greater per-insertion impact on expressed gene counts and were more active in 0–1 MYA. Interspecies structural variation (Pima90 vs NDM8): 846,363 SVs detected (517,230 insertions; 317,638 deletions). Although total counts in A- and D-subgenomes were similar, SV densities were higher in D-subgenome (insertions: 312/Mb vs 188/Mb; deletions: 194/Mb vs 114/Mb; P ≤ 1.51×10^-13). Inversions: 9,515 (avg 21.85 kb), denser in A-subgenome (P = 6.44×10^-9). Translocations: 1,980 (74.09% interchromosomal). 5,815 exonic variants in 5,256 genes included 4,180 frameshifts and 381 stop-gain/loss. Notable variants: a 2-bp deletion in a sucrose synthase (GbM_D13G2394) transmembrane region associated with elevated expression during fiber development in G. barbadense; multiple small indels in fiber-related regulators (EXPANSIN, bHLH, DIR). Across fiber, tissue, and V. dahliae conditions, 31,296 variant–gene pairs with significant differential expression were identified (|log2FC| ≥ 1; P < 0.05), supporting regulatory effects of SVs. Intraspecies (NDM8 vs TM-1) and breeding-era variation: 76,568 SVs in NDM8 relative to TM-1 (27,708 insertions, 47,221 deletions, 808 inversions, 831 translocations). SV densities (insertions and deletions) were significantly higher in D-subgenome (P ≤ 1.28×10^-3). SV density showed strongest enrichment near telomeres (top 20% windows) with a 3.71-fold increase over other regions (P < 1×10^-6 permutations). Exonic indels: 603 in 526 genes; example: GhM_A02G1731 (CCR) in TM-1 has a 1-bp splice-site indel leading to 29- and 45-bp deletions, truncating an NAD-binding domain, and lower expression under V. dahliae than resistant NDM8. Inversions were denser in A-subgenome (At/Dt density difference; P = 2.60×10^-5). Comparison across varieties suggested fewer and smaller rearrangements between more closely released cultivars. Modern (post-1990) vs early (pre-1970) varieties acquired 1,128 SVs (≥51% frequency), with higher density in D-subgenome (6.79×10^-4/kb vs 3.86×10^-4/kb; P = 7.81×10^-5), implying stronger selection on D-subgenome during modern breeding. Population resequencing and GWAS: In 1,081 G. hirsutum accessions (avg 10.65×), 304,630 SVs were called (141,145 insertions; 156,234 deletions; 39 inversions; 6,384 translocations; 828 duplications), mostly intergenic (76.94%). Together with 2,970,970 SNPs, GWAS identified 446 SVs significantly associated with seven traits: 346 for fiber quality (FL, FS, M), 97 for yield (BW, LP, SI), and 3 for Verticillium wilt resistance. Of 193 SVs replicated by both BLUP and means, 160 were for fiber quality and 33 for yield; 29 lay in regulatory regions and 19 within genes. Trait–subgenome bias: fiber-quality SVs predominantly in D-subgenome (139 vs 21 in A), while yield SVs were mostly in A-subgenome (22 vs 11 in D). Highlighted loci: For fiber length (Dt11, 24.55–24.93 Mb), a 370-kb region with 125 SVs contained variants increasing FL by 0.71–1.19 mm, elevating grades (e.g., 27–28 mm to 29 mm). For lint percentage (Dt03), two SVs increased LP from ~37.5% to ~39.7–40.0%. For Verticillium wilt resistance (Dt11, 69.00–69.33 Mb), three SVs reduced disease index by >13.6, shifting susceptible (DI ~45) to tolerant (DI ~31). Candidate genes and validation: 907 candidate genes (fiber/yield) and 60 (VW) were prioritized within LD (325 kb); 84.23% expressed during fiber development; 305 carried SVs in genes/regulatory regions. Four deletions in UTRs/introns of GhM_D11G2206 (GhFL2) associated with FL. Causal validation: GhM_D11G3743 (GhNCS; PR10/Bet v1 family) on Dt11 showed lower expression in resistant vs susceptible varieties under V. dahliae. Silencing GhNCS enhanced resistance (CCRI8: DI 53.1 to 21.4; NDM8: 27.9 to 10.0), while overexpression in Arabidopsis increased susceptibility (DI 38.1 to 58.1), supporting GhNCS as a negative regulator of VW resistance and validating SV-based GWAS hits. Introgression example: A backcross-derived line NDM373-9 carrying Pima90 introgressions showed improved VW resistance and fiber properties and harbored 171 exonic SVs from Pima90, including 34 and 12 genes implicated in disease resistance and fiber development, respectively.
Discussion
This work addresses the gap in understanding how structural variations shape agronomic traits in modern cotton. Chromosome-scale assemblies for modern G. hirsutum (NDM8) and G. barbadense (Pima90), combined with deep resequencing of 1,081 G. hirsutum accessions, revealed widespread SVs with distinct subgenome patterns: the D-subgenome bears higher densities of insertions/deletions and appears to have undergone stronger selection during species formation and modern variety development, whereas the A-subgenome harbors denser inversions. SV density is enriched near telomeres, paralleling observations in humans. Functionally, numerous SVs in coding and regulatory regions correlate with gene expression differences across tissues, fiber development, and pathogen stress, and specific exonic indels can alter protein domains (e.g., Sus, CCR) with plausible phenotypic consequences. GWAS leveraging SVs identified hundreds of loci for fiber quality, yield, and disease resistance, with a clear subgenome bias (fiber in D, yield in A). The experimental validation of GhNCS as a negative regulator of Verticillium wilt resistance provides causal support for SV-trait associations. Collectively, these findings establish SVs as key components of genotype-to-phenotype maps in cotton and provide targets and genomic resources to guide marker-assisted selection and interspecific introgression for crop improvement.
Conclusion
The study delivers high-quality, chromosome-scale genomes for modern cotton cultivars NDM8 and Pima90 and a comprehensive catalogue of structural variations across species and a large upland cotton population. It demonstrates that SVs, especially in the D-subgenome, have been targets of selection and significantly contribute to fiber quality, yield, and Verticillium wilt resistance. The identification and validation of trait-associated SVs and candidate genes (e.g., GhNCS for VW resistance; GhFL2-related variants for fiber length) provide actionable markers and functional leads for breeding. Future work could expand functional validation of additional SVs, develop SV-aware genomic selection models, construct a cotton pangenome to capture rare/complex variants, and explore the mechanistic roles of TE insertions (Copia vs Gypsy) in regulatory evolution and trait diversification.
Limitations
GWAS excluded SVs on chromosome At08 due to abundant inversions that may interfere with association accuracy, potentially missing associations on that chromosome. Verticillium wilt resistance was phenotyped on a subset (401/1,081) of accessions under controlled growth-chamber conditions, which may not fully capture field variability. Population SV discovery required support from at least ten accessions and integrated specific callers (LUMPY, BreakDancer), which could limit sensitivity to rare or complex SV classes.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny