Agriculture

Graph-based pan-genome reveals structural and sequence variations related to agronomic traits and domestication in cucumber

H. Li, S. Wang, et al.

Discover the groundbreaking research on cucumber genetics, where a comprehensive pan-genome has unveiled approximately 4.3 million genetic variants, including over 56,000 structural variants linked to key agronomic traits. This innovative work by Hongbo Li and colleagues enhances our grasp of cucumber trait evolution and sets the stage for future breeding advancements.... show more

Introduction

High-throughput sequencing has facilitated discovery of SNPs and small indels by mapping short reads to a single reference genome, but this approach underrepresents structural variants (SVs), which are often critical for agronomic traits and evolution. Presence/absence variations mean important genes can be missing from standard references, motivating multiple reference genomes and pan-genome approaches. Graph-based genomes, which embed alternate haplotypes at variant loci, provide a population-aware representation that can improve variant discovery and genotyping. Cucumber (Cucumis sativus L.) is a major vegetable crop and model for sex determination and vascular development. Prior studies using the ‘Chinese long’ 9930 reference and short-read resequencing characterized SNPs/indels, inferred domestication history, and mapped some SVs (e.g., copy number variation at the Female locus). However, limited reference quality and diversity hindered comprehensive SV discovery. This study addresses these gaps by generating 11 additional chromosome-scale assemblies from diverse wild and cultivated accessions, building a graph-based pan-genome with the 9930 reference, detecting and genotyping millions of variants including SVs across a 115-line core collection, and associating SVs with key agronomic traits and domestication signals.

Literature Review

The paper reviews limitations of single-reference, short-read mapping for capturing genome-wide diversity, particularly SVs. It cites plant and human pan-genome studies that reveal extensive structural variation and endorses graph-based genome representations for improved read mapping and variant genotyping. For cucumber, previous population genomics based on 9930 short-read mapping elucidated domestication and diversity, and a resequencing-based SV map identified a CNV defining the Female (F) locus, underscoring the functional importance of SVs. Nevertheless, the paucity of high-quality, diverse references limited SV detection, motivating the present pan-genome and graph-genome approach.

Methodology

Sampling and sequencing: Selected 11 representative accessions from a 115-line cucumber core collection (2 East-Asian, 3 Eurasian, 1 Xishuangbanna, 5 Indian including wild var. hardwickii). Generated 15.9–22.4 Gb PacBio long reads per accession (45–64×). For Cuc37, Cuc80, and Cuc64, produced 10X Genomics linked-reads and ~200× Hi-C reads; Illumina short reads were available for polishing.
Genome assembly: Assembled PacBio reads with Canu v1.8. Removed contaminant/plastid contigs by BLASTN; polished with Pilon using Illumina reads. For three accessions (Cuc37, Cuc80, Cuc64), scaffolded with ARCS (10X) and constructed chromosome-scale pseudomolecules using Juicer/3D-DNA and Hi-C maps; for the other eight, anchored and oriented contigs using four genetic linkage maps via ALLMAPS; resolved misassemblies by comparing with linkage maps and split conflicting contigs; placed unanchored contigs by MUMmer alignment to 9930. Assembly quality assessed by BUSCO (embryophyta odb10) and Hi-C contact maps.
Genome annotation: De novo repeat identification with RepeatModeler; LTRs detected by LTR_FINDER and LTRharvest, filtered with LTR_retriever; repeat masking with RepeatMasker. Classified intact LTR-RTs into families using the 80-80-80 rule and CD-HIT-EST. Predicted protein-coding genes with EVidenceModeler integrating ab initio, RNA-seq (10 tissues in 9930 and 7 in wild P1183967), and homology; rescued missing orthologs using SPALN; functional annotation via UniProt/TAIR and InterProScan, assigning GO terms.
Chromosomal rearrangements: Aligned assemblies to 9930 with MUMmer (nucmer), manually identified megabase-scale inversions using dot-plots and Hi-C support; surveyed presence/absence across accessions by aligning contigs where Hi-C was not available.
Pan-genome: Clustered 299,692 predicted genes from 12 accessions with GET_HOMOLOGUES-EST to define core vs dispensable gene clusters; evaluated GO enrichment (Fisher’s exact, FDR<0.05), gene lengths, expression (HISAT2/StringTie), and Ka/Ks on ortholog pairs.
Variant discovery: Built one-to-one collinear alignment blocks between each accession and 9930 via MUMmer (nucmer + delta-filter), axtChain/chainNet; called SNPs and small indels (≤50 bp) using EMBOSS diffseq; removed potential assembly base errors using accession’s short reads (BWA, SAMtools/BCFtools QC). Annotated SNP/indel effects using snpEff. Identified large indels (≥50 bp) from alignments, classifying canonical vs complex; refined breakpoints with BLAT re-alignment; excluded SVs with unknown sequence. Detected inversions/translocations using SyRI; manually curated inversions not encompassed by single contigs and filtered translocations with Ns. Assessed SV support using Illumina short reads via RD (coverage drop), SR/RP evidence using LUMPY and DELLY with QC thresholds.
Graph-based pan-genome and SV genotyping: Constructed a variation graph with vg (v1.23.0) from 9930 reference and 54,107 SV alleles (insertions, deletions, inversions); pruned and indexed (XG, GCSA). Mapped paired-end reads of 115 accessions to the graph, filtered low MAPQ/base quality, packed coverage, identified snarls, and genotyped SVs per sample with vg call.
GWAS with SVs: Combined 40,720 QC-passed SVs and 2,833,550 SNPs to model kinship (EMMAX, Balding–Nichols) and population structure (LD-pruned PCs). Performed mixed-model association (EMMAX) for female flower rate on primary branch, fruit spine/wart density, and branch number; significance threshold 3.46×10^-5 (effective tests).
Domestication sweeps and divergent SVs: Mapped 115 resequencing reads to 9930; called/filtered SNPs/indels with GATK. Computed π ratios and XP-CLR; defined top 5% overlap as sweeps. Classified SVs overlapping sweeps as domestication-associated (dSVs). Identified highly divergent SVs (hdSVs) by Fisher’s exact test on wild vs cultivated allele frequencies (FDR<0.01). Linked SVs to nearby orthologous genes and assessed differential expression between wild (Cuc64) and cultivated (9930) using RNA-seq across seven tissues, highlighting CDS/promoter SVs with |log2FC|≥1.5 in at least one tissue.
Validation and expression/phenotyping: Validated key SVs (pINS, iINS) with PCR and long-read mapping. Measured root traits (primary root length/weight) at 7 DAG across haplotypes; quantified PELPK7.1/7.2 expression by RNA-seq and qRT-PCR (UBIQUITIN control).

Key Findings

Generated 11 new chromosome-scale cucumber assemblies (232.5–251.1 Mb; 830–1015 contigs; contig N50 1.7–5.3 Mb; BUSCO 96.4–97.7%). TE content 32.5–38.5% with LTR-RTs 14.3–19.3%. Predicted 24,490–26,033 genes per accession with ~96% BUSCO completeness of gene sets.
Identified seven megabase-scale inversions between wild (Cuc64) and cultivated 9930 (three on Chr5, three on Chr4, one on Chr7); W8 also carries the Chr5 inversions, W4 does not. Breakpoints supported by Hi-C interaction patterns in Cuc64 mapped to 9930. Provides an inversion map across 12 accessions to guide crosses and avoid recombination-suppressed regions.
Cucumber pan-genome: 26,822 non-redundant gene clusters comprising 18,651 core and 8,171 dispensable clusters. Core genes >80% in each accession. Dispensable genes are shorter, expressed at lower levels, and have higher Ka/Ks, indicating faster evolution and specialized functions. GO enrichments: core in essential processes (macromolecule modification, glycosylation, phosphorus metabolism); dispensable enriched for DNA integration, hormone responses (e.g., auxin), telomere maintenance.
Variant landscape relative to 9930: 2,902,954 SNPs and 1,388,197 small indels; 2.5% of SNPs and 1.5% of small indels affect coding/splicing/start/stop/frameshift. SVs: 53,912 large indels (17,130 canonical insertions; 19,334 canonical deletions; 9,399 complex insertions; 8,049 complex deletions), 196 inversions (<1 Mb; total 5.47 Mb), and 2,106 translocations (total SVs 56,214). SV density reduced around centromeres. Short-read support rates: 69.7–85.5% for canonical deletions, 89.5–96.1% for canonical insertions, 62.9–86.5% for complex deletions, 83.6–91.7% for complex insertions; remaining SVs likely require long reads.
Graph-based pan-genome integrating 9930 with SV alleles enabled accurate mapping and SV genotyping in 115 accessions. GWAS using genotyped SVs recovered known sex-determination loci (m and F) and pinpointed the causative complex insertion upstream of CsGL3 for fruit spine/wart density. Identified a novel 59 bp canonical insertion (SV_INS_7G011800) 2,593 bp upstream of Csa9930_7G025850 (BYPASS1 homolog) associated with branch number.
Functional SVs in spine/wart genes: Found 20 likely functional variants across NS, CsTTG1, CsGA200x1, CsTu, CsTS1, CsGL3, including three SVs. Discovered a 51 bp deletion in CsTu CDS in Hx117 (SV_DEL_5G021710) removing 17 amino acids within an intrinsically disordered region; flanking 5 bp microhomology (CCACC) suggests MMBIR mechanism. The known 4,895 bp deletion removing CsTu (SV_DEL_5G021700) confers non-warty fruits; and the 802 bp complex insertion replacing 10 bp upstream of CsGL3 (SV_COMINS_6G012210) elevates spine density and shows group-specific selection in East-Asian cultivars.
CsFT flowering-time locus evolution: Identified three insertions (39.3 kb canonical; 25.3 kb and 44.0 kb complex) upstream of CsFT. Defined four upstream-region types (long-1, long-2, short-1, short-2) with distinct phylogeography and flowering-time distributions, proposing an evolutionary trajectory from ancestral long-1 (in late-flowering Indian wild) through domestication to short types in early-flowering cultivars, with independent selection in East-Asian vs Eurasian groups.
Domestication-related SVs: 2,578 dSVs overlap selective sweeps; 8,651 hdSVs show significant wild vs cultivated allele-frequency divergence (FDR<0.01); 1,611 SVs (131 CDS, 1,480 promoter) associated with expression differences between wild (Cuc64) and cultivated (9930).
Root growth candidate SVs: Identified two insertions near tandem PELPK1 homologs PELPK7.1 (Csa9930_7G006910) and PELPK7.2 (Csa9930_7G006920): a 76 bp promoter insertion (PINS; SV_INS_7G004090) and a 135 bp intronic insertion (iINS; SV_INS_7G004080). Haplotype distribution: wild Indian accessions (Hap1) carry both; some cultivated (Cuc80, Cuc37) carry PINS only (Hap2); most cultivated lack both (Hap3). PINS enriched in Indian (63.3%) and all Xishuangbanna accessions; iINS rarer in cultivated Indians. Both insertions associate with reduced PELPK7.1/7.2 root expression and shorter, lighter primary roots, consistent with domestication-driven selection on root development.
Comparative diversity: Cucumber exhibits fewer SVs per accession (3,213–21,261 large indels) than Medicago and rice with similar genome sizes, consistent with narrow genetic background (7 chromosomes vs 12 in other Cucumis) and severe domestication bottleneck.

Discussion

By assembling 11 diverse, high-contiguity genomes and integrating them into a graph-based pan-genome, the study overcomes limitations of single-reference, short-read approaches and captures a comprehensive SV landscape in cucumber. The inversion map clarifies karyotype evolution and informs breeding designs by highlighting recombination-suppressed regions, especially in wild–cultivated crosses. The variant-integrated graph enables robust SV genotyping at population scale, facilitating GWAS that not only re-identify known loci (sex determination, CsGL3 regulator) but also uncover new SV–trait associations (branch number, root growth). Pan-genome analyses delineate core vs dispensable gene functions, revealing that dispensable genes evolve faster and may contribute to environmental adaptation. Dissecting the CsFT upstream structural haplotypes links SV history to domestication and flowering-time adaptation across geographic groups. The domestication-related SV catalog, with expression-linked candidates, provides mechanistic hypotheses for traits like root growth and fruit morphology. Together, these findings address the central question of how structural and sequence variation relate to agronomic traits and domestication, and deliver a practical resource for genomics-assisted breeding using graph-based genotyping of SVs.

Conclusion

This work delivers a graph-based cucumber pan-genome built from 12 chromosome-scale assemblies, catalogs ~4.3 million variants including 56,214 SVs, and demonstrates population-scale SV genotyping and association with key agronomic traits. It refines cucumber karyotype evolution via seven large inversions, characterizes core vs dispensable gene space, elucidates structural haplotypes at CsFT underlying flowering-time adaptation, and identifies candidate SVs affecting fruit spines/warts and root development. The graph and variant resources offer a foundation for precise SV-aware breeding and functional genomics. Future directions include integrating SNPs and small indels into the graph for unified variant genotyping, expanding the pan-genome with more globally diverse accessions to capture additional diversity, and applying SV-based GWAS/QTL in larger, accurately phenotyped populations to validate causative variants and guide marker development.

Limitations

Sample size and diversity: Although the 12 accessions capture ~84% of the genetic diversity of the 115-line core collection, the relatively small number of assemblies may underestimate species-wide variation, particularly rare SVs.
Variant scope in graph: The current graph integrates SVs but not SNPs/small indels, potentially limiting full haplotype-aware genotyping and association analyses.
Recombination suppression: Large inversions may complicate linkage mapping and breeding; while mapped, their broader effects on recombination landscapes across diverse backgrounds require further study.
Support by short reads: A fraction of complex SVs lack short-read support and rely on long-read evidence; additional orthogonal validation could further increase confidence.
Functional validation: Candidate SV–trait links (e.g., PELPK insertions, CsTu 51 bp deletion) are supported by association and expression but require direct functional assays to establish causality.

Related Publications

Explore these studies to deepen your understanding of the subject.

Agriculture

Comprehensive genomic resources related to domestication and crop improvement traits in Lima bean

T. Garcia, J. Duitama, et al.

Agriculture

The pan-tandem repeat map highlights multiallelic variants underlying gene expression and agronomic traits in rice

H. He, Y. Leng, et al.

Psychology

Brain structural network alterations related to serum cortisol levels in drug-naïve, first-episode major depressive disorder patients: a source-based morphometric study

L. Nguyen, S. Kakeda, et al.

Medicine and Health

Efficacy of early PET-CT directed switch to carboplatin and paclitaxel based definitive chemoradiotherapy in patients with oesophageal cancer who have a poor early response to induction cisplatin and capecitabine in the UK: a multi-centre randomised controlled phase II trial

S. Mukherjee, C. N. Hurt, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny