Introduction
Understanding the genetic basis of agronomic traits and domestication in plants is crucial for crop improvement. While single nucleotide polymorphisms (SNPs) and small insertions/deletions (InDels) have been extensively studied, structural variants (SVs) represent a significant source of genetic diversity often linked to complex traits and evolutionary events. Traditional approaches relying on mapping short reads to a single reference genome are limited in capturing the full spectrum of SVs, particularly presence/absence variations (PAVs). Cucumber (*Cucumis sativus* L.) is an important vegetable crop and a model system for plant biology. While previous studies using short-read sequencing have identified SNPs and InDels, the characterization of SVs has been hindered by the limited number and quality of reference genomes. This research aimed to construct a high-quality pan-genome for cucumber to comprehensively identify and characterize SVs, thereby providing a deeper understanding of the genetic architecture of agronomic traits and the domestication process.
Literature Review
Several studies have highlighted the limitations of using single reference genomes for identifying the full extent of genetic variation, especially SVs. Pan-genome studies in various species have uncovered substantial species-wide biodiversity, emphasizing the importance of SVs. Graph-based genomes, which integrate variant information into the reference sequences, offer a promising approach for pan-genome representation. In cucumber, previous studies using Illumina short reads have characterized SNPs and small InDels, providing insights into domestication history and genetic diversity. A resequencing-based SV map identified a copy number variation linked to a key reproductive trait. However, the lack of high-quality, multiple reference genomes has hampered comprehensive SV characterization in cucumber.
Methodology
This study assembled chromosome-scale genomes for eleven representative cucumber accessions (three wild and eight cultivated) using PacBio long-read sequencing, complemented by Illumina and Hi-C data. The selected accessions were chosen to represent the genetic diversity within the 115-line core collection. Genome assembly involved several steps: PacBio read assembly using CANU, contig polishing with Illumina reads using Pilon, contig anchoring to linkage maps, scaffold construction using 10X Genomics data, and chromosome-level scaffolding using Hi-C data. Genome annotation included repetitive sequence identification (RepeatModeler and LTR_FINDER), gene prediction (EvidenceModeler), and functional annotation (InterProScan). Chromosomal rearrangements were identified by aligning the assemblies to the 9930 reference genome using MUMmer. A pan-genome was constructed by clustering gene models from all twelve accessions using GET_HOMOLOGUES-EST. Genetic variants (SNPs, InDels, and SVs) were identified through inter-genomic alignments using MUMmer and diffseq. The accuracy of SVs was assessed using short-read data via read depth (RD), split-read (SR), and read-pair (RP) analyses. A graph-based pan-genome was constructed using vg, integrating SV information into the reference genome. Genome-wide association studies (GWAS) were performed using the genotyped SVs for traits such as female flower rate, fruit spine density, and branch number. Functional impact of SVs was assessed by analyzing SVs affecting coding sequences (CDS) of genes involved in fruit spine/wart development and flowering time. Finally, SVs associated with cucumber domestication were identified by analyzing nucleotide diversity and XP-CLR values between wild and cultivated groups. PCR and qRT-PCR were used for validation of selected SVs.
Key Findings
The study generated eleven high-quality chromosome-scale cucumber genome assemblies. Seven large-scale chromosomal inversions were identified and their presence/absence patterns across accessions were mapped, providing insights into karyotype evolution and guidance for breeding programs. A cucumber pan-genome encompassing 26,822 non-redundant pan-gene clusters (18,651 core and 8171 dispensable) was constructed. A total of ~4.3 million genetic variants, including 56,214 SVs, were identified. GWAS using the graph-based pan-genome revealed SVs associated with several agronomic traits: two genes known to be involved in sex determination (m and F) were detected in association signals; a SV upstream of CsGL3 was linked to fruit spine/wart density; and a novel SV upstream of Csa9930_7G025850 (an Arabidopsis BYPASS1 homolog) was associated with branch number. Analysis of SVs affecting genes involved in fruit spine/wart development revealed a 51 bp deletion in CsTu (a C2H2 zinc-finger transcription factor) potentially causing the loss of 17 amino acids. An evolutionary model for the CsFT locus was proposed based on SVs and flowering time, suggesting an ancestral type of UR present only in Indian wild accessions. 2578 domestication-associated SVs (dSVs) and 8651 highly divergent SVs (hdSVs) were identified. Two SVs (PINS and iINS) within the promoter and intron of PELPK7.1/PELPK7.2 genes were associated with altered root development in cultivated cucumber.
Discussion
The findings demonstrate the power of a graph-based pan-genome in comprehensively characterizing genetic variation and its association with agronomic traits. The identification of numerous SVs linked to traits such as fruit morphology, flowering time, and root development significantly advances our understanding of cucumber domestication and its genetic architecture. The insights gained provide valuable resources for marker-assisted selection and genomics-assisted breeding. The evolutionary trajectory of the CsFT locus clarifies the genetic basis of flowering time variation during cucumber domestication. The discovery of SVs impacting PELPK7.1/PELPK7.2 and their association with root development highlights potential targets for improving root system architecture.
Conclusion
This study presents a comprehensive graph-based pan-genome of cucumber, revealing extensive genetic variation associated with key agronomic traits and domestication. The resources generated will be valuable for researchers and breeders. Future studies could focus on functional characterization of the identified SVs, further expanding the pan-genome with additional accessions, and incorporating SNPs and InDels into the graph for improved variant calling and genotyping.
Limitations
The study used a limited number of accessions, which might underestimate the full extent of genetic diversity within the cucumber species. Furthermore, the functional validation of many identified SVs remains to be conducted. The analysis primarily focused on SVs; integrating other types of variation could provide a more holistic view of genetic diversity.
Related Publications
Explore these studies to deepen your understanding of the subject.