Food Science and Technology
Chromosome-level genome assembly of bunching onion illuminates genome evolution and flavor formation in *Allium* crops
N. Liao, Z. Hu, et al.
Explore the fascinating world of the *Allium* genus with an 11.27-Gb genome assembly of bunching onion (*A. fistulosum*). This groundbreaking research, conducted by a team of esteemed authors from Zhejiang University and other institutes, uncovers the genomic foundations behind flavor formation and the domestication history of this vital crop, originating from China.
~3 min • Beginner • English
Introduction
Allium crops possess large genomes and unique cysteine-derived sulfur compounds that drive their characteristic pungent flavors and medicinal properties. Despite their global importance, the genomic mechanisms underlying flavor formation and the evolutionary relationships among Allium species remain unclear, confounded by morphological taxonomy, frequent interspecific hybridization, and the lack of high-quality reference genomes due to extensive repetitive content. This study aims to generate a chromosome-level reference genome for bunching onion (Allium fistulosum), clarify genome evolution and chromosomal collinearity within major Allium crops, dissect the expansion and diversification of sulfur-metabolism genes involved in ACSO biosynthesis and hydrolysis (e.g., alliinase and lachrymatory factor synthase), and resolve population structure, domestication, and migration routes of A. fistulosum using re-sequencing, transcriptomic, and metabolite analyses.
Literature Review
The Allium genus is distributed across the northern hemisphere, with diversity centered from the Mediterranean to Central Asia, and displays adaptations to harsh environments. Prior work has characterized Allium-specific alk(en)yl cysteine sulfoxides (ACSOs) including alliin, isoalliin, and methiin, and identified enzymes such as alliinase (ALL) and lachrymatory factor synthase (LFS) that generate characteristic volatiles. However, Allium taxonomy has relied largely on morphology, leading to misclassification among A. fistulosum, shallots (A. ascalonicum), and Chinese red onions (A. cepa var. proliferum). Natural and artificial interspecific hybridization is common and has been used to transfer traits, but genome-assisted breeding is hampered by large, repeat-rich genomes and the scarcity of high-quality assemblies. Previous genetic mapping suggested macro-synteny (e.g., between A. fistulosum and A. cepa using SSR markers), and recent onion genomic resources have emerged, yet comprehensive, chromosome-level assemblies and comparative analyses across major Allium crops were still limited.
Methodology
Plant material and sequencing: A bunching onion accession (SXSJC, Hangzhou, China) was sequenced. For population analyses, 135 Allium accessions (including 101 A. fistulosum, shallots, Chinese red onions, and A. altaicum) were re-sequenced.
Genome sequencing and assembly: Long reads were generated using PacBio SMRT (1,649.82 Gb; ~138× coverage) and assembled with Canu v1.9 (genomeSize=12G, corOutCoverge=40G). Polishing used NextPolish v1.2.4 (PacBio, rerun=3) and Pilon v1.2.2 (Illumina). Haplotypic redundancy was removed with Purge_dups v1.2.3. Optical maps were produced with Bionano and integrated using Bionano Solve v3.5.1 (DeNovo Assembly: -i 5 -F1 -W 1 -c 1; Hybrid Scaffold: -B 2 -N 2). Hi-C libraries were prepared (DpnII) and scaffolding performed with ALLHIC v0.9.8, anchoring contigs to 8 chromosomes. Illumina short reads also supported assembly accuracy.
Assembly evaluation: Genome size, heterozygosity, and repeats were estimated by 17-mer analysis. Completeness was assessed via BUSCO v3.0.2 (embryophyta_odb10) and CEGMA v2.5; LAI was computed with LTR_retriever v1.0.7. Read mapping used BWA v0.7.8; consensus QV and completeness were estimated with Merqury. Genetic map markers were aligned using Chromonomer to assess collinearity.
Repeat and gene annotation: TEs were annotated with RepeatMasker/RepeatProteinMask (Repbase v15.02), RepeatModeler/RepeatScout/LTR_FINDER, and TRF for tandem repeats. Protein-coding genes were predicted using ab initio tools (Augustus, Geneid, Genescan, GlimmerHMM, SNAP), homology (TBLASTN against 12 reference species), and RNA-seq evidence (Tophat v2.0.13, Cufflinks v2.1.1, Trinity v2.1.1, PASA v2.2.0), integrated with EVM v1.1.1. Functional annotation leveraged SwissProt/TrEMBL/NR, InterProScan v4.7 (InterPro v29.0), GO, and KEGG. Non-coding RNAs were predicted with tRNAscan-SE v1.4 and INFERNAL v1.1.2.
Comparative genomics and evolution: Orthogroups across 13–16 plant genomes were identified using OrthoMCL v1.4; single-copy orthologs were aligned with MUSCLE v3.8.31 and used to build ML phylogenies with RAxML v8.2.12 (GTRGAMMA). Divergence times were estimated with MCMCTree (TimeTree calibrations). Gene family expansion/contraction was inferred with CAFE v4.2. WGD inference used FASTKs and mixture modeling (MCLUST). Synteny/collinearity was assessed with MCScanX and JCVI.
LTR analysis: LTR insertion times were computed by aligning LTR ends (MUSCLE) and using T=K/2r (r=1.3e−8). LTRs were classified with LTRdigest v1.07; phylogenies built with TreeBest.
Gene duplication classification and DSD analysis: DupGen_finder classified duplications as dispersed (DSD), tandem (TD), proximal (PD), WGD, and transposed (TRD). Ancestral vs new DSD genes were designated by collinear context.
Flavor-related gene identification and expression: ACSO pathway genes (GSH1/2, PCS, GGT, FMO; ALL and LFS) were identified via BLASTP (E<1e−5, identity>50%) and domain filtering (HMMER v3.1b1), using known LFS (AB089203) and ALL references. Phylogenies for LFS and ALL were built with MUSCLE and TreeBest; expression profiled by RNA-seq across tissues (roots, leaves, pseudo-stems, bulbs, floral stalks, flowers) with four biological replicates.
Population genomics: 135 accessions were re-sequenced on Illumina (total 16,453.99 Gb; average depth 9.49×). SNPs and InDels were called; phylogenies built with SNPhylo; population structure inferred by ADMIXTURE-like cross-validation (minimum CV at K=5). PCA was conducted; LD decay analyzed with PopLDdecay; nucleotide diversity (π) and FST were calculated. Selective sweeps were detected with XP-CLR between key subgroups. Gene flow was inferred, and migration routes proposed.
Metabolite quantification: ACSOs (alliin, isoalliin, methiin) were quantified via UHPLC; correlations between gene expression (FPKM) and isoalliin levels were analyzed (Z-score normalization).
Key Findings
Genome assembly and content: A chromosome-level Allium fistulosum genome (11.27 Gb) was assembled with contig N50 7.34 Mb and super-scaffold N50 1.34 Gb; 98.71% of assembled contigs were anchored to eight chromosomes. BUSCO recovery reached 91.0% (of 1614 genes), CEGMA 94.35%, Merqury QV 37.06 and completeness 93.61%; Illumina mapping rate 99.65% and genome coverage 99.34% with SNP rate 0.0063%. LAI was 24.81. A total of 62,255 protein-coding genes were annotated (average gene length ~5000 bp; ~3.93 exons/gene). Repeats constituted 69.81% (7,885 Mb) of the genome, dominated by LTR retrotransposons (62.18%), particularly Gypsy elements with recent bursts within the last ~2 MY. CRM Gypsy clade contributed 10.38% in A. fistulosum; Copia composition differed from garlic.
Genome evolution: Phylogenomics (492 single-copy families) dated the A. fistulosum–A. cepa split to ~7.4 MYA and divergence from A. sativum at ~16.7 MYA. Allium genomes showed numerous expanded/contracted families, with 308 families commonly expanded across A. fistulosum, A. cepa, and A. sativum, enriched in catalytic and sulfur-metabolism-related GO terms (e.g., carbon–sulfur lyase, cysteine-type peptidase). Ks distributions support three ancient WGDs shared with garlic and no lineage-specific WGD after divergence among the three crops. Chromosome-level collinearity revealed near one-to-one synteny between A. fistulosum and A. cepa (notable inversion on Chr4), and extensive synteny with A. sativum alongside inversions and rearrangements on several chromosomes.
Gene duplication dynamics: Large-genome species exhibited high frequencies of dispersed duplications (DSD). In Allium, ancestral vs new DSD gene pairs were 13,676 in A. fistulosum, 17,817 in A. sativum, and 8,424 in A. cepa; ancestral DSD genes showed significantly higher expression than new duplicates, indicating potential functional divergence (sub-/neo-functionalization) and pseudogenization.
Flavor gene expansions and specialization: ACSO biosynthesis genes were cataloged; in A. fistulosum, seven GSH1, one GSH2, one PCS, three GGT, and three FMO orthologs were identified, many constitutively expressed across tissues. Hydrolysis genes were massively expanded: ALL copies numbered 56 (A. fistulosum), 65 (A. cepa), 48 (A. sativum). LFS genes were present only in Allium among 14 genomes and expanded to 43 (A. fistulosum), 27 (A. cepa), 29 (A. sativum). LFS family subdivided into LFS-a/b/c/d; functional onion LFS (BAC21275.1) clustered within LFS-c. In A. fistulosum, AfLFS01 (closest to BAC21275.1) showed highest expression in most tissues. A. fistulosum and A. cepa preferentially expressed LFS-c members, whereas A. sativum predominantly expressed LFS-a. Micro-collinearity indicated reciprocal one-to-many expansions between bunching onion and garlic for LFS and ALL. Duplications involved dispersed, tandem, and proximal modes; notably, 33 of 43 AfLFS formed tandem clusters within a ~10 Mb region on Chr5 intertwined with LTRs, implicating LTRs in gene family proliferation.
ACSOs profiles: UHPLC quantification showed isoalliin is the dominant ACSO in A. fistulosum and A. cepa; alliin predominates in A. sativum (with minor isoalliin). LFS catalyzes conversion of isoalliin-derived sulfenic acids to propanthial S-oxide (lachrymatory factor).
Population structure, domestication, and migration: Resequencing of 135 accessions yielded 48,218,339 SNPs and 18,182,723 InDels. ML phylogeny resolved four clusters: Chinese red onions (A. cepa var. proliferum) with A. altaicum; shallots (A. ascalonicum); A. fistulosum Group 1 (G1); and Group 2 (G2). Ninety-six confirmed A. fistulosum accessions partitioned into five subgroups (G1-1, G1-2, G1-3; G2-1, G2-2). G1-1 (western China) had highest nucleotide diversity; G1-3 (north/northeast China) had lowest LD decay, indicating a diversity center. G1-2 (Middle Asia/Russia) displayed distinct structure, lower π, higher LD, higher FST versus others, and evidence of migration/gene flow from western China (G1-1). G2-1 (southeast China) was closer to G1 than G2-2; G2-2 (Japan, America, Europe) likely derived from southeast China. Collectively, data support China (northwestern and adjacent regions) as the origin and domestication center of A. fistulosum.
Genomic basis of isoalliin variation: Leaves from 91 A. fistulosum accessions showed significantly higher isoalliin in G1-2 vs G1-3 (P=8.4×10−3). XP-CLR scans identified selective sweeps containing sulfur-flavor genes: G1-2 selections included AfGSH2, AfFMO3, AfGGT3, AfALL48, AfALL49, AfALL55, AfLFS11; G1-3 selections included AfPCS1 and AfALL42. Isoalliin levels positively correlated with expression of biosynthesis genes (notably AfGSH2 and AfPCS1). Low-isoalliin accessions exhibited higher expression of hydrolysis genes (ALL/LFS), suggesting combined effects of biosynthesis and hydrolysis on isoalliin accumulation.
Discussion
This chromosome-scale assembly of A. fistulosum resolves a long-standing barrier in Allium genomics posed by large, repeat-rich genomes. The predominance of LTR/Gypsy expansions and the prominence of dispersed duplications explain both genome enlargement and extensive gene family proliferation in Allium. Strong chromosome-level collinearity among A. fistulosum, A. cepa, and A. sativum, with limited inversions/rearrangements, aligns with observed weak reproductive isolation and facilitates cross-species translational genomics and breeding. The marked, Allium-specific expansions of ALL and LFS gene families, coupled with lineage-biased expression (LFS-c in A. fistulosum/A. cepa vs LFS-a in A. sativum) and tandem clustering associated with LTRs, support a model where repeated duplication and functional differentiation underpin diversification of pungent sulfur volatiles and adaptive responses to biotic/abiotic stresses. The distinct ACSO profiles (isoalliin-dominant in A. fistulosum/A. cepa and alliin-dominant in A. sativum) are consistent with differential expression of LFS clades and selection signals in flavor-related loci. Population genomics corroborates China—particularly northwestern to northern regions—as the origin and domestication center of A. fistulosum, with subsequent spread to Middle Asia (G1-2) and onward introductions to Japan, Europe, and America (G2-2) from southeast China. Selection and expression analyses indicate that both enhanced biosynthesis (e.g., GSH2, PCS) and modulated hydrolysis (ALL/LFS) contribute to isoalliin variation, offering tangible targets for flavor and pungency breeding.
Conclusion
The study delivers a high-quality, chromosome-level reference genome for Allium fistulosum and comprehensive comparative, functional, and population genomic insights into Allium evolution and flavor formation. Key contributions include: (1) elucidation of genome expansion driven by LTR/Gypsy bursts and dispersed duplications; (2) demonstration of strong chromosome-level collinearity among major Allium crops; (3) discovery of Allium-specific, massively expanded ALL and LFS gene families with lineage-biased expression and evidence for tandem proliferation; (4) metabolic and transcriptomic linkage of ACSO profiles to LFS clades; and (5) resolution of population structure, domestication origin in China, and migration routes of A. fistulosum. These resources and findings enable genome-assisted breeding for pungency, flavor, and other agronomic traits. Future work should include functional validation of LFS/ALL clade activities, dissection of regulatory networks controlling ACSO biosynthesis and hydrolysis, and expanded sampling to refine domestication scenarios and trait-associated loci.
Limitations
Functional roles of expanded LFS and ALL clades were primarily inferred from phylogeny and expression; direct biochemical or genetic validation (e.g., knockouts, enzyme assays across clades) remains to be completed. The contribution of specific LTR events to gene family expansion is correlative. Population analyses, while broad, may still be limited by sampling and moderate resequencing depth (average ~9.5×), potentially affecting rare variant detection. Comparative genomics included a defined set of plant genomes; additional Allium and outgroup genomes could further refine evolutionary inferences.
Related Publications
Explore these studies to deepen your understanding of the subject.

