Agriculture
The giant diploid faba genome unlocks variation in a global protein crop
M. Jayakodi, A. A. Golicz, et al.
Faba bean (Vicia faba L., 2n = 12) was domesticated in the Near East more than 10,000 years ago and has become a global crop due to broad adaptability, rotational value, and high nutritional density. Despite its importance, the wild progenitor remains unknown, though archaeological evidence indicates pre-domestication use and a Levantine origin. Faba bean exhibits extreme variation in seed size and has botanical types reflecting human selection without reproductive barriers. As societies aim to reduce agricultural greenhouse gas emissions by replacing animal with plant proteins, faba bean is attractive: it is the highest yielding grain legume with favorable protein content (~29%), high biological nitrogen fixation rates, and pollinator-friendly flowers. Yet, improvement is challenged by partial allogamy, a huge estimated genome size (~13 Gb), and low seed multiplication rate, making conventional breeding difficult. Although transcriptomic resources and SNP maps enabled identifications such as the Zero Tannin1 WD40 transcription factor and the VCI gene for vicine/convicine, the absence of a reference genome has hindered progress. This study aims to generate a high-quality, chromosome-scale faba bean genome assembly, characterize its evolution and repeat landscape, and demonstrate practical applications for trait dissection (seed size and hilum colour) to enable genomics-assisted breeding.
Prior work in faba bean genomics established transcriptomes and SNP-based genetic maps showing strong collinearity with model legumes, facilitating gene discoveries such as the WD40 transcription factor underlying Zero Tannin1 and cloning of VCI controlling vicine and convicine content. However, lack of a reference genome complicated these efforts. Comparative studies in legumes (pea, lentil) and grasses informed expectations about gene distribution and recombination. Repeat biology in Fabaceae highlighted the Ogre family of Ty3/gypsy LTR retrotransposons as major determinants of genome size variation. Methylation and TE dynamics have been implicated in genome expansion in other taxa. Seed size regulators such as CYP78A genes are known in Arabidopsis, and hilum colour in pea maps to variation in a PPO gene, providing comparative candidates for faba bean. Genotyping technologies like SPET offer high-throughput, cost-effective marker discovery for GWAS and breeding.
Reference genotype selection and sequencing: The highly homozygous, autofertile, early maturing line 'Hedin/2' was chosen as reference. PacBio HiFi long reads were generated to ~20-fold coverage. Assembly: 11.9 Gb were assembled with contig N50 > 2.7 Mb. Scaffolding and chromosome assembly used a genetic map and Hi-C chromatin conformation data to place 11.2 Gb (94%) into six chromosomal pseudomolecules. Centromere locations were identified using CENH3 ChIP-seq. Hi-C interaction matrices were inspected for Rabl configuration and gaps corresponding to large satellite arrays. Assembly quality was evaluated with Merqury (completeness 96.3%, QV 60.5) and LTR Assembly Index (10.5). A second genotype, the German cultivar 'Tiffany', was sequenced with PacBio HiFi (~10×) and assembled to 11.4 Gb (contig N50 1.6 Mb); contigs were ordered against Hedin/2 to form pseudomolecules. Annotation and comparative genomics: Gene prediction used RNA-seq from nine diverse tissues, yielding 34,221 protein-coding genes (BUSCO: 96% Embryophyta single-copy orthologues). Gene density, recombination distribution, and synteny with Medicago truncatula and pea were analyzed. Orthologues of 262 Medicago symbiosis genes were identified and expression responsiveness to inoculation verified by RNA-seq. Intron/exon and intergenic lengths were compared across angiosperms and gymnosperms. Tandem duplication and WGD history were assessed using substitution rates (4dTv) and phylogenomics to date the last WGD (~55 Ma) and divergence times (~12.2 Ma from pea, ~13.8 Ma from lentil). Repeat and methylation analyses: Transposable elements (TEs) were annotated, with emphasis on LTR retrotransposons (Gypsy/Copia) and satellite repeats. Full-length vs solo-LTR counts were modeled to estimate DNA loss via recombination. Whole-genome DNA methylation was quantified (CG, CHG, CHH contexts), and methylation patterns profiled over genes, TEs, and satellites (e.g., FabTR-83). TE insertion age distributions were inferred. Genes involved in RNAdirected DNA methylation were enumerated for copy number comparisons. Genetic mapping resources: Markers from Illumina Infinium 1,536 SNP and Illumina Oligo Pool Array platforms, and biparental/consenus genetic maps, were projected onto the assembly to anchor QTLs for disease resistance and phenology, with interactive browser support. SPET genotyping and variant discovery: A 90,000-probe SPET assay (≥1 probe per predicted gene) was designed from Hedin/2. A diversity panel of 197 cultivated accessions was genotyped. Reads were mapped to Hedin/2 and 1,081,031 segregating variants (SNPs and indels) were called and functionally annotated to identify premature termination codons (PTCs) across accessions. GWAS for seed size: Seed size phenotypes (seed area, width, length, thousand grain weight, TGW) were measured at two locations over three years for the 197 accessions. Multiple GWAS methods were applied, yielding stable marker-trait associations. Candidate gene identification included phylogenetics and expression analyses; genomic prediction (GBLUP) assessed prediction accuracy using candidate vs genome-wide markers. Hilum colour genetics: Hilum colour (binary) was analyzed by GWAS in the diversity panel and by bulk segregant analysis using a recessive pseudo-F2 pale-hilum bulk. Laser desorption-ionization mass spectrometry imaging (LDI-MSI) profiled phenolic compounds in hilum tissue. Structural variation and expression of a tandem polyphenol oxidase (PPO) gene cluster on chromosome 1 were compared between Hedin/2 (dark hilum) and Tiffany (pale hilum), including promoter repeat insertions (MITE, 'Tippo') near transcription start sites and assessment of methylation levels.
Genome assembly and architecture: A chromosome-scale assembly of the ~13 Gb faba bean genome was produced for 'Hedin/2' (11.9 Gb assembled; 11.2 Gb (94%) placed into six pseudomolecules). Assembly metrics: Merqury completeness 96.3%, QV 60.5; LTR Assembly Index 10.5; contig N50 > 2.7 Mb. Centromeres were defined by CENH3 ChIP-seq; chromosome 1 exhibited a Rabl configuration. Large satellite arrays (up to 752 Mb) created Hi-C coverage voids and corresponded to cytological repeat bands. A 'Tiffany' assembly (11.4 Gb; N50 1.6 Mb) of similar quality was ordered relative to Hedin/2. Gene content and recombination: 34,221 protein-coding genes (BUSCO 96%) with uniform gene density along chromosomes except at satellite arrays. Recombination is evenly distributed (~27 genes per cM), contrasting with cereals having large pericentromeric coldspots. Gene order is highly syntenic with legumes. Introns are not inflated relative to smaller angiosperms; genome expansion is largely in intergenic regions. No recent WGD; last WGD estimated at ~55 Ma; divergence from pea ~12.2 Ma and lentil ~13.8 Ma. Tandem duplications are common (1,108 syntenic clusters), including expansion of agronomically relevant leghaemoglobins; copy number variation exists between Hedin/2 and Tiffany. Repeats and genome expansion: ~79% of the genome is transposon-derived; LTR retrotransposons comprise 63.7% (Gypsy > Copia by >2:1). The Gypsy 'Ogre' family alone constitutes ~44% of the genome; elements reach up to 35 kb (Ogre) and 32 kb (SIRE). Satellite repeats total 9.4% of the genome, with FokI accounting for 4% (0.475 Gb) and forming prominent heterochromatic bands. TE density is remarkably uniform and inversely related to satellite arrays. TE dynamics and methylation: Modeling 395,657 solo-LTRs (sLTRs) for canonical Ogre (two LTRs ~4,161 bp each; internal ~11,655 bp) suggests ~6.26 Gb DNA loss (55.6% of assembly) via recombination, yet the sLTR-to-full-length ratio is low relative to smaller genomes, indicating slower TE removal versus insertion, akin to gymnosperms. Whole-genome cytosine methylation is high: 95.8% CG, 88.2% CHG, 14% CHH. Gene body CG methylation follows canonical patterns and correlates with higher expression in young leaf tissue; TEs are heavily methylated in bodies and flanks. Recent TE bursts occurred <1 Ma, with many intact elements 3–5 Ma old; methylation is high across ages. Elevated CHH islands co-occur with satellite FabTR-83 (1.1% of genome). Copy numbers of RdDM pathway genes are comparable to related legumes, arguing against methylation deficiency as a driver of genome expansion. Genetic resources and variation: The SPET 90K assay across 197 accessions yielded 1,081,031 segregating variants uniformly distributed. Functional annotation identified 1,042 SNPs and 65 indels introducing premature termination codons affecting 933 genes, including 39 resistance gene analogues; an atlas of PTC-bearing genes/accessions is provided. Seed size genetics: GWAS identified 15 robust marker-trait associations for seed area, width, length, and TGW. The strongest association on chromosome 4 localizes within Vfaba.Hedin2.R1.4g051440, a CYP78A homolog highly expressed in seeds, in a known QTL region and homologous to Arabidopsis CYP78A seed size regulators. While this single locus does not explain most variation, using all 15 seed-size-associated SNPs enabled genomic prediction with accuracy nearly matching genome-wide markers, indicating coverage of many key loci. Population structure grouped accessions into four geographic clusters with widespread seed-enlarging alleles, suggesting extensive germplasm sharing. Hilum colour genetics: A single major GWAS peak colocalized with the known HC locus and a peak of homozygosity in a pale hilum bulk. The top marker lies in a PPO gene within a tandem PPO cluster: eight intact PPO copies in Hedin/2 (dark) and eleven in Tiffany (pale) with structural differences (e.g., triplication of PPO-4 in Tiffany; absence of PPO-5 in Hedin/2). Expression is testa-specific; in Hedin/2 testa, PPO-2 (and to lesser extent PPO-3) dominate expression, whereas in Tiffany, PPO-6/7 dominate. A ~2-kb AT-rich MITE ('Tippo') inserted near the PPO-2 promoter in Tiffany disrupts predicted TF binding and lies near the TATA box; this MITE class is highly methylated. LDI-MSI showed absence of chlorogenic acid oligomers, gallocatechin, and tetracosylcaffeate signals in pale hila. Together, data indicate that altered regulation (likely silencing) of PPO-2 via a promoter-proximal MITE underlies hilum colour variation.
This study addresses the longstanding lack of a reference genome for faba bean, a major temperate protein crop, by delivering a high-quality, chromosome-scale assembly and a companion assembly from a second cultivar. The results reveal that genome gigantism arises primarily from proliferation and slow elimination of very long LTR retrotransposons (notably Ogre) and abundant satellite repeats, rather than intron expansion or recent polyploidy. Despite its size, the genome features evenly distributed genes and recombination, improving prospects for mapping and breeding compared to cereals with large pericentromeric coldspots. High, efficient methylation across TEs and repeats suggests that methylation deficiency is not driving TE proliferation. Practical translational outcomes include a 90K SPET genotyping platform and genome-anchored maps enabling integration of legacy markers and QTLs. High-resolution GWAS identified a suite of seed size loci, including a strong CYP78A candidate, and demonstrated that a small set of markers predicts seed size nearly as well as genome-wide data, providing immediate utility for selection. For hilum colour, integrating GWAS, bulk segregant analysis, structural variant analysis, tissue-specific expression, and metabolite imaging pinpointed regulatory variation at a PPO cluster—specifically a promoter-proximal, methylated MITE insertion near PPO-2—as the likely cause of pale hilum. These advances collectively establish a comprehensive genomics-based breeding platform for faba bean, facilitating rapid introgression and selection of agronomically important traits.
The work delivers the first high-quality, chromosome-scale assembly of the giant (~13 Gb) diploid faba bean genome and a secondary assembly, elucidating that genome expansion is driven by proliferation of very long LTR retrotransposons and satellite repeats with slow removal. Despite genome size, gene and recombination distributions are favorable for genetic analysis. Extensive repeat and methylome annotations, together with robust gene models, underpin translational tools: a 90K SPET assay, integrated marker/QTL maps, and GWAS discoveries for seed size and hilum colour. Key contributions include identification of a CYP78A homolog as a major seed size candidate and evidence that a promoter-proximal, methylated MITE modulates PPO-2 expression underlying hilum colour. These resources enable genomics-assisted breeding across diverse temperate and subtropical environments. Future directions include: constructing a faba bean pan-genome anchored on Hedin/2; population-scale resequencing of genebank and elite germplasm to catalogue segregating variation and structural variants; functional validation of candidate genes via expression perturbation and gene editing (supported by available transformation methods); and leveraging the repeat atlas to study TE-driven regulatory variation affecting traits.
Some highly repetitive satellite-rich regions remain challenging to assemble, leaving a fraction of the ~13 Gb genome unresolved and producing Hi-C gap regions. Trait dissection in tandemly duplicated gene clusters (e.g., PPO) can be confounded by copy number and structural variation, requiring multi-omic validation. GWAS and genomic prediction were performed on a panel of 197 cultivated accessions; broader sampling may reveal additional alleles and refine effect estimates. Modeling of TE elimination via solo-LTRs relies on assumptions about recombination mechanisms and element structure. The lack of a known wild progenitor limits access to broader allelic diversity and restricts interspecific introgression.
Related Publications
Explore these studies to deepen your understanding of the subject.

