Agriculture

Chromosome-level genome assembly and population genomic resource to accelerate orphan crop lablab breeding

I. Njaci, B. Waweru, et al.

This exciting research unveils a high-quality chromosome-scale assembly of the drought-resilient lablab genome, revealing two domestication events and evaluating genetic diversity. Conducted by a collaborative team of experts, this study paves the way for lablab enhancement and models a path for researchers working with indigenous crops in low and middle-income countries.... show more

Introduction

The study addresses the need to diversify global food systems that currently rely heavily on a few staple crops, increasing vulnerability to environmental and social instability. Lablab purpureus, an indigenous African legume with notable drought resilience and multiple uses (food, feed, bioactive compounds), is a promising orphan crop for genome-assisted breeding. Despite advances in crop genomics, African researchers have been under-represented in sequencing indigenous orphan crops due to limited sequencing infrastructure, computing resources, and bioinformatics capacity. The authors aim to produce a high-quality, chromosome-scale lablab genome assembly, investigate domestication history and population structure, and identify genomic loci linked to agronomic traits, while demonstrating an inclusive, Africa-led genomics collaboration model.

Literature Review

Background highlights include the increasing application of genome-assisted breeding to orphan crops and a prior short-read draft genome for lablab. The paper underscores under-representation of African researchers in sequencing indigenous crops and references efforts to build capacity (e.g., African Plant Breeding Academy, African Biogenome Initiative). Comparative context is provided by legume genomics (common bean, cowpea) and known cases of multiple domestications in other species (common bean, lychee, Tartary buckwheat, potentially rice and barley). These works frame the significance of producing a reference-grade assembly and population resources for lablab.

Methodology

Reference genome sequencing and assembly: In-country long-read sequencing of Lablab purpureus cv. Highworth using ONT MinION generated 4.7 million reads (mean length 6.1 kb), totaling 28.4 Gbp (~67x coverage of a ~423 Mbp genome). De novo assembly with Flye (v2.7.1) yielded 2,260 contigs (N50 11.0 Mbp; total 426.2 Mbp). Assembly polishing used ~380x publicly available Illumina short reads with HyPo, and Hi-C scaffolding (Phase Genomics Proximo, Illumina HiSeq PE150; 232M read pairs) produced 11 pseudomolecules covering 417.9 Mbp (98.0% of assembly; 98.6% of estimated genome), with scaffold N50 38.1 Mbp. Assembly quality: BUSCO completeness 98.5% (embryophyta) and 98.2% (fabales); LAI 19.8. Chromosome naming was guided by synteny to Phaseolus vulgaris and Vigna unguiculata using BLASTP and MCScanX.
Genome annotation: Combined pipelines (Funannotate and plant.annot) integrating protein homology (P. vulgaris, V. angularis, C. cajan, M. truncatula, Arabidopsis thaliana), RNA-Seq evidence (leaf, stem, sepal, petal), and ab initio predictors (AUGUSTUS, SNAP, GlimmerHMM), merged via EvidenceModeler and GenomeTools. Functional annotation via AHRD, Pfam/InterPro, and GO terms. tRNAs identified with tRNAscan-SE. TE-related gene models flagged. BUSCO used to assess annotation completeness.
Repeat annotation: EDTA pipeline (LTRharvest, LTR_FINDER, LTR_retriever, TIR-Learner, HelitronScanner, RepeatModeler2, RepeatMasker) for TE discovery and genome-wide annotation; tandem repeats via TandemRepeatFinder with overlap resolution.
Comparative genomics and gene family analyses: Orthogroups identified using OrthoFinder across lablab and four legumes (P. vulgaris, V. angularis, C. cajan, M. truncatula) with A. thaliana as outgroup. Gene family expansion/contraction assessed using CAFE; GO enrichment of expanded families via topGO.
Trypsin inhibitor (TI) gene family: Candidate identification using BLASTP against lablab proteins with TI references, domain confirmation (InterProScan: PF00228 Bowman-Birk, PF00197 Kunitz), phylogeny (MUSCLE, FastTree), genomic distribution (RIdeogram), synteny and tandem duplication analysis (MCScanX). Expression quantified from SRA RNA-Seq (petal, stem, sepal, young leaf) using fastp, Kallisto, tximport, DESeq2; visualized with pheatmap.
Resequencing and phylogenetics: Short-read resequencing (2×150 bp) of wild and domesticated two- and four-seeded accessions, subsp. bengalensis, likely feral Indian accessions, plus Dipogon lignosus outgroup. Read processing with Trimmomatic; mapping with Bowtie2; BAM processing with SAMtools and Picard; variant calling with bcftools; filtering on quality, depth, missingness, and MAF. Phylogeny from variants >2 kb apart using VCF2Dis + FAST-ME NJ; bootstrap from 1,000 replicates. STRUCTURE runs (K=1–6) with Evanno method to infer clusters; diversity (π) and FST computed in windows.
Population diversity and GWAS in germplasm collection: 2,300 plants from 203 accessions genotyped with DArTseq (SNP and SilicoDArT). Quality filtering on MAF, missingness, LD, and genome distribution; imputation with missForest. True-to-type accessions (191) selected using Identity by Descent. Population structure via STRUCTURE (multiple K), hierarchical clustering, and PCA. AMOVA, pairwise FST, and Nei’s D computed. Historic phenotypes (field trials in Ethiopia, 1998 and 2000) integrated via linear mixed models, ANOVA/Tukey, and chi-square for qualitative traits. GWAS using GAPIT3 (GLM, MLM, MLMM, FarmCPU, BLINK) on 21,982 markers; significance by FDR (<0.05) and consensus across ≥2 models; MTAs summarized on Circos plots.

Key Findings

Reference-grade genome assembly: Hi-C–scaffolded 11 chromosomes totaling 417.9 Mbp (98.6% of estimated 423 Mbp), scaffold N50 38.1 Mbp; BUSCO completeness 98.5% (embryophyta) and 98.2% (fabales); LAI 19.8, indicating high repeat/intergenic contiguity. Chromosome naming based on synteny to common bean and cowpea.
Genome annotation: 30,922 gene models (79,512 transcripts; mean 2.57 isoforms/gene); 24,972 high-quality non-TE protein-coding genes; RNA-Seq expression support for 73%; functional descriptions for 28,927 genes (93.3%); 542 tRNA genes.
Repeats: 168,174 TE sequences covering 28.1% of the genome; LTR-RTs dominant (85,149 sequences; 19.9% of genome). Copia elements more abundant (13.2%) than gypsy (4.7%), an uncommon pattern in plants. Total repeats (TEs + unclassified) 43.4%; 142,302 tandem repeats enriched near putative centromeres.
Gene family evolution: 24,397 of 24,972 genes (97.7%) placed into orthogroups; 14,047 orthogroups shared among five legumes; 119 lablab-specific orthogroups (417 genes) enriched for fatty acid biosynthesis and arabinose metabolism, with roles in pollen–pistil interaction and plant development. Lablab shows 448 significantly expanded and 899 contracted gene families; expanded sets enriched for lignin/pectin metabolism and photosynthesis.
Trypsin inhibitor (TI) gene architecture: 35 TI genes identified, enriched relative to other legumes and Arabidopsis. Genes cluster mainly on Lp01, Lp04, Lp06, Lp11, with two major clusters on Lp04 harboring 21 genes (60% of TI set); 66% of TI genes are tandem duplicates. Synteny indicates conservation across legumes with additional clusters/duplications specific to lablab and V. angularis. Expression: Lp01/Lp06/Lp11 cluster members highly expressed in vegetative tissues; most Lp04 cluster genes show low expression.
Dual domestication confirmed: Resequencing and NJ phylogeny (rooted on Dipogon) reveal distinct two- and four-seeded clades (100% bootstrap), each containing wild and domesticated accessions; subsp. bengalensis derives from the four-seeded group; likely feral Indian accessions related to four-seeded domesticates. STRUCTURE supports K=3 with clear separation and no admixture between two- and four-seeded groups; further sub-structuring within four-seeded group.
Diversity and divergence: Total variants in four-seeded and two-seeded gene pools: 10,666,655 and 5,200,923, respectively. Nucleotide diversity (π per 100 kb, variant sites only) significantly higher in four-seeded (0.00790 ± 0.00311 SD) than two-seeded (0.00599 ± 0.00260 SD) (t=30.43, df=8095, P<0.001). Mean FST between gene pools = 0.438 ± 0.059, suggesting substantial divergence and potential taxonomic reassessment.
Germplasm population structure: From 203 accessions, 191 true-to-type retained. STRUCTURE on 7,780 SNPs supports two main clusters (two- vs four-seeded), with K=7 further partitioning four-seeded accessions by geography/morphology (e.g., clusters enriched for Africa, Asia/Australia, subsp. bengalensis). Pairwise FST among seven clusters ranges 0.31–0.97; AMOVA shows 81% variance between clusters. Within-cluster Nei’s D lowest in cluster II (0.002) and highest in cluster I (0.186).
Phenotypic differentiation: Significant differences among genetic clusters across 14 quantitative and 6/7 qualitative traits, aligning with pod/seed morphology, plant architecture, flowering time, and flower color.
GWAS: 18 significant MTAs (8 SNPs, 10 SilicoDArTs) across eight chromosomes associated with leaf length/width/ratio, plant height, days to 50% flowering, pod length/width/ratio, and thousand seed weight, explaining 7–24% of phenotypic variance. These markers provide entry points for marker-assisted selection and QTL characterization.
Inclusive collaboration: Sequencing, assembly, and parts of analysis performed in Africa using portable, low-cost ONT MinION; complemented by capacity building and equitable international partnerships.

Discussion

The high-contiguity, chromosome-scale lablab assembly fills a critical gap for an African orphan crop, enabling robust genome analyses. The assembly’s strong BUSCO and LAI metrics indicate near-complete gene and repeat representation, surpassing previous short-read drafts. Repeat analyses uncovered an unusual predominance of copia over gypsy LTR-RTs, a pattern uncommon in plants and potentially a distinguishing genomic feature of lablab. Comparative genomics highlights lablab-specific orthogroups and expansions in cell wall and photosynthesis-related gene families, consistent with traits relevant to stress tolerance and adaptation. The resolved organization of the expanded trypsin inhibitor gene family, including large tandem arrays on Lp04, provides practical targets for breeding or genome editing to reduce antinutritional factors while maintaining defense. Population genomics across wild and cultivated accessions conclusively supports dual domestication events in distinct two- and four-seeded gene pools with high divergence, and reveals sub-structure correlating with geography and morphology. The germplasm-wide GWAS identifies trait-associated loci for key agronomic traits, advancing marker-assisted breeding. Beyond scientific insights, the project demonstrates an effective Africa-led, inclusive genomics model leveraging portable sequencing, in-depth local bioinformatics training, and equitable international collaboration, which can serve as a template for other indigenous crops.

Conclusion

This work delivers a high-quality, chromosome-level reference genome for Lablab purpureus, comprehensive genome annotation and repeat landscape, and comparative insights into unique and expanded gene families. It elucidates the genomic architecture of trypsin inhibitor genes relevant to nutrition and defense, confirms dual domestication in two distinct seed-number gene pools with strong divergence, and characterizes global germplasm structure and phenotypic differentiation. GWAS yields markers for key agronomic traits, enabling marker-assisted selection and further QTL dissection. The study also models an inclusive, Africa-led genomics approach that integrates local sequencing capacity, bioinformatics training, and international partnerships. Future work should evaluate reproductive compatibility between gene pools to inform taxonomy and breeding strategies, functionally validate GWAS signals and TI gene cluster manipulations for nutritional improvement, and investigate the biological implications of the unusual copia LTR-RT predominance.

Limitations

Biological significance of the unusually high copia vs gypsy LTR-RT abundance remains unresolved and requires further investigation.
Reproductive isolation between two- and four-seeded gene pools is unclear; crosses are mainly reported within four-seeded samples, necessitating dedicated crossing studies for taxonomic reassessment.
Resequencing panel size is modest and uneven across gene pools; additional sampling across the species range would refine diversity and domestication inferences.
GWAS relies on historical phenotypic data and moderate-density GBS markers; validation in contemporary multi-environment trials and fine-mapping are needed to confirm causal loci.
Some likely feral accessions show introgression, which may complicate population assignments and domestication inference.

Related Publications

Explore these studies to deepen your understanding of the subject.

Agriculture

Four chromosome scale genomes and a pan-genome annotation to accelerate pecan tree breeding

J. T. Lovell, N. B. Bentley, et al.

Food Science and Technology

Chromosome-level genome assembly of bunching onion illuminates genome evolution and flavor formation in *Allium* crops

N. Liao, Z. Hu, et al.

Biology

Chromosome-scale assembly and analysis of biomass crop *Miscanthus lutarioriparius* genome

J. Miao, Q. Feng, et al.

Agriculture

Comprehensive genomic resources related to domestication and crop improvement traits in Lima bean

T. Garcia, J. Duitama, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny