Agriculture
The pan-tandem repeat map highlights multiallelic variants underlying gene expression and agronomic traits in rice
H. He, Y. Leng, et al.
Discover how tandem repeats (TRs), previously a mystery in rice genetics, significantly impact gene expression and agronomic traits. This groundbreaking research led by Huiying He and colleagues uncovers over 227,000 multiallelic TRs, revealing their crucial role in rice molecular breeding!
~3 min • Beginner • English
Introduction
The study investigates how multiallelic tandem repeat (TR) variations contribute to gene expression and agronomic traits in rice. Alterations in gene expression underlie phenotypic diversity and are key for crop improvement. While bi-allelic variants such as SNPs, small indels, and presence/absence variants explain only part of heritability and are limited for precise tuning of gene expression, TRs are abundant, unstable in length, and typically multiallelic, suggesting they could fine-tune gene expression and quantitative traits. However, systematic, genome-wide assessment of TR variation and its functional contribution in rice has been lacking due to technical challenges in accurately genotyping repeats. The authors aim to construct a population-scale pan-TR map from 231 assemblies, quantify TR effects on gene expression in leaves and panicles relative to bi-allelic variants, and link eTRs to phenotypes (e.g., plant height, grain width) to identify causal multi-allelic regulatory variants relevant for breeding.
Literature Review
Prior work shows TR variation affects gene expression across species (humans, pigs, Caenorhabditis elegans, sunflower, Arabidopsis) and is implicated in many human repeat expansion disorders. In rice, TRs have mainly served as markers for QTL mapping, but several examples indicate functional roles: a CCATTC repeat in OsSPL13 5′UTR modulates grain length and yield; a tandem array upstream of IPA1 increases its expression by creating open chromatin, influencing stem width and panicle architecture; TR changes near FZP alter its expression and grain yield; repeats in COLD11 coding region affect chilling tolerance. Traditional TR genotyping from short reads mapped to a single reference suffers from reference bias and difficulty resolving repeat lengths. Long-read assemblies and pan-genome graphs enable unbiased discovery of TR polymorphisms, including those absent from the reference, and population-scale analyses of multiallelic variation.
Methodology
- Samples and genomes: 231 accessions (Nipponbare plus 202 O. sativa and 28 O. rufipogon) with ONT-based assemblies and a previously constructed rice pan-genome graph.
- TR identification: For each assembly, TRs were annotated de novo using Tandem Repeats Finder and ULTRA, and RepeatMasker-based annotations; results were merged per accession. Quality of TR regions was assessed by comparison of 93-11 assembly to BAC clones.
- Pan-TR dataset construction: TR annotations were integrated with the pan-genome graph by extracting bubbles overlapping TRs. Pairwise whole-genome alignments (minimap2) and liftover (paftools.js) determined collinear positions. STRs and VNTRs were genotyped with distinct strategies: STRs by identifying longest continuous repeat arrays including loosely spaced motifs within assemblies; VNTRs by BLASTn matches to motif-constructed pseudo-sequences, keeping matches with ≥90% identity and selecting the most frequent motif across accessions.
- Validation: TR loci validated via (i) BAC-to-assembly concordance, (ii) manual multiple sequence alignment inspection of 218 random TR loci, and (iii) PCR/Sanger validation of 8 loci.
- Bi-allelic variants: SNPs/indels called by mapping short reads to Nipponbare using Sentieon/bwa and Sentieon variant calling; PAVs from long-read mapping (minimap2, NGMLR) and Sniffles, merged with SURVIVOR from prior work.
- Transcriptomes: RNA-seq from 193 panicles (10–20 mm) and 202 young leaves; reads mapped to Nipponbare (TopHat2), expression quantified to FPKM (Cufflinks), quality-controlled and normalized.
- eQTL mapping: For each gene (mean FPKM > 0.1), adjusted expression values were obtained controlling for population structure (top 5 PCs) and hidden factors (top 20 PEER factors). Linear regression tested associations with nearby (±100 kb) TR repeat numbers and genotypes of SNPs, indels, PAVs. P-values were Bonferroni-corrected per gene and FDR-controlled across genes.
- Fine mapping and model comparison: susieR generated 95% credible sets for loci with both TR and bi-allelic signals. ANOVA compared models with leading bi-allelic variant versus joint models including the leading TR to test additional explanatory power.
- Conditional analysis: For genes with both TR and bi-allelic associations, tested TR effects within accessions carrying the common allele of the lead bi-allelic variant; retained pairs with ≥3 TR alleles in that subset.
- LD and genomic context: Assessed TR–bi-allelic LD (Pearson r within 100 kb), distributions relative to TSS, and enrichment of eTRs in histone marks and accessible chromatin (RiceENCODE) via Fisher’s exact tests.
- Phenotypes and GWAS/coloc: Collected plant height and grain width from field trials (n=6 plants per accession). Performed GWAS with TRs, SNPs, indels, PAVs; for grain width signals, evaluated gene expression associations in panicle, and used coloc to compute posterior probability of shared causal TR for phenotype and expression, followed by fine mapping.
- Functional validation: Generated CRISPR/Cas9 knockouts (osprr1, trgw6) and CRISPR-Cas12a promoter TR copy edits at TRGW6; performed dual-luciferase assays with OsPRR1 promoters containing different TR copy numbers; measured expression via RT-qPCR and phenotypes in edited lines.
Key Findings
- Pan-TR discovery: Identified 531,575 polymorphic TR loci, including 227,391 multiallelic TRs (multiTRs: 145,279 STRs; 82,112 VNTRs). A total of 54,416 TR loci were absent from the Nipponbare reference genome as TR arrays. Motif lengths ranged from 2 to 2,581 bp; allele counts per locus ranged from 3 to 105, with 31.7% having >5 alleles.
- Representation in reference: Among TRs present in Nipponbare, major alleles differed from the reference at 28.5% of STRs (37,316/130,735) and 21.0% of VNTRs (8,886/42,240). Absent-from-reference fractions: 10.0% of STRs (14,544/145,279) and 48.6% of VNTRs (39,872/82,112).
- Genomic impact: 13.31% of multiTRs (30,274/227,391) were in coding regions of 17,376 genes. Among coding TRs, 8,593 loci had alleles predicted to cause frameshifts (96.42% of these alleles had frequency <5%). Another 21,681 TR loci had alleles causing amino acid indels; 3,443 were within Pfam domains.
- LD and distribution: TRs cluster upstream of TSSs, similar to bi-allelic variants, but chromosome-wide distributions differ significantly for most chromosomes, indicating limited LD. Only a minority of TRs show strong LD with nearby bi-allelic variants within 100 kb: 31.3% with SNPs, 15.0% with indels, 26.4% with PAVs. Large fractions show weak correlation (e.g., 38.3% with SNPs have |r|<0.3 or not significant).
- eQTL landscape: Identified eGenes per tissue—panicle: 12,823; leaf: 9,343; overlap: 6,232. Variant-specific eGenes—panicle: TRs 10,810; SNPs 10,221; PAVs 8,409; indels 5,789. Leaf: TRs 7,685; SNPs 7,249; PAVs 6,224; indels 4,027. eGenes detected only with TRs: 1,392 (panicle) and 1,049 (leaf). Tissue-specific functional enrichment matched tissue functions (photosynthesis for leaf; flower development for panicle).
- Fine mapping and independent TR effects: Among genes associated with both TRs and bi-allelic variants, 4,964 leaf genes and 7,369 panicle genes had 95% credible sets; TRs were in the sets for 1,867 leaf and 1,642 panicle genes. Model comparison showed 570 panicle and 550 leaf TR–gene models explained expression better than models with only the lead bi-allelic variant. Conditional analyses (controlling for lead bi-allelic variants) retained consistent TR effect directions for 485 panicle and 511 leaf TR–gene pairs, indicating independent TR contributions.
- Mechanism context: eTRs were more enriched in histone mark and chromatin-accessible regions compared with background TRs, suggesting epigenetic involvement.
- Case study—OsPRR1: A 5′UTR TR ((CGTCC)n; Chr2: 24,569,319–24,569,333) and a nearby SNP (C/A at 24,569,295) both associate with OsPRR1 expression in leaves. TR repeat number correlates positively with expression, and this holds after conditioning on the SNP. Higher TR copy numbers associate with shorter plant height; CRISPR knockout (osprr1) increases plant height. Dual-luciferase assays show promoters with 4 TR copies have higher activity than those with 3; CRISPR editing reducing a TR copy decreases OsPRR1 expression and increases plant height.
- Case study—Grain width TRGW6 (LOC_Os06g03850): TR-GWAS revealed a Chr6 peak detected with TR markers but not bi-allelic variants. Colocalization identified a promoter (GT)n TR at Chr6: 1,552,149 as the shared causal variant (posterior probability=1) for both grain width and TRGW6 expression; seven TR alleles observed. TR copy number positively correlates with grain width and TRGW6 panicle expression in both indica and japonica. CRISPR knockout (trgw6) reduces grain width. CRISPR-Cas12a promoter editing decreasing one TR copy lowers TRGW6 expression and narrows grains; adding one TR copy increases expression and widens grains.
- Data quality: Average 379,176 TR loci per accession (range 346,269–416,093), covering ~22.9 Mb per genome; STR validation against BACs showed 89.8% concordance.
Discussion
The study demonstrates that multiallelic TR variations are widespread in rice and often poorly tagged by nearby bi-allelic variants, enabling discovery of regulatory variants not captured by SNP/indel/PAV analyses. TRs significantly contribute to gene expression variation across tissues, with numerous eGenes uniquely detected by TR markers. Fine mapping and conditional analyses show that many TRs explain expression differences independently of bi-allelic variants, supporting causality. Integration with GWAS and colocalization reveals cases where TRs affect both expression and phenotype, exemplified by OsPRR1 (influencing plant height) and TRGW6 (modulating grain width), with functional validation via genome editing and reporter assays. Enrichment of eTRs in regulatory chromatin contexts suggests epigenetic mechanisms by which repeat copy number modulates transcription. Collectively, these findings address the research question by establishing TRs as key multiallelic regulatory variants underlying quantitative trait variation, and highlight their utility as precise targets to fine-tune agronomic traits in breeding.
Conclusion
This work builds a population-scale pan-TR map for rice, identifying 227,391 multiallelic TR loci (including many absent from the reference genome) and demonstrating their broad, often independent contributions to gene expression and phenotypes. By combining eQTL mapping, fine-mapping, conditional analyses, GWAS-coloc, and experimental validation, the study establishes causal roles for specific TRs regulating OsPRR1 (plant height) and TRGW6 (grain width). The pan-TR resource and identified eTRs provide candidate targets for precise genome editing to finely modulate gene expression and optimize yield-related traits. Future research should expand datasets to capture rare and very long repeats, explore non-linear TR–expression relationships, dissect mechanistic links between TR insertions/deletions and regulatory outcomes, and systematically test eTRs for breeding applications.
Limitations
- Detection limits: Extra-long VNTRs were incompletely captured; maximum repeat motif length identifiable was <4 kb due to computational constraints.
- Modeling assumptions: Linear regression may miss non-linear (e.g., sigmoidal) relationships between TR copy number and expression, potentially overlooking some causal TRs.
- Sample size and rarity: Larger cohorts and more transcriptomes are needed to identify rare TR variants and improve power.
- Mechanistic resolution: The distinct regulatory mechanisms of TR-caused insertions/deletions versus other indels remain to be clarified; more functional experiments are necessary to generalize findings for breeding.
Related Publications
Explore these studies to deepen your understanding of the subject.

