Medicine and Health

Whole-genome sequencing in 333,100 individuals reveals rare non-coding single variant and aggregate associations with height

G. Hawkes, R. N. Beaumont, et al.

Discover how a comprehensive analysis involving 333,100 participants uncovered 29 rare variants linked to height, revealing effects from -7 cm to +4.7 cm. This cutting-edge research sheds light on non-coding variants near *HMGA1* and *MIR497HG*, providing new insights into genetic influences on complex traits, conducted by a multidisciplinary team of experts.... show more

Introduction

Most genetic variation associated with complex traits lies in the non-coding genome, yet the role of rare non-coding variants in common human phenotypes is largely unknown. Prior work has predominantly focused on common variants via array-based GWAS or rare coding variants via exome sequencing, leaving the abundant rare non-coding variation underexplored. Identifying rare non-coding variants associated with complex traits could reveal key gene regulatory elements with large effects and provide direct insights into causal biology because such variants are less confounded by linkage disequilibrium. Whole-genome sequencing (WGS) enables discovery of rare non-coding contributors and has proven value in diagnosing monogenic conditions, but few large-scale sequencing studies have systematically assessed rare non-coding variation for complex phenotypes. Here, height is used as a model polygenic trait to test whether WGS-based single-variant and aggregate rare variant analyses focused on regulatory annotations can uncover novel non-coding associations that are not detected by standard GWAS or exome approaches.

Literature Review

Array-based GWAS have mapped thousands of common variants for complex traits, and for height a large proportion of common variant heritability has been explained (e.g., Yengo et al., 2022). Rare variant discovery has largely centered on coding regions using exome sequencing (e.g., GIGYF1 loss-of-function associated with diabetes). Although WGS has identified rare non-coding causes of monogenic disease (e.g., intronic regulatory variants in HK1 causing congenital hyperinsulinism), sequencing-based studies of rare non-coding variation for complex traits remain scarce. Recent WGS rare-variant analyses in TOPMed reported suggestive non-coding signals for lipids (N ≈ 66,000) and blood pressure (N ≈ 51,000), including signals in regulatory regions (e.g., DHS near PCSK9, aggregate signals at KIF3B). Estimates suggest 6–15% of the non-coding genome is functionally important, underscoring the potential yield of rare non-coding association studies at scale.

Methodology

Study design and cohorts: Discovery analyses used WGS data from 200,003 UK Biobank (UKB) participants, with primary analyses in 183,078 genetically inferred European-ancestry individuals and ancestry-specific analyses in South Asian (N=4439) and African (N=3077) subsets. Replication was performed in TOPMed (N=87,652, multi-ancestry) and All of Us (AoU; N=45,445 EUR, 20,548 AFR, 13,683 AMR). Phenotype was rank inverse-normalized standing height. WGS coverage averaged 32.5× (GRCh38; SNVs/indels called with GraphTyper; structural variants called using the same pipeline). Variant quality control excluded variants with GraphTyper AAScore < 0.5.

Common variant conditioning: To prioritize novel rare non-coding signals, analyses were conditioned on 12,661 previously reported height variants, including loci from the latest GIANT (5.4 million individuals), exome-array height associations, and significant exome-wide height results. Genome chunks (PLINK format) were grouped to ensure variants near boundaries were conditioned on known loci up to ±5 Mb.

Single-variant testing: Variants with MAC ≥ 20 (approx. MAF ≥ 5.46×10^-3 in the discovery set) were tested using REGENIE v3.1. Genome-wide significance thresholds were derived from 20 null simulated phenotypes to account for correlated tests (two-sided chi-squared statistics). Significant variants were LD-clumped (PLINK; r^2 < 0.001; 250 kb) and underwent sequential conditional analysis to identify likely independent signals. Regional visualization used LocusZoom with LD from UKB WGS.

Genomic annotation and aggregate testing: Variants were annotated by Ensembl VEP and classified into gene-centric coding/splicing, proximal regulatory (within ±5 kb of 5'/3' UTR and not coding), and non-gene-centric regulatory (intergenic and intronic across any transcript). Additionally, non-coding sliding windows (2 kb; exons excluded) were tested. Rare variant aggregate testing was limited to within-sample MAF < 0.1% using masks informed by predicted functional impact: conservation (GERP > 2; phastCons top percentile), non-coding constraint (JARVIS), splicing (SpliceAI), and deleteriousness (CADD; applied to coding masks). Variants were uniquely assigned within genome units to avoid duplication across masks. Aggregate tests included BURDEN, SKAT, ACAT, and ACAT-O, plus singleton tests (MAC=1). An omnibus “all-mask” statistic was also computed per unit. Proximal and coding were tested separately per transcript to avoid mixing coding and regulatory signals.

Multiple-testing correction and thresholds: Study-wide significance thresholds were estimated from 20 simulated null phenotypes by taking the minimum p-value across all tests (single variant and genomic units) to reflect a 95% level under the null. Resulting thresholds were P < 6.3×10^-10 for single variants and P < 6.58×10^-10 for genomic aggregates.

Signal classification and conditional analyses: Aggregate signals were re-tested after conditioning on significant single-variant signals to distinguish multi-variant contributions from single-variant–driven associations.

Replication: Cross-ancestry replication used TOPMed (STAARpipeline; self-reported/ HARE-assigned ancestry groups) and AoU (principal components–based continental ancestry classification; REGENIE association). Where available, meta-analyses across replication datasets were used and heterogeneity assessed (R package metafor; fixed-effects). Fine-mapping of common variants within rare-variant loci used SuSiEx leveraging multi-ancestry LD.

Covariates and model framework: All association tests adjusted for age, sex, age^2, recruitment center (geography proxy), and the first 40 genetic PCs. REGENIE step 1 constructed null models using 487,558 LD-pruned, frequency-filtered UKB array variants to control for relatedness and population structure.

Key Findings

Novel rare and low-frequency single-variant associations: After conditioning on known height loci, 28 rare (MAF < 0.1%, MAC ≥ 20) and low-frequency (0.1% < MAF < 1%) SNVs/indels were independently associated with height in UKB; adding one structural variant yields 29 independent signals. Effect sizes ranged from -7.25 cm to +4.71 cm (-0.79 to 0.52 SD), with rarer variants showing larger effects.
Structural variant near SHOX: A 47,543 bp deletion in the pseudo-autosomal region of chromosome X (X:819,814–867,357), 173 kb downstream of SHOX, present in 0.3% (824 carriers; one homozygote), associated with -2.79 cm in height (95% CI -3.33, -2.25; P=5.01×10^-24). This deletion has previously been reported in Leri-Weill dyschondrosteosis clinical cohorts.
Replication of single variants: Of 28 SNVs/indels, 22 had consistent effect direction in replication (binomial P=1.51×10^-3). Ten showed nominal replication (P<0.05; ~1.4 expected by chance), and three reached Bonferroni significance in meta-analysis of TOPMed and AoU: HMGA1 promoter (6:34237902:G:A; β=+4.71 cm [3.41, 6.01]; discovery P=1.29×10^-12; replication P=6.82×10^-7), GHRH promoter (20:37261871:G:A; β=+1.82 cm [1.43, 2.23]; discovery P=2.52×10^-19; replication P=3.13×10^-5), and proximal to CUL3 (2:224492608:T:C; β=+2.72 cm [2.24, 3.19]; discovery P=4.29×10^-11; replication P=4.20×10^-4). Chromosome X replication was unavailable. No novel genome-wide signals replicated in UKB South Asian or African analyses.
Aggregate non-coding associations: Across 57,608,498 aggregate tests (5.94M coding; 13.01M proximal; 4.86M intergenic/deep intronic; 33.80M sliding windows), seven non-coding regions of interest were identified (P<6.31×10^-10), with four remaining significant after conditioning on known height loci. Replicating proximal non-coding aggregates were observed near HMGA1 and C17orf49 (MIR497HG) in TOPMed/AoU (and UKB non-EUR where available). After additional adjustment for significant single variants, two aggregates remained study-wide significant: C17orf49 downstream (GERP > 2; β=+1.34 cm [0.931, 1.66]; P=2.00×10^-11) and PRRS-ARHGAP8 upstream (JARVIS > 0.99; P=4.27×10^-10).
HMGA1 allelic series in promoter: The upstream non-coding HMGA1 aggregate (2,006 rare variants; 603 with MAC ≥ 5) contained multiple independent rare variants with large effects, including 6:34237902:G:A (altering the first base of the MANE Select transcript TSS; β≈-4.83 cm; P=2.00×10^-11; MAF≈0.04%) and 6:34236873:C:G (β≈-3.97 cm; P=1.00×10^-10; MAF≈0.047%). A common GWAS variant within the same enhancer (6:34237688:G:GGAGCCC; MAF=10.9%) fine-mapped with posterior probability >0.99 (95% credible set size = 1), indicating an allelic series spanning rare to common variants in a regulatory region.
MIR497HG (C17orf49) miRNA involvement: A downstream conserved non-coding aggregate (235 rare variants, 59 with MAC ≥ 5) near C17orf49 overlapped MIR497HG, with cumulative effect β=+1.36 cm (95% CI 1.11, 1.48; P=1.26×10^-11). Removing miRNA-overlapping variants attenuated the signal (β=+1.11 cm; P=3.98×10^-5). A dedicated miRNA aggregate in MIR195 showed a larger effect (β=+3.05 cm [1.44, 4.65]; P=1.97×10^-4), suggesting contributions from promoter and miRNA-sequence variants.
GH1 promoter: Nine rare, highly conserved upstream GH1 variants (5 with MAC ≥ 5) formed an aggregate associated with a 0.34 SD (~3.11 cm) reduction in height. One variant (17:63918961:A:G; MAF≈0.04%) independently replicated (β=-4.24 cm [95% CI -5.53, -2.94]; P=1.46×10^-10) and corresponds to a previously reported variant of unknown significance in idiopathic short stature cohorts, at a distal POU1F1 binding site.
Power and calibration: In replication analyses, power was >80% to detect ~9 signals at P<0.05 and ~4 at P<1.85×10^-3. Rare single-variant and aggregate thresholds were calibrated via phenotype simulations to account for correlated testing.
No significant intergenic-regulatory or sliding-window aggregates reached study-wide significance beyond the proximal/regulatory loci noted.

Discussion

This study demonstrates that rare non-coding variants uncovered by WGS contribute substantially to variation in human height, with some variants exerting effects of several centimeters—magnitudes rarely seen for common variants. By conditioning on known common and coding signals, the analyses targeted previously hidden non-coding contributions, reducing LD-driven artifacts. Findings at HMGA1 illustrate an allelic series in a regulatory region, where multiple rare promoter/enhancer variants and a fine-mapped common variant converge on the same element, providing mechanistic insight into gene regulation affecting growth. The MIR497HG/C17orf49 locus implicates miRNA biology—through promoter and mature miRNA sequence variation (MIR195 and MIR497)—as a contributor to height variation, consistent with prior functional literature linking these miRNAs to skeletal muscle quiescence, osteoblast proliferation, and chondrogenesis. The GH1 promoter findings highlight that conserved regulatory variation in key endocrine pathways can have large phenotypic effects and connect population associations to variants reported in clinical short stature cohorts. Collectively, the results validate rare-variant aggregate approaches in non-coding regions and underscore their ability to pinpoint causal regulatory elements and target genes. The study emphasizes the necessity of adjusting for common variants in rare-variant discovery to avoid spurious associations and discusses that, at very large sample sizes, many variants within a significant aggregate may individually reach significance—without negating the aggregate nature of the signal. These insights have broad relevance for dissecting genetic architectures of other complex traits.

Conclusion

Large-scale WGS-based association testing focused on rare non-coding variation identified multiple novel single-variant and aggregate associations with height that elude standard GWAS and exome analyses. Key contributions include: (1) discovery and replication of rare promoter and enhancer variants with large effects (e.g., HMGA1, GHRH, CUL3), (2) evidence for miRNA-related mechanisms at MIR497HG/C17orf49, and (3) validation of conserved regulatory variants in endocrine pathways (GH1). The analytical framework—comprehensive conditioning on known signals, rare-variant aggregate testing across regulatory annotations, and cross-cohort replication—provides a template for future WGS studies of complex traits. Future directions include expanding sample sizes (especially in underrepresented ancestries), integrating richer tissue-specific functional annotations for the non-coding genome, and enabling individual-level meta-analysis across cohorts to enhance power and fine-mapping resolution.

Limitations

Sample size constraints for very rare variants: Aggregate tests used MAF < 0.1%, limiting per-variant carriers (~≤183 in discovery), reducing power to detect the rarest effects.
Limited representation and power in non-European ancestries: Smaller sample sizes for South Asian and African subsets constrained rare variant discovery and replication compared to European analyses.
Replication limitations: Replication relied on external cohorts (TOPMed, All of Us) with differing pipelines and incomplete variant coverage (e.g., chromosome X unavailable), precluding unified individual-level meta-analysis.
Functional annotation gaps: Limited availability of high-quality, tissue-specific regulatory annotations for the non-coding genome hampers precise interpretation of non-coding associations.
Residual uncertainty in aggregate signal architecture: At large sample sizes, multiple variants within a locus may become individually significant, complicating attribution of aggregate versus single-variant drivers, though this does not negate aggregate evidence.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Whole-genome sequencing reveals novel ethnicity-specific rare variants associated with Alzheimer’s disease

D. Shigemizu, Y. Asanomi, et al.

Biology

Single-cell RNA sequencing reveals shared and distinct immune responses in Kawasaki disease and COVID-19

X. Liu, T. Luo, et al.

Veterinary Science

Single-cell RNA sequencing reveals the cellular and molecular heterogeneity of treatment-naïve primary osteosarcoma in dogs

D. T. Ammons, L. S. Hopkins, et al.

Biology

Accurate and scalable variant calling from single cell DNA sequencing data with ProSolo

D. Lähnemann, J. Köster, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny