logo
Loading...
Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs

Biology

Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs

Q. S. Wang, D. R. Kelley, et al.

Unlock the secrets of non-coding genetic variants with groundbreaking insights from the research conducted by Qingbo S. Wang and colleagues. Using the innovative Expression Modifier Score (EMS) predictor, this study reveals improved prioritization of eQTLs, uncovering over 20,000 additional putative causal variants and 310 candidate genes through co-localization with UK Biobank phenotypes.... show more
Introduction

Genome-wide association studies identify many loci for complex traits, but most associated variants are noncoding and occur within linkage disequilibrium blocks, complicating causal interpretation. Understanding the function of noncoding variants at single-variant resolution is required to link associations to mechanisms. Existing functional genomics consortia have mapped regulatory elements enriched for trait heritability, and reporter assays and CRISPR perturbations can test variant effects but either lack native chromatin context or are low-throughput. eQTL studies associate loci with gene expression but, like GWAS, do not pinpoint causal variants. Statistical fine-mapping methods (e.g., FINEMAP, SuSiE) can infer likely causal variants within loci, yielding sets of putative causal cis-eQTLs that inform gene regulation and GWAS interpretation, though power and LD limit resolution. Sequence-based deep learning predictors estimate regulatory effects from sequence context but are typically trained to predict molecular readouts rather than directly on gold-standard expression-modifying variants. The authors propose combining statistical fine-mapping-derived putative causal eQTLs with thousands of functional features to train a supervised predictor of whether a variant modifies nearby gene expression and to use this predictor to enhance fine-mapping and colocalization, thereby expanding the catalog of confidently identified eQTLs.

Literature Review

The study builds on: (1) GWAS findings of predominantly noncoding associations and the challenge of LD (Maurano et al., Trynka et al.); (2) functional genomics maps from ENCODE and Roadmap highlighting regulatory elements enriched for heritability; (3) reporter assays (MPRAs) and genome editing approaches for variant function assessment with limitations in throughput and chromatin context; (4) eQTL catalogs (GTEx) associating loci with expression without single-variant resolution; (5) statistical fine-mapping frameworks including stochastic search and Bayesian stepwise/mixture models such as FINEMAP, DAP-G, and SuSiE, applied to eQTLs to identify high-PIP variants; and (6) deep learning variant effect predictors (DeepSEA, Basenji) that infer regulatory activity from sequence and score variants via predicted allele differences. Prior functionally informed fine-mapping approaches for complex traits adjust priors using annotations (e.g., Kichaev et al., PolyFun). The authors leverage these foundations to train a supervised classifier directly on fine-mapped cis-eQTLs and integrate its output as a prior for eQTL fine-mapping and trait colocalization.

Methodology

Training data and labels: The authors used GTEx v8 cis-eQTL fine-mapping results across 49 tissues (±1 Mb windows) and defined positive variant–gene pairs as those with posterior inclusion probability (PIP) > 0.9 by both FINEMAP and SuSiE; negatives had PIP < 1e-4 by both. This yielded 14,807 positive pairs. Feature annotation: Each variant–gene pair was annotated with 6,121 features: distance to the gene’s canonical TSS; 12 non–cell-type-specific binary annotations from the LDSC baseline model; 795 tissue-specific binary histone mark peaks from Roadmap; and 5,313 deep learning-derived regulatory features from the Basenji model (cell-type specific activity predictors). Feature selection and model: From these, 152 predictive features were selected based on F1 and mean decrease in impurity metrics. A random forest classifier was trained per tissue to predict whether a variant–gene pair is a putative causal eQTL. Hyperparameters were tuned using random and grid search to maximize AUROC on held-out data. Score calibration: The RF output was scaled bin-wise to match the empirical fraction of positives, yielding the Expression Modifier Score (EMS), interpretable as an estimated probability a variant–gene pair is a putative causal eQTL in that tissue given its features. Feature importance analysis indicated that distance to TSS (single feature) and aggregate Basenji features dominated predictive power, while binary epigenetic annotations contributed minimally after accounting for these. Performance evaluation: For whole blood and other tissues, models were trained in a leave-one-chromosome-out fashion to avoid overfitting. EMS was compared against other scores (distance to TSS, DeepSEA, CADD, ncER, GERP, FATHMM) using enrichments for held-out fine-mapped eQTLs, MPRA saturation mutagenesis hits, reporter assay QTLs (raQTLs), and putative complex trait causal noncoding variants (UK Biobank hematopoietic traits), as well as AUROC/AUPRC. Functionally informed fine-mapping: The authors developed an approximate reweighting algorithm within the SuSiE framework. Starting from SuSiE posterior vectors for each single-effect component under a uniform prior, they reweighted posterior mass for variants in ‘pure’ credible sets (no pairwise LD r < 0.5) by the corresponding EMS values and renormalized, leaving impure sets unchanged. Updated vectors were used to compute functionally informed PIPs (PIP_EMS) and credible sets. A sensitivity analysis capped prior max/min ratio at 100 as in prior work. Colocalization and gene prioritization: For 95 UK Biobank traits, they computed PIP_EMS across 49 tissues and performed colocalization via CLPP (max over variants and tissues of trait PIP × eQTL PIP). In regions with an unresolved noncoding credible set and at least one evaluation gene (protein-coding variant PIP > 0.5) for the same trait, a gene was prioritized if CLPP > 0.1 and it was the top CLPP gene in the region. Precision was the fraction of prioritized genes matching the evaluation gene; recall was correct prioritizations over all evaluation genes. Tissue-specificity: They defined tissue-specific putative causal eQTLs (PIP_EMS > 0.9 in one tissue and < 0.1 elsewhere) and assessed enrichment of tissue-relevant TF-related Basenji features. Computational resources and data sources (CADD, GERP, DeepSEA, FATHMM, ncER, VEP) and software versions are detailed.

Key Findings

• Positives: 14,807 high-confidence cis-eQTL variant–gene pairs (PIP > 0.9 by FINEMAP and SuSiE) across 49 GTEx tissues; 6,121 annotations used per pair. Functional enrichment: strong enrichment near TSS (>500× for distance <100 bp), in 5′UTR and H3K4me3 peaks (>10×), depletion in H3K9me3. • EMS performance (whole blood): top-bin enrichment for held-out putative causal eQTLs 18.3× vs 15.1× for distance to TSS (second best), Fisher’s exact p = 3.33 × 10^−4; AUPRC 0.884 vs 0.856 (TSS). EMS ranked among top methods for prioritizing MPRA hits and raQTLs when distance to TSS was controlled. For UKBB hematopoietic traits, enrichment for putative causal noncoding variants was 17.6× for EMS vs 17.1× for DeepSEA. Results replicated in BioBank Japan and Geuvadis LCL eQTL. • Feature importance: In whole blood, Basenji features collectively accounted for 55.0% of importance; TSS distance 43.1%; binary annotations < 2% total. Across tissues, TSS distance was the top single feature, but Basenji features collectively matched or exceeded its contribution. • Functionally informed fine-mapping: Using EMS as a prior increased resolution. In whole blood, 95.4% of variants with PIP_unif > 0.9 also had PIP_EMS > 0.9 (2,152/2,255); only 33.8% of PIP_EMS > 0.9 variants had PIP_unif > 0.9 (1,125/3,277). Credible set sizes mostly decreased. With a capped prior (max/min=100), fewer additional variants were identified (269 vs 1,125 without cap). PIP_EMS showed higher raQTL enrichment in PIP > 0.9 bin: 16.8× vs 12.9× (PIP_unif) and 11.4× (PIP_DAP-G), p = 1.65 × 10^−2 versus PIP_DAP-G; complex trait causal variant enrichment was comparable across methods. • Across 49 tissues, PIP_EMS identified 20,913 additional putative causal eQTLs (PIP_EMS > 0.9 not identified by uniform prior). Tissue-specific putative causal eQTLs showed significantly higher Basenji TF-related scores in relevant tissues versus controls in 39/42 features (p < 0.05/42), and EMS-based fine-mapping increased these enrichments versus uniform prior (aggregate p < 10^−100). • Gene prioritization (95 UKBB traits): PIP_EMS achieved higher precision (0.556) and recall (0.052) than PIP_unif (precision 0.525, recall 0.039); PIP_DAP-G (distance-to-TSS prior) had precision 0.500 and recall 0.078, illustrating precision–recall trade-offs by prior choice. Overall, EMS-integrated colocalization yielded 310 additional candidate genes not identified with uniform prior. • Case example: At CITED4, four variants in perfect LD had PIP_unif = 0.25 each; EMS distinguished rs35893233 (SPI1 motif creation; myeloid activity) with >25× higher EMS, producing PIP_EMS = 0.956 and enabling colocalization to prioritize CITED4 for neutrophil count (CLPP = 0.173).

Discussion

Training a supervised classifier directly on fine-mapped cis-eQTLs and thousands of functional features yields a calibrated probability (EMS) that a variant modifies gene expression in a given tissue. EMS leverages strong, cell-type-aware sequence-based regulatory predictions (Basenji) and simple genomic context (distance to TSS), with traditional binary epigenetic annotations adding little incremental information once these are included. EMS improves prioritization of putative causal eQTLs and provides competitive performance on reporter assay and complex trait benchmarks despite their imperfect reflection of native chromatin regulatory effects. Using EMS as a prior for eQTL fine-mapping increases resolution (smaller credible sets) and identifies many additional high-confidence cis-eQTLs, which translates into improved precision and recall in gene prioritization via colocalization with complex traits. The tissue-specific enrichments of TF-related Basenji features among EMS-identified eQTLs support the biological plausibility of the additional signals. Differences among priors (EMS vs distance-to-TSS) reveal trade-offs between precision and recall, motivating further work on prior choice. Overall, EMS represents a practical, scalable way to integrate functional predictions with statistical fine-mapping to refine cis-regulatory variant discovery and trait gene mapping.

Conclusion

The study introduces the Expression Modifier Score (EMS), a supervised, calibrated probability that a variant–gene pair is a cis-eQTL in a tissue, trained on fine-mapped GTEx v8 eQTLs using >6,000 features. EMS outperforms or matches leading scores for prioritizing functional regulatory variants and, when used as a prior in SuSiE fine-mapping, identifies 20,913 additional putative causal eQTLs across 49 tissues, reduces credible set sizes, and improves gene prioritization for complex traits, yielding 310 additional candidate genes in UK Biobank. EMS highlights the dominant predictive contributions of sequence-based regulatory activity and proximity to TSS. Future directions include extending the framework to splicing and other molecular QTLs, incorporating richer annotations (3D genome architecture, gene constraint, pathway features), and modeling effect size direction and magnitude. The EMS resource for GTEx v8 tissues is publicly available, providing a foundation for deeper interpretation of noncoding variation.

Limitations

Key limitations include: (1) ascertainment biases in training labels—power to fine-map is affected by recombination rate and allele frequency; GTEx is primarily adult European ancestry; (2) feature set incompleteness—lack of comprehensive 3D chromatin, gene constraint, and pathway annotations may limit predictive accuracy; (3) modeling simplification—PIP was thresholded to form a binary classification, discarding intermediate-PIP variants and avoiding a full hierarchical regression; (4) evaluation constraints—held-out chromosomes from the same dataset are not fully independent gold standards; reporter assays and complex trait fine-mapping do not fully recapitulate native cis-regulatory effects; and (5) recognized differences between cis-expression architectures and complex trait architectures may limit transferability.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 22+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny