logo
ResearchBunny Logo
PEACOCK: a machine learning approach to assess the validity of cell type-specific enhancer-gene regulatory relationships

Medicine and Health

PEACOCK: a machine learning approach to assess the validity of cell type-specific enhancer-gene regulatory relationships

C. Mills, C. N. Marconett, et al.

This cutting-edge research by Caitlin Mills, Crystal N. Marconett, Juan Pablo Lewinger, and Huaiyu Mi presents PEACOCK, a revolutionary machine learning tool that predicts cell type-specific enhancer-gene relationships. By utilizing validated enhancer-gene links, this model offers a powerful method to advance our understanding of gene regulation in diseases.

00:00
00:00
~3 min • Beginner • English
Introduction
Enhancers are short (~50–2000 bp) DNA regulatory elements that activate the expression of target genes in a cell type and timing-specific manner. They function independently of orientation, containing transcription factor binding sites that recruit chromatin remodeling complexes and transcription machinery, and loop into close proximity with target promoters to upregulate transcription. Genes can be targeted by multiple enhancers, and a single enhancer can regulate more than one gene, sometimes over distances exceeding 1 Mb. Enhancers play vital roles in differentiation and disease, including cancers, where super-enhancers often drive oncogene upregulation. Thousands of GWAS variants map to enhancers, implicating them in disease, yet the vast majority of enhancer target genes remain unknown, limiting mechanistic understanding. Common experimental approaches to identify enhancer-gene regulatory associations include proximity ligation assays (3C/4C/5C/Hi-C/pcHi-C/ChIA-PET/HiChIP) and advanced microscopy to detect enhancer–promoter interactions, as well as genome/epigenome editing (e.g., CRISPR) to perturb elements and test effects. These methods are powerful but not scalable genome-wide across many cell types. Given the strong cell type specificity of enhancer–gene links, scalable, cell type-specific genome-wide characterization is a critical need. Many computational resources predict enhancer–gene links using public data, but they often lack robust validation and cell type-matched evidence, leaving the accuracy of scores uncertain. While some known links are captured, the fraction of spurious links is unknown. It is also unclear how to optimally weight diverse predictive characteristics (enhancer marks, promoter marks, eQTLs, proximity) simultaneously. To address this, the authors developed PEACOCK (Predicted Enhancer Activity in Cis Originating from Cell-specific Knowledge), a supervised machine learning framework to score enhancer–gene regulatory links in a cell type-specific manner. Curated, experimentally validated links in four cancer cell lines (HepG2, HCT116, K562, MCF7) provide positive examples, and DNA accessibility data define negatives. Models are trained per cell type and evaluated across cell types, producing a quantitative score for all cis enhancer–gene pairs. Findings show cross-cell consistency in the patterns of features associated with active regulation, enabling pooled training and genome-wide scoring integrated into the PEREGRINE database.
Literature Review
The paper reviews experimental strategies for mapping enhancer–promoter regulation, including 3C-based proximity ligation assays (3C/4C/5C/Hi-C/promoter-capture Hi-C/CHIA-PET/Hi-ChIP) and advanced microscopy, as well as perturbation approaches (CRISPR editing/CRISPRi/a) to establish causality. While informative, these methods are not readily scalable across the genome and across many cell types. Computational databases (e.g., HACER, GeneHancer, EnhancerAtlas, SEdb, HEDD) integrate heterogeneous evidence to predict enhancer–gene links but frequently lack sufficient cell type-matched validation and statistically validated scores, making reliability uncertain. Prior modeling frameworks such as TargetFinder and the Activity-By-Contact (ABC) model provide predictive strategies in certain contexts but have limited coverage across the specific cell types evaluated here. Overall, the literature supports key predictive features (enhancer/promoter histone marks, coactivator binding like P300, eQTLs, proximity), yet emphasizes the need for validated cell type-specific scoring at genome scale.
Methodology
Datasets and curation: - Positive class: 159 experimentally validated enhancer–gene links across four cancer cell lines met predefined criteria (e.g., CRISPR perturbation altering gene expression; TF binding with motif mutation effects; allele-specific SNP effects with physical interaction evidence). Positives per cell line: HepG2 n=23, HCT116 n=43, K562_1 n=60 (CRISPRi-FlowFISH across 30 genes within 450 kb), K562_2 n=6 (CRISPRi targeting enhancers within 1 Mb of MYC and GATA1), MCF7 n=27. Literature mapping required ≥33% overlap between reported enhancer coordinates and PEREGRINE enhancer coordinates; only protein-coding genes (PANTHER IDs) were included. - Negative class: Defined using DNA accessibility (ENCODE DNase-seq/FAIRE-seq). Enhancers and/or genes (±2 kb) without overlap with accessibility peaks were deemed inaccessible. All cis pairs (<1 Mb) with an inaccessible component were labeled negative. To match genomic context, negatives were sampled to target the same genes present in positives: HepG2 n=360, HCT116 n=420, K562_1 n=400 (also included CRISPR-nonsignificant p>0.10), K562_2 n=1303 (CRISPRi nonsignificant FDR>0.05), MCF7 n=300. - Joint training sets: Combined positives and matched negatives from multiple cell types for pooled training. Feature engineering: - Main features (9 total): enhancer H3K27ac (binary), enhancer H3K4me1 (continuous), promoter H3K4me3 (binary), enhancer P300 binding (continuous; unavailable in HCT116), eQTL combined Z-score for enhancer→gene (significance), nearest gene (binary), intronic (enhancer within gene intron; binary), average absolute eQTL coefficients within enhancer for same gene (effect size), promoter H3K27ac (continuous). ENCODE provided ChIP-seq/cCRE signals; GTEx tissue-level cis-eQTLs were used (liver, colon, whole blood, breast for respective cell types). Overlaps required ≥50% region overlap. Continuous variables were standardized. - Interaction terms: Up to 8 enhancer–promoter interaction products among H3K27ac/H3K4me1 enhancer marks and promoter H3K27ac/H3K4me3; additional interactions included eQTL significance × eQTL effect size and enhancer mark interactions, yielding up to 17 features per pair. Model training and evaluation: - Algorithms evaluated: random forest, flexible/linear discriminant analysis, gradient boosting, ridge regression, k-nearest neighbors (k-NN), and SVMs with Gaussian radial, polynomial, linear, hyperbolic tangent, Laplace radial, Bessel, and ANOVA radial kernels. Default hyperparameters were used. - Metrics: Area Under Precision–Recall Curve (AUPRC) prioritized due to class imbalance; AUC reported as secondary. Models trained on HepG2, HCT116, or K562_1 were tested on MCF7 and K562_2 (held-out) and on any non-training datasets among HepG2/HCT116/K562_1. - Cross-cell generalization: Models trained in one cell type performed well on unrelated cell types, indicating consistent cross-cell patterns in features associated with active enhancer regulation despite cell-specific element activity. Model selection: - Selection criterion: sum of minimum AUPRC across test sets plus mean AUPRC; models with minimum AUPRC <0.10 were excluded. Among best models (mean AUPRC 0.55–0.64), the one with highest mean AUC (0.93) was chosen. Feature ablation improved performance slightly by removing one interaction feature (Feature 11); further removals did not help. - Final model: k-NN trained on the pooled HepG2 + K562_1 set. Alternate final model (for cell types lacking P300): SVM with ANOVA RBF kernel trained on K562_1 without P300 feature. Genome-wide scoring and outputs: - Universe: 17,354,145 cis enhancer–gene pairs (enhancer within 1 Mb of gene TSS) were annotated with available features and scored per cell type. P300-missing cell types (e.g., HCT116) used the alternate model. Scores for 880,946 PEREGRINE links were examined against all other cis pairs. - Interpretability: For each cell type, the raw score in [0,1] is provided along with Z-score (relative to cell type-specific distribution) and percentile F(score). Raw scores are not calibrated probabilities due to nonrepresentative class sampling. Comparison baselines and external methods: - Random classifier AUPRC equals positive class proportion. - Distance-only models trained/tested across datasets using midpoint distances; best AUPRCs per test set reported. PEACOCK was also compared fairly against ABC, GeneHancer, and TargetFinder on intersecting predictions (≥33% enhancer overlap via bedtools). Data/software availability: - Scores downloadable from PEREGRINE (www.peregrineproj.org); integrated into PANTHER (www.pantherdb.org). Code and pipelines on GitHub: https://github.com/USCbiostats/PEACOCK.
Key Findings
- Cross-cell predictive performance: Models trained in one cell type generalize well to others, indicating consistent feature patterns of active enhancer–gene regulation across cell types despite cell-specific activity of individual pairs. Reported AUPRCs reached as high as 0.77 with AUCs frequently >0.90 across tests. - Final model performance: The pooled HepG2+K562_1 k-NN final model achieved strong precision–recall performance on held-out test sets: MCF7 AUPRC 0.73 and K562_2 AUPRC 0.56. The alternate SVM (ANOVA kernel, no P300) achieved AUPRCs of 0.55 (MCF7), 0.17 (K562_2), 0.36 (HepG2), and 0.42 (HCT116). - Genome-wide scoring: 17,354,145 cis enhancer–gene pairs were scored. In HepG2, ~84% of pairs had non-zero scores. - Enrichment for PEREGRINE links: Across all cell types, PEREGRINE-predicted links had significantly higher PEACOCK scores than other cis pairs (two-sample Kolmogorov–Smirnov tests p<2.2e-16), supporting that PEREGRINE links are more likely to be genuine regulatory relationships. - Comparison with other methods (AUPRC on intersecting datasets): PEACOCK outperformed random classifier and distance-only baselines and surpassed other approaches: ABC (PEACOCK 0.68 vs ABC 0.47 in K562), GeneHancer (PEACOCK 0.45 vs 0.04 merged sets), TargetFinder (PEACOCK 0.33 vs 0.04). - Training/test composition: Positives curated: HepG2 23, HCT116 43, K562_1 60, K562_2 6, MCF7 27. Negatives sampled: HepG2 360, HCT116 420, K562_1 400, K562_2 1303, MCF7 300; full negative search spaces numbered ~13–16 million cis pairs per cell line from accessibility screens. - Systems-level summaries (top 5% F(score)≥0.95): Proportion of enhancers linked to >1 gene ranged ~7.9% (HCT116) to ~30% (MCF7); ~98–99% of genes linked to >1 enhancer in all cell lines. Examples of most-connected elements: GSDMC (608 enhancers, HepG2), LPP (731, HCT116), KCNMB2 (568, K562), PTPRC (535, MCF7); most pleiotropic enhancers linked to up to 102 genes in MCF7. - Case studies: High Z-scores align with literature in disease-relevant contexts: EH37E0467415–SMAD7 in HCT116 (Z=7.55; CRC risk SNPs with allele-specific activity), EH37E0225350–CCND1 in MCF7 (Z=8.29; ERα hotspot; CRISPR deletion abolishes CCND1 activation), and EH37E0634729–CYP2D6 in HepG2 (Z=4.36; enhancer SNP rs5758550 modulates hepatic expression), with lower scores in unrelated cell lines, underscoring cell specificity.
Discussion
The study addresses the challenge of accurately identifying enhancer target genes in a cell type-specific manner by learning predictive patterns from experimentally validated links. Findings demonstrate that while enhancer–gene interactions are context-specific, the combinatorial signatures of enhancer activation (histone marks, coactivator binding) and promoter activity, coupled with eQTL evidence and genomic context, form consistent patterns across cell types. This enables robust cross-cell generalization and pooled training, improving performance and scalability. The genome-wide scoring framework provides interpretable, cell type-specific measures (raw score, Z-score, percentile) that help prioritize candidate enhancer–gene links for validation and analysis. Integration into PEREGRINE and PANTHER facilitates practical use by researchers, including mapping disease-associated variants to putative regulatory targets in relevant cell contexts. Comparative analyses show PEACOCK’s advantage over distance-only and several established methods on overlapping datasets, and distributions of scores confirm enrichment among PEREGRINE links, supporting validity. Use cases in colorectal cancer (SMAD7 enhancer), breast cancer (CCND1 enhancer), and liver pharmacogenomics (CYP2D6 enhancer) illustrate that high Z-scores correspond to literature-supported functional regulation and that scores appropriately vary across cell types. Together, results provide a statistically validated, cell type-aware framework for enhancer–gene link assessment that can be incorporated into downstream disease genetics and functional genomics workflows.
Conclusion
The paper introduces PEACOCK, a supervised machine learning framework that generates cell type-specific scores for enhancer–gene regulatory links using a compact feature set derived from common assays (histone marks, P300, eQTLs, proximity). By training on curated, experimentally validated positives and carefully matched negatives, and by leveraging cross-cell consistency in feature patterns, PEACOCK delivers robust performance across diverse cell types. The final k-NN model (and an alternate SVM model for settings without P300) scored 17.35 million cis enhancer–gene pairs and assigned interpretable Z-scores and percentiles, now available via PEREGRINE and PANTHER. PEACOCK advances the field by providing a statistically validated, cell type-aware scoring system that outperforms baselines and several prior approaches on overlapping datasets and helps contextualize disease-associated variants in enhancers. Future directions include: adding gene expression and genomic distance as features; exploring class-imbalance strategies (e.g., SMOTE) and probability calibration; expanding to more cell types as data become available; and investigating higher-order regulatory architectures (cis-regulatory modules, combinatorial TF regulation).
Limitations
- Negative set definition: Based on lack of DNase/FAIRE accessibility, which may misclassify rare cases (e.g., potential active regions within heterochromatic contexts or regions with overlapping heterochromatin/euchromatin marks), introducing some false negatives. - Positive set scope: Positives are limited in number and derived from curated literature and specific CRISPR datasets; potential inclusion of indirect regulatory effects (e.g., CRISPR deletion causing downstream changes) is acknowledged, especially under Criterion 1. - Cell type specificity of inputs: eQTL data are at tissue-level resolution (GTEx) rather than exact cell line specificity, potentially diluting cell type-specific signal. - Missing features: P300 ChIP-seq is unavailable in some cell lines (e.g., HCT116), necessitating an alternate model with lower performance. - Class imbalance and calibration: Training sets are not sampled to reflect true class frequencies; raw scores are not calibrated probabilities and should not be interpreted as such across cell types. - Generalizability beyond 1 Mb: The study only considers cis pairs within 1 Mb; long-range trans or more distal cis interactions are not modeled.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny