logo
ResearchBunny Logo
Introduction
Enhancers are crucial cis-regulatory elements that activate gene expression in a cell type- and time-dependent manner. They function independently of orientation, recruiting transcriptional machinery to target gene promoters. A single enhancer can regulate multiple genes, and vice versa, across varying distances. Enhancers are vital in cell fate determination and implicated in many diseases, especially cancers, as many disease-associated variants map to them. However, the target genes of most enhancers remain unknown, hindering disease etiology understanding. Experimental methods like proximity ligation assays and epigenome editing techniques can identify enhancer-gene relationships but don't scale genome-wide. Computational approaches exist but often lack validation. This study aimed to develop a machine learning approach to predict cell type-specific enhancer-gene regulatory links, validating predictions against existing data.
Literature Review
Numerous computational methods predict enhancer-gene regulatory links, contributing to databases like HACER, GeneHancer, EnhancerAtlas, SEdb, and HEDD. However, these lack cell type-matched data for robust validation. While some databases capture known relationships, the proportion of spurious links is often unknown. Manually programming rules to identify active gene-enhancer pairs is infeasible given the complexity of the relationships and the numerous contributing factors. Supervised machine learning offers an alternative, training algorithms on examples of active and inactive links based on enhancer and gene characteristics and their interaction.
Methodology
PEACOCK (Predicted Enhancer Activity in Cis Originating from Cell-specific Knowledge) uses a machine learning classification approach. Experimentally validated enhancer-gene links from peer-reviewed publications in four cancer cell lines (HepG2, HCT116, K562, and MCF7) comprised the positive training data (159 links total). Publicly available DNA accessibility data provided negative examples. Features included epigenetic marks (H3K27ac, H3K4me1, H3K4me3, P300 binding) from ENCODE, eQTL data from GTEx, and binary features indicating proximity and intronic location. Multiple machine learning algorithms (random forests, flexible discriminant analysis, linear discriminant analysis, gradient boosting machines, ridge regression, k-nearest neighbors, support vector machines with various kernels) were trained and evaluated using AUPRC and AUC on test datasets. A final model (k-nearest neighbors) and an alternate model (support vector machine with ANOVA kernel, for datasets lacking P300 data) were selected based on consistent performance across cell types. These models were used to score all possible cis enhancer-gene pairs (<1 Mb apart, ~17 million). Cell type-specific scores, Z-scores (standard deviations from the mean), and F-scores (percentiles) were calculated and integrated into the PEREGRINE database and PANTHER.
Key Findings
The final PEACOCK model demonstrated good performance within and across cell types, suggesting consistency in feature patterns despite cell-type-specific enhancer activity. The best-performing model (k-nearest neighbors) achieved AUPRCs as high as 0.77 and AUCs frequently above 0.90. PEREGRINE enhancer-gene links consistently received significantly higher scores than other cis enhancer-gene pairs, indicating higher likelihood of active regulatory relationships. PEACOCK outperformed other methods like ABC, GeneHancer, and TargetFinder in AUPRC, suggesting improved predictive capabilities. Analysis of the top 5% of predictions (F-score ≥ 0.95) revealed consistent patterns across cell lines regarding the number of genes regulated by multiple enhancers and vice versa. Specific examples highlight the cell-type-specific nature of the scores, illustrating the model's ability to discriminate between active and inactive relationships in different cellular contexts (e.g., SMAD7-EH37E0467415 in HCT116, CCND1-EH37E0225350 in MCF7, CYP2D6-EH37E0634729 in HepG2).
Discussion
PEACOCK addresses the need for a validated cell type-specific scoring system for enhancer-gene regulatory links. The high performance and consistent results across cell types demonstrate the robustness of the approach. The integration into the PEREGRINE and PANTHER databases makes the results readily accessible to researchers. The cell-type specificity of the scores is crucial, as it allows for better interpretation of predicted relationships in disease-relevant contexts. The ability to discriminate between active and inactive enhancer-gene pairs significantly improves the utility of existing enhancer prediction databases.
Conclusion
PEACOCK provides a robust and accessible method for assessing the validity of cell type-specific enhancer-gene regulatory relationships. The cell type-specific scores significantly enhance the interpretation of enhancer-gene predictions, particularly in disease research. Future directions include investigating more complex regulatory interactions (CRMs) and exploring the incorporation of additional features, like gene expression levels, to further improve prediction accuracy.
Limitations
The study relied on a limited number of experimentally validated enhancer-gene links for training. The class imbalance in training data (more negative than positive examples) might affect model performance, although measures like AUPRC mitigate this. The features used might not capture the full complexity of enhancer-gene interactions. Generalizability to cell types beyond the four studied needs further investigation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny