logo
ResearchBunny Logo
Using computational approaches to enhance the interpretation of missense variants in the *PAX6* gene

Medicine and Health

Using computational approaches to enhance the interpretation of missense variants in the *PAX6* gene

N. S. Andhika, S. Biswas, et al.

Discover how researchers, including Nadya S. Andhika and Susmito Biswas, have enhanced the interpretation of *PAX6* missense variants, a critical factor in eye development. Their study reveals the effectiveness of optimized computational tools, driving accuracy in genetic analysis. Dive into the findings that could reshape our understanding of genetic significance!

00:00
00:00
~3 min • Beginner • English
Introduction
PAX6 encodes a DNA-binding transcription factor essential for ocular development. Pathogenic PAX6 variants cause a spectrum of eye disorders, most commonly aniridia due to haploinsufficiency, while missense variants can range from milder disease to severe microphthalmia/anophthalmia. Interpreting the growing number of PAX6 missense variants is challenging; many are classified as variants of uncertain significance (VUS) under ACMG/AMP criteria. In silico tools incorporating features such as evolutionary conservation and protein/domain context are widely used, including meta-predictors, but their performance varies across genes. Prior evaluations in other genes show variability and suggest gene-specific thresholds may improve predictions. No prior PAX6-focused evaluation/optimization existed; this study addresses that by benchmarking and optimizing computational tools for PAX6 missense variant interpretation.
Literature Review
Methodology
Study design: Benchmark and optimize computational prediction tools for PAX6 missense variants using primary (public) and secondary (local) datasets, derive gene-specific thresholds via ROC/MCC optimization, assess combinations of top tools, and validate with cross-validation and an external dataset. Datasets and curation: - Primary datasets (accessed Feb 2023): aggregated PAX6 missense variants from gnomAD v2.1.1 and v3.1.1 (controls/biobanks), LOVD v2/v3, HGMD Public, ClinVar, plus PubMed literature (2021–2023). Duplicates and VUS (including HGMD DM?) were excluded. Variants were categorized as: • Primary Dataset Neutral: (i) benign/likely benign classifications, and (ii) variants present in gnomAD controls/biobanks considered “presumed benign” without allele-frequency filtering (to preserve sample size). • Primary Dataset Disease: variants labeled pathogenic/likely pathogenic in ClinVar/LOVD/literature and HGMD DM. - Secondary datasets (accessed May 2023): from Manchester Center for Genomic Medicine (MCGM) diagnostics database: variants classified per ACMG/AMP 2015; likely pathogenic/pathogenic comprised Secondary Dataset Disease. Presumed benign variants were collected from BRAVO/TOPMed Freeze 8 (Secondary Dataset Neutral). VUS were retained for downstream analysis. - Totals: Primary = 241 variants (167 Disease, 74 Neutral). Secondary = 10 Disease, 65 Neutral, plus 7 VUS. Variant representation: - Genome build GRCh38; gnomAD v2 variants lifted over. Transcript ENST00000241001 (canonical PAX6, 422 aa; UniProt P26367-1). Descriptive analysis: - Variant distribution mapped along the PAX6 protein using cBioPortal (v5.4.5) lolliplots to visualize clustering in domains (Paired Domain and Homeodomain). Computational tools and scoring: - Ten tools assessed: AlphaMissense, BayesDel (AddAF), CADD (phred), ClinPred, Eigen (raw coding), MutPred2, PolyPhen-2 (HumVar), REVEL, SIFT4G, VEST4. Scores were obtained from dbNSFP v4.1; AlphaMissense scores from AlphaMissense_hg38.tsv.gz. - Default thresholds used per tool (from developers or prior literature). For AlphaMissense, scores 0.564–1.0 were considered predicted pathogenic by default; SIFT4G uses inverse scoring (lower = more damaging). Categorical outputs like “deleterious/damaging/probably/possibly damaging” were mapped to predicted pathogenic; “tolerated/benign” to predicted benign. Performance assessment and optimization: - Metrics: sensitivity, specificity, accuracy, PPV, and Matthews Correlation Coefficient (MCC) computed on primary datasets. MCC used to identify best-performing tool. - Threshold optimization: ROC-based search for the threshold maximizing MCC for each tool (iterative threshold adjustment). Performance with optimized vs default thresholds compared (IBM SPSS v25). Combination of tools: - Constructed a custom meta-predictor using the three highest-MCC tools with their optimized thresholds; applied a majority rule (≥2 of 3 tools predicting pathogenic → overall predicted pathogenic). Validation and evaluation: - Fivefold cross-validation focused on the best tool (AlphaMissense): Randomly partitioned data into 5 folds; in each iteration, optimized threshold on 4 folds (training) and evaluated on the held-out fold (testing). Averaged performance across folds. - External validation on secondary dataset (MCGM + BRAVO) with both default and optimized thresholds; also applied tools to 7 VUS.
Key Findings
- Dataset composition: • Primary: 241 missense variants (167 presumed pathogenic; 74 presumed benign). • Secondary: 10 pathogenic/likely pathogenic; 65 presumed benign; 7 VUS. - Distribution: Presumed pathogenic variants cluster within the Paired Domain and Homeodomain; presumed benign changes more often outside DNA-binding domains. VUS showed no clear clustering. - Default-threshold performance (primary): • Specificity was generally low across tools; SIFT4G (96%) and AlphaMissense (81%) were highest. Several tools had specificity <70% (e.g., CADD 12%, BayesDel 14%). • MCC (top three): SIFT4G 0.74; AlphaMissense 0.72; MutPred2 0.62. Accuracy ranged 72–90%; PPV 71–96%. - Optimized thresholds (primary; maximizing MCC): • AlphaMissense: optimized threshold >0.9667 (approx. 0.967); MCC 0.81 (highest), improved specificity and overall metrics. • SIFT4G: optimized threshold ≤0.03 (approx. 0.025 from abstract); MCC 0.77. • REVEL: optimized threshold >0.77 (approx. 0.772); MCC 0.77. • All tools improved, notably in specificity, under gene-specific thresholds. - Combination of top tools: • Majority vote of AlphaMissense + SIFT4G + REVEL (optimized thresholds) achieved MCC ~0.78; sensitivity 87%; accuracy 90%—but did not exceed AlphaMissense alone (MCC 0.81). - Cross-validation (AlphaMissense with optimized threshold): • Average ± SD across 5 folds: Specificity 93.1% ± 5.1; Sensitivity 89.5% ± 6.3; Accuracy 90.6% ± 4.0; PPV 96.7% ± 2.2; MCC 0.80 ± 0.10. - Secondary dataset performance: • With default thresholds, specificities were low (e.g., AlphaMissense and PolyPhen-2 65%; SIFT4G 79%). • With optimized thresholds: AlphaMissense MCC 0.63 (Sp 88%, Sn 90%, Acc 88%); REVEL MCC 0.61; SIFT4G MCC 0.56; 3-tool combination MCC 0.58. - VUS assessment (n=7): Six VUS were predicted pathogenic by all ten tools. One variant, PAX6 c.926T>G p.(Phe309Cys), located in the C-terminus, was predicted benign by AlphaMissense (0.1654) and SIFT4G (0.16), illustrating potential domain-specific behavior and AlphaMissense’s refined site-specific predictions.
Discussion
The study demonstrates that while many in silico tools show high sensitivity for PAX6 missense pathogenicity, specificity at default thresholds is suboptimal, risking misclassification of benign variants. Implementing PAX6-specific thresholds substantially improves performance, especially specificity. After optimization, AlphaMissense provided the highest MCC and robust cross-validated performance, outperforming combinations of multiple tools. This contrasts with some prior gene-specific studies where ensemble approaches excelled, but aligns with others suggesting a single high-performing tool can suffice. The improved results likely reflect mitigation of underfitting inherent in genome-wide default thresholds that ignore gene-specific characteristics (e.g., PAX6’s conserved DNA-binding domains). Application to local VUS showcased how AlphaMissense and SIFT4G may downclassify variants outside key domains, potentially aiding clinical resolution. Nonetheless, computational predictions should be integrated with other evidence (segregation, frequency, functional data) under ACMG/AMP frameworks and prospective Bayesian calibrations.
Conclusion
Using gene-specific thresholds substantially enhances the interpretive performance of computational tools for PAX6 missense variants. AlphaMissense with an optimized PAX6 threshold performed best and exceeded majority-vote combinations of top tools. This approach can support more precise clinical interpretation and diagnosis in PAX6-related disorders. Future work should expand variant sets, incorporate predictors of additional mechanisms (e.g., splicing, expression, 3D-structure), and refine integration with clinical and functional evidence within updated ACMG/AMP and Bayesian frameworks.
Limitations
- Sample size constraints due to rarity of PAX6-related disease limit the number of presumed pathogenic variants. - Possible circularity: some variants used for evaluation may have been included in training sets of assessed tools. - Presumed benign variants from population datasets (gnomAD, BRAVO) could include pathogenic changes (subclinical phenotypes, reduced penetrance); analyses with intentionally contaminated datasets still reproduced main findings. - Did not systematically exclude potential exonic splice-affecting missense variants or evaluate non-missense mechanisms (e.g., splicing, gene expression regulation). - Did not integrate splicing- or structure-focused tools; ensemble of diverse mechanism predictors not examined. - Differences between primary and secondary datasets (case mix, ancestry, domain representation) may influence performance estimates, particularly specificity.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny