Biology
Epistatic Features and Machine Learning Improve Alzheimer's Risk Prediction Over Polygenic Risk Scores
S. Hermes, J. Cady, et al.
Alzheimer's disease (AD) is the most common cause of dementia and lacks effective prevention or cure. Early intervention is a growing focus, but pathologic changes precede symptoms by years, limiting biomarker-based screening for pre-symptomatic risk. A genetic risk test could identify high-risk individuals at any point in life. However, the genetic architecture of late-onset AD (LOAD) is complex. While APOE ε4 is the strongest known risk factor, accounting for a small fraction of phenotypic variance, many loci of small effect contribute. Polygenic risk scores (PRS) model additive effects across many SNPs and achieve AUCs around 0.62–0.78 in clinical LOAD and ~0.82 in pathologically confirmed cases, but explain only part of the heritability and are sensitive to population structure, limiting generalizability. Non-additive (epistatic) interactions may account for some missing heritability, and prior work combining epistatic and polygenic risk achieved modest gains. This study aims to build a paragenic risk model that integrates epistatic features with machine learning to improve prediction of lifetime LOAD risk beyond standard PRS, and to evaluate its performance overall, across APOE genotypes, and on an external holdout dataset.
Prior PRS studies for LOAD report AUCs of ~0.62–0.78 using clinical diagnoses and up to ~0.82 in pathologically confirmed cases. Despite LOAD heritability estimates near 75%, only ~24% is explained by additive genetic components, and PRS capture roughly 21% of overall heritability. Epistatic interactions among genes implicated in LOAD, including between loci not significant individually, have been reported and may explain missing heritability. A recent model combining epistatic and polygenic risk reported AUC 0.67, a minor improvement over PRS alone within that dataset. PRS transfer poorly across datasets and ancestries, a known limitation for clinical deployment. These gaps motivate integrating epistatic features and advanced machine learning to capture non-linear genetic architecture.
Study population: Aggregated data from ADNI, NACC/ADGC, FHS, Knight-ADRC (Washington University), and Emory University. Phenotypes and covariates (case/control, age, APOE genotype, education) were harmonized across cohorts. Individuals younger than 55 and non-European ancestry (per first two genetic PCs) were excluded to minimize stratification. Final modeling dataset: 9,139 participants. ADNI3 served as an external holdout: after removing related/overlapping individuals, 316 unique participants with multiple age assessments (681 records), including 28 cases and 238 controls; 77 with MCI were excluded. Participants could contribute multiple age points for progression analysis. Genotyping and QC: Multiple genotyping arrays were imputed to HRC r1.1 using the Michigan Imputation Server. Pre-imputation checks used HRC-1000G-check-bim. Post-imputation, variants filtered to biallelic SNPs with Rsq > 0.8; SNPs with large MAF differences across studies or strand issues were removed. Duplicates identified by KING and removed. Variants further filtered to MAF > 0.1 (PLINK). Cross-validation: 10-fold nested cross-validation with consistent fold partitions across all models. Related individuals assigned to the same fold for association testing; maximum unrelated set retained per training fold. Feature selection: Individual SNPs selected via BOLT-LMM association with case/control status on each training set; top 50 SNPs by log-odds ratio retained. Epistatic features mined using Crush-MDR: LD downsampling to ~100k SNPs (PLINK r^2 > 0.11 yielded 98,903 SNPs), then MultiSURF to select top 10,000 SNPs associated with disease. Crush-MDR searched 2- and 3-way SNP interactions using multiobjective optimization with expert knowledge—number of shared Gene Ontology pathways between SNP-associated genes and pairwise mutual information conditioned on case/control status. Interactions ranked by Pareto optimality on balanced accuracy and mean Cartesian entropy; top 100 interactions selected. Each interaction encoded per individual as high-risk or low-risk; unseen genotype combinations in training encoded as missing. Models: Two epistatic models trained per fold on the selected SNPs, epistatic terms, and covariates (age, sex, APOE, education, top 20 genomic PCs): (1) Gradient boosting (XGBoost) with hyperparameters tuned via a distributed NSGA-II (Origin) on AWS; (2) Deep learning with NODEnn. NODEnn could not accommodate missing covariates, so only SNP dosages and epistatic terms were used; genotype dosages used to reduce missingness; remaining missing imputed by k-NN (k=5). Features normalized to [0,1]. NODEnn architecture: two blocks of 1,024 neural trees, depth 6, dimension 3; trained with quasi-hyperbolic Adam (recommended settings), early stopping with 10% validation split and patience of 5 epochs on an NVIDIA Titan RTX GPU. PRS construction: Following Escott-Price et al. methodology. Discovery performed within each training fold; validation on the corresponding test fold. QC: include SNPs with MAF ≥ 0.01, HWE p ≥ 1e-6, call rate ≥ 0.9; remove individuals with genotype missingness ≥ 0.1; LD clumping (PLINK --clump) with r^2 > 0.2 and 1 Mb window. PRS built using IGAP effect sizes; p-value thresholds 0.05–1.0 evaluated; best threshold p = 0.6. Performance assessed via logistic regression including PRS, APOE ε2 and ε4 genotype, age, and sex. Ensembles (paragenic models): Out-of-fold predictions from XGBoost, NODEnn, and PRS were stacked using logistic regression as a meta-learner (scikit-learn). Ensembles trained for all combinations of component models, with paragenic models defined as any ensemble including PRS plus at least one epistatic model. Thresholds for matched sensitivity/specificity were chosen per training fold and applied to test sets. External validation performed by training on the full aggregated dataset and evaluating on ADNI3 holdout.
- Cross-validation: The full paragenic ensemble (PRS + XGBoost + NODEnn) achieved mean AUC 0.83 (95% CI 0.82–0.84) with matched sensitivity/specificity of ~0.75, significantly outperforming all individual models in AUC (DeLong Z = 3.2555, p = 0.0006). Specificity and sensitivity trade-offs showed PRS could exhibit higher specificity in some comparisons (χ2 ≈ 553.0, p < 1e-10) but at the cost of markedly lower sensitivity (e.g., χ2 = 88.0, p < 1e-16 vs. paragenic models).
- External holdout (ADNI3): XGBoost (epistatic) outperformed all models in AUC (Z = 42.9, p < 1e-15) and had higher sensitivity than PRS (χ2 = 6.0, p = 0.0003). It also had significantly higher specificity among models using SNP/epistatic features (χ2 = 79.0, p = 4.5×10^-5). The baseline model (age + sex + APOE) had the highest specificity but poor AUC and sensitivity. Ensembles did not surpass their components on the holdout; adding PRS generally reduced performance (e.g., XGBoost vs. XGBoost+PRS: Z = 42.9, p < 1e-15; χ2 = 149.0, p < 1e-30 for specificity).
- APOE strata: The paragenic model maintained strong AUC within all APOE genotypes, generally within 4 percentage points of the overall AUC (except ε4/ε4). PRS performance dropped 6–7 points within most strata, except ε2/ε4 where it matched overall.
- Age-related risk: Kaplan–Meier curves by paragenic risk quartiles showed significant separation (log-rank p = 0.0016 for Q1 vs Q2; p < 1e-14 for Q2 vs Q3; p < 1e-17 for Q3 vs Q4).
- Clinical utility: In cross-validation at assumed prevalences of 17% and 32%, the full paragenic and PRS+XGBoost ensembles had the highest PPV/NPV among evaluated models. On holdout, the full paragenic model achieved the strongest PPV, while epistatic-only models (XGBoost, NODEnn) had stronger NPV.
The study demonstrates that incorporating epistatic interaction features and using machine learning substantially improves genetic risk prediction for LOAD over standard PRS. The paragenic ensemble increased AUC and balanced sensitivity and specificity in cross-validation, and maintained discriminative performance across APOE genotypes, addressing a key limitation of relying on APOE alone. External validation highlighted data-shift challenges: the XGBoost epistatic model generalized best, while stacking with PRS hurt performance on the holdout, consistent with known cross-cohort portability issues of PRS. Nonetheless, ensembling with PRS improved specificity and PPV in some settings. Survival analysis confirmed that paragenic scores stratify age-dependent risk, supporting potential clinical application for trial enrichment and personal risk assessment. The results suggest epistatic modeling captures non-additive genetic risk components missed by PRS, directly addressing the research objective of improving lifetime LOAD risk prediction.
A paragenic risk modeling strategy that integrates epistatic interaction features with machine learning outperforms traditional PRS for clinically diagnosed LOAD, achieving state-of-the-art AUC in cross-validation and robust stratification across APOE genotypes. External validation indicates epistatic models generalize better than PRS across datasets, although ensemble gains may depend on cohort characteristics. Future work should: (1) incorporate environmental and lifestyle covariates to further improve accuracy and enable modifiable risk insights; (2) extend modeling to diverse ancestries to ensure broad applicability; (3) explore genotype-specific modeling (e.g., within APOE strata) and data augmentation for rare genotypes (ε4/ε4); and (4) enhance interpretability to inform biological understanding and potential therapeutic targets.
- Cohort ancestry: Analyses were restricted to individuals of European ancestry; generalizability to other ancestries is unknown.
- Data heterogeneity: Differences in data collection and completeness across cohorts led to informative missingness, precluding inclusion of environmental/lifestyle covariates.
- Model components: The deep learning model (NODEnn) could not incorporate covariates due to missing data handling constraints.
- External validation distribution shift: The ADNI3 holdout had a lower case proportion, contributing to reduced specificity and NPV for paragenic models; PRS portability issues likely reduced ensemble performance.
- Interpretability: The focus was predictive performance rather than mechanistic interpretation; feature importance and causal inferences were not emphasized.
Related Publications
Explore these studies to deepen your understanding of the subject.

