Introduction
Alzheimer's disease (AD) is the most prevalent type of dementia, primarily affecting the elderly. Its progressive nature necessitates early risk prediction for clinical trials and experimental studies. While lifestyle factors account for approximately 35% of lifetime dementia risk, increased lifespan continues to drive prevalence. Even age-matched control groups in studies may include future AD cases. The APOE-ε4 allele, associated with earlier AD onset, has an age-dependent frequency, further complicating risk assessment. Genome-wide association studies (GWAS) often compare AD cases with cognitively normal individuals without strict age matching. The polygenic nature of AD risk, with multiple genes contributing to disease development, lacks consensus regarding the optimal PRS calculation method, APOE modeling within PRS, SNP selection p-value thresholds, and score comparability across studies. This study aimed to address these methodological inconsistencies to improve the robust prediction of individuals at very high or low AD risk.
Literature Review
Existing literature highlights inconsistencies in approaches to calculating polygenic risk scores (PRS) for Alzheimer's Disease (AD). Disagreement exists regarding the optimal p-value threshold for SNP selection, how to model the APOE gene's effect, and how to compare PRS across different studies and cohorts. Some studies suggest a polygenic architecture for AD, while others propose an oligogenic model. Methods for PRS calculation vary, including simpler methods like clumping and thresholding (C+T) and more complex Bayesian approaches. The choice of methodology significantly impacts the identification of individuals at high and low risk, even at the extremes of the PRS distribution. Previous research demonstrates that selecting individuals more than two standard deviations from the PRS mean yields high accuracy in risk distinction, but the selection of individuals varies depending on the methodology used. This study aimed to synthesize these divergent approaches and propose a best-practice model.
Methodology
The study utilized multiple datasets, including the 1000 Genomes Project, UK Biobank, ADNI, ROSMAP, Mount Sinai Brain Bank, and Mayo Clinic Alzheimer's Disease Research Center. The primary PRS calculation employed a clumping and thresholding (C+T) approach using summary statistics from a large GWAS study (N = 63,296). Polygenic risk scores (PRS) were calculated using various p-value thresholds (1e-5, 0.1, 0.5). Several models were developed, including: PRS.FULL (including SNPs with pT ≤ 1e-5, excluding APOE region), PRS.NO.APOE (including SNPs with pT ≤ 1e-5 or 0.1, excluding APOE region), and PRS.ADA (weighted sum of PRS.NO.APOE and APOE genotypes). Other PRS calculation methods (PRSice, LDPred-Inf, PRS-CS, LDAK, SBayesR) were also employed for comparison. Logistic regression was used for case-control analysis, assessing prediction accuracy via AUC and R². PRS standardization was performed both within the sample and against the 1000 Genomes European population. The analysis focused on identifying individuals at the extremes of the PRS distribution (±2SD from the population mean). A simulation study was also conducted to assess the impact of age-dependent APOE-ε4 frequency.
Key Findings
The study found that the best prediction accuracy for Alzheimer's disease risk was achieved using a model combining APOE genotype and a PRS excluding the APOE region, with a p-value threshold of ≤ 0.1 for SNP selection. Different PRS calculation methods yielded similar overall prediction accuracy (AUC) but identified different individuals as high or low risk. Standardizing PRS against the population mean (1000 Genomes) improved comparability across studies and revealed more individuals at the extremes of the risk distribution. The number of individuals identified at the extremes of the PRS distribution differed substantially depending on the standardization approach (sample vs. population). The highest odds ratio and prediction accuracy were observed with PRS.AD (OR = 124, AUC = 88.2%), while the lowest was observed with the oligogenic risk scores (OR = 10, AUC = 74.6%). Excluding APOE reduced the number of extremes identified, but prediction accuracy remained high (OR = 95, AUC = 95.7%). The oligogenic model was not useful for discrimination between ε3 cases and controls. The greatest overlap in identified extremes was between PRS(C+T) and PRSice methods. Overall, LDPred-Inf, PRS(C+T), PRSice, and PRS-CS showed considerable overlap in identified extremes, in contrast to LDAK and SBayesR.
Discussion
This study’s findings emphasize the importance of methodological choices in calculating and interpreting PRS for AD. The superior performance of the model incorporating APOE separately underscores the gene’s significant role in AD risk. The need for population-based standardization highlights the challenge of directly comparing PRS across diverse cohorts. The substantial variation in identified high-risk individuals across different PRS methods underscores the need for methodological standardization and consensus. This research has implications for selecting participants in AD research, potentially leading to more efficient clinical trials and drug development. Future research should focus on refining PRS models by integrating additional data types and improving our understanding of the interplay between genetic and environmental factors in AD pathogenesis.
Conclusion
This study demonstrates that for AD risk prediction, a p-value threshold of ≤ 0.1, separate modeling of APOE, and population-based PRS standardization are optimal. The PRS(C+T) method showed superior performance. This work provides valuable guidance for researchers aiming to reliably identify individuals at high or low risk of AD using PRS, facilitating more efficient and impactful AD research.
Limitations
The relatively small size of the case-control sample may have limited statistical power and precision. Variability in clinical definitions of AD and age recording across cohorts may also have influenced results. Excluding the entire APOE locus to avoid linkage disequilibrium issues might have inadvertently excluded independently associated SNPs. Replication in independent datasets is crucial to ensure generalizability of findings.
Related Publications
Explore these studies to deepen your understanding of the subject.