logo
ResearchBunny Logo
Polygenic prediction of educational attainment within and between families from genome-wide association analyses in 3 million individuals

Economics

Polygenic prediction of educational attainment within and between families from genome-wide association analyses in 3 million individuals

A. Okbay, Y. Wu, et al.

This exciting genome-wide association study identified nearly 4,000 SNPs linked to educational attainment, revealing insights that predict the risk of various diseases. The research conducted by Aysu Okbay and collaborators opens new avenues in understanding the genetic factors behind education and health outcomes.

00:00
00:00
~3 min • Beginner • English
Introduction
Educational attainment (EA) is a key dimension of socioeconomic status, measured accurately and routinely in cohort studies and linked to many health behaviors and outcomes, including mortality. Prior GWAS meta-analyses of EA (N ~1.1 million) identified many loci. This study updates and expands EA GWAS to N = 3,037,499, primarily by increasing the 23andMe contribution to ~2.3 million genotyped participants. Core aims are to: (1) identify additional associated SNPs and improve effect-size precision; (2) construct and evaluate a more predictive polygenic index (PGI) for EA and related cognitive/academic outcomes; (3) quantify the fraction of the PGI’s predictive power due to direct genetic effects versus indirect genetic and environmental correlations using within-family designs; (4) assess assortative mating contributions using mate-pair PGIs; and (5) test for non-additive (dominance) genetic effects and update X-chromosome associations. The study’s significance lies in advancing predictive genetics for EA, dissecting sources of polygenic predictability, and informing long-standing debates in behavior genetics about dominance, assortative mating, and gene–environment correlation.
Literature Review
- Previous EA GWAS meta-analyses (e.g., Lee et al., 2018) with ~1.1M individuals identified 1,271 genome-wide-significant SNPs and demonstrated polygenic prediction but left open questions about direct versus indirect effects and assortative mating. - Quantitative genetics theory and recent work (Hivert et al., 2021; Pazokitoroudi et al., 2021) indicate dominance variance from common variants is small across many traits, but EA had not been specifically assessed. - Prior analyses reported high male–female genetic correlation for EA (~0.98) and X-chromosome SNP heritability ~0.4%. - Polygenic scores trained in European-ancestry samples often show reduced predictive accuracy in non-European ancestries; frameworks by Wang et al. quantify contributions of LD and MAF differences to this loss of accuracy. - Within-family designs and relatedness disequilibrium regression have been proposed to separate direct from indirect genetic effects and to mitigate confounding from population stratification and assortative mating. - Behavior genetics debates have involved differing assumptions on dominance, assortative mating, and special twin environments to explain familial resemblance in cognitive traits; large-scale GWAS and PGIs can refine these models.
Methodology
Phenotype coding: EA (EduYears) was coded by mapping highest qualification to ISCED 1997 categories with years-of-education equivalents, consistent with prior GWAS. In UK Biobank (UKB), individuals with NVQ/HND/HNC but no degree were recoded as age left full-time education minus five (dropping implausible <12), correcting an overestimate from prior coding. Unlike earlier exclusions, ~16% of 23andMe participants were aged 16–29; UKB-based simulations indicated minimal impact on meta-analysis results. Additive autosomal GWAS meta-analysis: Combined three components: (1) public Lee et al. results excluding 23andMe and UKB (N = 324,162); (2) new 23andMe association results (N = 2,272,216); (3) new UKB GWAS with updated EA coding (N = 441,121). Analyses were restricted to European-ancestry individuals passing each cohort’s QC and typically age ≥30 at EA measurement (except 23andMe). Sex-stratified autosomal analyses were not run given prior evidence of near-unity male–female genetic correlation for EA. Quality control used an updated EasyQC-like pipeline; sample-size-weighted meta-analysis via METAL. To adjust for stratification/relatedness, standard errors were inflated by sqrt(LDSC intercept) = √1.663. Lead SNPs were identified by PLINK clumping (r² < 0.1, no distance cutoff) at P < 5×10⁻⁸; sensitivity checks used COJO (GCTA) to identify jointly associated SNPs. Biological annotation: Stratified LD Score regression compared enrichment patterns with the prior meta-analysis using a recent set of SNP annotations, with improved precision due to larger N. X-chromosome analysis: Conducted pooled-sex association analyses in 23andMe (N = 2,272,216; male 0/2 coding) and sex-stratified UKB analyses meta-analyzed (N = 440,817; harmonized to 0/2 male coding). A sample-size-weighted meta-analysis of 211,581 QC-passed X SNPs was performed, with SEs inflated by √1.666 (autosomal intercept). Lead X SNPs selected with the same clumping algorithm. Dominance GWAS: Tested dominance deviations from additivity by meta-analyzing 5,870,596 autosomal SNPs available in both 23andMe (N = 2,272,216) and UKB (N = 302,037); SEs adjusted using dominance-adapted LD scores (HapMap3, 1000 Genomes Phase 1 reference). Estimated dominance SNP heritability by adapted LDSC; conducted preregistered replication checks; and estimated inbreeding depression (ID) for EA using Idscdom from dominance summary statistics. Polygenic prediction: Constructed PGIs from a meta-analysis excluding Add Health, HRS and WLS. Main PGIs used LDpred (v1.0.11) on HapMap3 SNPs with Gaussian prior, LD from HRC reference; PGIs computed in PLINK2. Alternative PGIs used clumping+thresholding at multiple P-value cutoffs and SBayesR (GCTB) using ~2.5M pruned common SNPs with a four-component mixture prior. Prediction cohorts: Add Health (N ~5,653), HRS (N ~10,843), and WLS (N ~8,395 for cognitive/grades; EA excluded due to range restriction). Incremental R² (or incremental Nagelkerke R² for binaries) was computed by adding the PGI to regressions controlling for sex, age/birth year polynomials and interactions, and principal components; 95% CIs via bootstrap (1,000 reps). Prediction was also evaluated in African-ancestry subsets of HRS and Add Health. Cross-ancestry relative accuracy: Following Wang et al., trained PGIs in UKB European-ancestry (N = 425,231) on HapMap3 SNPs (1,365,446), identified 507 lead SNPs, and evaluated accuracy in UKB African-ancestry (N = 6,514) vs European-ancestry holdout (N = 10,000). Predicted loss-of-accuracy due to LD and MAF was quantified using 1000 Genomes EUR/AFR reference. Disease risk prediction: In UKB European-ancestry participants, assessed EA PGI predictive power for 10 common diseases (hypertension, ischemic heart disease, myocardial infarction, hypercholesterolemia, type 2 diabetes, asthma, osteoporosis, rheumatoid arthritis, migraine, major depression). Compared incremental Nagelkerke R² for EA PGI, disease-specific PGI, and both PGIs plus their interaction; covariates included sex, birth-year polynomial and interactions, 40 PCs, and batch dummies. Within-family analyses (direct vs population effects): Used ~53,000 genotyped siblings and ~3,500 trios from UKB, Generation Scotland (GS), and Swedish Twin Registry (STR). For siblings, regressed phenotypes on within-family PGI deviation and family mean PGI; for trios, regressed phenotypes on offspring PGI controlling for both parents’ PGIs. Linear mixed models accounted for relatedness. Phenotypes and PGIs were standardized; effects interpreted as partial correlations. Meta-analyzed across datasets, applying a correction for assortative mating to estimate population effects and the difference between population and direct effects. Analyzed EA plus 22 additional health, cognitive, socioeconomic, and biomarker phenotypes. Assortative mating analyses: Identified mate pairs in UKB (862) and GS (1,603) via genotyped parents of genotyped individuals. Tested phenotypic assortment predictions by comparing observed mate-pair PGI correlations to predicted values r_Y × r_P × r_M (phenotypic correlation times phenotype–PGI correlations of each mate), with SEs via bootstrapping. Examined residual mate-pair PGI correlations after residualizing on phenotype and up to 40 genetic PCs; further adjusted for geography (birth coordinates, UKB assessment center) and added cognitive/vocabulary proxies in GS.
Key Findings
- Loci discovery: 3,952 approximately independent genome-wide-significant autosomal SNPs for EA at P < 5×10⁻⁸ (3,277 at P < 1×10⁻⁸); COJO identified 2,925 signals, with 41 likely secondary within loci. - Effect sizes: Winner’s-curse-adjusted effects are small; median lead SNP corresponds to ~1.4 weeks more schooling per reference allele (5th/95th percentiles: ~0.9/3.5 weeks). - Polygenic prediction (EA): LDpred PGI explained 15.8% of variance in EduYears in Add Health and 12.0% in HRS (mean ~13.3%). SBayesR PGI improved to 17.0% (Add Health) and 12.9% (HRS) (mean ~14.3%). Across deciles, college completion ranged from ~7% in the lowest to ~71% (Add Health) and ~53% (HRS) in the highest decile; individual-level prediction remains limited. - Prediction of related outcomes: In Add Health, PGI explained 8.7% of Peabody verbal test variance and 12.3% of overall GPA; in WLS, 6.1% of Henmon–Nelson and 7.7% of high-school grades percentile variance. - Cross-ancestry accuracy: In African-ancestry samples, incremental R² for EduYears was 1.3% (HRS; 95% CI 0.6–2.2%) and 2.3% (Add Health; 95% CI 1.1–3.7%), implying relative accuracies of ~11% and ~15% vs European-ancestry. In UKB, LD/MAF differences explain only part of the loss; remaining loss likely due to G×E, gene–environment correlation differences, assortative mating, and environmental variance. - Disease prediction: EA PGI significantly predicted 10 diseases in UKB Europeans (all P < 3×10⁻⁸), with mean incremental Nagelkerke R² ~0.63%, compared to ~1.19% for disease-specific PGIs. EA and disease-specific PGIs contributed roughly additively to prediction; higher EA PGI was associated with lower disease risks. - Direct vs population effects: For EA, ratio of direct to population effect estimates was 0.556 (s.e. 0.020), implying ~56% of PGI R² is due to direct effects. Comparators: height 0.910 (s.e. 0.009; ~82.8% of R² direct), BMI 0.962 (s.e. 0.017; ~92.5%), cognitive performance 0.824 (s.e. 0.033; ~67.9%). Across 22 other phenotypes predicted by the EA PGI, inverse-variance-weighted mean ratio was 0.588 (s.e. 0.013). - Assortative mating: Observed mate-pair EA PGI correlation 0.175 (s.e. 0.020) greatly exceeded the phenotypic assortment prediction 0.031 (s.e. 0.004). Residualizing on EA reduced correlation by only ~37% to 0.110 (s.e. 0.021). Adjusting for 40 PCs further reduced to 0.091 (s.e. 0.021); additional adjustments for geography and cognition/vocabulary explained some but left a substantial residual. For height, observed 0.106 (s.e. 0.020) was close to predicted 0.087 (s.e. 0.007). - Dominance GWAS: No genome-wide-significant dominance SNPs; 80% power to detect R² ≥ 0.0015% dominance effects—over an order of magnitude smaller than largest additive effects (R² ~0.04%). Dominance SNP heritability estimate 0.00015 (s.e. 0.00024; P = 0.54), indistinguishable from zero. Inbreeding depression estimate: offspring of first cousins have ~1.0 fewer months of EA (P = 0.04). - X chromosome: Identified 57 lead X-chromosome SNPs; X-chromosome SNP heritability ~0.4% and male–female genetic correlation ~0.94 (s.e. 0.03), consistent with prior work. - LD score regression intercept for autosomes was 1.66, suggesting ~7% of inflation due to confounding; analyses reported using intercept-adjusted statistics.
Discussion
The expanded GWAS substantially improves power for locus discovery and polygenic prediction of educational attainment, enabling finer-grained analyses of genetic architecture and sources of prediction. The EA PGI shows nontrivial predictive power for educational, cognitive, and disease outcomes, but within-family analyses reveal that only about half of the predictive signal reflects direct genetic effects. The remainder likely arises from indirect genetic effects (genetic nurture), broader gene–environment correlations, and assortative mating, which also amplifies variance components correlated with the PGI. The observed mate-pair EA PGI correlation far exceeds what is expected under phenotypic assortment solely on EA, indicating additional assortment on PGI-correlated factors (e.g., personality, cognitive proxies) and social homogamy related to ancestry and geography. Dominance deviations from additivity among common variants appear negligible for EA at current power, and inbreeding-related directional dominance yields modest reductions in EA. Collectively, the findings address key questions about the nature of polygenic prediction, the contributions of direct versus indirect genetic effects, and the roles of assortative mating and non-additive genetics, informing behavior-genetic modeling and highlighting that models assuming substantial dominance, zero gene–environment correlation, or purely phenotypic assortment are likely misspecified for EA.
Conclusion
This study presents the largest EA GWAS to date (N > 3 million), identifying 3,952 independent autosomal loci and 57 on the X chromosome, and delivering PGIs that explain up to ~17% of EA variance in European-ancestry samples. Within-family analyses show that roughly half of the PGI’s predictive power for EA and related traits is attributable to direct effects, with the remainder due to indirect genetic effects and gene–environment correlations, augmented by assortative mating. Dominance effects from common variants are essentially negligible. These results advance understanding of EA’s genetic architecture, broaden the utility of EA PGIs for research in social science and medicine, and provide empirical constraints for behavior-genetic models. Future work should expand ancestrally diverse GWAS to improve cross-population predictive equity, increase sample sizes for better power to detect effect-size heterogeneity and potential epistasis, and develop methods that more fully disentangle direct, indirect, and environmental contributions across populations and contexts.
Limitations
- Ancestry portability: PGI predictive accuracy is much lower in African-ancestry samples (relative accuracy ~11–15%), and LD/MAF differences do not fully explain the loss; results may not generalize across ancestries. - Within-family power: Family-based samples (~53k siblings, ~3.5k trios) limited power for disease outcomes; sibling-based direct-effect estimates can be biased by sibling indirect effects, though these appear small for studied traits. - Dominance scope: Null dominance findings pertain to common variants; substantial dominance from rare variants cannot be excluded. - Assortative mating interpretation: While adjustments for EA, PCs, geography, and cognition reduced mate-pair PGI correlations, a sizeable residual remains with multiple plausible contributors (social homogamy, unmeasured traits), limiting causal attribution. - Phenotype coding and measurement: Although refined (e.g., UKB recoding), EA mapping to years may introduce measurement error; some cohorts included participants aged <30 (in 23andMe), though simulations suggest minimal impact. - Diminishing returns for PGI R²: Further increases in sample size yield diminishing returns for prediction, though they enable other analyses (e.g., cross-phenotype/population effect heterogeneity, epistasis).
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny