logo
ResearchBunny Logo
Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan

Medicine and Health

Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan

D. Mcguire, H. Markus, et al.

Discover groundbreaking research by Daniel McGuire and colleagues at Penn State College of Medicine, revealing how genetics and environmental factors converge to influence disease risk. This study leverages a vast dataset to provide insights into the causal effects of air pollution on a range of diseases, showcasing innovative modeling strategies that can redefine our understanding of health.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses how to disentangle and quantify genetic versus environmental contributions to complex diseases, a longstanding question complicated by limited datasets that jointly capture genetics and environmental exposures. Traditional family-based variance components models and twin studies can be confounded by shared but unmeasured community-level environmental factors, inflating heritability. GWAS-based chip heritability estimates using unrelated individuals differ conceptually, may suffer from geographic confounding, and typically do not quantify community-level environmental effects. To overcome these gaps, the authors introduce a spatial mixed linear effect (SMILE) model that uses participants’ approximate geographic locations as proxies for spatially correlated community-level environmental risks, enabling joint estimation of genetic heritability and environmental components. They apply SMILE to 1,083 PheWAS-defined diseases in >50 million individuals from MarketScan, integrate external environmental data (PM₂.₅, NO₂, climate, sociodemographics), and extend the approach (SMILE-2) to test causal effects of air pollution using wind speed and direction as instrumental variables.
Literature Review
The paper summarizes prior approaches: (1) family-based and twin designs that partition phenotypic variance into genetic and environmental components but often assume independence of environmental exposures across families, ignoring spatially correlated community-level risks; (2) GWAS and linear mixed models with unrelated individuals that estimate chip heritability but can be biased by geographic confounding and do not quantify community-level environment; and (3) studies estimating environmental impacts without modeling genetic relatedness, thus lacking joint estimation of heritability and environmental contributions. These limitations motivate a framework that explicitly models shared community-level environmental variance alongside familial and genetic structure.
Methodology
- Data and cohort: The IBM MarketScan health insurance claims database was used to assemble 257,620 quality-filtered nuclear families (parents and children) enrolled ≥6 years (2005–2017), with all children ≥10 years at entry. Inpatient/outpatient ICD-9/10 codes were mapped to PheWAS codes (1,083 binary disease outcomes). Individual-level covariates included year of birth, sex, and approximate location (U.S. county or Metropolitan Statistical Area, MSA). Community-level risk factors from external sources were linked by location: ACS sociodemographics, satellite-derived PM₂.₅ and NO₂, and wind speed/direction. - SMILE model: A linear mixed effects framework with random effects for genetic relatedness, shared family environment (parents and children separately), and shared community-level spatial environment. Model: Y = Xπ + u + Zus uus + Zpar upar + Zchild uchild + ε, where Y is a binary disease indicator vector; X includes fixed covariates; u captures genetic random effects determined by within-family kinship (individuals in different families assumed unrelated); upar and uchild capture within-family shared environmental effects for parents and for children; us captures community-level spatial random effects shared by families in the same location. Design matrices map individuals to the corresponding random effects. - Spatial covariance: Community-level effects were modeled under independent normal (IND), conditional autoregressive (CAR), or simultaneous autoregressive (SAR) covariance structures. Adjacency (sharing borders) among counties/MSAs defined the weight matrix (W), standardized by row. Using M as a diagonal normalizing matrix, covariance matrices were defined as ΣCAR = σ²{I − (1 − ρ)M⁻¹W}⁻¹M⁻¹ and ΣSAR = σ²{(I − ρM⁻¹W)⁻¹(I − ρM⁻¹W′)⁻¹}. A Gower factor was used to report averaged variance explained by spatial random effects across individuals. Variance components estimated on the observed scale were converted to the liability scale for comparability across traits with differing prevalence. - Estimation and model selection: Models were fit using the TMB R package with Laplace approximation and automatic differentiation. Bayesian Information Criterion (BIC) was used to compare specifications (e.g., with/without spatial effects; alternative spatial covariance structures). Extensive simulations sampled families and covariates from MarketScan to emulate realistic spatial distributions, prevalence, risk factors, and confounders, varying variance components and prevalence. Eight model variants combining genetic (G), parental (P), child (C), and spatial (S) components were evaluated (e.g., GPC+S full model), with performance assessed via bias, MSE, and BIC selection accuracy. Robustness to family relationship misclassification and varying enrollment length was assessed. - Causal inference (SMILE-2): A two-stage instrumental variable regression to estimate causal effects of air pollution (PM₂.₅, NO₂). Instruments: long-term averaged wind speed and direction at each location. Assumptions verified conceptually: instrument relevance (wind correlates with pollution), exogeneity (uncorrelated with unmeasured confounders in disease model), and exclusion restriction (no direct effect on disease). Stage 1 regresses pollution on instruments; Stage 2 uses predicted pollution in a mixed model akin to SMILE: Y = P̂ β + Xπ₂ + u + Zpar upar + Zchild uchild + ε. Simulations compared SMILE-2 to a fixed-effects independent-individuals two-stage model (IND-FE) to assess type I error, power, and MSE under realistic confounding and variance components, and evaluated robustness to measurement noise in wind/pollution.
Key Findings
- Model performance and bias: In simulations with 250,000 quad-families, BIC selected the correct variance component specification in 81.8% of replicates (71.1% with 50,000). When community-level spatial effects exist, omitting them inflates heritability; the bias grows with the spatial variance component. The full SMILE (GPC+S) showed minimal bias and MSE across variance components and remained unbiased even when a component (e.g., parental environment) was absent in truth. Heritability estimates were robust to pedigree misclassification and measurement noise in wind/pollution. - Empirical variance components (1,083 traits): For 1,021/1,083 (94.3%) phenotypes, the model including spatial effects (SMILE) had better BIC than models without spatial effects. Adding spatial variance generally decreased estimated heritability (median decrease 0.03; IQR 0.018–0.051), confirming the importance of modeling community-level environment. Among traits selecting spatial models, CAR was chosen for 783 (76.7%), SAR for 203 (19.9%), and IND for 35 (3.4%). Heritability and spatial variance estimates were nearly identical across CAR/SAR/IND structures (mean absolute difference vs CAR ≈ 0.002). - Disease category landscape: Highest average heritability observed for hematopoietic traits and congenital anomalies. Largest spatial variance components were seen for parasitic infections (e.g., Lyme disease) and allergic reactions (e.g., contact dermatitis due to plants, dermatitis due to solar radiation), consistent with known spatial/environmental drivers. - Cross-study comparisons: SMILE heritability estimates correlated with prior family-based and chip heritability studies (e.g., CaTCH, NY, MS, LDSC-UKB; see Table 2), but SMILE typically yielded smaller heritabilities than family-based models not accounting for community effects, consistent with reduced upward bias. Example: Type 2 diabetes heritability decreased from 37.7% (without spatial) to 28.4% (with spatial), and to 31% with medication-enhanced case definition; obesity decreased from 53.1% to 46.3%. Spatial random effects for T2D correlated strongly with those for obesity (r≈0.67) and lipid traits (hyperlipidemia r≈0.75, hypercholesterolemia r≈0.61). - SES and environment correlations: For 115 diseases (prevalence ≥2% and spatial variance ≥2%), spatial random effects correlated with community-level factors. Many diseases showed higher risk with lower SES indicators (lower median income, higher poverty, lower education), including obesity, diabetes, chronic liver disease, COPD, influenza, and fever. Some traits associated with higher SES (e.g., benign skin neoplasms, hemorrhoids, adjustment reaction), potentially reflecting care access and ascertainment patterns. - Causal effects of air pollution (SMILE-2): Across 1,083 diseases, after Bonferroni correction, 135 (12.5%) had significant causal associations with PM₂.₅, NO₂, or their standardized sum (Psum). Of these, 105/135 (77.8%) had positive effects (higher pollution increases risk). Distinct pollutant-disease patterns were observed. PM₂.₅-only associations included multiple sleep disorders (e.g., hypersomnia OR=1.13, P=5.4×10⁻17; obstructive sleep apnea OR=1.04, P=1×10⁻8; parasomnia OR=1.07, P=1.2×10⁻6; narcolepsy OR=1.13, P=5.3×10⁻7), respiratory infections (acute sinusitis OR=1.07, P=3.8×10⁻24; acute bronchitis/bronchiolitis OR=1.05, P=1.1×10⁻18), ear infections (otitis media OR=1.04, P=1.2×10⁻23), and ADHD (OR=1.04, P=1.4×10⁻7). NO₂-only associations included gastrointestinal disorders (gastritis OR=1.10, P=4.4×10⁻7; IBS OR=1.10, P=8.1×10⁻7), type 1 and type 2 diabetes (ORs=1.15 and 1.17; P=2.5×10⁻6 and 2.3×10⁻6), and lipid disorders (hyperlipidemia OR=1.09, P=1.3×10⁻7; hypercholesterolemia OR=1.10, P=7.5×10⁻6). PM₂.₅ was also causally linked with MRSA infection in the general population (OR=1.05, P=5.1×10⁻6). Simulations showed SMILE-2 controlled type I error and had higher power and lower MSE than fixed-effects two-stage models, especially at low prevalence or larger effects. - Robustness: Variance component estimates were strongly correlated between subsets with 6–7 years versus 10–12 years enrollment, indicating minimal impact of reduced enrollment length on estimates.
Discussion
The findings demonstrate that spatially correlated community-level environmental factors substantially contribute to disease risk and, if unmodeled, inflate family-based heritability estimates. By incorporating spatial random effects, SMILE refines heritability estimates and reconciles discrepancies between family-based and GWAS-derived estimates. The approach is particularly valuable for large EHR and claims datasets that lack genotype information, complementing genome-wide interaction studies by capturing unmeasured environmental influences. SMILE-2 further reveals that different air pollutants can have distinct causal relationships with disease classes (e.g., PM₂.₅ with respiratory infections and sleep disorders; NO₂ with gastrointestinal, metabolic, and lipid conditions), offering mechanistic hypotheses and informing targeted environmental and public health policies. Extensive simulations and robustness checks support the validity and stability of variance component and causal effect estimates across modeling choices, phenotype definitions, and measurement noise. These results underscore the importance of jointly modeling genetic relatedness and community-level environment to accurately characterize disease etiology across the phenome.
Conclusion
The study introduces SMILE, a spatial mixed linear effects framework that jointly estimates genetic heritability and community-level environmental variance using EHR-derived phenotypes, and SMILE-2, an instrumental-variable extension for causal inference of air pollution effects. Applied to >50 million individuals in MarketScan across 1,083 diseases, the methods improved model fit for most traits, reduced inflated heritability estimates by accounting for spatial environment, mapped the disease-category landscape of genetic versus environmental contributions, and identified 135 significant pollutant-disease causal relationships with pollutant-specific patterns. These resources and methods provide a foundation for future research leveraging national biobanks and claims data. Future directions include modeling longitudinal and time-to-event outcomes, incorporating temporal dynamics of pollutant exposures, and extending spatial random effects to enhance power in genetic association studies.
Limitations
- Sample representativeness: MarketScan includes individuals with employer-sponsored insurance, underrepresenting low-income populations; generalizability beyond this demographic should be made cautiously. - Limited observation window: Claims/EHR data cover finite enrollment periods (children observed only up to age 26). This may lower observed prevalence for late-onset conditions in children; the study mitigated this via enrollment-duration filters, age thresholds, and inclusion of age and age² covariates. - Phenotype accuracy and ascertainment: EHR-derived diagnoses can be incomplete or heterogeneous across providers, and some conditions (e.g., substance use) may be under-diagnosed. Spatial random effects may partially control for such documentation biases but cannot eliminate them. - Causal inference caveats: IV-based conclusions rely on instrument assumptions and temporality. Pollution exposures were represented by long-term averages; transient changes (e.g., COVID-19-era NO₂ reductions) are not modeled. Some negative associations (e.g., with certain infections) may reflect differential ascertainment or access to care rather than protective effects of pollution.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny