logo
ResearchBunny Logo
Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan

Medicine and Health

Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan

D. Mcguire, H. Markus, et al.

Discover groundbreaking research by Daniel McGuire and colleagues at Penn State College of Medicine, revealing how genetics and environmental factors converge to influence disease risk. This study leverages a vast dataset to provide insights into the causal effects of air pollution on a range of diseases, showcasing innovative modeling strategies that can redefine our understanding of health.

00:00
00:00
Playback language: English
Introduction
The relative contributions of genetic and environmental factors to complex traits remain a subject of considerable debate. Traditional family-based studies and variance components models have been used to partition phenotypic variance, but these often assume independence of environmental exposures across families. However, many environmental risk factors, such as air pollution, climate, and sociodemographic characteristics, are spatially correlated, leading to potential bias in heritability estimates. Twin studies, while informative, may also be susceptible to such confounding, and often don't explicitly account for community-level environmental effects. Genome-wide association studies (GWAS) using unrelated individuals offer larger sample sizes and reduced shared environmental confounding, but geographical confounding can still bias chip heritability estimates. Existing studies often lack the scale and quality of data necessary to resolve these challenges. This study addresses these limitations by using a large electronic health record (EHR) dataset to jointly model genetic and environmental effects. The researchers developed a new model, the Spatial Mixed Linear Effect (SMILE) model, to address the shortcomings of previous approaches. This model allows the investigators to simultaneously consider the roles of genetics and spatially correlated environmental factors in the determination of disease susceptibility. The study leverages the large scale and depth of the MarketScan database in this endeavor, seeking to generate refined estimates of genetic heritability and environmental contributions for a wide range of diseases. The use of instrumental variables such as wind speed and direction allows the investigators to move beyond simple correlations and assess causal effects of environmental pollution, providing an enhanced understanding of how environmental exposures influence disease risk in large populations. This approach can significantly improve our understanding of disease etiology and inform the development of effective preventive and therapeutic strategies.
Literature Review
The literature extensively highlights the complex interplay between genetic predisposition and environmental factors in shaping disease risk. Early family-based studies using variance components models attempted to dissect these influences, but often suffered from limitations due to the assumption of independent environmental exposures between families. This assumption is unrealistic for spatially correlated factors like air pollution, climate, or socioeconomic status. Twin studies, while valuable, may also be influenced by environmental confounding. Genome-wide association studies (GWAS) using unrelated individuals address some of these limitations due to their larger sample sizes, but geographical confounding can still impact results. Furthermore, GWAS chip heritability estimates differ conceptually from those in family-based studies. Prior research has explored the impact of specific environmental exposures on disease risk, yet frequently lacked simultaneous consideration of genetic relatedness, preventing comprehensive assessment of heritability and environmental contribution. The authors reviewed the strengths and limitations of previous methods, highlighting the need for a more robust and comprehensive approach such as the one proposed.
Methodology
The researchers developed a novel statistical model, the Spatial Mixed Linear Effect (SMILE) model, to jointly estimate genetic heritability and environmental contributions to disease risk. This model uses the geographical locations of participants as a proxy for spatially correlated community-level environmental risk factors. The SMILE model incorporates random effects to capture phenotypic variation due to genetic relatedness, shared family environment, and shared community-level environment. The model is expressed as: Y = Xπ + u + Zusuus + Zparupar + Zchilduchild + ε. Here, Y represents the disease status (0/1), X is the design matrix for fixed-effect individual-level covariates (year of birth, sex), u represents genetic random effects (correlation determined by genetic relatedness), upar and uchild are random effects for shared parental and children-level family environment, and uc is the random effect for community-level shared environment (families in the same location share the same effect). The spatial structure of the community-level effects is modeled using independent normal distribution (IND), conditional autoregressive (CAR), or simultaneous autoregressive (SAR) distributions. The model was applied to data from 257,620 nuclear families extracted from the MarketScan database, encompassing 1083 disease outcome measurements. External datasets were integrated to incorporate community-level risk factors, including PM2.5 and NO2 levels, climate data, and socioeconomic variables. To assess causal effects of air pollution, the researchers used a two-stage regression framework (SMILE-2), employing wind speed and direction as instrumental variables. In the first stage, pollution levels were regressed against instrumental variables, and in the second stage, SMILE-2 tested for the causal effect of pollution using the predicted pollution levels from the first stage. Extensive simulations were conducted to evaluate the accuracy and robustness of the SMILE model, varying sample size, disease prevalence, and model specifications. Model selection was based on the Bayesian Information Criterion (BIC). The causal inference approach was assessed in simulations, comparing SMILE-2 to a standard two-stage model ignoring genetic relatedness. Finally, the SMILE model was used to analyze the 1083 diseases in the MarketScan dataset, comparing models with and without spatial random effects. Correlations between spatial random effects and community-level environmental risk factors were also investigated.
Key Findings
The SMILE model demonstrated superior performance in variance component estimation compared to models ignoring community-level effects. Simulations showed that models failing to account for spatial effects resulted in upwardly biased heritability estimates. The full SMILE model yielded minimal bias and mean squared error (MSE) across various scenarios. In the MarketScan data analysis, the SMILE model (with spatial random effects) was selected as the best model for 94.3% of phenotypes based on BIC values. Incorporating spatial effects generally decreased estimated heritability, confirming the influence of shared community-level environment. The choice of spatial covariance structure (CAR, SAR, IND) had minimal impact on heritability estimates but affected estimates of spatial variance components. The study observed a consistent decrease in heritability estimates after accounting for spatial community effects, with a median decrease of 0.03 and IQR of (0.018, 0.051). For example, type 2 diabetes heritability decreased from 37.7% to 28.4%. Heritability estimates from SMILE showed significant correlations with published studies but generally yielded smaller estimates, aligning with the expectation of reduced bias due to accounting for spatial effects. Strong correlations were observed between spatial random effects and lower socioeconomic status (SES) indicators for many diseases, potentially reflecting disparities in healthcare access. SMILE-2 identified 135/1083 phenotypes with significant causal associations with PM2.5 or NO2 pollution after Bonferroni correction. PM2.5 and NO2 tended to affect distinct disease categories, with PM2.5 significantly associated with sleep disorders and respiratory infections, while NO2 was associated with gastrointestinal disorders and diabetes. The findings highlight the importance of considering both genetic and environmental factors, particularly spatially correlated ones, when studying complex diseases. The causal inference analyses further revealed the distinct effects of different air pollutants on various disease categories, suggesting potentially different biological mechanisms.
Discussion
This study's findings contribute significantly to our understanding of complex disease etiology by demonstrating the importance of considering spatially correlated environmental factors alongside genetic influences. The SMILE model addresses limitations of previous approaches by jointly modeling genetic and environmental effects, reducing biases in heritability estimates. The refined heritability estimates from SMILE can help reconcile discrepancies between family-based studies and GWAS using unrelated individuals. SMILE's applicability to EHR data without genotype information makes it uniquely suitable for analyzing large-scale insurance claim databases. The identification of distinct causal effects of PM2.5 and NO2 on different disease categories provides valuable insights into the biological mechanisms underlying air pollution's impact on health. The study's findings could inform the development of more targeted preventive and therapeutic strategies, particularly in public health initiatives aiming to address environmental risk factors.
Conclusion
This research introduced the SMILE model, a powerful statistical framework for jointly analyzing genetic and spatially correlated environmental influences on disease risk. Using a massive EHR dataset, the study produced refined heritability estimates and identified significant causal effects of air pollution on a wide range of diseases. The findings highlight the importance of incorporating spatial environmental factors to reduce bias and improve the accuracy of heritability estimations. Future research could extend these methods to model time-to-event outcomes, recurrent events, and time-varying environmental exposures, further enhancing our understanding of disease mechanisms and informing public health strategies. The SMILE model offers a promising approach for analyzing national biobanks and improving our understanding of complex trait etiology.
Limitations
The study's reliance on the MarketScan database, which primarily includes individuals with employer-sponsored insurance, may limit the generalizability of findings to low-income populations. The use of EHR data also introduces potential limitations related to incomplete or inaccurate medical records, potentially affecting disease prevalence estimates. The reliance on a single, long-term average of air pollution levels may not fully capture the impact of transient environmental changes on disease risk. Furthermore, causal inference relies on assumptions that need to be carefully considered when interpreting the results.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny