Introduction
The COVID-19 pandemic resulted in millions of deaths globally, highlighting the need to understand the role of social and environmental factors in shaping health disparities. Many studies have explored individual stressors, but there's a lack of integrated frameworks considering heterogeneous stressors at various scales. This research uses the socioexposome framework, which integrates exposome concepts with socioeconomic and behavioral factors to understand the impact of environmental and social factors on public health. The study focuses on New Jersey, a diverse state with varying environmental conditions and socioeconomic strata, making it ideal for investigating socioexposomic influences on COVID-19 outcomes. Previous studies using linear models have limitations; for example, ordinary linear regression may lead to inaccurate estimates due to its Gaussian assumptions. While Generalized Linear Regression (GLR) like Poisson and Negative-Binomial regression addresses this, they may still fail to capture high variability and spatial autocorrelation. Machine learning (ML) methods offer the potential to capture complex nonlinear relationships, but their interpretability is often a concern. This study aims to fill these gaps by comparing geostatistical and machine learning approaches to model COVID-19 outcomes at the municipality level in New Jersey.
Literature Review
Existing literature shows a correlation between COVID-19 morbidity and mortality with various environmental and social determinants of health. Most studies have focused on single stressors or groups of stressors, lacking an integrated approach. The socioexposome framework is proposed as a solution to incorporate various biological, demographic, socioeconomic, and environmental factors. Studies comparing linear statistical models have shown that no single method is consistently superior; the best choice depends on the data and research question. Previous work on machine learning in environmental health often prioritizes prediction over association inference, with limitations in interpretability. This study aims to address these limitations by comparing various geospatial and machine learning methods to determine their effectiveness in modeling socioexposomic patterns of COVID-19.
Methodology
This study analyzed data from 565 municipalities in New Jersey, focusing on the first wave of the pandemic (March-September 2020). Data included COVID-19 cases and deaths (with and without long-term care facility deaths), 84 socioexposomic factors (air pollution, proximity to industrial sites, noise levels, occupation, housing characteristics, demographics, socioeconomic status, etc.) from various public databases. Eight modeling approaches were used: six geostatistical models (Poisson/Negative-Binomial regression, mixed-effect models, and BYM spatial models) and two machine learning methods (Random Forest and Extreme Gradient Boosting). A forward stepwise algorithm was used for variable selection, prioritizing variables with variance inflation factors below 5 and selecting based on the deviance information criterion. The Shapley effects plot was used for interpretability in machine learning, and change of support validation compared model performance across different scales (municipality vs. county). A simulation analysis was conducted to compare the performance of different models under various scenarios of data variability and missing data. Bayesian inference was applied to the geostatistical models, using non-informative priors and R-INLA for computational efficiency.
Key Findings
The study found robust positive associations of COVID-19 mortality with historical NO2 exposure, population density, percentage of minorities, percentage of individuals with less than a high school education, and other socioexposomic factors. Excluding long-term care facility deaths did not significantly affect correlations for most factors. However, the model's structure significantly influenced the findings. The best-performing geostatistical models incorporated flexible structures to account for data variations. Machine learning models captured association patterns consistent with the best-performing geostatistical models, but they additionally identified consistent nonlinear associations not revealed by geostatistical models. Specifically, the machine learning models indicated a saturation effect of population density on mortality rates at higher densities, a pattern not captured by the geostatistical models. The change of support validation showed that machine learning models achieved prediction accuracy comparable to the geostatistical models, with the Negative Binomial BYM spatial model showing the best performance in terms of prediction intervals. Simulation analysis confirmed that advanced geostatistical models (especially BYM spatial models) were more accurate than simpler models, while machine learning models efficiently captured non-linear trends.
Discussion
This study demonstrates the importance of considering multiple social and environmental determinants when assessing COVID-19 health outcomes. The findings highlight the influence of model structure on results, emphasizing the need for models that appropriately address data variability and spatial autocorrelation. The comparison of geostatistical and machine learning models reveals the complementary nature of these approaches. Geostatistical models, particularly the BYM spatial models, offer superior performance for linear data, while machine learning models provide valuable insights into nonlinear associations. The combined use of these methods improves the understanding of complex patterns. The results align with previous research indicating the significant impact of air pollution, socioeconomic factors, and demographic characteristics on COVID-19 severity. The study provides evidence to support public health interventions aimed at addressing social and environmental inequalities to mitigate the effects of future pandemics.
Conclusion
This study provides a comprehensive comparison of geostatistical and machine learning approaches for modeling the socioexposomic determinants of COVID-19 mortality. Bayesian geospatial models, particularly the BYM spatial model, proved superior in capturing data variability and spatial autocorrelation. Explainable machine learning methods served as valuable complements, uncovering nonlinear relationships not detected by geostatistical models alone. The findings underscore the importance of using integrated modeling frameworks to understand the complex interplay of socioexposomic factors in influencing health outcomes.
Limitations
The ecological design of the study may introduce ecological fallacy, limiting the interpretation to group-level associations rather than individual-level relationships. The study did not explicitly model synergistic effects of exposure mixtures, although machine learning models implicitly detected some interactions. The machine learning models employed in this study did not provide confidence or prediction intervals, which can be important for uncertainty quantification.
Related Publications
Explore these studies to deepen your understanding of the subject.