logo
ResearchBunny Logo
Socioexposomics of COVID-19 across New Jersey: a comparison of geostatistical and machine learning approaches

Health and Fitness

Socioexposomics of COVID-19 across New Jersey: a comparison of geostatistical and machine learning approaches

X. Ren, Z. Mi, et al.

This groundbreaking study explores how socioexposomic factors relate to COVID-19 outcomes in New Jersey, revealing strong connections between mortality rates and factors like NO2 exposure and population density. Conducted by Xiang Ren, Zhongyuan Mi, and Panos G. Georgopoulos, this research uncovers insights that could reshape public health strategies.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses how heterogeneous social and environmental stressors (the socioexposome) relate to disparities in COVID-19 morbidity and mortality at fine local scales. COVID-19 has caused substantial global mortality, and prior work has linked outcomes with various determinants, often considering single stressors or broad spatial units. Recognizing COVID-19 as a syndemic driven by interacting demographic, socioeconomic, and environmental factors, the authors argue for integrated frameworks that model multiple correlated stressors simultaneously, capture spatial heterogeneity, and allow for complex (nonlinear, interacting) associations. The objective is to evaluate socioexposomic associations with COVID-19 outcomes across New Jersey municipalities during the first wave (March–September 2020) and to compare the consistency and efficiency of multiple geostatistical and machine learning modeling approaches for association inference and prediction.
Literature Review
Prior exposome-health studies comparing linear statistical models show no single method is universally superior; model choice should reflect task and data behavior. In COVID-19 association studies, ordinary linear regression has been used but relies on Gaussian assumptions that can yield biased p-values in count data. Generalized linear models (Poisson, Negative Binomial) address mean–variance structure but may still be inadequate when overdispersion, group-level randomness, or spatial autocorrelation are present. Advanced spatial models (e.g., Besag–York–Mollié) add random and spatial components to better represent data variability. Machine learning (ML) is increasingly used in environmental health but is often prediction-focused and considered less interpretable; prior comparisons under linear assumptions reported limited advantages for tree boosting and elevated false positives in variable selection for certain designs. Nonetheless, ML can capture nonlinearities and interactions when adequately tuned and paired with explainability tools (e.g., variable importance, Shapley values). There remain open questions about ML performance in detecting and quantifying complex socioexposome–health associations, especially at small to moderate sample sizes.
Methodology
Study area: 565 municipalities in New Jersey, focusing on the first wave of COVID-19 (March–September 2020) to avoid confounding by vaccination and variants; NJ was an early hotspot with high per-capita fatalities and substantial demographic, socioeconomic, and environmental heterogeneity. Data sources: - COVID-19: Cumulative cases/deaths as of Sept 24, 2020 from local health departments; long-term-care facility (LTCF) deaths from NJDOH; mortality rates computed with and without LTCF deaths. - Demographics (ACS 2015–2019): population density, age structure, racial/ethnic composition. - Socioeconomics: education, language isolation, household crowding, poverty, disability, unemployment, uninsured, social vulnerability indices. - Air pollution: annual PM2.5 and seasonal ozone DM8HA (2016) from EJSCREEN; annual daily max 1-h NO2 (2016) at 1 km from Di et al.; air toxics and risk indices from EPA NATA. - Proximity to industrial sites: EJSCREEN proximity metrics; NJDEP power plants (2020) with inverse distance weighting by facility size. - Transportation noise: 24-h A-weighted equivalent levels (2018) from BTS for aviation, roadway, rail at 30 m, aggregated to municipality. - Occupation/commuting: LEHD (2018) for sectors (health care, food service, transportation, retail/wholesale) and commuting to other counties/NYC. - Other: Number of licensed LTC beds (NJDOH, 2020); number of restaurants and supermarkets per 1000 residents (AHRQ). Variables: 84 socioexposomic factors considered. Variable selection: Forward stepwise under Poisson and NB regression frameworks with VIF < 5 to control multicollinearity and DIC to select variables. Both frameworks selected 10 variables: % age > 64, % minority, % below high school education, median gross rent, population density, % occupation (high risk), PM2.5 average, ozone seasonal DM8HA, % high-occupancy residence, % unemployed. Base models built on these; remaining variables were added one-at-a-time for extended analyses. Geostatistical models: Six models with increasing structure complexity: Poisson regression; Poisson mixed-effects model (random intercept by 5 NJ regions); Poisson BYM spatial model; Negative Binomial (NB) regression; NB mixed-effects model; NB BYM spatial model. BYM models used municipality-specific random effects for non-spatial heterogeneity and intrinsic conditional autoregressive (ICAR) spatial effects defined by a 565x565 adjacency matrix. Frequentist fits used R packages (stats, lme4, spaMM); Bayesian inference used INLA with non-informative priors. Machine learning models: Random Forest (bagging) and Extreme Gradient Boosting (XGBOOST, boosting) with repeated coarse-to-fine grid search hyperparameter tuning; fivefold cross-validation; best parsimonious structures selected by minimum predicted R2. Explainability: Predictor effects plots for GLMs/GLMMs; Shapley values for ML to construct Shapley effects plots enabling fair comparison of effect trends and association magnitudes. Association quantification: For ML, relative percent change in predicted outcome between first and third quartiles versus median of each predictor using Shapley-based effects; note limitations for highly non-monotonic relations. Validation (change of support): Predicted deaths for 209 municipalities lacking death data, aggregated to counties; compared predicted county totals to NJDOH observations; evaluated R2 and prediction interval (PI) coverage (for statistical models). Simulation study: Generated outcomes using NB with fixed effects (the 10 selected predictors with specified coefficients), ordinary random effects (N(0, σ2)), and spatial random effects (Gaussian process with adjacency-based covariance). Considered scenarios varying variance proportions: (1) r1=0.6, r2=0.6 (reference), (2) r1=0.6, r2=0.3 (more ordinary random effect), (3) r1=0.3, r2=0.6 (more total random effect). Also assessed impact of missing data by fitting spatial models to a subset matching observed data availability. Offsets: log(population) included in models of counts.
Key Findings
- Robust positive associations of COVID-19 mortality with NO2 exposure, population density, % minority, and % below high school education, among other social/environmental factors. - Effect sizes from BYM spatial models and ML were generally consistent when exponential assumptions held: • Population density: 20% quantile increase associated with 17% (95% CI: 7%, 29%) higher mortality rate (BYM); ML indicated ~7–8% increase. • % below high school education: 1 SD increase (6.3%) associated with 36% (20%, 55%) higher mortality rate (BYM); ML ~28–40% increase. • NO2: 1 ppb increase associated with 6% (4%, 9%) higher mortality rate (BYM); ML ~4–5% increase. - Excluding LTCF deaths modestly affected correlations for a few variables (e.g., % uninsured ρ from 0.27 to 0.38; % below high school education ρ from 0.24 to 0.36) but had minimal impact (Δρ < 0.08; p > 0.2) for most factors; similar patterns for case rates. - Model structure influenced uncertainty and estimates: Poisson regression produced the narrowest CIs but risked underestimating variability; BYM spatial models produced widest, more realistic CIs and lowest DIC. NB regression/mixed models had wider CIs than Poisson but point estimates differed by −14% to +14% from BYM. - ML captured nonlinear associations not available to linear log-link models, including saturation of population density effects and steeper increases at low % minority. - Change of support validation (county-level aggregation): • Poisson regression achieved highest R2=0.984 but had poor PI coverage (Accuracy 3/21). • NB BYM spatial model had strong PI coverage (19/21) with R2=0.965; BYM models best captured local spatial variation. • ML models achieved comparable R2 (RF 0.961; XGBOOST 0.980) but do not provide CIs/PIs. - Simulation analyses: BYM spatial models yielded estimates closest to the true effects under scenarios with overdispersion and spatial/ordinary random effects; simpler models showed significant bias. Missing data increased uncertainty for spatial models but did not drastically bias estimates. ML identified correct trend shapes and approximate slopes but did not outperform advanced geostatistical models for linear-generative scenarios. - Consistency with external literature: NO2 effect (~4–6% per 1 ppb) aligns with Los Angeles County analysis; larger than some nationwide county-level estimates, likely reflecting finer spatial resolution. PM2.5 showed no stable linear association overall, but ML indicated positive nonlinear effects above ~8 µg/m3. Positive associations observed with proximity to industrial and waste sites; transportation noise correlated univariately but not after adjustment. Age showed expected J-shaped pattern (negative for 15–44, positive for >64). Multiple demographic and socioeconomic variables remained significant, reflecting underlying disparities.
Discussion
Findings demonstrate that multiple social and environmental stressors jointly influenced COVID-19 severity across New Jersey and that inference depends substantially on appropriate modeling of overdispersion, group-level randomness, and spatial autocorrelation. BYM spatial models provided the most reliable association estimates and uncertainty quantification (lowest DIC, strong PI coverage), highlighting the importance of spatial random effects in ecological health analyses. ML methods, while not superior for linear-generative structures, complemented geostatistical models by uncovering nonlinear relationships (e.g., saturation in density effects, non-exponential trends at low % minority) and delivering comparable predictive accuracy at the county aggregation level. Bayesian inference via INLA matched Frequentist estimates under non-informative priors but was more computationally efficient for complex models, favoring Bayesian frameworks for flexible ecological modeling. The observed NO2 effect sizes and other associations align with prior local and national studies, reinforcing the role of environmental exposures and structural socioeconomic factors in driving disparities. Together, the results advocate for a multi-model strategy combining Bayesian geospatial models for robust inference with explainable ML to detect complex patterns, thereby improving the assessment of socioexposomic determinants and informing environmental justice and public health interventions.
Conclusion
This work provides an integrated socioexposomic framework to analyze COVID-19 outcomes at fine spatial scale, comparing six geostatistical models and two explainable ML methods. Key contributions include: (1) demonstration that Bayesian BYM spatial models yield the most reliable association estimates with realistic uncertainty when accounting for overdispersion and spatial structure; (2) introduction of Shapley effects plots for interpretable ML effect analysis alongside traditional predictor effects plots; (3) evidence that ML complements geostatistical models by revealing nonlinear association patterns while achieving competitive predictive performance. The study strengthens evidence that NO2 exposure, population density, lower educational attainment, and minority composition are associated with higher COVID-19 mortality. Future work should incorporate explicit modeling of exposure mixtures and interactions, probabilistic ML frameworks to provide uncertainty estimates, and individual-level or multiscale designs to mitigate ecological bias and enhance causal interpretation.
Limitations
- Ecological study design may not reflect individual-level associations (risk of ecological fallacy). - Mixtures and interactions were not explicitly modeled in geostatistical analyses; while ML detected interactions, a dedicated mixtures framework (e.g., BKMR-like) was not implemented. - ML models were deterministic and did not provide confidence or prediction intervals, limiting uncertainty characterization for ML-derived associations. - Small/moderate sample sizes can pose overfitting risks for ML; careful hyperparameter tuning was applied, but generalization may still be sensitive. - Missing municipality death data required prediction for validation; while BYM models handled uncertainty better, clustered missingness can increase uncertainty in spatial estimates.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny