
Environmental Studies and Forestry
Exploring interactions between socioeconomic context and natural hazards on human population displacement
M. Ronco, J. M. Tárraga, et al.
This research, conducted by Michele Ronco, José María Tárraga, Jordi Muñoz, María Piles, Eva Sevillano Marco, Qiang Wang, Maria Teresa Miranda Espinosa, Sylvain Ponserre, and Gustau Camps-Valls, delves into the intricate connections between socioeconomic factors, natural hazards, and internal displacement, harnessing the power of explainable machine learning to drive impactful insights and recovery strategies.
~3 min • Beginner • English
Introduction
The study addresses how sudden-onset natural hazards (floods, storms, landslides) interact with socioeconomic context to drive internal displacement. Human mobility is multi-causal, with hazard, exposure, and vulnerability jointly shaping outcomes. Prior work often emphasized international migration, assumed linear relations, or suffered from data limitations and inconsistent targets, making it difficult to detect environmental drivers. The authors posit that internal displacements are primarily influenced by the interplay of extreme weather with local vulnerability and exposure, and seek to quantify these relationships using data-driven, explainable machine learning without strong parametric assumptions. The purpose is to build a harmonized global disaster-level dataset and model the number of new displacements per event to identify key drivers and interactions, particularly testing the hypothesis of differential vulnerability (greater impacts in poorer areas under similar hazards).
Literature Review
The paper situates itself within literature recognizing migration as an adaptive response to environmental change and conceptualizing disaster risk as the intersection of hazard, exposure, and vulnerability. Previous studies frequently focused on international migration, found weak or mixed evidence for direct environmental effects depending on definitions and data, and often employed linear models that may miss non-linear interactions. Limitations include scarce, incomplete, and heterogeneous mobility data, varying spatial/temporal scales, and confounding from socioeconomic and governance factors. Some work has used non-linear or ML methods and identified roles for agriculture dependence, conflict, and socioeconomic status, but comprehensive analyses integrating hazard, exposure, and vulnerability at subnational event level remain limited. This study addresses these gaps by combining event-level displacement data with multi-source covariates and XAI/causal tools.
Methodology
Design: A global, event-level dataset was constructed for sudden-onset hazards (storms, floods, landslides) from 2016–2021, targeting the number of new displacements of people (NDP) per event from IDMC’s Global Internal Displacement Database. Spatial units are geo-referenced polygons of affected areas (administrative level 1/2 or nearest available polygon) extracted via OSMnx from OpenStreetMap; polygons covering entire countries were discarded to reduce aggregation bias. The analysis focuses on low- and middle-income countries where AWI data are available. Dataset size: 2400 disaster events (2016–2021). Covariates (Table 1): Hazard: ERA5-Land precipitation (sum over event duration; maximum accumulation per polygon), 10 m wind speed (max over event). Exposure: polygon area (m²), population (mean), kNDVI (mean), elevation (mean). Vulnerability/societal: Absolute Wealth Index (AWI; max within polygon from Meta Data4Good 2.4 km grid), education expenditures (% of GNI; national), % agricultural land (national), global human modification index (gHM; mean), conflict fatalities (ACLED; annual sum within polygon). Data extraction/aggregation: Weather variables from ERA5-Land (hourly ~9 km) via Google Earth Engine; kNDVI (MODIS), elevation (SRTM), population (GPWv4), gHM (CSP). AWI aggregated by polygon-grid intersection taking maximum. National indicators from UN SDG API (last available year carried forward if missing). Areas computed by projecting each polygon to UTM zone of its centroid. Preprocessing: Targets and some skewed covariates log-transformed: log(NDP) as target; log(population) and log(conflict fatalities). All variables standardized via z-score. Alternative target (NDP per population without log) was tested but performed worse. Models: - Baselines: Linear Regression (LR). - Non-linear ML: Random Forests (RF) and Gradient Boosting Machines (GBM) regressors (scikit-learn). Train/test splitting via stratified sampling (70/30) using quantile binning on target to balance NDP ranges. Performance estimation via bootstrapping with ~10^3 iterations, reporting mean and standard deviation for R², RMSE, and mean error (ME). Hyperparameters: RF: n_trees=40, max_depth=6, min_samples_split=4, max_features=3. GBM: n_estimators=60, learning_rate=0.05, max_depth=4, min_samples_split=4, min_samples_leaf=2. Grid search tuning minimized overfitting defined as 1 - (R²_train/R²_test). Comparisons were made with and without weather variables to assess their contribution; significance of performance differences evaluated by counting bootstrap iterations where no-weather RF outperformed full RF (yielding p-value). Explainability and causal analysis: - SHAP (Shapley values) to rank features and interpret per-event contributions (python SHAP package). - Causal forests and Double ML (EconML) to estimate conditional average treatment effects treating each covariate as a treatment with others as controls; report medians and spreads; statistical significance assessed. - Individual Conditional Expectation (ICE) plots to visualize non-linear responses and interactions by varying one feature at a time. Robustness: Additional cross-validation strategies and subset analyses (by continent or hazard type) are noted in the supplementary information.
Key Findings
- Predictive performance: RF and GBM substantially outperformed linear regression on log(NDP). RF achieved R² = 0.37 ± 0.02, RMSE = 0.90 ± 0.02, ME = −0.006 ± 0.90; GBM R² = 0.36 ± 0.02, RMSE = 0.91 ± 0.02, ME = −0.003 ± 0.91; LR R² = 0.19 ± 0.02, RMSE = 1.02 ± 0.02. Prediction vs truth correlation showed Pearson r = 0.57. - Role of weather: Excluding weather variables (precipitation and wind speed) degraded ML performance (RF R² dropped to 0.33 ± 0.02; GBM to 0.32 ± 0.02; LR to 0.16 ± 0.02). The drop in RF hold-out R² had p = 0.05, supporting weather predictors’ relevance. - Feature importance and directions (SHAP and model weights): Vulnerability (AWI) and hazard (precipitation) emerged as top drivers, followed by exposure (area). Poorer areas (low AWI) experienced higher NDP; higher precipitation and stronger winds increased NDP. Larger affected area and higher conflict fatalities associated with more displacement. Elevation negatively associated (higher-altitude regions less exposed). Greater dependence on agriculture (% agricultural land) linked to higher NDP. kNDVI and gHM had marginal contributions; higher gHM indicated more developed, resilient areas. Lower national education expenditures associated with higher NDP, consistent with education as an adaptation capacity proxy. - Non-linearity and interactions (differential vulnerability): SHAP analyses revealed non-linear responses including saturation at very high precipitation and large areas. Under similar high precipitation, events in poorer areas (low AWI) had higher SHAP contributions to NDP; similarly for large areas at low AWI, evidencing differential vulnerability (the same hazard yields greater displacement in poorer contexts). - Causal estimates: Causal forest treatment effects’ directions and relative importance aligned with SHAP findings for key predictors (precipitation, AWI, area), but none reached statistical significance given current data. - Spatial aggregation: Asia had the highest NDP totals across hazards; North America’s storm-related NDP totals were ~10² times lower than Asia; floods dominated NDP in other continents; landslides contributed marginally worldwide.
Discussion
The findings demonstrate that internal displacement from sudden-onset hazards is driven by a complex, non-linear interplay of hazard intensity, exposure, and socioeconomic vulnerability. Data-driven ML, coupled with XAI and causal tools, captured relationships that linear models missed, particularly the strong role of precipitation and its interactions with wealth. The evidence supports the concept of differential vulnerability: under comparable hazards, poorer communities experience more displacement. Exposure (affected area, population, agriculture dependence) and exacerbating factors (conflict) also materially shape outcomes. These insights advance understanding beyond traditional risk models that treat hazard, exposure, and vulnerability as independent, highlighting the need for integrated approaches. The work underscores the importance of selecting effective hydro-climatic indicators, improving geolocation and polygon delineation for disasters, and enriching datasets with high-resolution socioeconomic and governance-related variables to better quantify adaptive capacity and coping mechanisms. While top-down quantitative modeling reveals broad patterns, it cannot fully capture individual decision processes, pointing to the value of mixed-methods research.
Conclusion
The study builds a harmonized global event-level dataset and demonstrates that explainable ML can identify and quantify the drivers of internal displacement from sudden-onset hazards using observational data. It provides empirical evidence of differential vulnerability: the same hazard intensity displaces more people in poorer areas, with precipitation and socioeconomic conditions jointly determining displacement magnitude. Methodologically, integrating ML with XAI and causal analyses illuminates non-linear dependencies across hazard, exposure, and vulnerability, offering a foundation for evidence-based mitigation and adaptation planning. Future work should incorporate improved geolocated displacement data, refine hazard metrics (including lagged and compound effects), enhance polygon identification, include additional dimensions such as coping capacity and governance quality, and leverage higher-resolution, higher-quality variables to strengthen causal inference and generalizability.
Limitations
- Data coverage and bias: AWI currently available only for low- and middle-income countries, restricting global representativeness and potentially introducing income-related biases. Not all IDMC countries/events could be included due to harmonization needs. - Event geolocation and exposure delineation: Polygons derived from place names via OSMnx may not perfectly match affected areas; in some cases larger administrative units were used, potentially biasing polygon-level covariates. Entire-country polygons were removed, but residual aggregation bias may remain. - Variable measurement and proxies: Some covariates (e.g., education expenditures) may proxy unobserved governance or response effectiveness; potential confounding among variables persists despite data-driven controls. - Limited size and resolution: Dataset size is modest for ML; many variables available only at aggregated (national) scales; hydro-climatic drivers and lagged/compound effects may be incompletely captured. - Modeling constraints: Although SHAP and causal forests agree on directions, causal effects did not reach statistical significance with the current dataset. Linear models underperform due to non-linearities; ML interpretability depends on the chosen model and explanations. - Top-down perspective: Quantitative, event-level modeling may not capture individual-level decision factors and heterogeneous responses. - Potential saturation effects and data artifacts: Observed SHAP saturation at high precipitation/area may reflect true limits of exposure or artifacts from polygon matching and data limitations.
Related Publications
Explore these studies to deepen your understanding of the subject.