
Economics
What dictates income in New York City? SHAP analysis of income estimation based on Socio-economic and Spatial Information Gaussian Processes (SSIG)
R. Bai, J. C. K. Lam, et al.
This groundbreaking study by Ruiqiao Bai, Jacqueline C. K. Lam, and Victor O. K. Li develops a novel Socio-economic & Spatial-Information-GP model to tackle income inequality in New York City. With enhanced accuracy in estimating district-based income, the research underscores the impact of education and persistent disparities due to race and sex, offering vital insights for effective policy-making.
~3 min • Beginner • English
Introduction
The paper addresses the challenge of measuring and understanding intra-city income inequality in developed economies, focusing on New York City (NYC). High spatial granularity income data are vital for evidence-based policy but are difficult and costly to collect via field surveys, and often constrained by privacy and reporting limitations. The study investigates between-district inequality (since within-district measures are available but between-district indicators are not) using tract- and ZIP-level data. Two research questions guide the work: (1) Does incorporating socio-economic variables from field surveys into a spatial Gaussian Process (GP) model improve fine-grained income estimation accuracy? (2) Which socio-economic variables best predict district-based income in NYC? To answer these, the authors develop a Socio-economic and Spatial-Information GP (SSIG) model and apply SHAP analysis to interpret the contributions of individual socio-economic features. The study emphasizes the importance of modeling complex non-linear relationships beyond traditional linear/log-linear approaches to better capture determinants of income and inform urban sustainability policy.
Literature Review
Two broad approaches exist for fine-grained income estimation in developed economies: (1) models inferring income from socio-economic variables collected via field surveys, and (2) machine-learning/big-data models using alternative proxies (e.g., housing prices, satellite imagery, street views, mobility, restaurant data, social media, spatial coordinates). Traditional field-survey-based models commonly use variables like education, employment, age structure, and population density, and have repeatedly found educational attainment (share ≥25 with bachelor’s or higher) to be a strong predictor of income via enhanced productivity. However, such studies often use a limited set of variables, smaller samples, coarser spatial resolution (county-level), and simpler models (linear or log-linear), limiting their ability to capture complex non-linear relationships at high spatial resolution. On the ML/big-data side, prior work (Bai et al., 2020) developed three models, notably a Spatial-Information-GP that uses only latitudes/longitudes of district centroids and achieved strong accuracy with low data burden. The literature suggests benefits to integrating richer socio-economic features with spatial information, while also highlighting potential overfitting risks with high-dimensional proxy data. This study aims to fill gaps by combining socio-economic and spatial inputs in a GP framework and providing interpretable attributions via SHAP.
Methodology
Study area and labels: NYC tract- and ZIP-level income from the 2015–2019 American Community Survey (ACS; 5-year estimates). Labels are per-capita income and median household income at two granularities: 2117 tracts and 180 ZIPs for per-capita income; 2095 tracts and 179 ZIPs for median household income. Between-district inequality indicators (Gini, Decile Dispersion Ratio) are computed to evidence inequality.
Inputs: Two input groups feed the SSIG model: (a) spatial information (latitude and longitude of district centroids derived from Census Reporter shapefiles), and (b) ten socio-economic variables selected via literature review and ACS availability: proportion ≥25 with bachelor’s degree (≥25UDG), proportion ≥25 with postgraduate degree (≥25PGD), unemployment rate (Un-employ), proportion ≥65 years, proportion <18 years, population density (Pop-density), proportion male (Male), racial composition: Black or African American, Asian, White. Proportions default to zero where denominators are zero; districts lacking income labels are excluded from training/validation.
Baselines and additional data: For comparative baselines, the study considers prior models (GP-Mixed-Siamese-like-Double-Ridge, Mixed-Siamese-like-GP, Mixed-Siamese-like-Random-Forest, Mixed-Spatial-Siamese-like, Mixed-Siamese-like). Additional proxies used for baselines include 2019 house price transactions (NYC Department of Finance), 2018 daytime satellite imagery (NYC GOV), and 2018–2020 street view images (Google Street View Static API), processed following Bai et al. (2020).
Model: Gaussian Process regression with a Matérn 3/2 kernel; inputs are concatenated socio-economic variables and spatial coordinates (SSIG). Labels and features are standardized before training. Evaluation uses fivefold cross-validation and metrics: r² (Pearson correlation squared), Coefficient of Determination (CoD), RMSE, MAE. Implementation uses GPy. The Spatial-Information-GP (spatial only) and Socio-economic-Information-GP (socio-economic only) serve as ablation baselines.
SHAP analysis: To quantify each socio-economic feature’s contribution to predictions, SHAP values are computed for the SSIG model. Mean absolute SHAP values indicate global importance; scatter plots (conceptually) relate feature values to SHAP impacts. SHAP is chosen for its model-agnostic interpretability with non-linear models and robustness to multicollinearity. The analysis compares relative importance of spatial vs socio-economic groups and among individual socio-economic features.
Between-district inequality indicators: Reported for context—Gini coefficient and Decile Dispersion Ratio at tract and ZIP levels: Tract per-capita (Gini 0.33; Decile ratio 7.87), Tract median household (0.26; 6.05), ZIP per-capita (0.35; 7.73), ZIP median household (0.26; 5.59).
Key Findings
- Model performance: SSIG outperforms all baselines across tasks and granularities. Fivefold validation results (examples):
- Tract-level per-capita income: SSIG r²=0.89, CoD=0.89, RMSE=9,619, MAE=5,424; higher than Spatial-GP (r²=0.84) and Socio-economic-GP (r²=0.84).
- ZIP-level per-capita income: SSIG r²=0.85, CoD=0.84, RMSE≈13,655–14,654, MAE≈10,036–10,040; higher than Spatial-GP (r²=0.69) and Socio-economic-GP (r²=0.81).
- Tract-level median household income: SSIG r²=0.89, CoD=0.89, RMSE=9,619 (per-capita table analogous) and strong gains over baselines.
- ZIP-level median household income: SSIG r²=0.85, CoD=0.84, RMSE≈14,654, MAE≈10,036; higher than Spatial-GP (r²=0.68–0.69) and Socio-economic-GP (r²=0.80–0.81).
- Ablation: SSIG (socio-economic + spatial) consistently outperforms GP variants using only socio-economic or only spatial inputs. Socio-economic-only GP generally surpasses spatial-only GP, especially at ZIP level, indicating socio-economic features carry more predictive power than pure spatial autocorrelation.
- SHAP—feature group importance: Socio-economic features exhibit larger aggregate SHAP importance than spatial coordinates, indicating content features dominate spatial location for prediction in NYC.
- SHAP—top individual predictors: Educational attainment is most influential. Proportion ≥25 with postgraduate degree (≥25PGD) is the strongest predictor; proportion ≥25 with bachelor’s degree (≥25UDG) is second. Higher values of these increase predicted income. Example average absolute SHAP values: at ZIP level, per-capita income—≥25PGD 0.4230, ≥25UDG 0.3484; median household income—≥25PGD 0.3494, ≥25UDG 0.3067.
- Race and sex: “White” share contributes positively to income; “Male” share contributes positively especially for median household income. Race contributes more to between-district income gaps than sex, per SHAP magnitudes (e.g., White > Male). Median household income is more sensitive to White than per-capita income.
- Other variables: Population density and unemployment rate show negative associations with income (higher values linked to lower predicted income). Age structure effects (<18, ≥65) are mixed/ambiguous across outcomes and granularities.
- Sensitivity differences: Estimated per-capita income is more sensitive to ≥25UDG than estimated median household income, suggesting concentration of bachelor’s degree holders in higher-income households.
- Inequality context: Between-district inequality in NYC is substantial (e.g., tract-level per-capita Gini 0.33; decile ratio 7.87).
Discussion
The study’s findings address the research questions by showing that integrating socio-economic variables with spatial coordinates in a GP framework (SSIG) significantly improves fine-grained income estimation accuracy over spatial-only and other complex multimodal baselines. The superior generalization suggests the relationship between field-survey socio-economic variables and income is more stable than relationships with high-dimensional proxy data, reducing overfitting risks.
SHAP analysis demonstrates that socio-economic features, especially educational attainment (≥25PGD, ≥25UDG), dominate predictive power for district-level income, and provides interpretable evidence for policy prioritization. The analysis also indicates that race-related composition (White) contributes more to between-district income differences than sex composition, while some structural factors (population density, unemployment) negatively relate to income. The socio-economic-only GP outperforming spatial-only GP and the SHAP group importance both support the primacy of socio-economic content over spatial autocorrelation in NYC’s income patterns, particularly at ZIP level where data quality may be higher. These insights are relevant for urban sustainability policies, suggesting targeted educational opportunities and addressing structural inequities related to race and gender could help mitigate income disparities. The authors note SHAP reflects correlational attribution, not causality, so policy inferences should be made cautiously and supplemented with causal analyses.
Conclusion
The paper introduces SSIG, a Gaussian Process model combining socio-economic and spatial inputs, and uses SHAP to interpret feature contributions to district-level income in NYC. SSIG achieves state-of-the-art performance for per-capita and median household income estimation at tract and ZIP levels, outperforming spatial-only GP and other multimodal baselines. SHAP reveals that higher educational attainment (≥25PGD, ≥25UDG) is the strongest predictor of income, with race (White) and sex (Male) contributing to observed inequalities; race effects are stronger than sex effects. Per-capita income is more sensitive to bachelor’s degree prevalence than median household income, consistent with assortative mating and concentration in higher-income households. The results suggest policy levers emphasizing expanded access to higher education in lower-income districts and addressing race- and sex-related structural inequities. Methodologically, the study demonstrates the value of integrating field-survey socio-economic data with spatial information and using SHAP for interpretable, non-linear modeling of socio-economic phenomena. Future work should incorporate annual (not 5-year) high-granularity data, analyze SHAP interaction effects when supported, examine pandemic-era impacts on inequality, and extend the framework to other cities for comparative analyses.
Limitations
- Data granularity and sampling: ACS tract/ZIP-level income data are 5-year averages (2015–2019) and smaller districts are sampled less frequently, which may introduce sampling variability and limit temporal resolution. No annual fine-grained income labels were available.
- Coverage constraints: Some baseline models could only be trained/validated where street-view imagery was available, potentially affecting comparisons.
- Correlational interpretation: SHAP attributions are correlational and may reflect household self-selection and unobserved confounding; causal conclusions should not be drawn without further analysis.
- Period coverage: The analysis does not include the COVID-19 pandemic period (2020+), during which differential labor market impacts may have altered income patterns and inequalities.
- Interaction effects: SHAP interaction analyses were not conducted; future software support could enable investigation of feature interactions.
Related Publications
Explore these studies to deepen your understanding of the subject.