logo
ResearchBunny Logo
Introduction
Income inequality significantly impacts urban sustainability in developed economies. High-granularity income data, crucial for evidence-based policy-making, are traditionally obtained from field surveys, but privacy concerns hinder accurate data collection. Previous machine-learning models for income estimation, such as the Spatial-Information-GP model, have shown promise but lack the incorporation of socio-economic factors. This study aims to improve income estimation accuracy and understand the relative contributions of socio-economic variables to income inequality in NYC. We address two research questions: 1) Does incorporating socio-economic data into the Spatial-Information-GP model improve income estimation accuracy? and 2) Which socio-economic variables best predict income in NYC? To answer these questions, we propose a novel SSIG model that combines socio-economic and spatial data within a GP framework. We compare SSIG's performance against state-of-the-art baselines and use SHAP analysis to interpret the model's predictions and identify key factors influencing income inequality.
Literature Review
Existing literature reveals two main approaches to fine-grained income estimation: one relying on socio-economic variables from field surveys, and another employing machine learning and big data. Traditional models using field survey data have explored variables like education, employment, age, and population density, but these models often have limitations including limited variables, low spatial resolution, and simple machine learning techniques. Previous studies using machine learning have employed various data sources, including house prices, satellite imagery, and street views. One notable model, the Spatial-Information-GP, demonstrated high accuracy but ignored socio-economic information. This study builds upon these previous approaches by incorporating both socio-economic data from field surveys and spatial information in a novel framework.
Methodology
The study focuses on NYC, using income data from the 2015-2019 American Community Survey (ACS) as labels (per-capita and median household income at Tract and ZIP levels). Input data included spatial information (latitude and longitude of district centroids) and ten socio-economic variables from the ACS: proportion of adults ≥25 with bachelor's or postgraduate degrees, unemployment rate, proportions of population aged <18 and ≥65, population density, proportion of males, and proportions of Black/African American, Asian, and White populations. For comparison, baseline models included data on house prices, daytime satellite images, and street views. The SSIG model uses a Gaussian Process (GP) with a Matern-3/2 kernel. Model performance was evaluated using R², Coefficient of Determination (CoD), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) through fivefold cross-validation. SHAP analysis was employed to interpret the model's predictions and assess the contribution of each socio-economic variable. An ablation study compared SSIG with models using only socio-economic or spatial data to determine the relative importance of each data type.
Key Findings
SSIG significantly outperformed state-of-the-art baseline models in income estimation accuracy across all evaluation metrics (R², CoD, RMSE, MAE) at both Tract and ZIP levels. The ablation study showed that SSIG consistently outperformed models using only socio-economic or spatial information, indicating the synergistic effect of combining both data types. SHAP analysis revealed that the proportion of adults ≥25 with postgraduate degrees (≥25PGD) was the most important predictor of income, followed by the proportion with bachelor's degrees (≥25UDG). The proportion of the White population was also a significant predictor. Race and sex were shown to contribute to income inequality, with race exhibiting a stronger effect. Population density and unemployment rate showed negative correlations with income. The effects of age (<18 and ≥65) were ambiguous. Further analysis revealed that per-capita income was more sensitive to ≥25UDG and ZIP-level analysis provided more robust estimates than Tract-level analysis. Median household income showed a stronger correlation with the proportion of White residents. Comparison of SHAP values across income types and geographical levels showed that ≥25UDG, ≥25PGD, and White were consistently key predictors.
Discussion
The superior performance of SSIG over baseline models demonstrates the value of integrating socio-economic and spatial data for accurate income estimation. The ablation study highlighted the importance of both data types, particularly socio-economic data at the ZIP level. SHAP analysis provides valuable insights into the drivers of income inequality in NYC. The significant influence of higher education aligns with economic theory. The persistence of racial and sex-based inequalities reinforces existing concerns about systemic biases. The findings offer actionable insights for policymakers, highlighting the need to address both educational attainment and systemic inequities related to race and sex to promote greater income equality.
Conclusion
This study introduces SSIG, a novel model combining socio-economic and spatial data for high-accuracy income estimation in NYC. SHAP analysis effectively revealed the importance of education and persistent inequalities. These findings highlight opportunities to improve urban sustainability by focusing on higher education and addressing racial and sex-based disparities. Future research could explore the interaction effects of variables and extend the model to include additional data sources or time periods.
Limitations
The study's reliance on 5-year average income data from the ACS, and the less frequent surveys in smaller districts, limits the precision of the findings. Future research could explore the impact of the COVID-19 pandemic and employ yearly income data for more accurate analysis. Investigation of SHAP interaction effects would also be beneficial to achieve further insights.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny