Economics
Using publicly available satellite imagery and deep learning to understand economic well-being in Africa
C. Yeh, A. Perez, et al.
Discover how advanced deep learning models are transforming economic research! A team of researchers from Stanford University and AtlasAI has harnessed the power of satellite imagery to predict asset wealth in 20,000 villages across Africa. This innovative approach not only reveals significant wealth variations but also provides insights into wealth changes over time. Dive into this groundbreaking study and explore its practical implications for policy and research.
~3 min • Beginner • English
Introduction
Reliable, local-level measurements of economic well-being are essential for policymaking, program targeting, and private-sector decision-making, yet such data are sparse across much of sub-Saharan Africa. Nationally representative consumption or asset surveys are infrequent (often ≥4-year intervals) and rarely provide repeated observations at the same locations, limiting the ability to measure local changes over time. Meanwhile, satellite sensors repeatedly observe the Earth's surface, providing frequent, wide-coverage imagery. The study investigates whether publicly available multispectral daytime Landsat imagery and nighttime lights can be used to accurately infer spatial and temporal variation in local-level asset wealth across Africa, particularly in countries or years lacking reliable survey data. The central research question is whether deep learning models trained on these imagery sources can predict village-level (and aggregated district-level) asset wealth and its changes over time, and how such predictions compare to ground-based measures and prior benchmarks.
Literature Review
Prior work has shown that coarse-resolution nighttime lights correlate with national or regional economic performance over time and that high-resolution commercial imagery can predict spatial variation in local economic outcomes in selected countries. Transfer learning approaches using nightlights as a proxy to extract features from daytime imagery have been used to predict poverty. Other studies have used mobile phone metadata to predict wealth and geostatistical models to estimate health and housing outcomes. However, there remains a gap in generating accurate, scalable, and temporally informative local-level economic indicators across sub-Saharan Africa using only publicly available imagery. This study builds on these literatures by directly combining multispectral Landsat and nightlights as inputs in an end-to-end deep learning framework, evaluating both cross-sectional and temporal performance across many African countries, and validating against independent census-based wealth measures.
Methodology
Data and outcome construction:
- Ground data: Asset wealth for over 500,000 households in 19,669 DHS clusters (villages/neighborhoods) across 23 African countries, surveys from 2009–2016. Asset index constructed via PCA (first principal component) using common assets: rooms, electricity, floor quality, water, toilet, ownership of phone, radio, TV, car, motorbike; standardized to mean 0, SD 1. Household indices averaged to cluster level. Alternative index constructions (subsets of assets, country-specific mappings) yield highly correlated results.
- Validation and auxiliary data: LSMS panel data in five countries (Malawi, Nigeria, Tanzania, Ethiopia, Uganda) for temporal analyses (~9,000 households in ~1,400 clusters). Independent census microdata (10% samples) in eight countries to construct district-level asset indices for validation within 4 years of DHS surveys. The wealth index correlates with log consumption (weighted r^2 ≈ 0.50) in LSMS subsets.
Satellite imagery:
- Daytime: Landsat 5/7/8 surface reflectance, 30 m/pixel, seven bands (RED, GREEN, BLUE, NIR, SWIR1, SWIR2, Thermal). Three-year median composites for 2009–11, 2012–14, 2015–17 to reduce cloud/seasonal noise and reflect slow-moving wealth changes.
- Nighttime lights: DMSP (30 arc-sec, unitless) for 2009–11 and VIIRS (15 arc-sec, nW cm−2 sr−1) for 2012–14 and 2015–17; treated as separate bands due to resolution/units differences.
- Preprocessing: Imagery exported from Google Earth Engine as 255×255 tiles centered on cluster coordinates, center-cropped to 224×224 (~6.72 km per side) and normalized (mean 0, SD 1 per band). Note DHS cluster GPS jitter (up to 2 km urban/10 km rural) introduces spatial misalignment; the 6.72 km window aims to encompass true locations.
Models:
- Core architecture: ResNet-18 (v2, preactivation) adapted for multi-band inputs and scalar regression outputs. For combined models, separate ResNet-18s are trained for Landsat (MS) and nightlights (NL), then final feature layers are concatenated and a ridge regression layer is trained on top.
- Initialization: Same-scaled ImageNet initialization for RGB channels; non-RGB first-layer weights initialized as mean of RGB weights, scaled by 3/C. Remaining layers initialized from ImageNet; final layer weights truncated normal. NL-only models use He initialization. For change-prediction tasks and LSMS “index of differences,” random initialization performed better.
- Training: Adam optimizer, mean squared error loss, batch size 64, learning rate decayed by 0.96 each epoch; 150 epochs (200 for DHS out-of-country). Early stopping via best validation r^2. Hyperparameters grid-searched over learning rate (1e-2 to 1e-5) and L2 regularization (1e0 to 1e-3). Data augmentation: random flips; for MS bands, random brightness and contrast perturbations. For temporal stacks, images from two years are stacked (224×224×2C), with randomized order and sign adjustment on labels.
- Baselines: (i) k-nearest neighbor on scalar nightlights (nonlinear mapping via nearest neighbors in nightlights space), (ii) regularized linear regression on scalar nightlights. Transfer learning baseline predicting DMSP/VIIRS nightlights from MS imagery (multitask regression), then using frozen features to predict wealth.
Evaluation design:
- Data splits: 5-fold cross-validation. Out-of-country setting: countries grouped into folds with roughly equal village counts; train/validate on four folds (excluding test country), test on held-out country. In-country setting: folds constructed to avoid any overlap in satellite footprints using DBSCAN grouping. LSMS evaluated in-country only.
- Outcomes: Cross-sectional prediction of cluster-level wealth; aggregated to district-level for robustness. Temporal prediction using (a) matched nearest DHS clusters across survey rounds, (b) LSMS panel differences of cluster-level wealth index, and (c) LSMS “index of differences” (PCA on changes in asset ownership). Performance measured by r^2 on held-out data. Weighted r^2 reported when aggregating by number of villages.
- Noise diagnostics: Compared DHS and satellite predictions to independent census-based district wealth; simulated impacts of GPS jitter by adding artificial noise and re-evaluating; estimated r^2 loss attributable to jitter (~0.07).
Scalability demonstration:
- Generated a 7.65 km/pixel gridded wealth map for Nigeria (2012–2014) by tiling inputs and running the trained MS+NL model; processing ~9.1 billion pixels. End-to-end runtime <30 hours (≈4 hours training on NVIDIA Titan X GPU; ≈24 hours imagery processing and raster generation). Aggregation to administrative units via population rasters (GHSL).
Key Findings
- Cross-sectional accuracy: The combined MS+NL CNN explains ~70% of the variation in village-level asset wealth in held-out countries (pooled r^2 ≈ 0.67; average within-country-year r^2 ≈ 0.70). Country-specific performance never falls below 0.50 and often exceeds 0.80; median ≈ 0.704.
- Aggregation improves fit: At the district level, predictions explain on average ~83% of variation in held-out countries (weighted r^2 up to ~0.83).
- Input modality contributions: CNNs trained only on MS or only on NL perform similarly to each other and close to the combined model for spatial variation. Direct use of NL imagery in CNNs outperforms transfer-learning approaches that use NL as proxy labels. A KNN model on scalar NL captures much of spatial variation but fails on temporal changes; linear NL models perform worst.
- Temporal changes: Using matched DHS clusters and LSMS panels, satellite-based models explain ~15–17% of variation in survey-measured changes at the village level. For the LSMS “index of differences,” MS-based models reach r^2 ≈ 0.35 at village level; aggregating to districts yields weighted r^2 ≈ 0.51 (unweighted ≈ 0.43). Nightlights alone contribute little to temporal prediction (r^2 < 0.01), while daytime MS is essential.
- Ground-truth validation: District-level DHS-based wealth correlates strongly with independent census-based wealth (weighted r^2 up to ~0.89). Satellite-predicted wealth also correlates highly with census measures (weighted r^2 up to ~0.83), only slightly below DHS vs. census.
- Error sources and limits: GPS jitter and survey noise materially degrade performance; extrapolation suggests location jitter reduces r^2 by ~0.07. Within-village heterogeneity is associated with lower performance. Simulations indicate that small average temporal wealth changes (≈0.08 SD) are difficult to detect given survey noise.
- Downstream utility: Satellite-based estimates closely recover the observed nonlinear relationship between maximum temperature and wealth; simple NL-based models do not. For hypothetical targeting of transfers to the bottom 50% by assets, MS+NL achieves ≈81% village-level targeting accuracy vs. ≈75% (transfer learning) and ≈62% (scalar NL), noting these are lower bounds due to ground data noise.
- Scalability: Country-scale mapping (Nigeria) is feasible with public imagery and modest compute, producing interpretable spatial gradients and handling artifacts (e.g., ignoring NL blooms from gas flares when not corroborated by MS).
Discussion
The study demonstrates that deep learning models trained on publicly available multispectral daytime imagery and nighttime lights can accurately and scalably estimate local-level asset wealth across sub-Saharan Africa, including in countries without training data. Spatial performance comparable to or exceeding prior benchmarks and strong validation against independent census measures indicate that public imagery contains rich socioeconomic signals. Temporal estimates, especially when aggregating to districts, capture meaningful changes in wealth, with daytime multispectral imagery providing most of the temporal signal.
These findings address the challenge of sparse, infrequent ground surveys by offering a complementary measurement approach that can fill spatial and temporal gaps, supporting research (e.g., environmental determinants of wealth) and policy applications (e.g., targeting social protection). The analyses highlight that model performance is constrained by noise in ground truth (survey sampling variability and DHS location jitter), not only by the capacity of satellite-based predictors. The results also reveal a performance–interpretability tradeoff: CNNs outperform simpler proxies like scalar nightlights but are less transparent, which may affect policy uptake. Nonetheless, feature visualizations and downstream validations suggest the models learn semantically meaningful patterns (urbanization, agricultural landscapes, water bodies, deserts) relevant to wealth.
Overall, satellite-derived wealth estimates can amplify ground survey efforts, enabling more frequent, granular, and timely socioeconomic monitoring. Integration with improved survey data, higher-resolution and multimodal remote sensing, and complementary data sources (e.g., mobile metadata) could further enhance accuracy and usability.
Conclusion
This work shows that publicly available satellite imagery combined with deep learning can produce accurate, scalable estimates of local asset wealth across Africa and track changes over time when aggregated appropriately. The approach outperforms or matches prior benchmarks, validates against independent census measures, and proves useful for downstream research and policy tasks, including targeting social programs. Key contributions include an end-to-end dual-input CNN architecture, cross-continental evaluation on held-out countries, temporal change estimation, validation against independent data, and a demonstration of country-scale mapping.
Future research directions include: improving interpretability of deep models for policy contexts; expanding to consumption-based poverty and other livelihood outcomes as more training data become available; leveraging higher-resolution optical and radar imagery with increased revisit frequencies; integrating additional passive data sources (e.g., mobile phone or social media); and refining methods to mitigate the impact of survey noise and location jitter on training and evaluation.
Limitations
- Ground-truth noise: DHS survey sampling variability, recall bias, and particularly GPS coordinate jitter (up to 2 km urban/10 km rural) misalign imagery with true locations, reducing performance (estimated r^2 loss ≈ 0.07).
- Temporal sensitivity: True changes in wealth over short horizons are small relative to cross-sectional differences and can be obscured by survey noise, limiting village-level temporal r^2; aggregation helps but may mask local heterogeneity.
- Interpretability: CNN-derived features are less transparent than simple proxies (e.g., scalar nightlights), potentially hindering adoption by policymakers.
- Heterogeneity within clusters: High within-village wealth variation reduces predictive performance, possibly due to both model limitations in heterogeneous contexts and noisier survey-based cluster means.
- Outcome scope: Asset indices are proxies, not direct consumption-based poverty measures; some assets and definitions vary across datasets. Certain socioeconomic dimensions (intra-household distribution, inequality within villages) are not directly observable from imagery.
- Data dependence: Although out-of-country generalization is strong, availability and quality of training data (e.g., for consumption outcomes) remain limiting for extending to other targets.
Related Publications
Explore these studies to deepen your understanding of the subject.

