Computer Science
A human-machine collaborative approach measures economic development using satellite imagery
D. Ahn, J. Yang, et al.
Reliable measures of economic activity are difficult to collect in many developing countries, constraining research and policy. North Korea exemplifies extreme data scarcity, with limited recent census information. Alternative approaches such as expert interviews, official media reports, and nighttime lights provide partial insights but suffer from limited precision and coverage. Recent computer vision models using satellite imagery can infer socioeconomic indicators like consumption and assets, especially when combined with auxiliary data (e.g., Wikipedia, mobile phone–based metrics). However, most methods rely on substantial ground-truth labels that are often unavailable in low-income and hard-to-access settings. To address this gap, the authors propose a human-machine collaborative deep learning model that predicts grid-level economic development from publicly available satellite imagery using lightweight human-provided relative rankings, avoiding dependence on ground-truth socioeconomic labels. The study applies this approach to North Korea and evaluates generalizability across five least developed countries in Asia (Nepal, Myanmar, Cambodia, Bangladesh, Laos).
The paper situates its contribution within work that measures economic development via alternative and remotely sensed data. Prior methods for North Korea have relied on interviews, official North Korean media, and nightlight imagery, each with limitations in precision and spatial coverage. In broader contexts, computer vision on satellite imagery has predicted poverty and wealth across Sub-Saharan Africa and Southeast Asia, with improved performance when integrating nontraditional sources such as Wikipedia and mobile audience estimates. Nevertheless, these approaches generally require extensive ground-truth labels. Semi-supervised or transfer learning strategies can help but still depend on labeled data from other regions. The literature thus highlights a need for methods that can generate economic indicators without substantial ground-truth labels, motivating the proposed human-guided ranking and machine learning framework.
Data: The study uses Sentinel-2 daytime satellite imagery for North Korea from 2016–2019 at 10 m per pixel (resampled to 9.557 m per pixel), in 256×256 pixel tiles representing ~2.45×2.45 km grids. The approach is also tested on five LDCs in Asia.
Overview: A three-stage human-machine collaborative pipeline produces a grid-level score (siScore, 0–1) reflecting economic development as visually captured by features like building density, roads, and agricultural conversion.
Stage 1 – Clustering satellite images: The authors use DeepCluster (unsupervised deep learning) to learn image embeddings and cluster images by visual similarity. Given that uninhabited/natural areas dominate North Korea’s terrain, a pre-training step isolates uninhabited regions into a single cluster to focus subsequent clustering on inhabited areas. Silhouette analysis over the embedding space determines the optimal cluster number; 23 clusters are used for North Korea.
Stage 2 – Human-guided ranking via Partial Order Graph (POG): Human annotators inspect representative images from each cluster and provide relative rankings of clusters using four operators: higher than, less than, equal, incomparable, producing a POG that encodes ordinal relations of cluster-level development. Ten annotators from three groups—economists, satellite imagery experts, and North Korean defectors—performed the task, typically within ~2 hours, indicating lightweight labeling effort. Individual POGs are ensembled into a representative POG. The paper notes that data-guided POGs can also be constructed from existing nationwide spatial data (e.g., nightlights, land cover) by ranking clusters based on average values, but the human-guided approach avoids dependence on pre-existing ground-truth statistics.
Stage 3 – Rank-wise score model (siScore): A CNN is trained to assign a scalar siScore to each grid image such that the distribution of scores respects the POG ordering (e.g., clusters judged more developed receive higher scores). The training objective is formulated to maximize alignment with POG orderings (equivalently maximizing Spearman correlation between scores and cluster order). Because rank operations are non-differentiable, a differentiable ranking surrogate is used to approximate ranks during backpropagation. The model also learns within-cluster variability so individual images in the same cluster can receive different scores based on visual cues.
Evaluation and baselines: For North Korea, with limited contemporaneous ground-truth, validation uses (i) a manually constructed building footprint dataset (2014) covering ~70% of the country to derive building area; (ii) establishment counts by district from North Korean news mentions; and (iii) district-level population density from the 2008 census. For the five LDCs, official census and survey-based statistics provide ground truth. Baselines include: (1) NL-regression using VIIRS 2019 nightlights; (2) NL-guided POG; (3) land cover–guided POG; and (4) Relative Wealth Index (RWI). An additional transfer-learning baseline (train on another country with labels, then apply to North Korea) is considered in supplementary analyses.
Interpretability: Grad-CAM highlights image regions contributing to siScore, enabling inspection of features like newly constructed buildings, roads, and reclaimed land.
Statistical analysis: Grid-level regressions (OLS with province fixed effects; standard errors clustered at province level) relate changes in siScore (or nightlights) from 2016–2019 to region-specific features: distances to economic/political hubs (capital, nearest city, major port), Economic Development Zones (EDZs; agriculture, tourism, industrial, export processing), mining sites (gold, coal, copper, iron), and nuclear-related sites (test sites, uranium mines), controlling for district population (2008) and area. Both continuous change (Δ log outcomes) and indicators for positive change are analyzed.
- The model produces fine-grained, grid-level economic development scores (siScore) for North Korea (2016–2019), revealing higher development in western plains, eastern coastal ports, and particularly around Pyongyang; lower scores are observed in central and northern mountainous regions. The approach offers higher spatial resolution than nightlights.
- Performance: In North Korea, siScore correlates strongly with proxies of development—grid-level Spearman ρ = 0.77; district-level R² up to 0.83—despite the absence of contemporaneous ground-truth labels. Across five LDCs in Asia, performance is comparable to or better than baselines.
- Against baselines: The human-machine model often matches or outperforms NL-regression, NL-guided POG, land cover–guided POG, and RWI. A transfer-learning baseline trained on a different country underperforms substantially (ρ ≈ 0.50 vs. 0.77 for the proposed model in North Korea).
- Temporal change (2016–2019): Increases in siScore are concentrated in and around Pyongyang and in areas with state-led development projects (e.g., Samjiyon, Kalma). Grad-CAM highlights new structures, road networks, and reclaimed land contributing to score increases.
- Regression analysis: Greater distance from major hubs (Pyongyang, nearest city) is associated with less development (negative coefficients for Δ ln(siScore) and reduced likelihood of positive change). EDZs oriented toward agriculture and tourism show higher development and a higher likelihood of positive change relative to industrial/export processing EDZs and non-EDZ areas. Most mining site indicators are not consistently significant, but districts with uranium mines show relatively higher development signals in siScore. Nightlight-based outcomes do not replicate all siScore patterns (e.g., distance to capital and EDZ agriculture/tourism effects are not evident with nightlights), underscoring that nightlights and siScore capture different facets of development.
- Generalizability: Application to five Asian LDCs (≈400,000 grids; ~300 million population) demonstrates broad applicability beyond North Korea.
The study demonstrates that a human-machine collaborative approach can generate high-quality, granular economic development indicators without relying on scarce or unavailable ground-truth socioeconomic labels. By combining unsupervised visual clustering with lightweight human ranking to form a POG, and training a rank-respecting CNN, the model captures physical manifestations of development—buildings, roads, industrial or tourism infrastructure—visible in daytime imagery. The resulting siScore complements nightlight-based measures by emphasizing physical capital stock rather than nocturnal utilization, revealing development patterns in data-poor regions such as North Korea. Empirical validations against building footprints, establishment counts, and population data, along with competitive or superior performance relative to established baselines, indicate that the method effectively addresses the core challenge of measuring development in low-information environments. The observed concentration of development around Pyongyang and in designated agriculture/tourism EDZs aligns with expectations under a planned economy and recent state-led projects. The divergence between siScore and nightlights highlights that different proxies illuminate distinct dimensions of development, suggesting that combining such measures may provide a more complete picture for policy and research.
This work introduces a scalable human-machine collaborative framework that infers grid-level economic development from publicly available satellite imagery using minimal human input and no ground-truth socioeconomic labels. Applied to North Korea and extended to five Asian LDCs, the model provides interpretable, high-resolution development maps, detects temporal changes, and often outperforms or matches established baselines. The approach can inform policy design in data-scarce settings, such as monitoring urban growth and regional inequality. Future research directions include: integrating additional remote sensing modalities (e.g., near-infrared bands, higher-resolution or proprietary imagery); improving training via tighter coupling of clustering and human guidance (e.g., active learning); enhancing robustness to initial clustering noise; and extending and validating cross-country comparability towards global-scale applications.
- Human annotation subjectivity: POG rankings reflect subjective judgments, potentially influenced by annotator expertise and biases. The ensemble approach mitigates, but does not eliminate, such variability.
- Remote sensing constraints: Cloud cover and limited temporal frequency can obscure features and miss dynamic economic activities (e.g., factory utilization, daily mobility, underground operations), meaning static imagery may underrepresent utilization rather than stock.
- Proxy limitations and differences: siScore captures physical capital visible in daytime imagery and may diverge from other proxies like nightlights that reflect nighttime utilization, particularly in regions with weak illumination or unreliable electricity supply.
- Privacy and dual-use concerns: Even low-resolution predictive analyses may raise ethical considerations. Responsible use requires safeguards and standards to prevent misuse.
- Top-down perspective: Remote sensing approaches cannot capture individuals’ lived experiences; complementary on-the-ground information remains important for policy decisions.
Related Publications
Explore these studies to deepen your understanding of the subject.

