Earth Sciences
A spatio-temporal analysis investigating completeness and inequalities of global urban building data in OpenStreetMap
B. Herfort, S. Lautenbach, et al.
Discover the disparities in urban building data completeness in OpenStreetMap, revealing that while some areas boast high completeness, many others lag behind significantly. This research, conducted by Benjamin Herfort, Sven Lautenbach, João Porto de Albuquerque, Jennings Anderson, and Alexander Zipf, sheds light on these inequalities and their implications for urban planning.
~3 min • Beginner • English
Introduction
The study addresses how complete OpenStreetMap (OSM) building data are for global urban analyses and how unevenness in this data is distributed within and across cities worldwide. Rapid urban population growth and built-up expansion, particularly in low- and lower-middle income countries, heighten the need for consistent global building inventories to support Sustainable Development Goals (SDGs), notably SDG 11. Traditional national statistical systems are underfunded and often lack accessible, standardized, and up-to-date geospatial baselines, motivating the use of non-traditional sources. OSM is widely used across domains (urban planning, SDG monitoring, disaster management, public health), but its volunteer-driven nature introduces geographic biases, with stronger contributions in high-income countries and targeted humanitarian efforts in LMICs. A key requirement for many applications is building data completeness; unaccounted spatial bias can lead to misleading generalizations. This work investigates: (1) completeness of OSM building data for global urban analysis, and (2) inequalities in its spatial distribution within cities, across regions, and globally.
Literature Review
The paper situates its contribution within prior OSM data quality research and global urban analytics. Earlier studies compared OSM against authoritative datasets to assess completeness in selected cities, but generalizing methods globally is constrained by limited reference data. Proxy data such as remote sensing (nighttime lights, built-up area, Sentinel-2 indices) and population datasets have been proposed to estimate completeness. Corporate and humanitarian mapping efforts have grown, with corporations focusing mostly on roads (and offering auxiliary building datasets like Microsoft’s), while humanitarian mapping has expanded building coverage in specific LMIC areas. Prior work found higher and more even completeness for OSM roads than buildings, and indicated that completeness and accuracy patterns for OSM land use/land cover (LULC) differ. This study extends these threads by assembling broader training data, applying machine learning to infer building area and completeness globally, and by analyzing spatial inequality using segregation metrics.
Methodology
Study scope: 13,189 urban centers globally (GHS-UCDB), defined as contiguous 1 km2 grid-cell clusters with ≥1,500 inhabitants/km2 and ≥50,000 population. Each urban center was spatially disaggregated into ~1 km2 grid cells (equal-area Mollweide). OSM building area per cell (polygons with building=*) was computed via the ohsome API (as of 2023-01-01). Reference building data: A composite of authoritative and commercial sources (including Microsoft Global ML Building Footprints), intersected with the grid to obtain per-cell reference building area; authoritative sources were preferred where available. Reference coverage spanned 6,633 urban centers (404,982 grid cells) across 162 countries. Microsoft building data suitability was assessed using Geo-Wiki built-up reference samples to compute precision/recall. Explanatory variables: GHS-POP (population), Subnational Human Development Index (SHDI), VIIRS nighttime lights (annual average 2020), ESA WorldCover 2020 (10 m) per-class area, and OSM main-road length per grid cell (primary to residential). Some covariates reflect 2020 conditions while completeness is analyzed for 2008–2023, implying temporal mismatch. Modeling: A Random Forest regressor (scikit-learn) predicted building area per grid cell from the covariates. Completeness per cell was not directly predicted; instead, completeness per urban center was inferred by ratio of OSM building area to predicted building area, aggregated annually (2008-01-01 to 2023-01-01). Model evaluation used 20-fold spatial cross-validation via k-means spatial blocking; metrics included r^2, explained variance, MSE, MAE; residual spatial autocorrelation was assessed with Moran's I. Inequality analysis: Global and regional non-spatial evenness via the Gini coefficient (from Lorenz curves of observed vs missing building area across urban centers), and spatial clustering via Moran’s I (global and local), using distance-band weights (5 degrees, inverse-distance power 1, row-standardized). Intra-urban analysis: For urban centers ≥25 km2 (4,722 centers), completeness was computed per 1 km cell; intra-urban Gini and Moran’s I used Queen contiguity weights; agglomerative hierarchical clustering (Ward linkage; Euclidean distance over completeness, Gini, Moran’s I) defined three main clusters with subtypes for low/moderate completeness patterns. Contributions categorization: Humanitarian mapping identified via HOT Tasking Manager logs; corporate mapping identified via OSM user bios (Apple, Meta, Mapbox, Microsoft, Kaart). Shares computed by building area added. Time series aggregation was provided by world region (World Bank), SHDI class, and city size (OECD thresholds).
Key Findings
- Completeness distribution: 1,848 cities exceeded 80% OSM building completeness (14% of analyzed cities; home to 492 million people, ~16% of urban population). In contrast, 9,163 cities (69% of analyzed; ~48% of urban population) had <20% completeness. Global average urban OSM building completeness was 24% per urban center.
- Regional completeness: Europe & Central Asia 71%; North America 64%; Sub-Saharan Africa 30%; Latin America & Caribbean 20%; East Asia & Pacific 20% (strongly influenced by minimal mapping in China); Middle East & North Africa 12%; South Asia 9%.
- SHDI and size: Urban centers in very high SHDI regions had the highest completeness; however, no monotonic positive correlation with SHDI—low SHDI centers sometimes surpassed high SHDI centers, likely due to post-2015 humanitarian mapping. Larger metropolitan areas exhibited higher completeness than small urban areas.
- Humanitarian and corporate mapping shares: Humanitarian mapping contributed ~10% of global urban building area edits, exceeding 50% of building edits in Sub-Saharan Africa. Corporate mapping accounted for <2% globally and ~0.1% in urban centers, contrasting with ~20% corporate share in OSM roads reported elsewhere.
- Spatial inequality: Global Gini coefficient of completeness was ~0.8, indicating strong unevenness, most pronounced in South Asia and Sub-Saharan Africa. Moran’s I increased through 2014 (indicating clustered growth near already mapped cities) and then declined from ~0.71 (2014) to ~0.56 (2023), suggesting reduced clustering over time; both Gini and Moran’s I stagnated since 2019, with Gini rising again after 2021.
- Intra-urban patterns: Three main city types identified. Type (1) very low completeness: (1a) largely unmapped with scattered edits (high Gini, low Moran’s I; e.g., Faisalabad), (1b) hardly mapped with localized hotspots (higher Moran’s I; e.g., Guadalajara). Type (2) divided cities: (2a) a few well-mapped clusters amid unmapped areas (e.g., Las Vegas), (2b) large mapped blocks coexisting with unmapped neighborhoods (e.g., Abidjan). Type (3) well-mapped cities: high overall completeness and low clustering (e.g., Paris).
- Model performance (20-fold spatial CV): Building area prediction (1 km grid): global r^2=0.74, MAE=0.034 km^2; regional r^2 ranged 0.67–0.84. OSM completeness prediction (urban centers): global r^2=0.90, MAE=0.055; regional r^2 ranged 0.81–0.93. Residuals showed slight spatial autocorrelation (Moran’s I≈0.29) and some heteroscedasticity in certain regions.
Discussion
The findings quantify substantial global and intra-urban inequalities in OSM building coverage, directly addressing the research questions on completeness levels and their unequal spatial distribution. While Europe & Central Asia and North America show higher completeness, many regions, especially South Asia and parts of MENA and LAC, remain far below levels needed for robust comparative urban analyses. Humanitarian mapping has measurably improved completeness in low and medium SHDI contexts since 2015, partially offsetting historical biases. The evolution of inequality shows that, despite overall gains, earlier clustering near already well-mapped areas gave way to broader diffusion after 2014, though progress plateaued around the COVID-19 pandemic period. The study underscores implications for policy and research: analysts must explicitly assess and adjust for coverage bias to avoid skewed conclusions, and mapping communities can use completeness maps to strategically target underrepresented areas. The results also highlight the contrast between building and road network completeness, pointing to the need for domain-specific quality assessments and tailored data fusion strategies.
Conclusion
This work delivers a global, spatio-temporal assessment of OSM building completeness across 13,189 urban centers, introducing a machine-learning framework to infer completeness where authoritative references are unavailable, and applying segregation metrics to characterize inequalities across scales. It provides actionable completeness maps and an open dataset to support bias-aware urban analyses and to guide mapping prioritization. Future research should: (1) extend to rural areas and develop localized completeness models where global modeling is less certain; (2) integrate OSM with machine learning-derived and official datasets into ensemble building inventories; (3) assess additional quality dimensions (e.g., accuracy, building use attributes) and their relationship to completeness and mapping trajectories; and (4) further analyze temporal dynamics of inequality and mapping behavior to inform equitable data production and SDG monitoring.
Limitations
- Scope limited to urban centers; results may not generalize to rural areas, where OSM data quality often differs, except where humanitarian efforts targeted rural zones.
- Temporal mismatch: Explanatory datasets (e.g., WorldCover, GHS-POP, nighttime lights) represent 2020 conditions, while completeness is analyzed for 2008–2023, potentially overestimating completeness in rapidly urbanizing areas post-2020 and underestimating for earlier years.
- Training data biases: Authoritative datasets may be outdated; Microsoft building footprints exhibit variable recall by region (generally sufficient in urban areas but potentially lower in rural contexts), potentially biasing building area predictions and inflating inferred completeness where reference undercounts exist.
- Geographic gaps: Uncertainty is higher for countries with many urban centers lacking training data (e.g., China), though feature importance suggests limited dependency on region codes.
- Model uncertainty: Residuals show slight spatial autocorrelation and heteroscedasticity in some regions; predictions tend to be conservative but still carry varying uncertainty (higher for Sub-Saharan Africa).
- Road network completeness varies regionally; while included as a predictor, incomplete road data in some urban centers could affect predictions.
Related Publications
Explore these studies to deepen your understanding of the subject.

