logo
Loading...
High-resolution population estimation using household survey data and building footprints

Social Work

High-resolution population estimation using household survey data and building footprints

G. Boo, E. Darin, et al.

This groundbreaking study by Gianluca Boo and colleagues unveils a sophisticated Bayesian hierarchical model that seamlessly integrates household surveys and building footprints to deliver accurate population estimates across five provinces in the DRC. With impressive predictive accuracy and high R² values, this research highlights the transformative potential of combining diverse data sources in regions with limited statistical information.... show more
Introduction

Accurate population figures are essential to support decision-making in areas such as urban planning, environmental hazard risk management, and public health. National population and housing censuses are the most complete and reliable source, but intercensal periods (often a decade or longer) mean data can quickly become outdated due to migration, fertility, and mortality changes. UN DESA produces national annual population estimates using projection models, but these can be highly uncertain, especially at subnational scales when census data are old. Bottom-up modeling addresses these limitations by producing high-resolution estimates independently from the census using recent household surveys that fully enumerate representative microcensus clusters and link them to geospatial covariates with full coverage (e.g., settlement extents, satellite imagery classes). Such models can incorporate additional covariates, stratification, and age/sex structures. In this work, the authors develop a Bayesian hierarchical bottom-up model that uses complex household survey data and building footprints to estimate population totals and age/sex breakdowns with uncertainty at ~100 m resolution across five provinces in western DRC, where the last census was in 1984.

Literature Review

The paper situates its contribution within prior work on population estimation: UN DESA projection models provide annual national estimates but lack subnational detail when censuses are outdated. Bottom-up approaches using microcensus-cluster enumerations have been shown to produce high-resolution estimates independent of censuses, leveraging settlement extents and satellite imagery classifications, with options for additional covariates, stratification, and demographic disaggregation. UNFPA highlighted the role of bottom-up models for census planning and preparation. Prior studies have explored census-independent mapping, hierarchical Bayesian frameworks for sparse survey data, and integrating survey data with remote sensing, as well as sub-national mapping of population pyramids. This study extends these by incorporating weighted-precision methods for complex survey designs and extensive building-footprint-derived covariates and settlement classifications.

Methodology

Study area and data: Geolocated household surveys fully enumerated 926 microcensus clusters (~3 settled hectares each) across five western DRC provinces in two rounds: May–July 2017 (Kinshasa, Kwango, Kwilu, Mai-Ndombe; random sampling) and Oct–Dec 2018 (Kinshasa, Kongo Central; population-weighted sampling). Essential demographic data (household size; age/sex) were collected; data were anonymized. Seed 100 m grid cells were selected; cluster boundaries were manually delineated to include ~3 settled hectares with similar settlement characteristics from satellite imagery. Sampling weights for the 2018 round were assessed; outliers due to uncertainties in gridded population used for sampling were truncated at the 90th percentile. Population totals per cluster were compiled from household counts (n=79,126 people) with nonresponse imputed (n=629) using cluster mean household size. Province-level counts by standardized age/sex groups were aggregated from individual records. Ethical approvals were obtained from KSPH and UCLA IRB; analysis approved by University of Southampton. Building footprints: Building footprints were automatically extracted by Ecopia.AI (2019) using Maxar imagery (generally <5% cloud coverage; on average more recent than 2017). Extraction accuracy may be affected in some areas by outdated imagery (as old as 2009) and contextual factors (clouds, smoke, canopy). Building footprints were used as the best available approximation to settled extent and structure distribution. Morphological and topological attributes (e.g., area, perimeter, nodes, distance to nearest building) were computed and summarized within clusters and ~100 m grid cells using sum/mean/CV; additional focal summaries were produced for ~500 m, 1 km, and 2 km windows. Settlement type (urban/rural) was assigned using CIESIN’s morphological classification of settlement extents derived from the same building footprints: urban defined by contiguous built-up areas ≥40 building footprint ha with at least thirteen footprints; remaining areas classified as rural (including small settlements and hamlets). Administrative boundaries: Boundaries from BCR included country, provinces, territories/cities, and sectors/chiefdoms/municipalities (level 3 only for Kinshasa). Sub-provincial regions were created by combining level 2 and available level 3 units; in Kinshasa, 24 municipalities were merged into nine groups according to the Strategic Orientation Plan to ensure each sub-provincial region contained at least one microcensus cluster. Gridded datasets (~100 m) with unique identifiers for province and sub-provincial region were produced; clusters were allocated accordingly. Covariate processing and selection: The effective cluster extent was constrained to building footprints within ~50 m of surveyed households to exclude inaccessible unsurveyed areas. Morphological/topological summaries from building footprints and additional standard gridded datasets (e.g., temperature, precipitation, land use, night-time lights) were extracted. Relationships between log population densities (people/building footprint ha) and candidate attributes were assessed via scatterplots and Pearson correlations. The five strongest linear associations were: building count, average building area (ha), average building perimeter (m), average building proximity (inverse distance to nearest building, m), and average building focal count (~2 km window). To avoid multicollinearity/circularity, average perimeter (correlated with area) and building count (used elsewhere in the model) were dropped. Selected covariates were mean-standardized using grid-cell-level statistics across the study area. Population model: Population totals per cluster Ni were modeled as a Poisson process with mean Di Ai, where Ai is total building-footprint area (ha) within the cluster and Di is population density (people/building footprint ha). To relax strict Poisson assumptions, Di followed a log-normal distribution with a hierarchical precision term τi estimated by settlement type and province and adjusted via a weighted-precision approach using survey sampling weights (vi = wi−1 normalized to sum to 1; weights for the random-sample round imputed proportionally). Where weights were unavailable for prediction, τi was derived from a weighted average of τtp across clusters. Expected log-density Di was modeled with a linear regression comprising a nested random intercept αtpi (settlement type, province, sub-provincial region) and covariate effects βk; covariate effects were initially estimated as random by settlement type, and where posterior distributions were similar between settlement types, converted to fixed effects. Uninformative priors were used for hierarchical means and variances. Model estimation used MCMC in JAGS 4.3.0 via runjags, with three chains; convergence assessed via Gelman–Rubin (<1.1). Spatial autocorrelation of residuals was assessed with semivariograms and Moran’s I. Ten-fold cross-validation withheld 10% of clusters iteratively for out-of-sample evaluation. Age/sex composition model: Province-level age/sex counts were modeled as a Dirichlet-multinomial: observed counts Ngp within each province p and age/sex group g followed a Multinomial(Np, πgp), with πp assigned an uninformative Dirichlet prior (χ = 1/G). Age/sex proportions were not modeled sub-provincially due to small sample sizes for smallest groups. Model fit metrics included bias, imprecision (SD of residuals), inaccuracy (mean absolute residuals), R2, and coverage of 95% credible intervals; standardized residual metrics were also computed.

Key Findings
  • High-resolution estimates: Produced 100 m grid estimates of population totals and age/sex breakdowns with uncertainty for five western DRC provinces; accessible via WorldPop Open Population Repository (WOPR) and WOPR Vision.
  • Data scope: 926 clusters enumerated; 21 discarded (7 undercounting due to limited survey coverage; 14 with no detected building footprints), leaving 905 clusters for modeling.
  • Population densities by settlement: Rural clusters showed highly heterogeneous and often higher observed population densities (people/building footprint ha) than urban clusters, particularly due to prevalence of non-residential footprints in urban areas and footprint detection issues in some rural clusters.
  • Hierarchical intercepts: Posterior means of intercepts were lower in urban settlements in Kinshasa and Kongo Central; 95% credible intervals wider in urban settlements, especially in Kongo Central, Kwango, and Mai-Ndombe, reflecting heterogeneous urban forms.
  • Covariate effects (95% credible significance): Average building proximity had a positive effect in rural settlements and was non-significant in urban; average building focal count had a negative effect in rural and a positive effect in urban; average building area had a strong negative effect across both settlement types (modeled as a fixed effect).
  • Age/sex structures: Rural provinces (Kwango, Kwilu, Mai-Ndombe) exhibited broad-based pyramids narrowing with age; Kinshasa and Kongo Central had narrower bases (lower fertility), with Kinshasa showing a larger 20–49 age share consistent with work-related migration. Credible intervals were generally narrow at the province level.
  • Model diagnostics: Converged in 10,000 iterations for 3 chains. Population totals fit well: R2 = 0.81 (in-sample) and 0.79 (out-of-sample). Population densities had lower fit: R2 = 0.52 (in-sample) and 0.47 (out-of-sample). Approximately 90% of observations fell within 95% credible intervals for out-of-sample totals and densities, indicating robust uncertainty quantification. Slight bias observed: overprediction of totals and underprediction of densities at the cluster level; larger imprecision/inaccuracy for densities, especially in densely populated rural clusters due to building-footprint under-detection. No significant spatial autocorrelation in residuals. Province-level age/sex proportions had perfect fit: R2 = 1.00 (in- and out-of-sample), with 100% coverage by 95% credible intervals.
Discussion

The study extends bottom-up population modeling by integrating complex household survey data with a weighted-precision Bayesian hierarchical framework and building-footprint-derived attributes. The approach leverages the strengths of microcensus-based enumeration to deliver timely, high-resolution estimates where censuses are outdated. The strong performance for population totals, robust uncertainty coverage, and interpretable covariate effects demonstrate the value of combining building footprints with survey data. The reduced fit for population densities is attributed to footprint detection issues (e.g., outdated or obscured imagery) and the prevalence of non-residential footprints in urban areas; importantly, totals remained well-estimated because building-footprint area acts as a multiplicative constraint in the Poisson component. Province-level age/sex estimates from the same survey data yielded precise demographic compositions, offering more current structures than those derived from older sources. Outputs can be flexibly aggregated for diverse operational needs (e.g., administrative units, health catchments) and support validation against small-area enumerations in public health campaigns. The model’s incorporation of weighted-precision addresses biases from probabilistic survey designs, though practical challenges with sampling weights persist. More broadly, the framework supports census planning and implementation and is applicable beyond sub-Saharan Africa when similar inputs are available.

Conclusion

The study presents a Bayesian hierarchical bottom-up framework that fuses recent household surveys and building footprints to produce up-to-date, 100 m resolution population totals and age/sex compositions with quantified uncertainty across five provinces in DRC. Key contributions include: a weighted-precision approach enabling the use of complex survey designs (e.g., DHS/MICS-like), modeling of age/sex structures from the same surveys, and the integration of morphological/topological building-footprint attributes for both settlement classification and covariates. The model fits population totals well, provides robust uncertainty estimates, and yields interpretable covariate effects, despite known limitations in building-footprint data affecting density estimates. Outputs are openly available via WOPR for decision-making and can aid multiple stages of census operations. Future work should address measurement error in surveys and footprints, refine handling of sampling weights, and explore hierarchical demographic modeling at finer administrative levels where sample sizes permit, with continued expansion to other regions through GRID3.

Limitations
  • Assumed no observation error in household survey population totals; inaccessible areas within microcensus clusters likely lead to undercounts.
  • Assumed no population outside areas delineated by building footprints; under-detection due to outdated or obscured imagery (clouds, smoke from slash-and-burn, canopy) can underestimate totals and inflate observed densities in affected clusters.
  • Weighted-precision approach is sensitive to sampling-weight uncertainties; weight outliers due to underlying gridded population inputs required truncation at the 90th percentile; use of seed-based weights introduces additional uncertainty as they may not represent selection probabilities for all locations within clusters.
  • Age/sex proportions modeled at province level only; assumes within-province homogeneity, potentially masking sub-provincial variation and leading to sensitivity where small proportional differences translate into large count differences when applied to totals.
  • Building-footprint data include non-residential structures, influencing urban density relationships; variability in urban form increases uncertainty in urban intercepts.
  • Lack of recent official subnational population figures and inconsistencies in administrative boundaries hinder direct comparisons and validation against official statistics.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny