Medicine and Health
Generating synthetic population for simulating the spatiotemporal dynamics of epidemics
K. Zhu, L. Yin, et al.
The study addresses how to generate realistic synthetic populations for agent-based epidemic simulations when full census microdata are unavailable. Existing population synthesizers often match only marginal distributions or inadequately capture intra-household structural relationships (e.g., age composition across generations), which can bias transmission dynamics, particularly for diseases with age-dependent risks and within-household spread. The authors focus on age and gender attributes and spatial allocation at subzone level to create a synthetic population that preserves household structure and spatial distribution. The objective is to design and validate a synthesis framework that reproduces both household- and person-level distributions and interdependencies, and to evaluate how different synthesizers affect epidemic outcomes in an agent-based SEIR model.
Population synthesis methods commonly include Iterative Proportional Fitting (IPF) and related extensions (e.g., iterative proportional updating, hierarchical/multistage IPF) that fit marginals but struggle to jointly capture household- and person-level dependencies and face issues like zero-cell problems and computational burden with many attributes. Approaches split into Synthetic Reconstruction (estimating distributions then allocating individuals via weights) and Combinatorial Optimization (replicating micro-samples to best fit marginals, e.g., simulated annealing). More recent probabilistic graphical and deep generative models (HMMs, Bayesian networks, GANs, GAEs) and spatially explicit synthesis have emerged but often require complex training and tuning. For epidemic modeling, synthetic populations are crucial due to heterogeneity in age-specific risks and contacts. Many agent-based epidemic models rely on IPF or general-purpose tools from transport/urban planning (e.g., MATSim, TRANSIMS, SPEW, Gen), which can be complex, computationally intensive, and not tailored to epidemiological attributes of interest. There is a gap in accurately reproducing intra-household structural relationships, which affects simulated contact patterns and disease dynamics. The paper proposes a lightweight combinatorial optimization framework that prioritizes household motifs to better preserve within-household interdependencies relevant to transmission.
Framework: A two-phase approach is proposed: (1) motif selection and (2) motif heuristic optimization (MHO). The goal is to capture frequent household structures (motifs) from microdata and reweight them to match city- and subzone-level demographic marginals, producing a spatially explicit synthetic population suitable for epidemic modeling.
Data preprocessing and encoding: Household travel survey microdata (2016 Shenzhen survey) are used as a proxy for PUMS. Only household/person IDs, age, and gender are retained; non-family collective dwellings are filtered. Ages are grouped into bins (e.g., ten-year groups; nine groups used in experiments). Each household is encoded as counts of gender-age categories, yielding a household structure code (e.g., (M,2)+(F,2)+(F,1)+(F,1)).
Motif selection: Among all observed structures, the top S most frequent household structures are selected such that their cumulative frequency exceeds a threshold α (e.g., cover ≥90–99% of observed households). These typical structures are termed motifs. To mitigate zero-cell problems (unseen but plausible structures), very small positive values (e.g., 1e-5) replace zeros in initial distributions.
Optimization (MHO): Let D be an attribute matrix describing each motif’s composition: household size Np, counts by gender (Nm, Nf) and by age groups (Nmk). Let X be motif weights initialized from survey frequencies, and Y be target totals from census marginals (population size and counts by gender and age; household-size distribution at city level). The objective minimizes discrepancy between simulated Ŷ = D·X and targets Y with a penalty encouraging consistency with initial weights: F(X) = φ(Ŷ,Y) + τ(X,X_init), where φ = Σ(ŷ_i − y_i)^2/n (or linear loss variant in implementation) and τ = Σ (log x_i − log x_init,i)^2/m. Constraints: X ≥ 0; upper bounds set large. The problem is solved via a trust-region reflective (TRF) bound-constrained optimizer (SciPy optimize), iteratively updating X until change in loss < ftol (e.g., 1e-6) or max iterations reached.
Generation of agents: Synthetic households are drawn by sampling motifs proportional to optimized weights X (to avoid rounding bias). Households are replicated across subzones so that each subzone’s age marginals match local census data, while city-level household size, gender, and motif distributions align with city marginals. The same motif pool is used for all subzones; only Y varies by subzone. Parallel optimization across 673 subzones is performed on HPC (800 nodes).
Validation: Accuracy is evaluated by comparing (a) subzone-level age marginals (R^2), (b) city-level marginal distributions (household size, age, gender), (c) joint age-gender distribution vs survey, (d) motif distribution preservation vs survey, and (e) within-household cross-age contact matrices. Two baselines are used: Direct Inflating (DI; direct replication) and IPF.
Epidemic model: A stochastic discrete-time agent-based SEIR model is built on multilayer contact networks: household, school (0–20y), workplace (20–60y), and community. Contact weightings are 10:2:2:1 (household:school:workplace:community). Disease course: incubation 2 days (E to I), infectious period 10 days (I to R), immunity thereafter. Initialization: 100 exposed seeds in ages 0–20y. Simulated 80 days. Metrics: daily and cumulative incidence, peak incidence rate, peak date, and final attack rate across age groups and overall. Populations synthesized by MHO, DI, and IPF are compared over repeated runs.
- Scalability and synthesis: A spatially explicit synthetic population of approximately 17.37 million individuals across 673 subzones in Shenzhen was generated using the proposed framework.
- Motif distribution in survey: Household structure frequencies follow a truncated power-law; the single most frequent motif (M,3)+(F,2) constituted about 7% of surveyed households. Only 124 motifs cover >90% of surveyed households; 216 cover 95%; 828 cover 99%.
- Optimization performance: TRF optimization converged in >8000 iterations with ~0.2 s/iteration; parallelization across 800 nodes completed optimization in ~30 minutes. Objective function approached zero, indicating close matching of targeted marginals.
- Subzone-level fit: Synthetic vs observed subzone age group counts achieved R^2 ≈ 0.9997. Larger relative deviations occurred mainly in sparsely populated elderly groups (e.g., 70+ in youthful subzones).
- City-level marginals: Household size, age, and gender marginal distributions matched with errors within ~0.01% (largest absolute error ~2.23×10^-5 for share of 2-person households).
- Joint age-gender distribution: Largest absolute deviation occurred in female 40–50y; largest relative error (~7.6%) in female 80y+, reflecting small denominators.
- Motif distribution preservation: MHO and DI better preserved ranked motif frequencies than IPF; IPF systematically underrepresented top motifs. DI’s mean frequencies matched observations over many runs but exhibited higher variance, depending on survey representativeness.
- Within-household interdependency: Contact matrices from MHO had lowest summed mean absolute error (MAE sum ≈ 0.40) versus DI (≈ 0.58) and IPF (≈ 0.48), best capturing cross-age co-residence patterns.
- Epidemic outcomes: Final attack rates were similar across synthesizers when marginals matched. However, epidemic dynamics differed: DI produced greater uncertainty (SD of full-population final attack rate ≈ 0.99; peak incidence rate SD ≈ 0.10) than MHO and IPF (both ≈ 0.12 and ≈ 0.03, respectively). IPF’s peak dates tended to occur later; in 60+ groups, IPF’s peak lagged MHO by about 2 days. Differences are attributed to differing household motif distributions affecting within-household transmission pathways.
The work demonstrates that accurately reproducing within-household structural relationships in synthetic populations materially affects epidemic dynamics, especially peak timing and daily incidence profiles, even when marginal distributions (age, gender, household size) are matched. By selecting a compact set of frequent household motifs and reweighting them to census marginals, the proposed method better preserves cross-age co-residence patterns than IPF and is more stable than DI, leading to more reliable household contact networks for transmission modeling. This addresses a key limitation in prior synthesizers that fit marginals but neglect household interdependencies, thereby reducing bias in agent-based epidemic simulations. The high fidelity to subzone age marginals also supports spatially explicit risk assessment without requiring full activity-based mobility models. Overall, the findings highlight that structural realism at the household level is critical for capturing transient epidemic behavior (e.g., peak timing) important for public health response, while final size is dominated by marginals.
The study contributes: (1) empirical evidence that real-world household structures follow a truncated power-law, enabling representation by a limited set of motifs; (2) a Motif Heuristic Optimization framework that integrates survey-derived household motifs with census marginals to produce spatially explicit synthetic populations preserving within-household interdependencies; and (3) demonstration that synthesizer choice affects epidemic peak timing and incidence dynamics in agent-based SEIR simulations, despite similar final attack rates. Future research directions include extending the approach to additional attributes relevant to infectious diseases (e.g., immunity status), improving handling of higher-dimensional attribute spaces without incurring prohibitive computational costs, incorporating subzone-level gender and household-size marginals when available, and validating across diverse cities and datasets to assess generalizability and performance.
- Attribute scope: The synthesized populations focus on age and gender; socioeconomic and behavioral variables (e.g., income, mobility) were excluded, which may influence transmission in real settings.
- Motif coverage and zero-cell issues: Motifs are drawn from observed survey structures; unobserved but plausible households may be underrepresented despite small positive adjustments for zero cells.
- Survey representativeness: The household travel survey may not perfectly represent the full population; non-family collective dwellings were removed and some non-traditional cohabitation patterns may be undercaptured, potentially biasing contact structures.
- Spatial marginals: Subzone-level gender and household-size marginals were unavailable; city-level distributions were assumed for all subzones, neglecting spatial heterogeneity.
- Scalability with added attributes: Adding more attributes substantially increases motif count and computational burden; methodological enhancements are needed for high-dimensional synthesis.
- Epidemic model simplifications: The SEIR model used simplified assumptions (e.g., fixed contact weights, full school/work participation for 0–60y) that may not capture all real-world nuances.
Related Publications
Explore these studies to deepen your understanding of the subject.

