logo
ResearchBunny Logo
Quantifying ethnic segregation in cities through random walks

Sociology

Quantifying ethnic segregation in cities through random walks

S. Sousa and V. Nicosia

This groundbreaking research conducted by Sandro Sousa and Vincenzo Nicosia introduces an innovative, non-parametric approach to measure spatial segregation in cities through random walks on graphs. By analyzing the time taken by a random walker to encounter diverse ethnic groups, the findings unveil significant insights into the multi-dimensional aspects of ethnic segregation and its connection to socioeconomic deprivation across US and UK metropolitan areas.... show more
Introduction

Spatial heterogeneity characterizes many complex systems, including urban areas, and affects their functioning and dynamics. Quantifying spatial segregation—how groups cluster and how evenly they are distributed—is challenging because existing measures often depend on neighborhood definition, data granularity, and arbitrary parameters. The literature distinguishes exposure/isolation and evenness/clustering as key dimensions of segregation, yet there is no consensus on robust quantification or cross-city comparisons. Evidence suggests segregation operates across multiple spatial scales, motivating approaches that can capture multiscale organization without scale-dependent biases. The research question here is to develop a principled, non-parametric, and scalable framework to quantify and compare spatial segregation that incorporates large-scale spatial correlations and is comparable across systems of different size and shape.

Literature Review

Prior work on segregation highlights multiple dimensions (exposure/isolation and evenness/clustering) but struggles with scale dependence, modifiable areal unit effects, parameter choices, and varying data granularity. Recent studies argue segregation is multiscale, with processes acting simultaneously across scales. Network science has been increasingly applied to urban segregation by modeling cities as graphs (e.g., adjacency of census tracts, transportation, commuting) and deriving statistics from within- and between-group connectivity. While informative, many indices remain sensitive to system size, shape, and class abundance, and often require tuning parameters. Diffusion/random-walk-based approaches have been used to capture exposure and structural properties of networks, suggesting their potential to reflect multi-scale spatial organization without explicit aggregation procedures.

Methodology
  • Data and graph construction: For each metropolitan area (US and UK), construct a spatial adjacency graph G where nodes are census tracts (US) or wards (UK), and undirected edges connect spatially bordering units. Each node i is associated with a vector x_i = {m_i1, m_i2, …, m_iI} of counts for I ethnic groups (UK: 250 classes from 2011 ONS census; US: 64 race/ethnic groups from 2010 Decennial Census). For some illustrations, nodes can be labeled by the majority class (categorical color), but the framework generalizes to full composition vectors.
  • Random-walk symbolic dynamics: Consider unbiased discrete-time random walks on G. For a walk started at node i at time 0 that visits sequence i0, i1, …, define W_i(t) as the fraction of distinct classes encountered up to step t, averaged over R independent realizations (R=1000 per starting node in experiments). Walks are simulated until they have encountered all I classes; trajectories are padded to the maximum length τ_i among realizations from node i by setting W_i(t)=1 for t beyond completion.
  • Class Coverage Time (CCT): For a given coverage fraction c∈(0,1], define the CCT for node i as C_i(c) = arg min_t { W_i(t) ≥ c }, i.e., the expected number of steps for a walker starting at i to encounter a fraction c of all classes for the first time.
  • System-level CCT profiles: For each c, compute: (1) Mean μ(c) = (1/N) Σ_i C_i(c), reflecting overall difficulty of reaching a fraction of classes; (2) Spatial variance (coefficient of variation) σ(c) = sqrt(Var(C_i(c)))/μ(c), capturing dependence on starting location; (3) Local spatial diversity Q(c) = (1/K) Σ_i Σ_j a_ij |C_i(c) − C_j(c)|, measuring differences between neighboring nodes (a_ij entries of the adjacency matrix, K edges).
  • Null-model normalization: Construct a null ensemble by randomly permuting node-level class distributions {x_i} across the same graph G (uniform random reassignment), preserving each node’s local population distribution and the global class abundances while destroying spatial organization. Compute μ_null(c), σ_null(c), Q_null(c) as averages over many random permutations. Define deviations integrated over c to remove dependence on class abundance and size/shape: Δμ = ∫ dc [μ(c) − μ_null(c)], Δσ = ∫ dc [σ(c) − σ_null(c)], ΔQ = ∫ dc [Q(c) − Q_null(c)]. These yield scale- and shape-comparable measures termed spatial heterogeneity (Δμ), spatial variance (Δσ), and local spatial diversity (ΔQ).
  • Synthetic benchmarks: Evaluate behavior on 2D square lattices (8×8, 16×16; degree 4) with random class assignments (varied number of classes Γ={2,4,8,16,32}), and on stylized patterns with pre-assigned clusters (e.g., four quadrant clusters plus a minority class arranged in center/corner/spread/corners). Additional tilings with fixed cluster sizes and shapes (e.g., 2×4, 2×2, stripes 1×4, 1×2) and domain modifications (lateral appendix) test sensitivity to cluster size, shape, and domain geometry. Null-models for synthetic cases use 100 realizations; CCT values per node averaged over 1000 random-walk trajectories.
  • Real urban systems: Apply to US and UK metropolitan areas using the above graph construction and compute C_i(c), μ(c), σ(c), Q(c), and Δμ, Δσ, ΔQ for each city. For visualization, map normalized CCT C_i(c)/C_null(c) at c≈0.7.
  • Robustness across spatial resolution: Assess sensitivity to tract aggregation via detrended fluctuation analysis (DFA) of random-walk trajectories on graphs at different spatial resolutions; report that large-scale correlations persist across resolutions.
  • Socio-economic correlations (US): Using ACS 2011 5-Year estimates (employment, commuting, occupation, income, security), compute two-sided Spearman correlations between city-level Δμ, Δσ, ΔQ and a suite of classical segregation indices (Moran’s I, Spatial Gini, Spatial Dissimilarity, Distance-decay exposure/isolation, Perimeter/Boundary spatial dissimilarity) versus socio-economic indicators.
Key Findings
  • Synthetic patterns: Larger lattices and fewer classes yield larger homogeneous clusters, increasing mean CCT μ(c). Pre-assigned large clusters produce higher Δμ (spatial heterogeneity) and lower Δσ (when clusters are large and deep) relative to dispersed minority distributions. Domain geometry (e.g., a lateral appendix) significantly increases Δμ and ΔQ due to bottlenecks limiting diffusion. The measures capture differences in cluster size and shape (e.g., stripes vs squares) and are sensitive to the typical cluster scale even when relative arrangements are preserved.
  • Cross-city comparisons (US and UK): The framework yields comparable segregation assessments across differently sized and shaped metropolitan areas. For example, Boston exhibits extreme values in both Δμ/ΔQ and Δσ/ΔQ planes, indicating strong spatial heterogeneity and variance; Los Angeles shows comparably high spatial variance with lower heterogeneity. London shows high Δμ with relatively low ΔQ (numerous small adjacent clusters with similar compositions), whereas New York shows relatively lower spatial variance but higher spatial diversity, implying larger, more distinct clusters.
  • Neighborhood-level insights: Maps of normalized CCT C_i(c)/C_null(c) at c≈0.7 reveal within-city heterogeneity, identifying hot spots where walkers require up to approximately three times the null-model steps to encounter a given fraction of ethnic groups.
  • Robustness to scale: DFA suggests that while spatial sampling granularity induces small fluctuations, the geographic distance over which random walks detect large-scale correlations in ethnicity distributions remains stable, mitigating size- and scale-dependence.
  • Socio-economic associations (US): Diffusion-based measures, especially Δμ and ΔQ, show stronger and more consistent two-sided Spearman correlations with socio-economic variables than many classical indices. Examples include: Income <$10k correlates with Δμ ≈ 0.77*** and Δσ ≈ 0.69***; Education/health/social care industry correlates with Δμ ≈ 0.82*** and Δσ ≈ 0.61***; Employment and commuting variables (e.g., walked to work) correlate with Δμ up to ≈ 0.73*** and with Δσ up to ≈ 0.45**. In many categories, ΔQ attains higher explanatory power (R^2, confidence) than Moran’s I and other spatial segregation indices.
Discussion

The proposed diffusion-based measures directly address the research goal of quantifying multi-scale spatial segregation in a way that is comparable across heterogeneous urban systems. By leveraging random-walk Class Coverage Times and deviations from a null-model preserving class abundances, the approach captures both global constraints (Δμ), dependence on starting location (Δσ), and local differences between neighboring areas (ΔQ). These measures integrate information across scales naturally through diffusion, overcoming common limitations such as arbitrary aggregation or fixed neighborhood sizes. Applications to synthetic patterns verify interpretability and sensitivity to cluster size, shape, and domain geometry; applications to US/UK metropolitan areas reveal nuanced differences in segregation structures and highlight areas where exposure to diverse groups is particularly limited. Strong correlations with socio-economic indicators suggest that diffusion-based segregation captures salient structural aspects tied to deprivation, employment, and income that are only partially reflected by classical indices. While causality is not established, the results indicate that diffusion metrics can serve as robust proxies for multi-dimensional socio-economic segregation and urban quality of life, facilitating fairer cross-city comparisons.

Conclusion

This work introduces a principled, non-parametric framework to quantify spatial segregation via random-walk Class Coverage Times on adjacency graphs. Three complementary measures—spatial heterogeneity (Δμ), spatial variance (Δσ), and local spatial diversity (ΔQ)—derived as deviations from null-model expectations enable scale- and shape-robust comparisons across systems. The framework accurately discriminates synthetic spatial patterns differing in cluster size, shape, and domain geometry, and yields interpretable, neighborhood-level insights in real metropolitan areas in the US and UK. Diffusion-based measures correlate strongly with multiple socio-economic indicators, often outperforming classical segregation indices, indicating that diffusion effectively integrates multi-scale spatial information. Future research directions include: (i) longitudinal analyses to investigate causal links between segregation dynamics and socio-economic outcomes; (ii) extensions to incorporate full composition vectors or weighted class targets to account for class abundances; (iii) exploration of biased or multi-layer random walks (e.g., incorporating mobility networks) to capture additional facets of urban structure; and (iv) computational and theoretical advances to further characterize coverage-time distributions and their scaling properties.

Limitations
  • Rare classes: Very infrequent ethnic groups concentrated in few nodes can inflate CCTs, biasing measures upward for c near 1; careful choice of class granularity or alternative definitions of node attributes may mitigate this.
  • Scale considerations: Although the null-model normalization reduces size and scale biases, the scale of analysis remains important; comparisons across vastly different spatial aggregations should be made cautiously.
  • Causality: Observed correlations with socio-economic indicators are not causal; establishing causation requires longitudinal data and appropriate modeling.
  • Data and topology constraints: Physical barriers and graph construction choices (e.g., adjacency definitions) influence diffusion paths; while the method is sensitive and informative, it inherits any biases present in spatial delineations.
  • Computational cost: Extensive random-walk simulations (R≈1000 per node, multiple null permutations) can be computationally intensive for very large systems.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny