Business
Predicting urban innovation from the US Workforce Mobility Network
M. Bonaventura, L. M. Aiello, et al.
The study addresses whether structural properties of a national-scale professional mobility network among metropolitan areas can predict urban innovation outcomes better than traditional demographic predictors. Prior work on urban scaling showed strong superlinear relationships between population and many socioeconomic outputs, but weaker associations for innovation-related indicators such as patents and inventors. Limitations of population-based models include treating cities as isolated units, neglecting the selective attraction of talent, and mismatched temporal dynamics relative to fast-moving startup ecosystems. Leveraging CrunchBase data, the authors pose the research question: to what extent do proxies of US workforce mobility (captured as inter-city flows of startup professionals) predict two innovation metrics for cities—the number of successful startups and the cumulative acquisition price—beyond demographic and investment-based predictors.
Urban scaling literature has found population size to strongly predict many outputs (R^2 ~0.88–0.99) but less so innovation indicators (e.g., patents R^2 ~0.72; inventors R^2 ~0.76; R&D establishments R^2 ~0.77). Theoretical and empirical work emphasizes the role of social interactions, network thickness, and the selective attraction of a creative class in fostering innovation. Few empirical studies, constrained by data availability, have examined how inter-city social network structures relate to innovation and economic development. Recent advances from statistical physics applied to spatial networks suggest more accurate modeling of complex urban dynamics than simple scaling laws. This study builds on and extends these strands by constructing a national Workforce Mobility Network from open startup data to quantify how centrality within mobility flows correlates with and predicts urban innovation performance.
Data sources: (1) 2010 US Census (population size, land area, density) at Metropolitan Statistical Area (MSA) level; (2) USPTO patents granted in 2010, mapped to inventors' MSAs; (3) CrunchBase API: organizations (HQ address, founding date, funding rounds, acquisitions/exits, IPOs, status, team members) up to end-2016; people/job roles up to end-2010. Firms were mapped to 369 of 374 MSAs; 243 MSAs had at least one active startup in 2010. About 42% of job records included a start date enabling longitudinal analysis; 75% of role-date records are from 2000–2010.
Workforce Mobility Network (WMN) construction: Nodes are MSAs; directed edges represent worker flows between MSAs derived from sequences of roles for individuals. For any worker with roles r1 in MSA i and r2 in MSA j (i ≠ j), if start(r1) precedes start(r2), increment weight w_ij by 1. When end dates exist and roles overlap (end(r1) after start(r2)), increment both w_ij and w_ji by 1 to capture potential bidirectional information exchange. WMN aggregates CrunchBase job transitions from 1960 through end-2010.
Centrality measures: Four normalized centralities computed per MSA: (a) Degree centrality (sum of in- and out-degrees in the unweighted projection); (b) Strength (sum of incident edge weights, in+out); (c) Harmonic closeness (sum over inverse weighted shortest-path distances where edge length is 1/weight); (d) Weighted PageRank with damping factor 0.85, where transition probability along an outgoing link is proportional to its weight. Centralities are normalized to sum to 1 across nodes.
Outcome variables: For each MSA, two innovation metrics measured over 2011–2016: (1) S_i: count of successful startups (acquired, IPO, or acquired another startup); (2) A_i: cumulative acquisition prices (USD) of startups.
Predictors: Two groups: (i) Socio-economic: population size, population density, patents in 2010, number of active startups N_i in 2010 (control/upper bound for S_i), and total past funding F_i up to 2010. (ii) WMN-based: degree, strength, PageRank, harmonic closeness.
Modeling: Ordinary least squares regressions on base-10 log-transformed variables (absolute, not per-capita), following urban scaling practice. Univariate models assess each predictor separately; multivariate models evaluate combinations. Stepwise feature selection via stepAIC (Akaike Information Criterion) identifies the best subset. Relative importance of correlated predictors estimated using the Lindeman–Merenda–Gold (LMG) method (R package relaimpo), averaging R^2 contributions over all predictor orderings. Validation with a null configuration randomizing A_i and S_i across areas to confirm non-spurious fit.
Network summary and visualization: WMN contains 243 nodes and 2,169 directed edges, reflecting 26,660 worker flows; max degree (in+out) 165; max node strength 8,370. Strength distribution follows a power law with exponent ~2. A backbone extraction (Coscia–Neffke) was used solely for visualization; analyses use the full WMN. A centrality-to-population ratio η = (PageRank_i / sum PageRank) / (Population_i / sum Population) identifies small-yet-central versus large-but-not-central MSAs.
- Residual variability under population-based scaling: Although S_i and A_i scale superlinearly with population (β ≈ 1.2–1.6) and sublinearly with past fundings (β ≈ 0.6–0.8), substantial performance variance remains among similarly sized and funded cities (e.g., North Port–Bradenton–Sarasota vs Colorado Springs with ~10^6 population and ~10^8 USD funding but vastly different cumulative acquisition outcomes: 5.8×10^9 USD vs 4.3×10^7 USD).
- WMN structure: 243 MSAs, 2,169 edges, 26,660 flows; coastal MSAs are most central by PageRank; population and PageRank correlate (Spearman ρ = 0.70) but with large deviations—some small MSAs (e.g., Boulder, Ithaca) are highly central relative to population, while several large MSAs are not.
- Centrality vs population: The η ratio surfaces top MSAs with high centrality relative to size (e.g., San Jose, San Francisco, Boulder, Boston, Ithaca) and bottom MSAs that are populous yet not central and show limited innovation returns (with exceptions like Virginia Beach).
- Predictive performance (adjusted R^2, univariate OLS, log-log): • For number of successful startups S_i: population 0.73; density 0.39; past funding 0.79; patents 0.74; active startups 0.92; PageRank 0.90; strength 0.88; degree 0.80; harmonic closeness 0.82; best multivariate subset (stepAIC) 0.94. • For cumulative acquisition price A_i: population 0.44; density 0.22; past funding 0.48; patents 0.51; active startups 0.57; PageRank 0.60; strength 0.55; degree 0.49; harmonic closeness 0.49; best multivariate subset (stepAIC) 0.60; all predictors 0.61; randomized outcomes yield R^2 ≈ 0 and non-significant coefficients.
- Effect sizes: PageRank outperforms population by 23% for predicting S_i and by 36% for predicting A_i. In multivariate models, PageRank is the only retained network metric and has the highest β for A_i and the second-highest for S_i (after active startups).
- Relative importance (LMG): After controlling for active startups, PageRank explains the largest share of additional variance among predictors for both S_i and A_i.
- Temporal robustness: Using workforce mobility data up to 2005 still predicts S_i and A_i with adjusted R^2 of 0.56 and 0.67, respectively, compared to 0.60 and 0.75 when using data up to 2010 (per Supplementary Information).
The findings directly address the hypothesis that inter-city workforce mobility structures, rather than sheer population or density, are key predictors of urban innovation performance. Centrality within the national WMN—especially weighted PageRank capturing access to flows of knowledge and opportunities—explains substantial variance not captured by population-based scaling, aligning with theories that talent migration and network thickness drive innovation. This refines the standard narrative attributing superlinear scaling solely to within-city interaction density by emphasizing selective attraction and circulation of skilled professionals across cities. Policy implications include shifting focus from undifferentiated population growth to fostering professional connectivity and selective talent attraction while mitigating adverse effects such as displacement and gentrification. Practically, the approach leverages open, near-real-time data to monitor and potentially guide urban innovation ecosystems, offering tools for researchers and practitioners to track and benchmark cities based on mobility-network embeddedness.
This work constructs the first US-wide Workforce Mobility Network from open startup data and demonstrates that network centrality, particularly weighted PageRank, is a strong and often superior predictor of urban innovation outcomes compared to traditional demographic and investment measures. The approach captures dynamic, exogenous flows of talent and knowledge that shape startup success and acquisition value, providing a complementary and more timely lens than patent-based metrics. The study suggests that policies aimed at enhancing inter-city professional linkages and selectively attracting talent may yield innovation gains. Future research should test causal mechanisms with richer longitudinal datasets, evaluate robustness across different economic cycles and sectors, refine mobility proxies beyond CrunchBase (e.g., integrating other labor platforms), and explore intervention simulations leveraging network-aware urban policy design.
- Causality and temporality: Lack of sufficiently long longitudinal data prevents causal inference and limits assessment across varying macroeconomic regimes; workforce mobility and innovation measurement windows may not fully overlap.
- Data completeness and bias: CrunchBase coverage is partial; funding and acquisition amounts are not always disclosed (about 83% of funding rounds fully disclosed), introducing missingness. Authors report evidence that missingness is largely random and has limited impact on comparative analyses.
- Model performance over time: Predictive power diminishes with older mobility data, though remains substantial (e.g., using data up to 2005 yields adjusted R^2 of 0.56 for S_i and 0.67 for A_i vs 0.60 and 0.75 with data up to 2010, per Supplementary Information).
- Single data source for mobility: Mobility inferred from startup roles may not capture all workforce flows or non-startup sectors, potentially limiting generalizability beyond the startup ecosystem.
Related Publications
Explore these studies to deepen your understanding of the subject.

