Business
What drives intercity venture capital investment? A comparative analysis between multiple linear regression and random forest
D. Du, J. Wang, et al.
Start-ups are key drivers of innovation and economic growth, and venture capital (VC) finances their development. While prior research has identified many determinants of VC activity, including attributes of investors and entrepreneurs and broader external conditions, less is known about the relative importance of city-level factors shaping intercity VC flows. Intercity investments face higher information asymmetry and coordination costs than local deals, bringing distance, transport, institutional relations, and regional economic/financial conditions into focus. This study addresses the research gap by quantifying the relative importance of city-level determinants of intercity VC investment within China, a major global VC market. It compares linear and nonlinear perspectives using multiple linear regression with LMG variance decomposition and random forest variable importance to assess the roles of economic, financial, innovation, location, and policy factors. The goal is to inform firms’ investment strategies and guide regional policy to enhance cities’ attractiveness to VC.
The literature identifies three broad groups of determinants of VC investments: (1) investor-side factors (e.g., VC firm reputation, networks, age, location, syndication) that influence project screening and selection and entrepreneurs’ willingness to accept offers; (2) investee-side factors (e.g., sector, technology/innovation, stage/size; entrepreneur’s experience, personality, and political background) that shape expected returns; and (3) external city/regional factors (economy, finance, innovation, location/transportation, institutions/policy, culture/social ties) that affect the spatial distribution of VC. Economic growth and mature financial markets create exit opportunities; innovation ecosystems (universities, patents, talent) attract high-tech VC; proximity and transport connectivity reduce information frictions and monitoring costs; institutional proximity and supportive policies further facilitate deals. Most prior work has emphasized investor/entrepreneur attributes or tested significance of selected city-level variables without establishing their relative importance. Methodologically, studies use interviews/surveys to elicit priorities, regression models (MLR, logit, Poisson, DiD) to test effects, and machine learning (e.g., decision trees, RF) to uncover patterns. Traditional regression coefficients and correlations can mislead when predictors are correlated and when relationships are nonlinear. Advanced variance decomposition methods (LMG, dominance analysis, proportional marginal variance decomposition) and RF importance address these gaps. This study leverages LMG (order-invariant variance decomposition) and RF permutation importance to compare linear vs. nonlinear determinants and quantify their relative weights.
Data and scope: Intercity VC investments in mainland China were analyzed at the city level for 2018, focusing on flows between the leading VC firm’s city and the investee enterprise’s city. Intracity deals were excluded to concentrate on intercity flows. The dataset covers 337 cities (4 municipalities and 333 prefecture-level cities). VC deal data (firm and investee locations, timing, leading investor) were sourced from CVSource. To avoid pandemic disruptions, 2018 is used; independent variables reflect 2017 values.
Independent variables: Guided by literature, variables span five dimensions (Table 2 sources):
- Economy: GDP, GDP growth rate (Growth), tertiary industry share (Tertiary).
- Finance: number of IPOs in recent five years (IPO), number of VC firms (VCF), bank deposits (Deposit), marketization index (Marketization), and number of new enterprises in recent five years (initially captured via data sources).
- Innovation: number of universities (University), university students (Student), R&D internal expenditure (RaD), granted patents (Patent).
- Location/transportation: presence of international airport (Airport), airline network degree centrality (AIR), high-speed rail degree centrality (HSR), distance to VC centers Beijing/Shanghai/Shenzhen (DIS_Center).
- Policy/visibility: administrative status (Administration: provincial capital/municipality), economic & technological development zone (ETDZ), high-tech industrial zone (HTIZ), Baidu Index (Baidu_Index) as a proxy for city visibility.
Preprocessing and collinearity: Most variables were log-transformed after adding 1 (to handle zeros); logical dummies (Airport, Administration, ETDZ, HTIZ) were not logged. Variance inflation factor (VIF) screening removed variables with VIF > 10 via iterative refitting. Deposit and Patent were dropped due to high collinearity with GDP and R&D. The remaining 18 variables all had VIF < 10 (Table 3).
MLR with LMG: A stepwise model selection with AIC (MASS::stepAIC) further pruned variables. The final linear model retained eight predictors: GDP, VCF, IPO, Marketization, University, DIS_Center, HTIZ, Baidu_Index. Model performance: adjusted R² = 0.852, residual SE = 0.509, F-statistic = 206.6, p < 2.2e-16. Relative importance was computed with the LMG method (relaimpo), which decomposes R² into non-negative, order-invariant contributions.
Random forest (RF): RF regression was trained in R (randomForest package). Data were split 80% train / 20% test; seed fixed for reproducibility. Key hyperparameters tuned: mtry (variables per split) and ntree (number of trees). Performance stabilized around ntree ≈ 100; the final model used mtry = 4, ntree = 627, sampsize = 80%. Variable importance used permutation-based % increase in MSE (%IncMSE), where higher values indicate greater importance. The final RF explained 83.69% of variance with mean squared residual ≈ 0.284.
Spatial heterogeneity: Variable importance analyses (normalized LMG and %IncMSE) were repeated by region (eastern, central, western China) to assess geographic differences in determinant importance.
All analyses were conducted in R 4.3.0.
- Model performance: MLR explained 85.2% of variance (adj R²=0.852), RF explained 83.69% (MSE ≈ 0.284). RF optimal settings: mtry=4, ntree=627.
- MLR coefficients and importance (LMG, normalized): VCF (Estimate 0.408, p<0.01; LMG 0.233), IPO (0.350, p<0.01; LMG 0.213), GDP (0.131, p<0.10; LMG 0.154), University (0.247, p<0.01; LMG 0.147), Baidu_Index (0.330, p<0.01; LMG 0.083), DIS_Center (−0.074, p<0.10; LMG 0.073), HTIZ (0.186, p<0.05; LMG 0.064), Marketization (0.259, p=0.112; LMG 0.032). DIS_Center is negatively signed; others positive.
- RF variable importance (%IncMSE): GDP 28.500, IPO 24.664, VCF 21.875, University 14.997, Marketization 13.331, Baidu_Index 11.752, HTIZ 8.589, DIS_Center 8.055.
- Convergent determinants: Across both models, economic and financial variables dominate (GDP, VCF, IPO), followed by innovation potential (University). Location (DIS_Center) and policy/visibility (HTIZ, Baidu_Index, Marketization) are of secondary importance; Marketization shows stronger importance in RF, consistent with nonlinear (U-shaped) effects.
- Spatial heterogeneity: Eastern region—financial environment is paramount (VCF, IPO), with GDP and University also important. Central region—economy (GDP) takes precedence; RF elevates Baidu_Index and Marketization relative to MLR. Western region—innovation/education (University) is most influential; IPO, Baidu_Index, and HTIZ are next; economy/finance play comparatively smaller roles.
The study’s central question—what city-level factors most strongly drive intercity VC inflows—finds consistent answers from linear (LMG in MLR) and nonlinear (RF) perspectives. Both approaches indicate that local economic scale and vitality (GDP) and depth of the financial ecosystem (VCF concentration and IPO activity) are the primary attractors of intercity VC, with innovation potential (universities) next in importance. Proximity to VC centers (negative DIS_Center effect) aids inflows via reduced information and monitoring frictions, though it is relatively less critical than economic/financial fundamentals. Marketization’s higher RF importance and identified U-shaped relationship suggest nonlinear institutional effects not fully captured by linear models, underscoring the value of combining MLR and RF. Regional analyses reveal how development stages and industrial structures condition determinant salience: mature eastern markets rely most on local finance; the central region hinges more on broad economic development; the western region’s VC attraction is most sensitive to innovation capacity and talent pipelines. These findings align with theory on agglomeration, exit opportunities, and information frictions, and they help reconcile prior mixed evidence by distinguishing relative importance and allowing for nonlinearity.
This study integrates multiple linear regression with LMG variance decomposition and random forest permutation importance to quantify the relative importance of city-level determinants of intercity VC investments in China. The main contributions are: (1) establishing that economy and finance (GDP, VCF, IPO) are the strongest drivers of VC inflows, with innovation potential (universities) next, and distance to VC centers negatively associated; (2) demonstrating complementarities between linear and nonlinear importance measures, revealing nonlinear effects (e.g., Marketization); and (3) documenting spatial heterogeneity: finance is most crucial in the eastern region, economy in the central region, and innovation in the western region. Policy implications include strengthening local financial institutions and IPO pipelines, investing in universities and innovation ecosystems, improving transport to reduce effective distance to VC centers, and tailoring strategies to regional contexts. Future research should incorporate multi-year panel data to track evolving determinant importance, integrate attributes of VC firms and entrepreneurs alongside city factors, and examine shocks (financial crises, pandemics) to inform risk management and resilience.
- Cross-sectional design (2018 outcomes with 2017 predictors) limits temporal inference; determinant importance may change over time. Future work should use panel data to capture dynamics.
- Focus on city-level characteristics excludes direct modeling of investor and entrepreneur attributes; interactions between actor-level and city-level factors warrant study.
- Exogenous shocks (e.g., financial crises, COVID-19) are not explicitly modeled; incorporating such events could clarify risk and stability effects.
- Collinearity management required dropping certain variables (e.g., Deposit, Patent), which may omit some nuanced effects correlated with GDP/R&D.
Related Publications
Explore these studies to deepen your understanding of the subject.

