
Linguistics and Languages
Networks and identity drive the spatial diffusion of linguistic innovation in urban and rural areas
A. Ananthasubramaniam, D. Jurgens, et al.
This groundbreaking study by Aparna Ananthasubramaniam, David Jurgens, and Daniel M. Romero explores how networks and identity shape the spread of linguistic innovation across urban and rural landscapes. Discover how weak-tie interactions in cities and strong-tie connections in rural areas fuel this fascinating diffusion process.
~3 min • Beginner • English
Introduction
The paper addresses why cultural and linguistic innovations are adopted in specific geographic regions and how diffusion differs across urban and rural areas. Two primary hypothesized mechanisms underlie regional adoption: (i) identity performance, where individuals adopt language to signal demographic identities, and (ii) network diffusion through homophilous social ties. Existing theories and models often emphasize one mechanism, leading to limited explanatory power, especially for urban–rural dynamics where urban centers tend to adopt early and diverse innovations that later reach more homogeneous rural areas. The authors propose that network and identity jointly determine diffusion, with weak ties driving spread among urban counties and strong ties aligned with shared identity driving spread among rural counties. They test this hypothesis with an agent-based model calibrated and validated on large-scale Twitter data, comparing a full network+identity model against network-only and identity-only counterfactuals to directly assess each mechanism’s contribution.
Literature Review
Prior research shows that innovation in technology, religion, music, memes, and language often diffuses regionally across the USA. Linguistic variables are used as proxies for cultural change because shifts in culture and language are interlinked. Urban centers, being larger and more diverse, often pioneer new cultural artifacts, with subsequent diffusion to more homogeneous rural areas; however, mechanisms explaining urban diffusion often fail for rural dynamics and vice versa. Two dominant accounts frame spatial diffusion: identity-based adoption (performing demographic identity through language) and network-based diffusion (homophily and tie structure shaping exposure and contagion). Weak-tie diffusion can increase exposure across diverse contacts, while strong-tie diffusion emphasizes influence among similar individuals. Existing models frequently consider only one mechanism, struggling particularly with urban–rural diffusion. The literature calls for frameworks uniting network and identity to explain spatial heterogeneity, including recognized cultural regions and known migration-linked pathways.
Methodology
Study design: An agent-based model (ABM) simulates the diffusion of a single neologism through a directed social network of agents (Twitter users), integrating both network exposure and identity alignment. The model is validated against empirical diffusion patterns of new words on Twitter and contrasted with counterfactuals isolating network-only and identity-only mechanisms.
Data and targets: 76 new words were identified from 1.2 million Urban Dictionary entries as terms rare before 2013 and frequent after, spanning 2013–2020 Twitter usage (examples: fleeky, birbs, ubering). For each word, the first 10 Twitter users in the sample constitute initial adopters (seeding). The empirical benchmark includes county-level adoption counts over time and inter-county pathway strengths.
Network construction: Using the Twitter Decahose (10% sample, 2012–2020), agents are US-based users with at least one reciprocal mention tie. A directed edge i→j exists if j mentioned i at least once; tie weight w_ij is proportional to the number of times j mentioned i (2012–2019). The resulting directed network has ~4M nodes and ~30M edges, exhibits demographic and geographic homophily, and expected urban/rural tie-strength patterns (urban–urban ties are generally weaker; rural–rural ties stronger). Robustness checks consider the Facebook Social Connectedness Index and inclusion of non-reciprocal ties.
Agent identity: Each agent’s identity is represented across D=5 categories with d=26 registers: (i) location, (ii) race/ethnicity, (iii) socioeconomic status (income, education, workforce participation), (iv) languages spoken, and (v) political affiliation. Agent locations are inferred via GPS-tagged tweets using a high-precision method requiring ≥5 GPS posts within a 15-km radius. Demographic registers are assigned from the agent’s Census tract and Congressional district (2018 ACS for race/ethnicity, SES, languages; 2018 US House election for politics). Identity registers are continuous in [0,1], reflecting local demographic proportions. Age and gender are excluded due to limited spatial variation; sensitivity analyses adding them do not materially change performance.
Word identity (enregisterment): Each word conveys identity along some categories. A weight vector v in [0,1]^D specifies category importance, and binary indicators specify which register(s) within those categories the word signals. Word identity is inferred from demographics of the first 10 adopters and remains fixed; agent identities are not altered.
Diffusion dynamics: Discrete-time simulation updates each agent j’s probability of using word w at t+1 (P_jwt+1) based on exposure and six factors: (i) attention fading (exponential decay when not exposed; retention r∈[0,1]), (ii) novelty decay with cumulative exposures, (iii) stickiness S_jw (word-specific adoption propensity), (iv) relevance (similarity between agent identity and word identity), (v) variety (fraction of neighbors adopting at time t), and (vi) relatability (heavier weight to demographically similar neighbors and stronger ties). A linear-threshold-like rule with damping captures complex contagion. New usage remains usage-based rather than one-time adoption; actors may use the word multiple times across timesteps. Simulation halts when growth falls below 1% over ten timesteps, with a minimum of 100 steps post-initialization.
Counterfactual models: Three configurations remove key components to isolate mechanisms: (1) Network-only: remove identity signaling (all identity similarity terms set to 1). (2) Identity-only: shuffle edges (configuration-model-like) preserving degree to eliminate homophily while maintaining degree and population distributions. (3) Null: shuffled network and no identity variables.
Parameter fitting and trials: Parameters Q, r, and θ are tuned via grid search on a random 20% of words to match empirical adoption counts, yielding Q=0.75, r=0.4, θ=100. Stickiness S_jw is tuned per word to match usage; for each word, five stickiness values are explored and five simulations run per value, giving 25 trials per word. Across 76 words and four model variants, 1,900 trials per model are evaluated.
Evaluation metrics: (i) Spatial correspondence: Lee’s L correlation between simulated and empirical county-level adoption counts; thresholds classify “broadly similar” (L≥0.13) and “very similar” (L≥0.4). (ii) Spatiotemporal pathways: compute pathway strength between county i and j as j’s propensity to adopt at t+1 given i’s adoption at t via a zero-inflated correlation; compare empirical vs. simulated pathway strengths using Bayesian likelihood. Analyses also segment pathways by urban–urban, rural–rural, and urban–rural using OMB-based county classification. A regression of empirical pathway strengths on network-only and identity-only pathway strengths and pathway type tests weak-tie vs. strong-tie mechanisms.
Key Findings
- The Network+Identity model best reproduces spatial diffusion overall. It is the only model with average Lee’s L above 0.15 (“broadly similar”) and yields more than 50% higher likelihood for empirical pathways compared to other models.
- Distribution of best-performing models by trial: Network+Identity best in 46% of trials; Network-only in 34%; Identity-only in 20%.
- Quality of spatial matches: Nearly 40% of Network+Identity simulations are at least “broadly similar,” and 12.3% are “very similar,” compared to 6.8% for Network-only and 3.7% for Identity-only. “Very similar” cases involve highly localized adopters (average Moran’s I ≈ 0.84) and often when a counterfactual (network or identity) also performs very well.
- Temporal dynamics: Early adoption is captured relatively well by Network-only, but its performance deteriorates over time; later adoption is best captured by Network+Identity. Identity-only and Null perform poorly throughout.
- Urban–rural segmentation (H2):
- Urban–urban pathways are best approximated by Network-only, indicating weak-tie diffusion among dissimilar and lower-weight ties in urban regions.
- Rural–rural pathways are best approximated by Identity-only, consistent with strong-tie diffusion among demographically similar counties.
- Urban–rural pathways are best predicted by Network+Identity, indicating both mechanisms are necessary for cross-geography spread.
- Mechanism analysis: Network-only and Identity-only pathway strengths are correlated (Pearson R=0.78; Spearman ρ=0.81), reflecting homophily, but each captures distinct aspects of diffusion. A three-way interaction between network strength, identity strength, and pathway type explains nearly 70% of the variance in empirical pathway strengths.
- Null model underperforms all others overall, though it sometimes predicts urban–urban pathways better than Identity-only and rural–rural better than Network-only, indicating population and degree structure alone cannot reproduce observed dynamics but may play roles in some geographies.
- The Network+Identity model’s strongest pathways align with known cultural regions and historical-migration and economic corridors (e.g., Great Migration routes, coastal clusters, Texas Triangle, Texas–West Coast links).
Discussion
The findings directly address the research question of how identity and network jointly shape the spatial diffusion of linguistic innovation. The model shows that considering identity alone overlooks weak-tie diffusion prevalent in urban, diverse networks, while considering network alone neglects identity-aligned strong ties that facilitate rural diffusion. Empirical spatial distributions and inter-county pathways are best reproduced when both mechanisms are integrated, especially for cross-urban–rural transmission. The emergence of urban/rural heterogeneity from minimal, general assumptions indicates these differences arise from the interplay of demographic distributions and network structure rather than population size or degree distributions alone. The results bridge sociolinguistic theories of identity signaling with network diffusion theory, suggesting that strong ties can be potent when reinforced by shared identity and that weak ties facilitate inter-group, inter-regional transmission. The alignment of simulated pathways with culturally significant regions underscores the model’s capacity to capture real-world sociohistorical processes. Nonetheless, the analysis suggests that some urban weak-tie dynamics may involve additional behavioral mechanisms (e.g., preference for diversity, reduced attention to identity) not fully captured by the current implementation.
Conclusion
This study demonstrates that models of cultural and linguistic diffusion must incorporate both network structure and identity signaling to reproduce spatial adoption patterns and spatiotemporal pathways. Network effects predominantly drive weak-tie diffusion across urban areas, identity effects dominate strong-tie diffusion among rural areas, and both are required for diffusion between urban and rural geographies. The agent-based framework provides a principled way to test counterfactuals that are unobservable in empirical data, revealing complementary mechanisms that collectively explain a large share of observed variation. Beyond lexical innovation on Twitter in the USA, the framework and insights likely extend to other cultural artifacts and contexts where networks and identities are geographically correlated. Future research should adapt identity and network estimation to other countries and platforms, enrich behavioral components (e.g., diversity-seeking, media influence), integrate additional demographic factors where spatially informative, and explore policy-relevant domains (e.g., health behaviors, activism) where strong- vs. weak-tie processes differ.
Limitations
- Data source limitations: reliance on a 10% sample of Twitter (Decahose); access and licensing constraints; platform-specific behaviors may limit generalizability.
- Identity inference: demographic attributes are approximated from Census tract and Congressional district data; individual-level identities may differ, raising potential ecological inference issues despite using fine-grained tracts.
- Simplifying assumptions: a parsimonious ABM with limited mechanisms; parameters tuned primarily to match usage counts; potential omitted factors (e.g., media effects, exogenous shocks).
- Demographic scope: age and gender excluded due to limited spatial variation; although sensitivity analyses suggest minimal effect, these factors can influence adoption.
- Network operationalization: reciprocal mention network may not capture all exposure channels (e.g., follower-only ties, algorithmic feeds); robustness checks mitigate but do not eliminate this concern.
- Urban weak-tie dynamics may involve behavioral mechanisms (e.g., diversity preference) not fully modeled, as suggested by residual discrepancies in urban–urban pathways.
Related Publications
Explore these studies to deepen your understanding of the subject.