Interdisciplinary Studies
American cultural regions mapped through the lexical analysis of social media
T. Louf, B. Gonçalves, et al.
This groundbreaking study reveals how cultural regions can be inferred through the analysis of geotagged microblogging posts in America. By examining frequency distributions of content words, researchers have uncovered distinct regional discourse patterns influenced by various cultural factors. Conducted by Thomas Louf and colleagues, this research offers a comprehensive picture of modern American cultural areas.
~3 min • Beginner • English
Introduction
The paper addresses how to infer American cultural regions in an objective, data-driven manner, avoiding subjective selection and weighting of cultural factors (e.g., politics, religion, ethnicity) that characterize previous theories. The authors argue that regional cultural differences should manifest as systematic regional variation in topics of everyday discourse. Using geotagged social media language (Twitter), they aim to (i) develop a method to discover cultural regions by identifying regional topical patterns and (ii) apply it to map modern American cultural regions. The work is important for understanding geographic variation in values and behaviors relevant to governance, economics, and social planning, and to provide a broad, accessible classification of populations within a nation.
Literature Review
The paper situates itself within longstanding debates over American cultural geography, reviewing seven prominent theories (Odum, 1936; Elazar, 1970; Zelinsky, 1973; Gastil, 1975; Garreau, 1996; Lieske, 1993; Woodard, 2012) that disagree on the number and boundaries of cultural regions. Common threads across theories include North–South and East–West distinctions, but proposed systems range from 4 to 12 regions (e.g., Northeast, South, West, Midwest). The authors critique prior approaches for subjectively selecting and weighting cultural factors and for limited or inconsistent quantitative analysis. They also reference computational sociolinguistics and prior work using Twitter for mapping dialect and topical variation, noting that no prior study has used such data to infer general cultural regions.
Methodology
Data: 3.3 billion geotagged tweets from the contiguous US (2015-01-01 to 2021-12-31) collected via Twitter’s filtered stream API. Users tweeting at inhuman rates (>10 tweets/hour over their tweeting span) and tweets from non-Twitter platforms were removed, retaining 17 million users. Tweets were cleaned by removing links, hashtags, and mentions; tweets with fewer than 5 remaining words were discarded. Language detection (CLD2) excluded non-English tweets. Tweets were geotagged by GPS or place; tweets with geotags too coarse for county-level assignment were removed. Unit of analysis: US counties and county equivalents (3108). Counties with fewer than 50,000 tokens were excluded, leaving N=2576 counties, totaling 9.1 billion tokens. Text processing: lowercasing, aggregation of token counts, removal of function words and interjections; selection of the 10,000 most frequent remaining content word forms (emergent from data). Measuring regional variation: For each word w and county c, compute relative frequency f_cw, then compute Getis–Ord G z-scores using a proximity matrix W linking a county to its 10 nearest neighbors (including itself). This spatial statistic smooths noise and identifies usage hotspots. Dimensionality reduction: Perform PCA on G-score maps across the 10,000 words to obtain principal components of regional lexical variation. Number of components (NPC=326) selected via the broken-stick rule; these explain ~92% of variance (first 4 PCs explain ~31%). Robustness checks vary neighbor definitions (5, 15 nearest; fixed distance), finding limited sensitivity to neighbor count but sensitivity to fixed distance due to non-uniform county sizes. Clustering cultural regions: Apply hierarchical clustering (Euclidean distance, Ward’s variance minimization) to the PCA-transformed county representations. Determine number of clusters using dendrogram structure and average silhouette scores; notable drops at 2 and 5 clusters. Characteristic topics: Define for each cluster a word specificity score S_C,w by comparing the cluster’s average G for word w to the closest other cluster’s average G, selecting words with highest specificity as most characteristic. Temporal analysis: Split corpus into 2015–2016, 2017–2018, 2019–2021; repeat PCA/clustering to assess stability. Compute annual distributions of inter-cluster Euclidean distances in G-space for the top-level (North–South) split to evaluate temporal stability.
Key Findings
- The US divides into five primary cultural regions based on regional topical patterns in Twitter discourse, derived objectively via PCA and hierarchical clustering. A two-way split (North vs. Southeast) is the strongest division; the five-cluster solution is supported by silhouette analysis and dendrogram structure.
- Data scale and processing: 3.3B geotagged tweets, 17M users, 9.1B tokens, 2576 counties, 10,000 content words. PCA reduced to 326 components explaining ~92% variance (first 4 PCs capture ~31%).
- Region descriptions and defining topics:
1) Southeast (blue): Aligns with areas of high African American population density; characterized by African-American culture (e.g., cuisine: grits, cookout; fashion: braids, dreads; music: rappers) and African-American English lexical items (e.g., bruh, dawg, lawd, turnt; acronyms like frfr, stg). Fewer place-name references than other clusters.
2) Midwest-centered (yellow): Strong focus on sports; 40 of top 50 words relate to American team sports (basketball, baseball, football, wrestling), with terms like basketball, rebound, innings, teams, tourney; place names tied to teams (e.g., Cubs, Chiefs) and state names (Ohio, Illinois); also school-related terms (locker, choir).
3) Rural/outdoors, discontinuous (green): Encompasses rural and mountainous/outdoor areas (Rockies, Appalachians). Common words relate to nature, weather, and outdoor activities (mountains, tree, snow, seasonal, adventures, trail), work (hiring, jobs), place names (Colorado, Montana). Less frequent use of verbs tied to communication/thought/physical actions and fewer negative words (wrong, bad), suggesting more positive outlook and less individual-focus discourse.
4) Urban coastal, discontinuous (red): Major metros on Northeast and West coasts; urban-life terms (homeless, traffic), nationalities/immigration (Latino, Asian), many place names (California, NYC), and many negative-connotation words (violence, attack, dangerous, crime, asshole, fucking, protests, indicted, Nazi, supremacist, disgusting, abusive), indicating more discussion of negative topics.
5) South Central/Texas-Oklahoma (cyan): Strong local pride and place references (e.g., Whataburger, Texas, TX, Texan, Dallas; Oklahoma terms), plus dialect items of Anglo (yalls, fixing) and Hispanic origins (queso, taco).
- The analysis reveals both contiguous and non-contiguous cultural regions; a strong North–South divide influenced by African American culture; and a notable coastal vs. interior opposition not highlighted in classic theories.
- Temporal stability (2015–2021): The North–South split remains stable across sub-periods; distributions of inter-cluster distances show roughly constant medians and little variation year-to-year, supporting robustness of detected regions.
Discussion
The study directly addresses the challenge of objectively defining American cultural regions by inferring them from regional topical variation in everyday language use on social media. By leveraging spatial statistics (Getis–Ord G), PCA, and clustering on a very large geotagged corpus, the approach avoids subjective selection/weighting of cultural factors. The resulting five-region map aligns with elements of prior theories (e.g., North–South divide, Midwest, coastal vs. interior distinctions) while revealing modern, non-contiguous patterns and a pronounced coastal–interior opposition. The topical profiles clarify how ethnicity (especially African American culture), urbanization, sports culture, outdoor orientation, and local pride contribute differently across regions. Stability over time suggests these are persistent cultural structures at the scale analyzed. The findings enrich understanding of American cultural geography and demonstrate that large-scale linguistic data can uncover both well-known and previously overlooked cultural dimensions (e.g., sentiment orientation, social interaction focus) that are difficult to capture in traditional data sources.
Conclusion
This paper introduces a fully data-driven method to infer cultural regions by analyzing regional topical variation in geotagged social media. Applied to 3.3B US tweets (2015–2021), it identifies five stable cultural regions with distinct topical signatures and confirms a strong North–South divide and a coastal–interior opposition. Contributions include: (i) an automated, scalable pipeline integrating spatial hotspot detection, PCA with broken-stick selection, and hierarchical clustering; (ii) quantitative identification of characteristic regional topics; and (iii) empirical evidence for persistent cultural structures over recent years. Future research could apply the method to other languages, platforms, and genres; incorporate richer demographic annotations; refine spatial modeling for heterogeneous units; extend diachronic analyses to longer periods; and explore finer-grained or multi-level regional structures and cross-country comparisons.
Limitations
- Data source bias: Twitter users are not demographically representative; geotagged users are further biased; topic distributions differ from offline discourse; and spatio-temporal tweet density is uneven.
- Genre limitation: Results reflect patterns in Twitter microblogging; other genres may show different topical emphases.
- Spatial unit heterogeneity: Counties vary in size/population; fixed-distance neighbor definitions can distort results due to non-uniform county sizes.
- Methodological choices: Selection of nearest neighbors (though tested), PCA-based dimensionality reduction, and hierarchical clustering may influence outcomes; the optimal number of regions beyond the supported 2 and 5 remains uncertain.
- Temporal scope: The 2015–2021 window may be too short to confirm structural cultural changes; temporal analyses show stability but cannot rule out longer-term shifts.
- Content filtering: Exclusion of hashtags and function words may omit certain topical signals, though hashtags represented <5% of content and were removed to reduce trend noise.
Related Publications
Explore these studies to deepen your understanding of the subject.

