logo
ResearchBunny Logo
American cultural regions mapped through the lexical analysis of social media

Interdisciplinary Studies

American cultural regions mapped through the lexical analysis of social media

T. Louf, B. Gonçalves, et al.

This groundbreaking study reveals how cultural regions can be inferred through the analysis of geotagged microblogging posts in America. By examining frequency distributions of content words, researchers have uncovered distinct regional discourse patterns influenced by various cultural factors. Conducted by Thomas Louf and colleagues, this research offers a comprehensive picture of modern American cultural areas.

00:00
00:00
Playback language: English
Introduction
Defining cultural regions is crucial for understanding human behavior within a nation. Previous attempts to map American cultural regions have yielded conflicting results due to subjective selection and weighting of cultural factors. This paper proposes a novel method to objectively identify cultural regions by analyzing regional patterns in topics of conversation from a large corpus of geolocated social media data. The hypothesis is that regional variations in everyday language reflect underlying cultural differences. Analyzing these patterns allows for the inference of broad cultural regions without predefining topics or their importance. The method leverages the massive amount of geotagged social media data, specifically from Twitter, to identify and analyze these regional topical patterns. While acknowledging biases inherent in social media data, the authors argue that pervasive cultural regions should manifest in any large sample of everyday language.
Literature Review
Existing research on American cultural regions presents diverse and often conflicting maps. Studies by Elazar (1970), Garreau (1996), Gastil (1975), Lieske (1993), Odum (1936), Woodard (2012), and Zelinsky (1973) reveal disagreements regarding the number and boundaries of these regions. While some commonalities exist, such as a North-South and East-West division, the number of primary cultural areas varies widely (4-12). The limitations of these approaches lie in the subjective selection and weighting of cultural factors and the lack of an objective method for synthesizing these factors. This study aims to address these limitations by employing a data-driven approach that objectively identifies cultural regions based on regional variations in language.
Methodology
The study analyzes 3.3 billion geotagged tweets from the contiguous US between 2015 and 2021. Data preprocessing involves discarding inhuman tweeting rates, non-Twitter mobile application tweets, links, hashtags, and non-English tweets. Geolocation is assigned to tweets using GPS coordinates or 'places,' with tweets not reliably assigned to US counties removed. The remaining tweets (9.1 billion tokens) are processed by converting words to lowercase, removing function words and interjections, and selecting the 10,000 most common remaining words. Regional variation is measured by calculating the relative frequency of each word in each county. To remove noise, Getis-Ord's z-scores are computed for each county and word to identify geographical hotspots of word usage. Principal component analysis (PCA) is then applied to reduce the dimensionality of the data from approximately 10,000 dimensions to 326 dimensions, which explain 92% of the variance, as determined using the broken-stick rule. Hierarchical clustering, using Euclidean distance and the Ward variance minimization algorithm, is then used to group counties with similar lexical signatures, thus inferring cultural regions. The optimal number of clusters is determined using the silhouette score. Finally, a specificity metric is used to identify the words that are most characteristic of each cluster, and the temporal stability of the results is assessed by splitting the dataset into three time periods: 2015-2016, 2017-2018, and 2019-2021.
Key Findings
The analysis reveals five primary cultural areas in the modern US, each with distinct topical patterns. The strongest division is between the Southeast and the rest of the US, reflecting the cultural divide between White and Black America and between the North and the South. A significant opposition exists between coastal and inland areas, suggesting a cultural divide between urban and rural regions. The study finds three basic types of lexical patterns associated with the five clusters: (1) words associated with specific locations (cities, states, sports teams); (2) words connected with non-regional topics, but showing regional variation; and (3) dialect items. Cluster 1 (Southeast) is strongly associated with African American culture, including cuisine, fashion, music, and African American English. Cluster 2 (Midwest) is characterized by a focus on American team sports. Cluster 3 (rural areas, mountainous regions) is linked to nature, weather, outdoor activities, and work-related words. Cluster 4 (coastal urban areas) shows frequent use of words associated with urban life, immigration, and negative connotations, including violence, crime, and political unrest. Cluster 5 (South Central states, especially Texas and Oklahoma) features frequent references to place names, reflecting local pride, and a mix of Anglo and Hispanic dialect terms. The temporal analysis, by splitting the data into three time periods, shows that the North-South division remains stable over time, supporting the validity of the identified cultural regions.
Discussion
The findings confirm the importance of factors like ethnicity and religion in defining American cultural regions, while revealing variations in their relevance across the US. The study identifies subtler cultural patterns overlooked in previous research. This data-driven approach not only maps cultural regions but also reveals the cultural factors defining them. While acknowledging the limitation of analyzing only one genre of American English (Twitter), the authors argue that similar patterns should be observed in other genres if cultural regions are pervasive. The method's applicability to other big data resources with linguistic value is highlighted.
Conclusion
This study presents a novel data-driven method to map American cultural regions using social media data. It identifies five distinct cultural areas, revealing a strong North-South division, an urban-rural divide, and regionally varying topical focuses. Future research could explore other genres of American English, compare findings with other data sources, and investigate the temporal evolution of cultural regions in greater detail.
Limitations
The study is limited to Twitter data, which may not fully represent the entire US population or encompass all aspects of cultural practice. The method relies on a specific algorithm (PCA and hierarchical clustering), different algorithms could generate slightly different clustering assignments. The results are descriptive and do not explain the causal mechanisms behind the observed cultural patterns. Temporal stability is assessed over a relatively short time frame (2015-2021).
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny