Linguistics and Languages

The rising entropy of English in the attention economy

C. Pilgrim, W. Guo, et al.

Dive into groundbreaking research by Charlie Pilgrim, Weisi Guo, and Thomas T. Hills, revealing how the word entropy of American English has steadily increased since 1900! Discover fascinating variations across media types, with short-form media providing a richer landscape of information. This study combines ecological models and consumer behavior insights, explaining why shorter formats are becoming the go-to choice for media consumers.... show more

Introduction

Word entropy quantifies repetition versus novelty in word distributions, with higher entropy indicating more diverse and less predictable language. Typical word frequencies follow Zipf's law with a stable exponent near 1, often explained by a balance between speaker effort and listener information preferences. Modern communication systems have increased information accessibility, creating a scarcity of attention and intensifying competition among media producers. In this attention economy, competitive pressures should shift content toward consumer preferences for information density, measurable as higher entropy. Framing this as information foraging, people search and consume information under exploration–exploitation trade-offs across environments such as web browsing and software debugging. The study investigates historical changes in information across media over two centuries and models how competition for attention can explain both the overall rise in word entropy and systematic differences across media categories (e.g., books versus news versus social media).

Literature Review

The study builds on: (a) Zipf's law and the principle of least effort linking linguistic distributions to equilibria between production costs and informational benefits; (b) information foraging theory describing search and selection behavior under resource constraints, applied to diverse information tasks; (c) attention economy perspectives positing limited human attention amid abundant information; and (d) prior findings on language evolution influenced by cognitive and social factors, including morphological simplification and learnability pressures (e.g., Linguistic Niche Hypothesis). The authors integrate these strands to hypothesize that increasing information abundance drives higher word entropy and that media formats differ in entropy due to different search and switching costs.

Methodology

Corpora and sampling: The study analyzes multiple English text corpora. COHA (Corpus of Historical American English) includes 116,614 texts from the 1810s–2000s, balanced across fiction (n=11,010), non-fiction (n=2,635), news (n=41,677), and magazines (n=61,292). COCA (Corpus of Contemporary American English) sample comprises 2,362 texts (fiction n=275, academic journals n=266, news n=872, magazines n=949). The BNC (British National Corpus) contains 8,098 texts (written categories including fiction n=904, academic prose n=994, newspapers n=972, and others). For analyses requiring balanced categories, fiction and newspapers were common across corpora; non-fiction combined COHA non-fiction, COCA academic journals, and BNC academic prose. Because lexical measures are sample-size sensitive, all text samples were truncated to N=2,000 words, discarding samples with fewer words.

Cleaning and tokenization: For COHA/COCA, the pipeline removed headers, XMI tags, sentences containing '@' (copyright masking), apostrophes and extra whitespace; tokenized with NLTK; removed non-text tokens (punctuation, numbers, special symbols); lowercased; filtered out samples with <2,100 tokens; then selected the middle 2,000 tokens per sample to avoid anomalous sections. For BNC, the NLTK BNC corpus reader extracted tokens; apostrophes and extra whitespace were removed. After cleaning, surviving samples with N=2,000 words: COHA total n=22,233 (fiction 8,162; non-fiction 2,045; news 720; magazines 11,306); COCA total n=985 (fiction 167; non-fiction 166; news 39; magazines 133); BNC total n=1,319 (fiction 447; non-fiction 477; news 395). Outliers (>5 SD from corpus mean in any lexical measure) were excluded (<0.1% of data).

Social media datasets: Twitter 2009: 1.6M tweets (Apr–Jun 2009; Sentiment140), chronologically collated into n=9,180 samples of N=2,000 words each. Twitter 2020: 16M unique tweets (Feb 2020–Jan 2021) from followers of major US news sources, collated into n=143,045 samples of N=2,000 words. Reddit 2024: 14,892 English posts from the homepage feed (Jan 15–17, 2024) aggregated into n=222 samples of N=2,000 words. Social media text was cleaned as above; URLs, emails, hashtags, and usernames were effectively removed. Posts were collated to simulate realistic feed consumption; this collation inherently increases entropy through rapid context switching, aligning with user experience.

Measures: Primary measure was unigram word entropy H1 = −∑ f_i log2 f_i over W unique words with empirical frequencies f_i. Robustness measures included type-token ratio (TTR = #types/#tokens) and Zipf exponent a, estimated via maximum likelihood. Each measure was computed once per distinct text sample.

Time series analyses (COHA): For each category and lexical measure, annual medians (and means for combined-category series) were computed. Breakpoints were estimated via Python piecewise-regression (default settings). Trend tests from 1900–2009 used KPSS (null: stationarity; one-sided; statsmodels; sub-0.01 p-values obtained via simulated critical values) and Mann–Kendall (two-sided; pymannkendall). Holm–Bonferroni corrections were applied across 24 tests. For visualization, time series were smoothed with a ±5-year moving average; combined series averaged across available categories per year. Confidence intervals were 1.96×SE; for combined categories, SE was computed via the delta method.

Between-category differences: Distributions within COHA (restricted to 2000–2007), COCA, and BNC were compared using one-way ANOVAs per corpus and measure, with Holm–Bonferroni-adjusted p-values. Kernel density estimates used Silverman’s rule; outliers >5 SD were removed (<0.1%).

Supplementary industry case study: US magazine circulation (Audit Bureau of Circulation data) was plotted to explore links between entropy changes and media competition; the dataset covers well-known magazines and indicates historical circulation growth.

Ethics and availability: Analyses complied with fair use; social media data collection followed platform terms at the time. Data generated from text samples and code are publicly available; original corpora and some social datasets are restricted due to copyright/size.

Modeling framework: An information-foraging model is developed. Information items of type i arrive as Poisson processes with rates λ_i, utility u_i, and handling times t_i. The forager maximizes utility rate R_media = U/(T_s+T_h), leading to a diet inclusion rule r_i = u_i/t_i ≥ R_media. Increasing information prevalence raises selectivity and pressures producers to increase information utility rate, proxied by word entropy. Extending to media platforms (patches), a merged Poisson process yields R_media = λ_i ū_i / (1 + f_i λ_i), where ū_i is average utility of consumed items, f_i average consumption time, and λ_i encounter rate for items in the diet. Platform inclusion depends on average item size, item utility rate, and prevalence. Short-form platforms incur higher switching/search costs, thus must offer higher information utility rates to compete, particularly at high information prevalence. Simulations illustrate selective survival of higher-utility information under abundance, and analyses predict viability of very short-form media when search costs are low and prevalence high.

Key Findings

Historical trends: Using COHA (n=22,233 samples), American English shows a steady rise in word entropy since around 1900, with corroborating broadening in type-token ratio and Zipf exponent. Category-specific series (magazines n=11,306; news n=720; fiction n=8,162; non-fiction n=2,045) all show upward trends in entropy from 1900. Breakpoint analyses indicate changes in gradient consistent with 20th-century shifts.

Trend tests: Across 24 KPSS and Mann–Kendall tests (four categories × three measures × two tests), all p-values remained significant after Holm–Bonferroni correction, supporting non-stationary upward trends in lexical diversity measures from 1900–2009.

Media-type differences: Short-form media (news, magazines) exhibit higher entropy than long-form media (fiction, non-fiction); social media feeds have the highest entropy. One-way ANOVAs (Holm–Bonferroni-adjusted p<0.001 in all cases) quantify robust between-category differences:

Word entropy: COHA F(3,3586)=83, η²=0.065; COCA F(3,501)=34, η²=0.172; BNC F(2,1316)=662, η²=0.502.
Type-token ratio: COHA F(3,3586)=34, η²=0.028; COCA F(3,501)=17, η²=0.094; BNC F(2,1316)=416, η²=0.388.
Zipf exponent: COHA F(3,3586)=89, η²=0.070; COCA F(3,501)=39, η²=0.190; BNC F(2,1316)=682, η²=0.509.

Social media feeds: Aggregated Twitter (2009 n≈9,180; 2020 n≈143,045) and Reddit (2024 n=222) feeds show the highest unigram entropy distributions, reflecting rapid context shifts and information density at the feed level.

Model insights: The information-foraging model explains (i) rising entropy with increasing information prevalence due to heightened selectivity and producer adaptation, and (ii) systematically higher entropy in short-form media where switching costs require higher information utility rates to achieve competitive platform utility. The model predicts very short-form media viability when prevalence is high and search time is minimized, aligning with platform features like infinite scroll and autoplay.

Discussion

The findings address the central question of how English word entropy has changed over time and why it differs across media. Empirically, a century-scale rise in entropy and consistent differences between short- and long-form media are observed across multiple corpora and validated with trend tests and ANOVAs. The attention economy model grounded in information foraging provides a parsimonious mechanism: as information becomes more abundant and search costs decline, consumers become more selective, favoring higher information utility rates, and producers adapt by increasing information density (entropy), especially in short-form contexts with greater switching costs. This framework connects linguistic change to broader cognitive and social processes and complements existing theories (e.g., the Linguistic Niche Hypothesis), suggesting that pressures toward learnability may coexist with pressures toward expressivity by reducing redundancy and increasing entropy. The model also explains the emergence and success of very short-form platforms under high prevalence and low search costs, potentially amplified by personalization algorithms and the attractiveness of high-entropy information. Overall, results support the view that humans, within limits, act as information rate maximizers, and that media ecosystems coevolve with attention constraints.

Conclusion

The study demonstrates that the unigram entropy of American English has risen since about 1900 and that contemporary media differ systematically in entropy: short-form media (news, magazines) exceed long-form (fiction, non-fiction), while social media feeds are highest. An ecological model of information foraging in the attention economy explains both the temporal rise and cross-media differences, attributing them to increased information prevalence, reduced search costs, heightened consumer selectivity, and producer adaptation toward higher information utility rates. Contributions include: (1) cross-corpus empirical evidence of rising entropy and widening word distributions; (2) robust statistical confirmation of trends and category differences; and (3) a mechanistic model linking attention economics to linguistic distributions. Future work should extend analyses beyond English and text-only media, incorporate richer entropy estimators and conditional structures, and test model predictions with experimental and real-world behavioral data, including personalized algorithmic feeds and multimodal media.

Limitations

Key limitations include: (1) Corpus representativeness: Big-data cultural analyses can reflect data-source variation. COHA’s balanced design mitigates this, but other sources (e.g., Google Ngrams) were avoided due to composition concerns. (2) Sample-size sensitivity: Lexical measures depend on text length; truncating to 2,000 words standardized comparisons but sacrifices absolute estimation accuracy and long-range dependencies. (3) Entropy estimation: The plug-in unigram estimator is simple but biased in finite samples; alternatives (e.g., compression-based estimators) were not adopted to avoid added assumptions; conditional entropy was avoided due to compounded biases. Bigram entropy analyses showed broadly similar trends but do not capture full linguistic structure. (4) Modeling simplifications: The foraging model abstracts away complex human goals, heterogeneity, cultural and economic dynamics, and other attention drivers (e.g., valence, emotion, predictiveness, belief-consistency). (5) Modality constraints: Analyses are text-only; attention competition spans multimedia, where processing differs and may show analogous entropy trends (e.g., declining shot lengths in video). (6) Data access: Some social media datasets are restricted due to copyright and platform policies. Results should be interpreted cautiously and generalized with care.

Related Publications

Explore these studies to deepen your understanding of the subject.

Economics

The development of Kondratieff's theory of long waves: the place of the AI economy humanization in the 'competencies-innovations-markets' model

A. E. Tyulin, A. A. Chursin, et al.

Business

“Smart” Outsourcing in Support of the Humanization of Entrepreneurship in the Artificial Intelligence Economy

D. E. Matytsin, V. A. Dzedik, et al.

Economics

Economic and legal approaches to the humanization of FinTech in the economy of artificial intelligence through the integration of blockchain into ESG Finance

O. P. Kazachenok, G. V. Stankevich, et al.

Business

The moderating role of information technology governance in the relationship between board characteristics and continuity management during the Covid-19 pandemic in an emerging economy

F. A. Almaqtari, N. H. S. Farhan, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny