Linguistics and Languages
The Russian war in Ukraine increased Ukrainian language use on social media
D. Racek, B. I. Davidson, et al.
This research, conducted by Daniel Racek, Brittany I. Davidson, Paul W. Thurner, Xiao Xiang Zhu, and Göran Kauermann, explores the significant shift in the language choices of Ukrainian citizens on social media before and during the war in Ukraine. With over 4 million tweets analyzed, the study reveals a dramatic move towards a more Ukrainian online identity amidst the conflict.
~3 min • Beginner • English
Introduction
The study investigates how Ukrainian citizens’ language use on Twitter (now X) shifted before and during the Russian invasion of Ukraine. Social media plays an important role in crises, both amplifying misinformation and enabling realtime updates and calls for aid. Ukraine’s context is marked by long-standing bilingualism (Ukrainian and Russian) and politicized language use tied to identity and nation-building. Prior to the 2022 full-scale invasion, there had been gradual shifts away from Russian linguistic identification, especially after the 2014 Euromaidan and Crimea/Donbas events. The research aims to quantify language choice and tweeting activity over time, distinguishing whether observed changes stem from user turnover (sample in-/outflux) or from behavioural changes among continuing users, and to measure the extent of language switching, especially from Russian to Ukrainian, at and around the outbreak of war.
Literature Review
The paper situates itself in research on social media’s role in crises and political events, including crisis informatics and monitoring conflict via social media. It draws on sociolinguistic literature linking multilingualism to identity and self-presentation online, where language choice can signal audience targeting (e.g., English for wider reach). Language is inherently political, especially in post-Soviet contexts where language laws and policies aim to support nation-building and reclaim native languages. In Ukraine, census and survey evidence across decades documented gradual shifts from Russian towards Ukrainian ethnic and linguistic identification, with accelerated changes post-2014. Small-scale qualitative findings also indicated language shifts on social platforms following earlier Russian interventions. This background motivates a large-scale, longitudinal, ecologically valid analysis of Ukrainian Twitter to quantify language shifts and their drivers.
Methodology
Ethics: Approved by the ethics commission of the Faculty of Mathematics, Computer Science and Statistics at LMU Munich (EK-MIS-2022-127). No user demographics were collected; no informed consent was required per ethics guidance. Study not preregistered.
Data collection: Tweets from 2020-01-09 to 2022-10-12 were collected via Twitter’s 1% real-time stream, filtering for tweets with geo-information. The dataset was manually filtered to retain tweets tagged with country code UA and to exclude retweets, leaving primary tweets, quotes, and replies.
Coverage recovery: Gaps (>10 min without tweets) and low-volume days were identified and backfilled using the Twitter Research API v2 (tweets/search/all) to retrieve tweets with Ukrainian geoinformation; duplicates removed. This added 350,359 tweets. Sensitivity analysis using random 29 days indicated 98.24% coverage from stream vs. 77.67% from historical API alone, suggesting the stream recovered most geo-tagged tweets, including many later-deleted ones.
Cleaning and spam/bot filtering: Steps included (1) removing duplicate tweets; (2) training a random forest bot detector following Yang et al. (2020) on botometer-feedback, celebrity, political-bots, and 100 manually labeled accounts; nested CV yielded AUROC 0.9837, AUPRC 0.7707. Conservative thresholds removed users with predicted bot probability >50% and >10 tweets, and users with >30% and >10,000 tweets; (3) removing users with >100 tweets/day; (4) keeping only tweets from official Twitter clients or Instagram; (5) filtering repetitive within-1-minute duplicates and high-volume BTS spikes. From 4,453,341 tweets (62,712 users), the cleaned set comprised 2,845,670 tweets (41,696 users).
User activity aggregation: For modeling, tweets per user per week were aggregated across three languages (Ukrainian UA, Russian RU, English EN). Weeks with any activity (and up to two subsequent weeks) were included. The modeling sample spanned 2020-01-13 to 2022-10-10 (143 weeks), 13,643 users, 1,045,245 observations.
Models:
- Tweet activity model (GAMM): For user u, language l, week t, count Y_{tul} ~ Poisson with log-intensity λ_{tul} = μ + s_l(t) + W_{ul}, where s_l(t) is a smooth global time trend per language and W_{ul} are user-language random intercepts (normal). Implemented in R mgcv (bam, discrete=TRUE) with thin plate regression splines; explained deviance 71.3%. Effect sizes: behavioural = exp(Δs_l(t)); sample effects = exp(Δ average W_{ul} among active users at two times).
- Language choice models (GAMM, binomial): Pairwise probabilities (UA vs RU, UA vs EN, RU vs EN) per user-week with X_{u} ~ Binomial(n_u, π_u), logit(π_u) = μ + s(t) + W_u, with user random intercepts and smooth time trend. Implemented with mgcv bam; explained deviances: 85.8% (UA/RU), 90.5% (UA/EN), 90.0% (RU/EN). Estimation samples: UA/RU x=194,178; UA/EN x=146,984; RU/EN x=170,853; all over 143 weeks. Effect sizes on odds: behavioural = exp(Δs(t)); sample = exp(Δ average W_u among active users).
Topic modeling: Multilingual BERTopic identified war-related topics (#1 updates/help; #3 political aspects) for subgroup analyses.
Key dates: Start (2020-01-27), Aggression (first US report of Russian troop mobilization: 2021-11-11), War (invasion: 2022-02-24), End Study (2022-10-10).
Key Findings
Descriptives and activity:
- Language distribution (API lang field): Ukrainian 35.8%, Russian 35.4%, English 11.5%, undefined 11.1%, others ≤1.2%.
- Over time, RU tweets declined steadily; UA rose, especially after the invasion; EN spiked around war onset and stabilized above pre-war levels.
- Active users/week fell from ~2,800 (early 2020) to ~1,800 pre-war, then increased post-invasion; turnover ~250 users/week switching active/inactive; first-time joiners and final leavers ~50/week each, roughly doubling post-war.
Tweet activity model (effect sizes; sample vs behavioural):
- Sample effects (relative change in expected counts):
- RU: largely stable until Nov 2021; then −21% (Nov 2021 to Oct 2022). Table 1: Start→Aggression −2.91%; Aggression→War −17.41%; War→End −4.12%; Aggression→End −20.82%.
- EN: minimal change pre-aggression; sharp +107% after aggression. Table 1: +6.16%, +34.87%, +53.14%, +106.54%.
- UA: +43% pre-aggression; little change right before war; +87.7% post-war. Table 1: +43.12%, −0.44%, +87.70%, +86.87%.
- Behavioural effects (relative change in expected counts):
- RU: −48.90% (Start→Aggression), +4.68% (Aggression→War), −23.86% (War→End), Aggression→End −20.30%.
- EN: −34.41% pre-aggression; +130.11% (Aggression→War); −39.98% (War→End); Aggression→End +38.09%.
- UA: +4.67% (Start→Aggression); +35.72% (Aggression→War); +15.184% (War→End); Aggression→End +56.32%.
Language choice models (odds changes; sample vs behavioural):
- Sample effects (odds of language1 over language2):
- UA over RU: +66.13% (Start→Aggression); +13.00% (Aggression→War); +65.72% (War→End); +87.25% (Aggression→End).
- UA over EN: +21.43%; −52.08%; +41.96%; −31.98%.
- RU over EN: −19.01%; −61.74%; −29.33%; −72.96%.
- Behavioural effects (odds):
- UA over RU: +128.69% (Start→Aggression); +64.14% (Aggression→War); +129.24% (War→End); +248.63% (Aggression→End).
- UA over EN: +52.08%; −38.23%; +92.663%; +27.90%.
- RU over EN: −33.61%; −38.69%; −20.659%; −51.36%.
User-level switching (UA vs RU, users active before and after war; n=3,237 with UA or RU both periods):
- Among 1,363 RU-dominant users pre-war (>80% RU): 839 (61.6%) tweeted more in UA post-war; 566 (41.5%) had significant change; 341 (25.0%) hard-switched to >80% UA; 296 (21.7%) significant hard-switches.
- Among 1,172 UA-dominant users pre-war (>80% UA): 471 (40.2%) tweeted more in RU post-war; 83 (7.1%) significant; 35 (3.0%) hard-switches.
Content patterns: RU→UA switchers produced more war-topic tweets in absolute counts (+62.5% topic #1; +100% topic #3), but differences were nonsignificant when normalized by each user’s total tweets (+4.71% and +17.6%, both ns).
Overall: A steady pre-war shift from RU to UA accelerated sharply with aggression and invasion, driven primarily by behavioural changes rather than only user turnover, with a pronounced temporary behavioural spike in EN use around war onset.
Discussion
The study demonstrates that language choice on social media is a salient expression of identity and political stance. Disentangling user turnover from behavioural change reveals that the long-term shift away from Russian toward Ukrainian predates the war but accelerates drastically with the invasion. Behavioural changes dominate the effect: active users reduce RU and increase UA tweeting, indicative of a conscious repositioning toward a Ukrainian identity and distancing from Russia. The temporary surge in English use around the invasion suggests strategic communication to international audiences for visibility and aid. These findings align with historical census and survey evidence of growing Ukrainian linguistic identification, particularly after 2014, and extend them by providing large-scale, longitudinal, ecologically valid behavioral data. The approach also clarifies that shifts are not merely due to different users joining or leaving; many individuals actively changed their language behavior, with substantial RU→UA switching and relatively few UA→RU reversals.
Conclusion
This large-scale longitudinal analysis of geo-tagged tweets from Ukraine shows substantial, accelerating shifts from Russian to Ukrainian around the Russian invasion, primarily driven by behavioural changes among active users. Over half of previously RU-dominant users tweet more in UA after the invasion, and a quarter hard-switch to predominantly UA. The work contributes a methodological framework using GAMMs to separate sample turnover from behavioural change in both tweet volumes and language choice and quantifies effect sizes over key periods. Future research could deepen insights by analyzing content and sentiment, incorporating media (images/videos), and studying retweet and follower networks to compare switchers vs. non-switchers and extend analyses across platforms.
Limitations
The sample is not representative of the full Ukrainian population and skews younger. Only tweets with geo-information (not enabled by default) were included, potentially biasing the sample. Users may create new accounts that cannot be linked, causing some behavioural changes to be attributed as sample effects. Users may stop tweeting with Ukrainian geolocation for reasons such as displacement, introducing selection biases; behavioural shifts are shown only for those active at/after war onset. Bot detection and spam filtering, while conservative, may still misclassify some accounts. Language identification relied on the API’s language field. The analysis focuses on tweeting behavior and does not incorporate richer content, sentiment, media, or network structure; generalizability to other platforms may be limited.
Related Publications
Explore these studies to deepen your understanding of the subject.

