Linguistics and Languages
Social, economic, and demographic factors drive the emergence of Hinglish code-mixing on social media
A. Sengupta, S. Das, et al.
India’s linguistic diversity (700+ languages; 26% multilingual as per the 2011 census) provides fertile ground for code-mixing, especially between Hindi (over 500 million speakers) and English, whose contact intensified during British colonization. Code-mixing involves a matrix (dominant) language with an embedded language and is widespread in multilingual societies and online social networks, often accompanied by script-mixing through Romanized transliteration. The paper positions Hinglish as a dynamic, evolving phenomenon influenced by social, cultural, and economic shifts and increasingly prevalent on social media among youth and diaspora. The research asks: What empirical and statistical evidence supports Hinglish’s evolution on social media? Does code-mixing impact different word groups similarly? What socio-economic and demographic drivers influence Hinglish’s spread? Can these drivers predict future adaptation? The authors hypothesize that code-mixing trends correlate with socio-economic trends and analyze these using Twitter data (2014–2022), aiming to model language dynamics among Hindi, English, and Hinglish and their exogenous drivers, and to understand micro-level linguistic changes over time.
Foundational code-mixing theories include the Matrix Language Frame (MLF) model (Myers-Scotton), equivalence constraint (Poplack), and functional head constraint (Di Sciullo et al.). Prior work explored language competition and exogenous influences via ODE/PDE, reaction-diffusion, and control-theoretic models (Abrams & Strogatz; Walters; Nie et al.; Parshad et al.). Studies examined Hinglish’s integration and sociological aspects (Kothari & Snell; Nema & Chawla), code-switching in Bollywood scripts (Si), and computational tasks on code-mixed data (sentiment, POS, NER, hate/sarcasm detection). However, large-scale empirical analyses of Hinglish evolution with explicit socio-demographic drivers, particularly using curated social media data rather than census data, have been lacking. This study addresses that gap by combining linguistic analysis with econometric modeling on Twitter data.
Dataset collection and labeling: Using the Twitter Academic API, the authors collected 262,578 tweets from 16,710 unique handles between Jan 2014 and Sep 2022, querying broadly popular Indian topics (Cricket, Bollywood, Politics). Tweets were filtered for users in Delhi and Mumbai metropolitan regions. Twitter’s language tags ‘hi’ and ‘en’ were used. Word-level language identification and POS tagging were performed with a pretrained model (Sagorsarker 2020). The code-mixing index (CMI) was defined as CMI = 1 − max(n_hi, n_en)/n, where n is total tokens and n_hi, n_en are Hindi and English counts. Classification: Monolingual Hindi if CMI < 0.5 and n_hi > n_en; Monolingual English if CMI < 0.5 and n_en > n_hi; Hinglish if CMI ≥ 0.5. User-level preferences were computed per quarter using mean CMI and language word totals.
Dynamic econometric model: The authors curated 1,442 annual, country-level socio-economic indicators (2014–2022) and selected 10 via Spearman correlation with the fractions of monolingual Hindi (h_i), monolingual English (e_n), and code-mixed (c_m) users. They modeled yearly changes in the three population fractions via transition probabilities P_ij among groups, bias terms b_hi, b_en, b_cm, and standardized year-wise rates of change of exogenous features (ΔX) with weight vectors W_hi, W_en, W_cm. The system was trained with OLS regression, using 2014 as base (t=0), and iteratively predicting 2015–2022. Forecasts assumed exogenous factors remain as in the last recorded year.
Word representation and retention: Year-wise Word2Vec models (Gensim) were trained on tweets for 2014–2022 to obtain 100D embeddings, window size 4, min count 10. Each word’s neighborhood comprised its 25 closest words by cosine similarity. A retention rate quantified stability of a word’s neighborhood across consecutive years, serving as a proxy for semantic stability under code-mixing influence. Words were analyzed by POS and topical categories to assess differential semantic change.
- Evidence of evolution and phases: CMI distributions split into three phases via Fisher–Jenks: Phase 1 (Jan–Dec 2014; median CMI 0.52), Phase 2 (Jan 2015–Mar 2020; median CMI 0.53), Phase 3 (Apr 2020–Sep 2022; median CMI 0.57). Overall median CMI increased ~0.2% per year (intercept 0.506). In Phase 2, CMI grew at 1.2% annually with strong fit (adjusted R2=0.755, F=59.41, p<0.001). After 2020, CMI stabilized.
- Usage trends: Proportion of code-mixed tweets rose from ~42% to ~60% (2015–2020) and stabilized near 60% after 2020; total yearly tweet volume increased ~12× in Phase 3. From 2014–2022, monolingual English usage decreased at ~1.2% per year; code-mixing increased at ~2% per year; monolingual Hindi remained roughly constant (prevalence ~26.6%).
- User preferences: Hinglish was the most popular mode throughout (44.9% of users preferred Hinglish in 2014), rising to 56.3% after 2020 with a 1.2% annual growth. Monolingual English preference fell from 23.3% to 11.2% (−1.6% annually).
- Script usage: Devanagari usage grew from 35% (2014) to 82% (2022). Hindi adverbs (e.g., आज, अब) were more often in Devanagari than Romanized script.
- Switching patterns: Hindi verbs tended to occur in English contexts; English nouns appeared in Hindi contexts. Switched words were mostly in Romanized script, indicating higher susceptibility to code-mixing than script-switching.
- Semantic retention: Proper nouns showed highest retention probability (~0.23), nouns lowest (~0.14). Retention increased across all 14 POS categories from 2017 to 2022. Topic-wise, cricket-related words had highest average retention (~0.35), while political (~0.19) and entertainment (~0.20) terms were lower. Case studies showed ‘government’ changed neighborhood over time (lower retention), whereas ‘khan’ remained stable (higher retention).
- Drivers: Wholesale Price Index had strongest positive correlation with code-mixing extent (ρ≈0.86), followed by net secondary income (ρ≈0.83) and government consumption expenditure. Model importance highlighted agriculture value-added and bilateral aid flows among key drivers.
- Transition dynamics: Estimated transition probabilities (from→to): Hi→CM ≈0.433; En→CM ≈0.783; CM→CM ≈0.736; CM→Hi ≈0.261; CM→En ≈0.004. Thus, English users are far likelier to switch to Hinglish than to Hindi; Hinglish users tend to remain Hinglish, and if they switch to monolingualism, it is overwhelmingly to Hindi (~0.98 conditional on leaving CM). Prior (bias) probabilities favored Hinglish (~0.36) over monolinguals.
- Forecasting: The dynamic model achieved RMSE 0.029 versus 0.045 without exogenous variables (≈55% lower error). It predicts Hinglish will grow at ~2.97–2.98% annually post-2022, monolingual Hindi remain roughly constant, and monolingual English decline through 2025. A 2022 dip reflects training-data discontinuity. The environment’s long-run support for code-mixed proportion may approach an asymptote near ~50%.
The results substantiate that Hinglish adoption on Indian Twitter has steadily grown and is significantly influenced by macro socio-economic factors. The econometric dynamic model, outperforming a version without exogenous drivers, directly supports the hypothesis that external economic and demographic trends (e.g., WPI, income flows, government expenditure, agriculture value-added) shape linguistic preferences over time. The observed non-uniform effects across word classes and topics show that code-mixing alters the semantics of certain categories (e.g., common nouns, political terms) more than others (e.g., proper nouns, sports terms), answering whether English fusion impacts all Hindi words similarly—it does not. Transition probabilities reveal strong gravitational pull toward Hinglish from both Hindi and especially English, and high persistence among Hinglish users, explaining the observed macro rise and stabilization patterns. The findings underscore the need to treat code-mixed NLP tasks differently from monolingual ones because semantics in Hinglish evolve over time, necessitating periodic dataset refreshes and model updates. The study’s insights also suggest practical applications for improved content moderation and conversational AI in mixed-language contexts, and highlight cultural influences (e.g., Bollywood) that may further personalize code-mixing patterns.
This study presents the first large-scale, social-media-based empirical and econometric analysis of Hinglish evolution, quantifying its growth (2014–2022), identifying key socio-economic drivers, modeling language-group transitions, and revealing differential semantic retention across POS and topics. It provides strong evidence that Hinglish is growing and stabilizing at high prevalence on Indian Twitter, driven by macroeconomic dynamics, with English users especially likely to transition to Hinglish. Linguistic analyses show non-uniform semantic change across word groups and increasing Devanagari usage. The dynamic model forecasts continued Hinglish growth (~3% annually), a stable Hindi share, and declining English share in the near term. Future work should: expand beyond two cities and selected topics; incorporate richer, finer-grained exogenous variables; consider user-level demographics; extend to other code-mixed pairs; and regularly curate new datasets to capture evolving semantics for robust NLP and safety applications.
- Data source and scope: Tweets were limited to Delhi and Mumbai users and to topics (Cricket, Bollywood, Politics), which may not represent all Hindi-speaking communities or domains. Twitter/X users skew urban/younger and do not represent the Indian population at large.
- Labeling and thresholds: Language identification and POS tagging rely on pretrained models and the chosen CMI threshold (0.5), which may introduce classification noise, especially with transliteration and mixed scripts.
- Population measurement: Analyses use fractional populations rather than absolute counts due to API retrieval constraints and non-representative user counts per year; findings reflect relative trends, not total population adoption.
- Econometric assumptions: The dynamic model assumes dependence only on current fractions and changes in exogenous features, with future exogenous variables held at last observed levels—potentially oversimplifying real-world dynamics. Transition probabilities may vary across subpopulations and contexts not modeled here.
- Semantic modeling: Word2Vec with fixed hyperparameters and minimum frequency thresholds may miss rare or emerging terms; retention rate approximates semantic stability but depends on neighborhood size and embedding quality.
- Temporal discontinuities: A noted dip in 2022 stems from data discontinuity; pandemic-era effects and media cycles could confound trends.
- Generalizability: Although the framework can extend to other code-mixed languages, cultural and platform differences may limit direct transferability.
Related Publications
Explore these studies to deepen your understanding of the subject.

