Economics
Exploring socioeconomic similarity-inequality: a regional perspective
M. L. Mouronte-lópez and J. S. Ceres
The study motivates measuring and understanding regional similarity and inequality across multiple socioeconomic dimensions to inform policy and progress toward the UN 2030 Sustainable Development Goals, particularly SDGs 4, 5, 9, 10, and 16. Inequality is multidimensional, spanning income, health, education, gender, and justice. The paper frames three research questions: (1) Can similarities and dissimilarities between countries and world regions be detected by studying time series of educational, economic, gender, and labour market variables, and can countries be ranked for each variable? (2) Can unknown relationships between variables across these domains be discovered? (3) Can a model be built to predict the Gini coefficient from a set of socioeconomic variables from health, economic, labour protection and gender domains, generalizable to any country? The work argues these analyses support policy design, inter-territorial learning, and monitoring progress on equality.
The background highlights that research often emphasizes differences rather than similarities; prior work (e.g., Hanel et al., 2019; Stern, 2020) shows high similarity across social variables and the value of similarity analysis for marginalized groups. Regional similarity has been explored by the World Bank’s regional updates, US state similarity indices (Economy League, 2023), EU forest-resource/economic similarity (Niskanen & Lin, 2001), and EU twinnings using cosine similarity on Eurostat indicators. For inequality, foundational and empirical studies (Kuznets, 1955; Alesina & Rodrik, 1994; Cingano, 2014) link inequality and growth. Work on measuring inequality suggests single measures like Gini can be limiting and multi-parametric models can be superior (Blesch et al., 2022), while some studies combine Gini with income shares (Sitthiyot & Holasut, 2020). Gini has been related to education outcomes, health, and social variables in diverse contexts. For Gini modelling, prior efforts used Random Forest, SVR, and linear regression for OECD (Koç & Akın, 2021) and deep learning for the US (Zhang, 2022). The present study extends scope to 220+ countries, more domains, and uses series clustering and ML to link Gini with broad determinants.
Data sources: World Bank Gender Statistics Database (888 variables, 264 countries, 1960–2019), World Bank Gini Index (up to 2018), and UNDP Gender Inequality Index (GII; 189 countries, 1995–2019). Regions analyzed: Europe (53), North America (33), South America (12), Africa (53), Asia (50), Oceania (19). Variables: 26 indicators across domains used for similarity/inequality analysis. Educational: multiple enrollment (primary/secondary/tertiary by sex), gender parity indices, and education expenditures (SE.ENR.PRSC.FM.ZS; SE.ENR.TERT.FM.ZS; SE.PRM.ENRR.FE; SE.PRM.ENRR.MA; SE.SEC.ENRR.FE; SE.SEC.ENRR.MA; SE.TER.ENRR.FE; SE.TER.ENRR.MA; SE.XPD.PRIM.PC.ZS; SE.XPD.SECO.PC.ZS; SE.XPD.TOTL.GD.ZS). Economic: NY.GDP.MKTP.CD; NY.GDP.MKTP.KD.ZG; SI.POV.GINI. Labour market: SL.TLF.ADVN.FE.ZS; SL.TLF.ADVN.MA.ZS; SL.TLF.BASC.FE.ZS; SL.TLF.BASC.MA.ZS; SL.TLF.TOTL.FE.ZS; SL.UEM.TOTL.FE.NE.ZS; SL.UEM.TOTL.MA.NE.ZS. Gender: SG.TIM.UWRK.FE; SG.TIM.UWRK.MA; SG.GEN.MNST.ZS; SG.GEN.PARL.ZS; plus GII. Software: R with packages including ranger, caret, NbClust, FactoMineR, dplyr, tidyr, ggplot2, etc. Preprocessing for Gini modelling: remove CountryName, CountryCode, IndicatorName; transpose to wide with IndicatorCode columns and rows as years; drop rows missing SI.POV.GINI; drop columns with >30% missing; impute categorical by mode and numeric by mean; remove collinear variables (|r|>0.9). Dataset split: 80% for 5-fold cross-validation (vfold_cv), 20% hold-out validation; CV set 1,344 rows, validation 336 rows (total 1,680 rows). Time series characterization and correlations: For each country-variable time series, test stationarity/seasonality using Augmented Dickey-Fuller (alpha=0.05). Decompose series into trend, seasonal, and random components. Normality by Shapiro–Wilk (alpha=0.05) to choose Spearman (non-normal) vs Pearson correlation; all variables were non-normal, so Spearman used. Clustering of time series and regional similarity/inequality: For each univariate indicator, compute pairwise distances across country time series of equal length and sample times; use Ward’s hierarchical clustering (Ward2) on Euclidean squared distances. Determine optimal number of clusters per variable using multiple validity indices (Calinski-Harabasz, Dunn, Silhouette, KL, etc., with max criteria; McClain, Cindex, DB, etc., with min criteria) via NbClust; select by majority rule. Handle missing values in series by country-wise mean imputation for numeric and mode for categorical. Define regional similarity levels S_p (p=25,50,75,100) based on the proportion of country pairs co-clustering across variables within a region; define inequality levels I_p as complements. Gini prediction model: Random Forest (Breiman, 2001) using ranger with bootstrapping and feature subsampling. Hyperparameters tuned; best: mtry=149, trees=5000, max_depth=20. Variable importance assessed by permutation importance (OOB RMSE increase) and node purity (Gini impurity decrease); iterative backward elimination to optimize average RMSE. Final model: 16 predictors with best trade-off (average RMSE=3.55701). Selected variables: SP.ADO.TFRT; SI.POV.DDAY; SP.POP.65UP.TO.ZS; SP.POP.0014.TO.ZS; SP.DYN.CBRT.IN; SH.STA.ODFC.ZS; SP.DYN.TFRT.IN; NY.GDP.PCAP.KD; SH.STA.ODFC.UR.ZS; SL.SRV.EMPL.FE.ZS; SP.DYN.CDRT.IN; SL.TLF.CACT.MA.ZS; SL.AGR.EMPL.FE.ZS; SP.POP.1564.TO.ZS; SL.EMP.VULN.FE.ZS; SH.PAR.LEVE.AL.
- Correlations: All variables non-normal (Shapiro-Wilk p<0.05), so Spearman used. Moderate positive correlation between GPI for primary/secondary enrollment and GPI for tertiary. Female and male enrollments correlate strongly at primary and tertiary levels. Secondary and tertiary enrollments correlate positively with GDP, and negatively (moderately) with Gini; government expenditure per student in secondary also negatively associated with Gini and strongly with primary expenditure. Female labour force share correlates positively with secondary/tertiary enrollments; basic-educated labour force correlates negatively with male unemployment; male and female unemployment highly correlated. Women’s parliamentary seats correlate moderately with secondary enrollment and education spending.
- Clustering outcomes (examples): Educational indicators mostly split into 2–3 clusters; e.g., SE.TER.ENRR.FE (2 clusters, means 11.28 vs 51.57), SE.TER.ENRR.MA (2 clusters, means 10.64 vs 42.27). Education spending SE.XPD.TOTL.GD.ZS had 3 clusters (means 4.12, 11.10, 9.50). Economic: NY.GDP.MKTP.CD (3 clusters), NY.GDP.MKTP.KD.ZG (2 clusters), SI.POV.GINI (3 clusters with means 34.51, 45.98, 55.31). Labour market: SL.UEM.TOTL.MA.NE.ZS (3 clusters; means 13.16, 5.17, 21.90); SL.UEM.TOTL.FE.NE.ZS (2 clusters; means 19.18, 6.36). Gender: SG.TIM.UWRK.FE (5 clusters; means 22.91, 19.23, 15.24, 8.36, 29.89); SG.TIM.UWRK.MA (2 clusters; means 3.94, 9.39); SG.GEN.MNST.ZS (3 clusters; means 26.50, 13.18, 40.16); SG.GEN.PARL.ZS (2 clusters; means 11.77, 28.91). GII (3 clusters; means 0.63, 0.22, 0.46).
- Regional similarity/inequality indices (S_p/I_p): Education S_100 highest in Europe (0.3669) and South America (0.3025); lowest in Oceania (0.0316). Economic S_100 highest in Europe (0.6143) and South America (0.3938); gender S_100 ≈ 0 across all regions (near zero in Europe). High inequality levels (I_100) in gender for all regions (e.g., North America 0.9177; Oceania 0.7658). Labour market similarity highest in South America (S_100=0.6038); highest labour-market inequality in North America and Africa (I_100=0.7552 and 0.5760 respectively). For SI.POV.GINI, internal inequalities (I_100) largest in North America (0.5076), Asia (0.3914), Africa (0.2950), and Oceania (0.2405); highest similarities in Europe (S_100=0.6143).
- Gini prediction model: Best RF with mtry=149, trees=5000, max_depth=20. Optimal 16-variable model yields average RMSE=3.55701. Most influential predictors (descending): adolescent fertility rate; poverty headcount at $1.90/day; population ages 65+; population ages 0–14; crude birth rate; open defecation (%); total fertility rate; GDP per capita (constant); open defecation urban; female employment in services; crude death rate; male labour force participation; female employment in agriculture; population ages 15–64; vulnerable female employment; length of paid shared parental leave.
- Substantive insights: Higher secondary/tertiary enrollment and greater education spending associate with lower income inequality; increased female labour force participation associates with lower inequality; gender time use and political representation show wide dispersion across countries.
The study’s findings address the research questions by showing that time-series clustering of socioeconomic indicators can systematically categorize countries into similarity groups across domains and reveal regional patterns of similarity and inequality. Europe shows the strongest internal similarities in education and economics, consistent with EU convergence policies, while South America shows the highest similarity in labour market indicators but persistent disparities in gender outcomes. North America displays comparatively high internal inequality in education, labour market, and gender indices. Correlation analyses suggest mechanisms linking education and labour variables with income distribution: greater enrollment (secondary/tertiary) and higher per-student expenditure are associated with lower Gini coefficients; female labour force participation relates to reduced inequality. The Gini RF model validates that inequality is multifactorial, with demographic structure, fertility, poverty incidence, sanitation, sectoral employment, and social protection features contributing to predictive performance. The cross-domain approach thus provides an integrated view of regional similarity-inequality and actionable associations relevant for policy targeting in education, labour, health/sanitation, and gender equity.
The paper contributes by: (1) introducing a time-series clustering framework to rank countries and quantify regional similarity-inequality across education, economic, labour market, and gender domains; (2) proposing interpretable regional similarity (S_p) and inequality (I_p) metrics; (3) identifying cross-domain relationships (e.g., education and female labour participation associated with lower inequality); and (4) delivering a global Random Forest model that predicts the Gini coefficient from 16 multi-domain variables with competitive accuracy (average RMSE ≈ 3.56). Europe and South America exhibit higher intra-regional similarity in several domains, whereas the gender domain shows the largest disparities globally. Future work outlined includes deeper analysis of the gender domain, further decomposition of time series into trend and random components, and applying the developed software to other social variables and settings.
- Data availability and missingness: Not all indicators and years are available for all countries; the study excludes columns with >30% missing and uses mean/mode imputation, which may bias estimates and cluster assignments.
- Indicator selection and scope: From large repositories, a subset of 26 variables was expert-selected for clustering; results may be sensitive to variable choice and do not encompass all socioeconomic dimensions.
- Temporal alignment: Time series lengths and availability vary; element-wise distance comparison assumes equal-length aligned series after imputation, potentially smoothing dynamics.
- Composite measures: GII is a composite index and was excluded from correlation heatmaps; interpretation differs from raw indicators.
- Clustering validity: Optimal cluster numbers were chosen by majority across indices; different validity criteria or distance metrics could yield alternative partitions.
- Generalizability of the Gini model: While built to be global, model performance depends on the quality and representativeness of available inputs per country-year; causal inference is not established.
- Regional aggregation: Broad regional groupings may mask heterogeneity within subregions or income groups.
Related Publications
Explore these studies to deepen your understanding of the subject.

