logo
ResearchBunny Logo
Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach

Medicine and Health

Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach

S. Bej, J. Sarkar, et al.

This innovative study explores diverse sub-populations within Type-2 Diabetes Mellitus in India, revealing unexpected insights about non-obese individuals and their dietary habits. Conducted by Saptrash Bej, Jit Sarkar, Saikat Biswas, Pabitra Mitra, Partha Chakrabarti, and Olaf Wolkenhauer, this research calls for a reevaluation of T2DM screening criteria in rural areas.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses whether distinct T2DM patient sub-populations exist when characterized by socio-demographic and lifestyle factors within a large epidemiological dataset. While T2DM has traditionally been viewed as a homogeneous condition, recent evidence indicates heterogeneity with different underlying pathophysiologies, suggesting potential for personalized treatment. Multiple factors (obesity, age, sex, socio-economic status, residence, smoking, alcohol, diet) are associated with T2DM and are variably modifiable, affecting glycemic control. Using the NFHS-4 dataset—India’s large, comprehensive survey—the authors apply unsupervised clustering to identify patterns and characterize sub-populations by socio-demographic and lifestyle features.
Literature Review
Prior studies have identified heterogeneous T2DM subtypes associated with differing pathophysiologies and outcomes (e.g., SIDD, SIRD, MARD) using clinical and biochemical variables. Epidemiological correlates of T2DM include age, obesity, sex, socio-economic status, residence, smoking, alcohol, diet, and cooking methods. Asian populations show T2DM onset at lower BMI and younger ages. Dimension reduction methods such as t-SNE and UMAP are common for uncovering structure in high-dimensional data, with UMAP offering advantages in preserving local and global topology and speed. However, conventional application can be biased toward continuous features in mixed-type datasets, motivating tailored similarity metrics by feature type.
Methodology
Data source: NFHS-4 (DHS Program), a nationally representative survey (stratified two-stage sampling) covering all 640 districts of India, with 601,509 households; 112,122 men and 699,686 women interviewed. Four questionnaires (Household, Woman’s, Man’s, Biomarker) collected demographic, socio-economic, nutrition, and biomarker data (including random blood glucose for ages 15–49 via finger-stick) alongside anthropometrics and blood pressure. Dataset preparation: Woman’s, Man’s, and Biomarker questionnaires were merged via a unique individual code. The combined dataset initially included 810,971 individuals (men and women ages 15–54). Exclusions: individuals with missing diabetes and blood pressure status. Pregnant women were not excluded to avoid losing gestational diabetes cases. Selected variables included known risk factors (BMI, age, residence, wealth index, smoking, alcohol intake, hypertension), socio-economic (sex, religion, social group, education), dietary practices, and hemoglobin. Continuous variables: BMI, age, hemoglobin; others categorical. Outliers in continuous variables were removed, yielding 610,498 individuals (526,678 females; 83,240 males) for the broader dataset; the T2DM subset used for clustering comprised 10,125 patients. Feature categories (36 total features across T2DM subset): - Continuous (4): age, BMI, hemoglobin, time to drinking water source. - Nominal (7): sex, household type/structure, type of place of residence (urban/rural), type of cooking fuel, source of drinking water, and related nominal attributes as listed. - Ordinal (25): primarily food frequency variables (milk/curd, pulses/beans, oats, leafy vegetables, fruits, eggs, fish, chicken/meat, fried foods, aerated drinks with levels: daily/weekly/occasionally/never) and other ordered categorical variables. Clustering workflow: - Dimensionality reduction via UMAP applied separately by feature type using n_neighbors=30 and min_distance=0.1 for all runs, with metrics: • Continuous: Euclidean. • Ordinal: Canberra (chosen to reflect ordered relationships and retain high variance in lower dimensions). • Nominal: Hamming (captures categorical dissimilarity without order). - Low-dimensional embeddings: 2D for continuous and ordinal; 1D for nominal (to avoid excessive variance-induced overclustering). These were concatenated to form a five-dimensional representation per individual. - Clustering: DBSCAN on the 5D space with eps=1 and min_points=200; DBSCAN avoids pre-specifying cluster number and identifies dense regions as clusters. Validation/visualization: UMAP visualizations of separate feature-type embeddings and of the integrated 5D representation (colored by DBSCAN labels) confirmed meaningful, interpretable clusters. Initial naive UMAP on all features with Euclidean showed dominance by continuous features; the feature-type-distributed approach mitigated this bias.
Key Findings
- Cluster detection: 7 clusters identified with 261 outliers; four significant clusters with sizes 2898, 2301, 2226, and 1315. - Age and BMI: Two clusters (Cluster 2 and Cluster 4) were non-obese and younger: • Cluster 2: Age 38.3 ± 0.19 years; BMI 23.9 ± 0.10; Hemoglobin 12.3 ± 0.04. • Cluster 4: Age 37.9–41.3 reported; table lists 37.9 ± 0.26 years and BMI 23.6 ± 0.13; Hemoglobin 12.3 ± 0.06. • Cluster 1 (obese comparator): Age 41.3 ± 0.14; BMI 26.7 ± 0.09; Hemoglobin 12.5 ± 0.04. • Cluster 3 (obese comparator): Age 39.9 ± 0.18; BMI 26.1 ± 0.11; Hemoglobin 12.1 ± 0.04. - Residence and wealth: • Rural residents: Cluster 2 = 69.4%; Cluster 4 = 72.02%; vs Cluster 1 = 31.3%; Cluster 3 = 49.19%. • Richest wealth quintile: Cluster 2 = 4.3%; Cluster 4 = 8.37%; vs Cluster 1 = 64.04%; Cluster 3 = 54.9%. - Diet patterns: • Cluster 3 showed very low non-vegetarian intake: Egg “never” 89.08–89.80%; Fish “never” 97.12%; Chicken/meat “never” 97.71%. • Daily milk/curd intake highest in Cluster 3 (61.81%) and high pulses/beans daily (50.31%). Other food items (leafy vegetables, fruits, fried foods, aerated drinks) were more evenly distributed across clusters. - Living conditions and access indicators (lower in Clusters 2 and 4): • Possess refrigerator: Cluster 2 ≈ 0.22% vs Cluster 1 95.48%; Cluster 4 24.79% vs Cluster 3 65.77%. • Possess motorbike: Cluster 2 30.96%; Cluster 4 32.78%; vs Cluster 1 51.36%; Cluster 3 67.03%. • Possess car/truck: Cluster 2 3.26%; Cluster 4 3.19%; vs Cluster 1 23.5%; Cluster 3 17.43%. • Cooking fuel: Higher plant/biomass-based use in Clusters 2 and 4 (e.g., 42.44% and 54.89%) vs lower in Clusters 1 and 3 (12.22% and 19.63%); gas/oil usage higher in Clusters 1 and 3. • Unprotected water sources: Cluster 2 6.35%; Cluster 4 15.51%; vs Cluster 1 2.62%; Cluster 3 1.98%. - Comorbidities: Similar distributions across clusters for asthma, thyroid disease, heart disease, cancer, history of TB, hypertension, and hemoglobin levels. - Additional distinguishing feature: Cluster 4 exhibited very high “time to water source (min)” compared to other clusters (mean ≈ 18.6 ± 0.39 min), indicating access issues.
Discussion
The study demonstrates that conventional UMAP on mixed-type epidemiological data can be biased toward continuous variables, obscuring meaningful structure from ordinal and nominal features. By applying UMAP with tailored similarity metrics per feature type and integrating embeddings, the authors uncovered interpretable T2DM sub-populations. The findings support known heterogeneity in T2DM, aligning epidemiological clusters with clinically recognized subtypes: the two younger, non-obese clusters (Clusters 2 and 4) may be analogous to severe insulin-deficient diabetes (SIDD) phenotypes, whereas the obese clusters (1 and 3) may resemble obesity-related or severe insulin-resistant diabetes. Importantly, the non-obese clusters were disproportionately rural and economically disadvantaged, suggesting a need to adjust screening thresholds (age, BMI) for rural populations and address inequities in access to care. The predominantly vegetarian Cluster 3 highlights dietary patterns relevant to T2DM management; ensuring adequate protein intake for such groups may be necessary. Overall, the clustering approach answers the research question by revealing distinct socio-demographic and lifestyle profiles within T2DM, emphasizing the role of socio-economic and environmental factors (e.g., fuel type, water access) in disease characterization and potential management strategies.
Conclusion
The study introduces a feature-type-distributed UMAP workflow that overcomes biases in mixed-type epidemiological data, producing meaningful, interpretable clusters. Four significant T2DM sub-populations were identified, including two younger, non-obese, predominantly rural and economically disadvantaged groups, and a largely vegetarian obese cluster. These insights suggest revising T2DM screening criteria (e.g., BMI and age cutoffs) for rural populations and tailoring dietary and management guidelines to subgroup-specific needs. Future work should link epidemiological clusters to clinical/biochemical subtypes, incorporate broader age ranges, and examine longitudinal outcomes to refine screening and intervention strategies.
Limitations
- Age range limitation: NFHS-4 biomarker-based diabetes data primarily covered ages up to 49 years, limiting identification of age-related T2DM subtypes (e.g., MARD). - Data scope: The analysis relied on self-reported socio-demographic, dietary, and comorbidity information alongside limited biomarkers; detailed clinical/biochemical measures were not available, constraining subtype validation. - Socio-economic inference: While proxies for socio-economic status and living conditions were analyzed, direct measures to ascertain causal pathways of socio-economic inequalities were limited. - Generalizability: Findings are specific to the NFHS-4 Indian population and may not generalize without validation in other populations or surveys. - Clustering sensitivity: Although the workflow mitigates bias by feature type, results may be sensitive to metric choices, dimensionality of embeddings, and DBSCAN parameters.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny