logo
ResearchBunny Logo
Introduction
Type-2 diabetes mellitus (T2DM) is a global health concern, projected to affect 629 million by 2045. While traditionally viewed as a homogeneous disease, recent research indicates significant heterogeneity in underlying pathologies. This study aims to identify T2DM sub-populations within the large and comprehensive NFHS-4 dataset from India. This dataset contains rich information on T2DM patients, including medical history, dietary habits, socioeconomic status, and lifestyle factors. The heterogeneity of T2DM suggests the possibility of personalized treatment approaches, and understanding the characteristics of different sub-populations is crucial for effective disease management and public health interventions. Factors like age, sex, socioeconomic status, residence (rural/urban), smoking, alcohol consumption, and dietary habits are significantly associated with T2DM. These factors, many of which are modifiable, influence glycemic control and treatment response. This study employs an unsupervised machine learning approach to identify clusters within the T2DM population based on these socio-demographic and lifestyle factors, characterizing those clusters to identify associated patterns.
Literature Review
Existing literature highlights the heterogeneity of T2DM, revealing varied pathophysiologies and suggesting the potential for personalized treatment. Studies have reported associations between T2DM and various factors like age, sex, socioeconomic status, residence, smoking, alcohol consumption, and dietary habits. However, identifying distinct T2DM sub-populations based on epidemiological data remains largely unexplored. Previous studies using clinical and biochemical data have identified T2DM subtypes, such as severe insulin-deficient diabetes (SIDI) and mid-age-related diabetes (MARD). The current study seeks to expand upon these findings by leveraging a large epidemiological dataset to identify distinct T2DM sub-populations based on socio-demographic and lifestyle factors, potentially offering further insights into T2DM subtypes and informing targeted interventions.
Methodology
This study utilized the NFHS-4 dataset, a large-scale epidemiological survey from India. The dataset includes 10,125 T2DM patients and encompasses a wide range of features, categorized as continuous (e.g., BMI, age), ordinal (e.g., frequency of food consumption), and nominal (e.g., sex, religion). Standard UMAP (Uniform Manifold Approximation and Projection) dimension reduction techniques proved ineffective due to the dataset's diverse feature types. Continuous features dominated the clustering results, overshadowing the contributions of ordinal and nominal variables. To address this challenge, the researchers implemented a distributed clustering workflow. This involved applying UMAP separately to continuous, ordinal, and nominal features, employing appropriate similarity metrics for each feature type. For continuous features, Euclidean distance was used; for nominal features, Hamming distance; and for ordinal features, Canberra distance. The resulting lower-dimensional embeddings from each feature type were integrated into a five-dimensional representation. Finally, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was used to identify clusters within this integrated representation. This approach aimed to create unbiased clustering, giving equal weight to all feature types and producing interpretable clusters. The chosen parameters for UMAP were n_neighbors=30 and min_distance=0.1 across all feature types. For DBSCAN, eps=1 and min_points=200 were used. Data pre-processing steps included merging questionnaires, creating a unique identifier for each individual, excluding individuals with missing data, and removing outliers from continuous variables. The final dataset comprised 10,125 T2DM patients with 36 features.
Key Findings
The analysis revealed four significant T2DM clusters, each with distinct characteristics. Two of these clusters (Clusters 2 and 4) were predominantly composed of non-obese individuals with lower mean ages (38.3 ± 0.19 years and 41.3 ± 0.14 years, respectively) compared to the other two clusters. These clusters also showed a higher proportion of rural residents and a lower representation from the richest wealth quintile. Cluster 3, surprisingly, contained approximately 90% of participants who reported never consuming eggs, fish, or chicken/meat. While Cluster 3 participants showed the highest daily intake of milk/curd and pulses/beans compared to other clusters, other clusters also showed nearly similar proportions of daily consumption of these items. The distribution of other food items (e.g., leafy vegetables, fruits, fried food, aerated drinks) across clusters was relatively similar. The findings suggest a potential link between non-obese T2DM, younger age, rural residence, lower socioeconomic status, and dietary habits. The presence of two non-obese clusters suggests a more nuanced understanding of T2DM epidemiology is needed, potentially pointing to distinct subtypes not captured by conventional BMI and age thresholds.
Discussion
The findings of this study challenge the traditional view of T2DM as a homogeneous disease and provide important insights into the epidemiological characteristics of T2DM sub-populations in India. The identification of two significant non-obese T2DM clusters, characterized by younger age, rural residence, and lower socioeconomic status, underscores the need for modified screening criteria and targeted interventions for rural populations. The development of the feature-type-distributed clustering workflow addresses the limitations of applying standard UMAP to datasets with diverse feature types, providing a valuable methodological contribution to epidemiological research. The significant differences in dietary habits observed across clusters further highlight the need for personalized treatment approaches that consider individual lifestyle and socio-economic factors. The cluster showing high prevalence of non-vegetarian diets but low intake of plant protein warrants further exploration. Future research should investigate the underlying pathophysiological mechanisms driving the observed cluster differences, exploring interactions between genetic factors, environmental exposures, and lifestyle choices. This would support the creation of more effective prevention and treatment strategies that consider the heterogeneity of T2DM.
Conclusion
This study demonstrates the effectiveness of a novel feature-type-distributed clustering workflow using UMAP for analyzing epidemiological datasets with diverse feature types. The identification of four distinct T2DM clusters, including two non-obese clusters with specific socio-demographic and lifestyle characteristics, necessitates revised screening criteria for T2DM, particularly in rural communities. Further research should investigate the pathophysiological mechanisms underlying these clusters and explore targeted interventions based on these identified subgroups. The findings highlight the importance of incorporating diverse data types and considering socio-economic factors in future T2DM research.
Limitations
The study's reliance on self-reported data from the NFHS-4 survey might introduce biases and inaccuracies. The cross-sectional nature of the data limits the ability to establish causal relationships between identified clusters and outcomes. The generalizability of the findings to other populations and settings might be limited. Further research utilizing longitudinal data and incorporating more detailed clinical information would strengthen the findings. The study's focus on India also limits its generalizability to other populations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny