Introduction
Predicting individual well-being is crucial for developing personalized interventions. Well-being, encompassing both hedonic (subjective well-being, life satisfaction, happiness) and eudaimonic (purpose, meaning) aspects, is complex and influenced by various factors. This study focuses on subjective well-being, aiming to build prediction models that integrate data from the exposome (environmental exposures) and genome (genetic predispositions). The exposome is categorized into specific (psychosocial factors, lifestyle) and general (objective environmental characteristics) components. Prior research has identified potential risk and protective factors from both the exposome and genome, but these studies often employed cross-sectional designs and a pick-and-choose approach, neglecting the complex interplay of factors. Therefore, this study employs machine learning methods to analyze a large longitudinal dataset, encompassing a wide range of predictors from multiple data modalities, allowing for the identification of the most important predictors of well-being at the individual level, paving the way for personalized interventions.
Literature Review
Existing literature highlights various risk and protective factors for well-being, drawing from both the exposome and genome. The exposome, encompassing all environmental exposures across the lifespan, has been increasingly linked to health and well-being. The specific exposome includes lifestyle factors and psychosocial elements like personality, social support, and life events, with evidence suggesting that adult well-being is rooted in childhood and adolescence. The general exposome comprises objective neighborhood characteristics such as socioeconomic status (SES), housing quality, and environmental factors. Genetic influences on well-being are also significant, with polygenic scores (PGSs) capturing the cumulative effects of many genetic variants on specific traits. However, previous studies often focused on individual predictors using cross-sectional data, neglecting the interactive nature of genetic, childhood, psychosocial, and environmental factors. Studies using multiple data modalities are frequently limited to clinical samples, impacting external validity. Machine learning offers a robust approach to model this complexity, capable of handling numerous variables and nonlinear interactions.
Methodology
This study leverages data from the Netherlands Twin Register (NTR), a large population cohort with longitudinal data (1991-2022) across seven waves of the Young NTR (YNTR) and three adult waves (ANTR). Three unimodal datasets (specific exposome, genome, general exposome) and four multimodal datasets (combining these modalities) were created. The specific exposome comprised longitudinal psychosocial predictors from childhood to adulthood, the genome was represented by PGSs for various traits, and the general exposome by objectively measured neighborhood characteristics linked to participants' postal codes. The outcome variable, adult well-being, was a continuous score derived from life satisfaction, happiness, and quality of life measures using structural equation modeling. Three machine learning algorithms (XGBoost, SVM, Random Forest) were trained, and their predictions served as input for a final XGBoost meta-model. Feature importance was assessed using SHAP values and permutation importance. Data preprocessing involved handling missing values (k-nearest neighbors imputation), removing irrelevant features, standardizing/normalizing continuous variables, and dummy coding categorical variables. For the general exposome, variables with high correlations (>0.95) were removed, and skewed variables were transformed. Feature selection was performed using elastic net regression in unimodal analyses. Model performance was evaluated using R² in an independent test set, with 95% confidence intervals calculated using bootstrapping. Statistical comparisons between models used clustered Wilcoxon signed rank tests. The study followed a data-driven approach, including features based on availability.
Key Findings
Unimodal analyses revealed that the specific exposome model exhibited high predictive accuracy (R² = 0.702 [0.637-0.753]) in the independent test set. The genome showed no predictive power (R² = -0.007 [-0.026-0.010]), while the general exposome showed modest but significant predictive power (R² = 0.047 [0.015-0.076] after dichotomizing features with a mode of zero). Multimodal analyses showed that adding the genome or general exposome to the specific exposome model did not significantly improve prediction. The model including both specific and general exposomes performed slightly better (R² = 0.688) than the model with specific exposome and genome (R² = 0.671). The model including all three data modalities (R² = 0.634) performed worse. Feature importance analysis using SHAP values and permutation importance revealed that optimism, loneliness, personality traits, mental health symptoms, and social support were the most predictive features from the specific exposome. For the general exposome, housing stock characteristics, particularly the number of newly built social rent houses, were highly predictive. Analyses of single waves of specific exposome data showed that predictive power increased as the time point of measurement approached adulthood, with significant predictive power present from age 7 onwards. The addition of distal childhood/adolescence features to adulthood features did not significantly improve prediction.
Discussion
This study demonstrates the high predictive power of longitudinal psychosocial data (specific exposome) in predicting adult well-being, surpassing the predictive power of genetic and objective environmental factors. The strong predictive power of the specific exposome aligns with the complex and interactive nature of well-being, emphasizing the significant role of psychosocial factors accumulated across the lifespan. The modest contribution of the general exposome highlights the importance of neighborhood-level factors, particularly housing, especially in childhood/adolescence, but also suggests limitations in using postal code-level data to capture the nuances of environmental exposure. The lack of predictive power of the genome may be attributed to limitations in current PGSs, including limited variant coverage and sample size. The findings underscore the importance of longitudinal data collection and the value of a data-driven approach for identifying key predictors. Future research should investigate causal pathways linking parental exercise behavior and well-being, as well as the moderating factors influencing the relationship between housing characteristics and well-being across different groups of individuals.
Conclusion
This study provides strong evidence for the effectiveness of machine learning models in predicting well-being using longitudinal exposome data. The specific exposome proved highly predictive, while the general exposome and genome offered limited additional predictive value. The identified key features highlight potential targets for personalized well-being interventions. Future studies should investigate causal mechanisms, refine the assessment of the exposome and genome, and extend these models to other populations and well-being concepts.
Limitations
This study has several limitations. The data-driven approach might have artificially boosted the specific exposome model's performance due to inclusion of features conceptually overlapping with well-being. The use of primarily hedonic well-being measures limits generalizability. The sample, largely from a Western population, may limit generalizability. The use of postal codes for general exposome data may not fully capture the dynamic nature of environmental exposures. Attrition might have introduced bias. Finally, the study focused on prediction, not causation, limiting causal inferences about well-being development.
Related Publications
Explore these studies to deepen your understanding of the subject.