Psychology

Building machine learning prediction models for well-being using predictors from the exposome and genome in a population cohort

D. H. M. Pelt, P. C. Habets, et al.

Discover groundbreaking insights from researchers Dirk H M Pelt, Philippe C Habets, and their team, who utilized longitudinal data from the Netherlands Twin Register to unveil how exposome factors like optimism and social support predict adult well-being. While genetic data fell short, understanding the psychosocial landscape proves essential in this compelling study.... show more

Introduction

The study addresses the need for accurate, individualized prediction of adult well-being to enable personalized interventions and to elucidate underlying mechanisms. Well-being is defined here as subjective (hedonic) well-being, encompassing cognitive (life satisfaction) and affective (happiness/positive affect, absence of negative affect) evaluations of life. Prior research has identified a broad set of potential risk and protective factors spanning the exposome (lifestyle, psychosocial and environmental exposures) and genome (polygenic influences), but most evidence stems from cross-sectional, association-based designs that may not translate to individual-level prediction. The authors aim to develop optimal machine learning models using multi-modal, longitudinal data to predict adult well-being and to identify the most predictive features across the specific exposome (psychosocial and lifestyle factors), general exposome (objective neighborhood/environmental exposures), and genome (polygenic scores). The work emphasizes moving beyond a pick-and-choose approach to integrate complex, potentially interacting influences across development.

Literature Review

The exposome comprises internal, specific external, and general external domains. Psychosocial and lifestyle factors (specific exposome) such as childhood psychopathology, personality, social support, health indicators, SES, maltreatment, substance use, and life events are associated with well-being and mental health. The general exposome (built environment) includes neighborhood characteristics like urbanicity, air pollution, greenspace, and SES measures linked to well-being and depression. Genetically, well-being, depression, anxiety, and personality share substantial polygenic architecture, with high genetic correlations indicating shared etiology. However, most studies have examined individual predictors or single modalities, often in clinical samples, limiting ecological validity and predictive performance. Multimodal machine learning models have improved prediction for mental illness phenotypes (e.g., depression, resilience), but few have targeted well-being, and many lacked environmental and genetic data or used cross-sectional designs. Gene–environment correlations and interplay, such as polygenic scores for education correlating with residential SES and mobility, underscore the need for integrative modeling that accounts for complex interactions.

Methodology

Sample and data: Data were drawn from the Netherlands Twin Register (NTR) across ten waves (1991–2022): seven Young NTR (YNTR) waves at ~ages 3, 5, 7, 10, 12, 14, 16 (parent reports) and three Adult NTR waves (ANTR8: 2009–2012; ANTR10: 2013–2015; ANTR14: 2019–2022; self-reports). Multiple unimodal datasets were built (specific exposome, genome, general exposome), plus four multimodal combinations. Family structure was preserved in data splitting and cross-validation. Outcome: A continuous adult well-being score was computed using a latent trait–state–occasion model (lavaan) from Satisfaction with Life Scale, Subjective Happiness Scale, and Cantril Ladder across ANTR8/10/14. Factor scores reflect stable well-being adjusted for age; the model was fit in the training set to avoid leakage and applied to test set. Predictors: Specific exposome included individual items (not sum scores) spanning personality (e.g., NEO-FFI items), mental health symptoms, social support/relations, life events, lifestyle (e.g., exercise), SES, health, etc., from childhood through adulthood. Genome predictors were 60 polygenic scores (PGSs) covering personality, childhood health/psychopathology, substance use, SES, exercise, etc., computed using the NTR pipeline (LDpred 0.9, all SNPs), with 10 genomic principal components and genotyping platform dummies as covariates. General exposome predictors were objective neighborhood exposures from GECCO (1990–2022), linked to participants’ postal codes at the year of each survey; features included housing stock, income, population composition, education, amenities, safety, liveability, air pollution, urbanization, etc. Preprocessing: The dataset was split into train (80%) and test (20%), with relatives kept together. Grouped tenfold cross-validation respected family clustering. Features with nonsensible values were set to missing; zero-variance, character/text, twin-specific, and direct well-being indicators (adult “happy/happiness” items) were removed. Continuous variables were standardized and scaled 0–1; categorical variables were dummy-coded. For general exposome, highly collinear features (r>0.95) were iteratively removed; skewness transformations (cube or cube-root) were applied. Missingness: unimodal datasets dropped participants/features with >55% missing; remaining missing values were imputed via k-nearest neighbors (k≈√N). Postal code gaps were imputed by assuming no move if identical codes were present before/after missing waves; exposure years between waves used linear interpolation. Feature selection: Applied only in unimodal analyses using elastic net regression with tenfold cross-validated hyperparameters (randomized grid with 100 searches). Selected features then fed into multimodal models. Selections: 212/2615 specific exposome features (43% from childhood/adolescence), 13/60 genome features (PGSs: agreeableness, asthma, childhood BMI, childhood maltreatment, circadian rhythm, educational attainment, household income, loneliness, MVPA, pubertal growth, resilience, smoking cessation, well-being), and 29/732 (dichotomized mode-zero features) or 36/732 (non-dichotomized) general exposome features, many related to housing stock and adolescent periods. Modeling: Stacked ensemble with level-1 models XGBoost, Random Forest, and Support Vector Machine; level-2 meta-model XGBoost. Hyperparameters tuned via randomized grid search (100 searches) within tenfold grouped CV. An OLS baseline was also fit. Performance was evaluated by R² in the independent test set with 95% CIs from family-wise bootstrap (10,000 samples). Model comparisons used nonparametric clustered Wilcoxon signed-rank tests on squared errors (two-tailed, significance threshold P<0.005). Feature importance was assessed with SHAP values (mean absolute) and permutation importance across all three base models. Sensitivity analyses examined outcome measurement counts, outliers, number of features, and feature-to-sample ratios. A sensitivity dichotomization of general exposome features with mode zero improved performance and was used for main general exposome results. Longitudinal analyses: Built specific-exposome models from single waves (YNTR3–16, ANTR8/10/14), from childhood/adolescence-only features (YNTR3–16), adulthood-only features (ANTR8/10/14), and all waves combined, to assess predictive power over development.

Key Findings

Unimodal performance (independent test set):

Specific exposome: R² = 0.702 (95% CI 0.637–0.753), high accuracy by conventional standards.
Genome (PGSs): R² = −0.007 (95% CI −0.026–0.010), not predictive.
General exposome: initial model small but significant; after dichotomizing mode-zero features, R² = 0.047 (95% CI 0.015–0.076), modest predictive value.

Multimodal performance:

Specific exposome + genome: R² = 0.671 (0.574–0.738).
Specific + general exposome: R² = 0.688 (0.606–0.750).
Genome + general exposome: R² = 0.022 (−0.034–0.066).
All three modalities: R² = 0.634 (0.490–0.728). Incremental value beyond specific exposome: genome P=0.334 (ΔMSE ≈ 0.118), general exposome P=0.695 (ΔMSE ≈ 0.045), both together P=0.029 (not significant at P<0.005; ΔMSE ≈ 0.289). Thus, adding genome and/or general exposome did not improve prediction; performance slightly decreased.

Top predictors:

Specific exposome: features related to optimism (e.g., optimistic about the future), loneliness, personality (neuroticism, extraversion), subjective health, mental health traits (e.g., feelings of emptiness, worthlessness), social relations/support; one childhood factor—parental exercise behavior around age 10—was among top features. SHAP plots showed mostly linear effects, with some nonlinear, person-specific influences (e.g., ‘feel empty’ and ‘worthless’ items affecting subgroups differently).
General exposome: many top features related to housing stock and neighborhood SES, with a notable role for adolescent exposures; examples include number of newly built social rent houses around age 12 (high-ranking in combined models), percentage public housing, rented business premises, apartment transactions, and housing unit counts. SHAP indicated nonlinear, heterogeneous effects across individuals.

Longitudinal prediction (specific exposome, single waves):

YNTR3: R² 0.028 (−0.015–0.060); YNTR5: 0.045 (−0.018–0.094) — nonsignificant.
YNTR7: 0.081 (0.028–0.124); YNTR10: 0.073 (−0.009–0.141); YNTR12: 0.082 (−0.006–0.154).
YNTR14: 0.156 (0.069–0.228); YNTR16: 0.229 (0.133–0.315).
ANTR8: 0.275 (0.169–0.369); ANTR10: 0.491 (0.413–0.556); ANTR14: 0.463 (0.340–0.562). Models by developmental period:
Childhood/adolescence-only: R² = 0.268 (0.193–0.333), only marginally higher than YNTR16 alone (0.229; Z=−1.936, P=0.051), and dominated by YNTR16 features among top importances.
Adulthood-only (ANTR8/10/14): R² = 0.629 (0.517–0.712), significantly higher than childhood/adolescence-only (Z=−6.219, P<0.001; ΔMSE ≈ −0.745).
All waves combined: R² = 0.702; not significantly higher than adulthood-only (Z=0.892, P=0.372; ΔMSE ≈ −0.089). Robustness: Results held across sensitivity analyses; differences in feature counts or feature-to-sample ratios were not the sole drivers of performance differences.

Discussion

The study demonstrates that adult well-being can be predicted with high accuracy using longitudinal psychosocial and lifestyle features (specific exposome), while objective neighborhood exposures (general exposome) provide modest predictive value and polygenic scores currently offer negligible predictive power in out-of-sample tests. These findings address the aim of individualized prediction by showing that integrative, data-driven, machine learning approaches across development can identify key predictors such as optimism, personality, loneliness, subjective health, and social support. The discrepancy between strong associations in prior literature for certain environmental exposures (e.g., greenspace, air pollution) and their limited predictive contribution here underscores the difference between association and out-of-sample prediction in high-dimensional contexts, where effects may be small, redundant, or context-dependent. SHAP analyses suggest nonlinear and heterogeneous effects for some features, implying subgroup-specific pathways to well-being that could inform tailored interventions and future moderator analyses. From a policy perspective, housing-related neighborhood features, including those experienced in childhood and adolescence, emerged as predictive, highlighting the potential impact of housing policies on long-term well-being. The lack of incremental prediction from genome and general exposome beyond the specific exposome suggests that, given current measurement precision and sample sizes, psychosocial and lifestyle factors encapsulate much of the predictive signal for adult well-being. Continued improvements in environmental exposure assessment and genetic prediction may change this balance.

Conclusion

Using extensive longitudinal data from the Netherlands Twin Register and a stacked ensemble machine learning framework, the study achieves high accuracy in predicting adult well-being from the specific exposome, with modest prediction from the general exposome and minimal contribution from current polygenic scores. Key predictive domains include optimism, personality, loneliness, mental health symptoms, subjective health, and social relations, with some influential neighborhood housing characteristics. Proximal (adult) features predict best, but childhood/adolescent data add value and enable earlier risk stratification from as early as age 7. Future work should: refine environmental exposure measurement (higher spatial/temporal resolution, activity-space assessments via EMA and GPS); expand and improve genetic predictors (larger GWAS, broader trait coverage, improved PGS methods); integrate additional modalities (e.g., internal exposome such as microbiome/metabolome); and pursue external validation across diverse populations and contexts. Such advances may further enable personalized prevention and intervention strategies based on individuals’ exposome and genome profiles.

Limitations

Conceptual overlap: While direct well-being indicators were excluded, related constructs (e.g., optimism, self-esteem) were retained and may inflate specific-exposome performance; however, models excluding mental health features still performed well in sensitivity analyses.
Outcome scope: The well-being factor emphasized hedonic measures (life satisfaction, happiness, QoL); limited indicators per wave and lack of eudaimonic/social well-being measures may constrain generalizability across well-being constructs.
Generalizability: WEIRD sample from the Netherlands; overrepresentation of women and underrepresentation of individuals with migration background; Dutch-specific environmental context (e.g., housing market, urban form, relatively low air pollution) may limit transferability.
Environmental exposure measurement: Postal code–level linkage may be too coarse for some exposures; assumes uniform exposure within postal codes; does not capture mobility or activity spaces (commuting, work, leisure), potentially attenuating predictive power.
Attrition and sample differences: Included participants were slightly less happy and more highly educated than those who dropped out (small effects), which may affect generalizability; clinical samples not represented.
Causality: Models are predictive and based on associations; causal interpretations are not warranted. Bidirectional and downstream effects may explain limited incremental value of childhood features when adult features are included.
Genetic predictors: Limited out-of-sample predictiveness of PGSs may reflect current GWAS limitations (coverage of common variants), sample size for genetic data (N≈5,874), and number of PGSs.
Feature-to-sample ratios: High dimensionality could contribute to overfitting risks; although controlled via feature selection, CV, and independent testing, and sensitivity analyses suggest ratios are not the sole reason for results.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Using the interest theory of rights and Hohfeldian taxonomy to address a gap in machine learning methods for legal document analysis

A. Izzidien

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Medicine and Health

Recent Advancements and Perspectives in the Diagnosis of Skin Diseases Using Machine Learning and Deep Learning: A Review

J. Zhang, F. Zhong, et al.

Medicine and Health

A multimodal deep learning approach for the prediction of cognitive decline and its effectiveness in clinical trials for Alzheimer’s disease

C. Wang, H. Tachimori, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny