logo
ResearchBunny Logo
Plasma proteomic profiles predict individual future health risk

Medicine and Health

Plasma proteomic profiles predict individual future health risk

J. You, Y. Guo, et al.

A revolutionary proteomic risk score (ProRS) was developed by Jia You, Yu Guo, Yi Zhang, Ju-Jiao Kang, Lin-Bo Wang, Jian-Feng Feng, Wei Cheng, and Jin-Tai Yu, leveraging a vast dataset of 52,006 UK Biobank participants. This cutting-edge method not only stratifies the risk for a multitude of diseases, including cancer and dementia, but also surpasses conventional clinical indicators. Independent validation may soon pave the way for its real-world application.... show more
Introduction

The study addresses the need for efficient multi-disease risk stratification by proposing a single-domain plasma proteomics-based approach. Conventional risk assessment often requires accumulating numerous predictors for one disease at a time, which is time- and cost-intensive. Since routine blood testing is common in preventive care, the authors hypothesize that plasma proteomic profiles can simultaneously and systematically predict risks for multiple future health outcomes, overcoming limitations of cross-sectional and case-control proteomic studies that are prone to reverse causality. The purpose is to evaluate whether proteomic profiles can provide accurate, practical prediction across a wide spectrum of diseases and mortality in a longitudinal setting and how they compare or add to established clinical predictors.

Literature Review

Prior work has linked plasma proteomics to gene–protein interactions, biomarkers for individual diseases and risks, aging, and drug pharmacology. Many proteomic prediction studies have been cross-sectional or case-control (e.g., in dementia, Alzheimer’s disease, coronary heart disease, and Type I diabetes), limiting inference on preclinical signatures due to potential reverse causality. Although shared molecular bases have been observed among closely related diseases, knowledge about common pathways across less-related diseases is sparse. Recent studies show proteomic profiles can predict cardiovascular events, obesity, dementia, and cancer, and metabolomic profiles can predict multi-disease outcomes, but comprehensive longitudinal proteomic prediction across many endpoints alongside clinical variables has been lacking.

Methodology

Design and cohort: Retrospective analysis using the UK Biobank (UKB). A total of 52,006 participants (median age 58 years; 53.9% female; 93.7% White) with plasma proteomic data were included after excluding samples with >30% proteomic missingness. Baseline recruitment occurred in 2006–2010 with median follow-up of 14.1 years until March 2023. Ethical approvals and participant consent were in place. Endpoints: 45 longitudinal endpoints were defined from ICD-10 three-character codes comprising 14 disease categories, 26 specific diseases, all-cause mortality, and 4 cause-specific mortalities. First occurrences were ascertained from primary care, hospital inpatient, self-report, and death registry data. Exclusions removed prevalent cases within the disease category at baseline. Proteomics: Plasma proteins were quantified by Olink Proximity Extension Assay with next-generation sequencing in the UKB Pharma Proteomics Project (April 2021–January 2022). After QC, 1461 unique proteins across cardiometabolic, inflammation, neurology, and oncology panels were analyzed. Clinical predictors: 54 baseline variables were included: demographics (n=5), lifestyle (n=6), physical measures (n=7), disease and medication history (n=7), family history (n=4), and serum assays (n=25). Three predictor sets were defined: Age+Sex; Serum (25 serums); PANEL (all 54 variables). ProNNet and ProRS: A multilayer perceptron neural network (ProNNet) was developed to derive proteomic risk scores (ProRS) for each endpoint. Architecture included (1) a comorbid network with two branches predicting counts of disease categories before and after baseline from 1461-protein inputs (dense layers: 512-256-128-64 with ReLU; linear output; trained with MSE; performance MAE 1.84±0.32 pre-baseline and 1.41±0.10 post-baseline), and (2) an endpoint-specified network that concatenated frozen features from the comorbid network with an encoding block and used sigmoid activation for endpoint outputs, trained with log loss. Optimization used Adam (learning rate 1e-5), batch size 128, up to 1000 epochs with early stopping after 25 epochs without validation loss improvement. Hyperparameters were tuned via grid search (optimizers: SGD, RMSprop, Adam; learning rates 1e-5 to 1e-1). Implementation: Keras (v2.7.0) in Python. Survival modeling: ProRS was used in Cox proportional hazards (CPH) models to evaluate discrimination for each endpoint. Models included: ProRS-only; Age+Sex; Serum; PANEL; and combinations ProRS+Age+Sex, ProRS+Serum, ProRS+PANEL. CPH implemented via lifelines CoxPHFitter (v0.27.4) with penalizer 0.01. Pairwise C-index comparisons used a one-shot non-parametric test (R CompareC v1.3.2) accounting for correlation on the same sample. Risk stratification used tertiles of ProRS with Kaplan–Meier plots. Time-window analyses assessed AUC, AUPR, accuracy, sensitivity, and specificity within 5 years, within 10 years, and beyond 10 years (Youden index cutpoints). Model evaluation: Internal leave-one-region-out cross-validation over 10 UK regions (from 22 assessment centers merged). Nine folds for training, one for testing, iterated over all folds. ProNNet hyperparameter tuning used 5-fold CV within training folds, then retrained on full training fold to produce ProRS for both training and test sets prior to CPH modeling. Discrimination assessed with Harrell’s C-index; calibration via calibration plots; clinical utility via decision curve analysis (standardized net benefit). Data processing: Continuous variables standardized; categorical one-hot encoded. Missing continuous values imputed with KNN (k=50, Euclidean distance, median of neighbors) separately for proteomics and clinical data within each CV fold; categorical imputed with mode. SHAP analysis: DeepExplainer (shap v0.41.0) computed SHAP values for proteomic attributions; values normalized by sum across 1461 proteins to aggregate across folds. Top 1% proteins (15/1461) per endpoint were reported. Associations between individual proteins and endpoints were also estimated via age- and sex-adjusted CPH models with Bonferroni correction (FWER 0.01/1461 = 6.84×10^-6). Software: Python 3.9.16, scikit-learn 1.2.2. Code availability: https://github.com/jasonHKU0907/FutureHealthProteomicPrediction. Data: UK Biobank under application 19542.

Key Findings
  • Cohort and events: N=52,006 (median age 58, 53.9% female, 93.7% White). Median follow-up 14.1 years. Deaths: 5625 (10.82%); cancer incidence: 7654 (15.76%); common diseases: hypertension 4911 (15.96%), anemia 4528 (9.31%). - ProRS stratification: Higher baseline ProRS percentiles showed higher observed event rates across all 14 disease categories and all-cause mortality. ProRS correlated positively with age for all 45 endpoints (r>0.1 except obesity and breast cancer), strongest for eye disorders (r=0.78 [0.77–0.78]), cancer (0.70 [0.70–0.71]), and circulatory disorders (0.67 [0.66–0.68]). Sex differences: At the same ProRS percentile, males had higher risks of cancer, circulatory disease, and all-cause mortality than females. - Risk gradients: Top vs bottom tertile of ProRS showed markedly different survival. Odds ratios (top vs bottom tertile): all-cause mortality OR 11.83 [10.11–13.55]; blood/immune disorders 5.08 [4.66–5.50]; skin disorders 1.82 [1.50–2.14]; digestive disorders 1.65 [1.48–1.82]. ORs >20 for death due to respiratory system (53.80 [38.17–69.43]), dementia (32.83 [21.18–44.48]), death due to circulatory system (29.51 [21.68–37.34]), diabetes (23.58 [19.36–27.79]), and obesity (20.45 [15.00–25.89]). - Time windows: ProRS generally achieved highest AUC for outcomes within 5 years, suggesting strong near-term predictive value; for some endpoints (e.g., viral infections, neurotic disorders, sleep disorders, and death due to nervous system), models beyond 10 years performed best. - Discrimination vs clinical predictors: ProRS alone often matched or outperformed Age+Sex, Serum, and PANEL across endpoints. C-index examples for ProRS-only CPH: prostate cancer 0.85 [0.84–0.87]; dementia 0.86 [0.85–0.87]; anemia 0.85 [0.84–0.86]; nervous system disorders 0.83 [0.82–0.85]; inflammatory bowel disease 0.82 [0.80–0.84]; heart failure 0.74 [0.73–0.75]; stroke 0.75 [0.73–0.76]; COPD 0.71 [0.68–0.73]; asthma 0.67 [0.65–0.70]; hypertension 0.67 [0.65–0.70]. For several disease categories (infections, blood/immune, nervous, respiratory, genitourinary), and specific diseases (bacterial/viral infections, leukemia, anemia, dementia, heart failure, COPD), and for all-cause and cause-specific mortalities, ProRS significantly outperformed clinical predictor sets. - Added value of combining predictors: Adding ProRS to Age+Sex or Serum significantly improved discrimination across most endpoints; adding ProRS to PANEL improved many endpoints (11 disease categories, all-cause mortality, 20 specific diseases, 4 causes of death). However, in over one-third of endpoints, ProRS+PANEL had C-index comparable to ProRS alone, indicating limited incremental gain beyond proteomics. - Protein attributions: SHAP identified proteins with broad discriminative importance. GDF15 was a robust predictor across all 14 disease categories and all-cause mortality, ranking first for several causes of death and specific diseases (bacterial infections, anemia, mood disorders, arrhythmias, heart failure, inflammatory bowel disease, renal failure). CDCP1, CXCL17, EDA2R, and HAVCR1 were important across >10 categories with consistent effect directions. Proteins with narrower importance included NEFL, BCAN, TNFRSF10B, CA14; some were limited to few categories (e.g., NTproBNP, TSPAN1, ACE2). For cancer, higher CXCL14, GDF15, HAVCR1, CDCP1, TSPAN1, LTBP2, ACTA2 increased risk; higher RET was protective. For dementia, NEFL, BCAN, GFAP, and GDF15 were key contributors; additional signals included ACTA2, LTBP2, NCS1. - Calibration and clinical utility: Models were well calibrated for most endpoints (except viral infections). Decision curve analysis showed ProRS had greater net benefit than Age+Sex or PANEL for most outcomes. Adding ProRS to clinical predictors improved net benefit but often performed similarly to ProRS alone; in some ranges, ProRS alone exceeded combinations (notably for all-cause mortality, and to a lesser extent dementia and heart failure).
Discussion

The findings show that a single-domain plasma proteomic profile can comprehensively stratify future risk across 45 diseases and mortality endpoints. ProRS effectively captures preclinical signals preceding disease onset, aligning with the goal of early identification and enabling timely preventive interventions. Although ProRS correlates with age, it provides risk information beyond age and sex and reveals sex-specific disparities at comparable risk-score percentiles. Compared to widely used clinical predictors and serum measures, proteomics delivers superior or comparable discrimination for most endpoints, and the added value of combining proteomics with clinical variables is often modest, indicating that proteomics encapsulates rich pathophysiological information relevant to multiple disease processes. Proteins such as GDF15, CDCP1, CXCL17, and EDA2R likely represent shared pathways influencing diverse diseases, suggesting opportunities for interventions targeting common mechanisms without adverse trade-offs across disease domains. Robust calibration and favorable decision-analytic net benefit underline the translational potential of ProRS for clinical decision-making and population health screening. Overall, the study supports integrating plasma proteomics into risk assessment workflows to improve precision prevention and triage across a broad spectrum of conditions.

Conclusion

Plasma proteomic profiling, operationalized via the ProNNet-derived ProRS, provides accurate, calibrated, and clinically useful prediction of future risk for numerous diseases and mortality, frequently outperforming or matching established clinical predictors. The approach offers a scalable single-domain alternative to complex, multi-parameter risk assessments and supports individualized health checks and targeted prevention. Future work should include independent external validation in ethnically diverse, unselected populations; assessment of implementation pathways and cost-effectiveness; exploration of expanded protein panels; evaluation of longitudinal changes in proteomic profiles; and mechanistic studies to validate shared protein targets such as GDF15 and CDCP1 for potential intervention.

Limitations
  • Proteomic scope: The Olink panels used may omit proteins predictive of multi-disease outcomes; larger or different panels might improve performance. - Cohort characteristics: UK Biobank participants are generally healthier and predominantly White/European, which may lower event rates and limit generalizability. - Validation: Only internal cross-validation was performed; external validation in independent, diverse cohorts is required before clinical deployment. - Endpoint ascertainment: Reliance on routine clinical coding and self-report may introduce misclassification despite comprehensive sources.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny