Computer Science

Personalized Federated Learning for In-Hospital Mortality Prediction of Multi-Center ICU

H. Hamdan, R. Yaakob, et al.

This research by Hazlina Hamdan, Razali Yaakob, Ting Deng, and Khairul Azhar Kasmiran presents a breakthrough in predicting in-hospital mortality through a novel personalized federated learning approach called POLA. Discover how POLA enhances prediction accuracy while minimizing communication time in multi-center ICU settings.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses predictive modeling in EHRs while preserving privacy, where centralization poses legal and technical challenges. FL offers a privacy-preserving alternative but suffers under heterogeneous, non-IID and unbalanced client data, with higher skewness causing accuracy loss and even divergence. This threatens the incentive for institutions to participate. Personalized FL (PFL) can mitigate heterogeneity by tailoring models per client. The paper argues a single global model is unsuitable in such settings and proposes POLA, a personalized, one-shot local adaptation method to improve in-hospital mortality prediction across multi-center ICU data while maintaining natural non-IID and unbalanced distributions. Contributions: (1) empirical analysis of baseline FL degradation in this setting; (2) reformulation of FL optimization to per-client structure and parameter search; (3) proposal and evaluation of POLA against baseline FL and two PFL methods, demonstrating improved performance and reduced communication.

Literature Review

Preliminaries cover: (A) Baseline FL via FedAvg, an SGD-based distributed optimization minimizing weighted average client loss and iteratively aggregating client-updated parameters. Non-IID and unbalanced data induce client drift and degrade convergence and accuracy. (B) Personalized FL approaches: model fine-tuning; local loss regularization (e.g., proximal/Moreau envelopes); meta-learning (MAML/Reptile); multi-task learning; transfer learning; and knowledge distillation (KD). KD supports weak model coupling and can personalize structure, parameters, and hyperparameters. (C) Federated knowledge distillation (FKD): exchanging logits or scores to reduce communication (FD, HFD), often at the cost of accuracy; public-data-based methods (FedMD, MHAT) can help but conflict with FL’s no data sharing principle. KD also enables heterogeneous client models. (D) AutoML in FL: most work centers on federated NAS for DNNs (e.g., evolutionary NAS, FedNAS, MGF-NAS) to personalize architectures and reduce communication; this study instead uses GA to optimize MLP structures/hyperparameters during local adaptation without NAS on deep CNNs.

Methodology

Problem definition: Standard FL optimizes a single global model parameter vector ω by minimizing the weighted sum of client objectives. The paper reformulates FL to per-client optimization over each client’s model structure α_i and parameters θ_i that depend on the global model ω: minimize f_i(α_i, θ_i(ω)) for all clients i, rebalancing global generalization and local specificity. Overall framework (POLA): a two-step, one-shot PFL approach. Step 1 performs FL training to obtain a robust teacher model. Because global performance can be unstable under heterogeneous data, the authors adjust FedAvg by splitting client data into train/validation subsets, tracking average validation error after a warm-up R_w rounds, and selecting as teacher the global model with minimum average validation error over the training (Algorithm 3). Step 2 conducts parallel local adaptation at each client: a GA-driven AutoML process (Procedure 1) searches over model structures (MLP with 2–3 hidden layers) and hyperparameters; for each candidate, Procedure 2 builds a student model, initializes parameters layerwise using teacher model parameters for the input and first hidden layer (remaining layers randomized), and trains using a combined KD objective. Knowledge distillation: Outputs distillation minimizes the divergence between teacher and student softmax outputs at temperature T using MSE for soft targets (l_s1); feature distillation minimizes KL divergence between teacher and student conditional feature-space distributions (l_s2). Hard targets are learned via binary cross-entropy (l_h). The overall loss is L = β(l_s1 + l_s2)/2 + (1−β)l_h, where β balances global knowledge vs. local data. GA setup: chromosomes encode model structure and four hyperparameters; fitness is the inverse validation error from Procedure 2; GA applies selection, crossover (p=0.9), and mutation (p=0.1) over a population of 20 for 5 generations, returning the best personalized configuration. Implementation uses MLPs: unified FedAvg model has two 100-unit hidden layers, ReLU, SGD with momentum 0.9, BCE loss; personalized models share training settings except structure/hyperparameters from GA, trained up to 20 epochs with early stopping. Hyperparameters: client fraction C=1.0, batch size B=50, local iters E=5, LR η=0.01, warm-up R_w=5, total rounds R=100. KD temperature T=10; train/val split 4:1. The scaling factor β is set to 0.6 for hospital-based non-IID and 0.4 for unit-type-based non-IID to reflect teacher quality under different skewness. Evaluation uses AUROC averaged across clients.

Key Findings

Data preprocessing: From eICU-CRD v2.0, a cohort of 30,680 patients (1,164,966 records) was selected. Using a 48-hour window, 19 input features (plus one mortality label) were engineered with OHE, normalization, and padding to 200×442 per patient. Data distributions: three scenarios—(1) IID evenly partitioned; (2) natural hospital-based non-IID/unbalanced (12 hospitals, ≥600 cases each; total 9,660 patients); (3) natural ICU unit-type-based non-IID/unbalanced (8 unit types; 30,680 patients), the latter having higher non-IID skew. Baseline FL impact: Under IID, FedAvg performs well; under natural non-IID, performance degrades with increasing skew. In the hospital-based split, FL still outperforms local training; in the more skewed unit-type split, FedAvg exhibits severe degradation and even non-convergence, sometimes underperforming local models. Comparative results: POLA vs. FT-FedAvg and pFedMe across 100 rounds. POLA achieves higher AUROC and faster effective performance, significantly reducing required communication rounds. pFedMe attains strong end-performance but requires many more rounds to close the gap; for example, the performance difference from POLA at round 5 may require ≥30 additional rounds to catch up. FT-FedAvg yields unstable, limited gains highly dependent on the global model. Client-level outcomes: In the unit-type non-IID split, POLA improves performance for all centers; in the hospital-based split, 58.33% of centers see gains over the best comparator, indicating sensitivity to local sample sizes and distribution skew. Overall, POLA consistently improves AUROC and reduces communication relative to baselines without using public data for distillation.

Discussion

The findings support that a single global model is ineffective under natural, highly non-IID and unbalanced EHR distributions, and that one-shot local adaptation via KD with automated personalization can rebalance global and local knowledge effectively. POLA’s success depends on selecting an appropriate teacher model: strong generalization is necessary but not sufficient; the teacher must also align with student update directions for diverse clients. Thus, POLA is not universally compatible with arbitrary FL training schemes—teacher selection is critical. The method is extensible: applicable to other cross-silo domains (biomedical prediction, medical imaging, finance) and to more complex NN models where personalized architecture/hyperparameters may yield larger gains and lower communication by training smaller global models and adapting to richer local models. POLA demonstrates that one-shot personalization can deliver competitive performance earlier in training, addressing communication and compute constraints common in healthcare collaborations.

Conclusion

The study preserves the natural non-IID and unbalanced characteristics of multi-center ICU EHRs and shows baseline FL degrades, especially with high skewness. It introduces POLA, a one-shot, two-step PFL method combining adjusted FL teacher selection with local KD-based adaptation and GA-driven personalization of model structure and hyperparameters. POLA improves in-hospital mortality prediction AUROC across clients, reduces communication rounds, and outperforms FT-FedAvg and pFedMe in early to mid training, while matching or exceeding final performance. Future work includes extending POLA to deeper architectures, exploring broader FL training strategies compatible with the teacher-student paradigm, enhancing teacher selection criteria, and investigating adaptive β scheduling based on client-specific data characteristics.

Limitations

Effectiveness depends on the quality and suitability of the selected teacher model; not all FL training procedures produce teachers that align with diverse local student updates. Performance gains vary by data distribution and client dataset size—improvements were universal for unit-type-based splits but only observed in 58.33% of hospital-based centers. Experiments used MLPs and a constrained GA search (population 20, 5 generations) limited by computational resources; results may differ with deeper models or larger search budgets. The approach’s compatibility with arbitrary FL variants is limited, as noted by the authors. Evaluation focused on a single large EHR dataset (eICU-CRD); generalization to other datasets/tasks, and sensitivity to β and other hyperparameters, warrant further study.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Machine learning-based prediction of in-hospital death for patients with takotsubo syndrome: The InterTAK-ML model

O. D. Filippo, V. L. Cammann, et al.

Medicine and Health

A multimodal deep learning approach for the prediction of cognitive decline and its effectiveness in clinical trials for Alzheimer’s disease

C. Wang, H. Tachimori, et al.

Medicine and Health

Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19

S. Subudhi, A. Verma, et al.

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny