logo
ResearchBunny Logo
Advancing COVID-19 diagnosis with privacy-preserving collaboration in artificial intelligence

Medicine and Health

Advancing COVID-19 diagnosis with privacy-preserving collaboration in artificial intelligence

X. Bai, H. Wang, et al.

Discover how the Unified CT-COVID AI Diagnostic Initiative, led by an impressive team of researchers including Xiang Bai and Hanchen Wang, is revolutionizing COVID-19 diagnosis while ensuring patient data privacy through federated learning. This innovative approach not only improves accuracy but also upholds confidentiality, demonstrating the future of AI in healthcare.

00:00
00:00
~3 min • Beginner • English
Introduction
RT-PCR is the primary diagnostic for COVID-19 but shows limited sensitivity (~0.60–0.71), leading to false negatives. Chest CT exhibits radiological signs (ground-glass opacities, interlobular septal thickening, consolidation) that can aid diagnosis, but CT-based decisions vary across radiologists and are not specific to COVID-19. Developing accurate automated CT-based diagnosis requires large, well-annotated datasets, yet faces three data-related challenges: incompleteness (limited high-quality CTs may not cover full radiologic variability), isolation (privacy/security concerns hinder multi-center data sharing and limit generalization of local models), and heterogeneity (differences in acquisition protocols such as contrast use and reconstruction kernels). The study aims to develop a robust, generalized, and privacy-preserving AI diagnostic model for COVID-19 CT interpretation by enabling multinational collaboration without sharing patient data, via a federated learning framework (UCADI).
Literature Review
Reported CT sensitivities for COVID-19 vary widely (0.56–0.98) across settings, and radiologists show discrepancies distinguishing COVID-19 from other viral pneumonias. Prior AI studies explored multimodal and imaging-based COVID-19 detection. Federated learning, introduced for decentralized model training without data centralization, provides a pathway to collaboration under privacy constraints, while related approaches like Swarm Learning aim at decentralization but may be computationally demanding at scale. The UK’s NCCID aggregates multi-institutional imaging data but includes heterogeneous acquisition protocols (notably high prevalence of contrast-enhanced CT). Image-to-image translation methods (e.g., CycleGAN) have been proposed to mitigate domain shifts between contrast and non-contrast CTs to improve generalizability.
Methodology
Study design and datasets: CT datasets were collected from 23 hospitals: three branches of Wuhan Tongji Hospital Group (Main Campus, Optical Valley, Sino-French) in China and 18 UK hospitals via the NCCID, totaling 9,573 CTs from 3,336 patients. Chinese CTs (5,740 CTs) were non-contrast, acquired on GE LightSpeed16/Discovery 750 HD and Siemens SOMATOM Definition AS+ with 1–1.25 mm slices; 2,723 CTs from 432 COVID-19 patients (Jan 2020 onward) and 3,017 CTs from healthy, other viral, and bacterial pneumonia (2016 onward). The China cohort was split into train/validation (1,095 cases; 1,136 healthy CTs, 2,200 COVID-19 CTs, 250 other viral CTs, 1,078 bacterial CTs) and test (254 cases; 262 healthy CTs, 523 COVID-19 CTs, 84 other viral CTs, 207 bacterial CTs). Hold-out COVID-19-only sets: 507 cases from Wuhan Union Hospital and 645 cases from Wuhan Tianyou Hospital. UK NCCID included 2,682 CTs across diverse scanners and protocols, with 2,145 contrast-enhanced CTs. Non-contrast subset: train/val 116 non-COVID-19 cases (394 CTs) and 54 COVID-19 cases (163 CTs); test 75 non-COVID-19 cases (103 CTs) and 27 COVID-19 cases (37 CTs). Contrast subset: train/val 276 non-COVID-19 (1,097 CTs) and 145 COVID-19 (491 CTs); test 160 non-COVID-19 (259 CTs) and 83 COVID-19 (138 CTs). CTs with <40 slices were excluded. Preprocessing and sampling: From each CT study, 16 slices were selected using adaptive sampling (random start and scalable interval). During training/validation, a single sample was drawn per study; at test time, five independent samplings were performed and predictions averaged. Slices were standardized (channel-wise de-meaned, variance rescaled), resized via trilinear interpolation from 512 to 128 pixels per axis, and lung windowed to −1200 to 600 HU. Model architecture: A 3D-DenseNet was developed, adapting DenseNet to 3D with 14 convolutional layers organized into six 3D dense blocks and two transition blocks. Each dense block has two 3D conv layers with inter-residual connections; transition blocks comprise a 3D conv and average pooling. 3D DropBlock regularization was applied before/after dense blocks. Batch normalization momentum 0.9; LeakyReLU slope 0.2. Training: The network outputs four-class probabilities (healthy, COVID-19, other viral, bacterial). Weighted cross-entropy addressed class imbalance with weights 0.2 (healthy), 0.2 (COVID-19), 0.4 (other viral), 0.2 (bacterial). Optimization used SGD with momentum 0.9, batch size 16. Learning rate warmed up linearly from 0 to 0.01 over 5 epochs, then cosine annealed to 0 over the remaining 95 epochs (total 100). Fivefold cross-validation on the train/val split selected the best model for testing. For main text comparisons, a binary COVID-19 vs non-COVID-19 evaluation was emphasized; four-class results were reported in supplementary materials. Domain adaptation (NCCID): To mitigate heterogeneity between contrast and non-contrast CTs, CycleGAN-based unpaired image translation was used to synthesize non-contrast-like images from contrast CTs (and vice versa for experiments). CycleGAN used a ResNet encoder, batch size 12, 200 epochs (LR 2e-4 for 100 epochs, then linearly decayed to 0). Augmentation with CycleGAN-imputed non-contrast images improved non-contrast model performance; mixing raw plus translated contrast images could degrade performance. Federated learning framework (UCADI): A central server orchestrated model aggregation via FedAvg, weighting client updates by dataset sizes and local epochs between synchronizations. Clients trained locally and transmitted encrypted model parameters back every few epochs over web sockets. Additive homomorphic encryption (Learning With Errors, LWE) was applied to all transmissions (model parameters and metadata) to preserve privacy; the server aggregates without access to individual plaintext weights. LWE overhead: encryption–decryption ~2.7 s per round (on Intel Xeon E5-2630 v3 @ 2.40 GHz); model size increases from 2.8 MB to 62 MB; at 900 KB/s bandwidth, transmission time increases from ~3.1 s to ~68.9 s. The framework supports dynamic participation and breakpoint resumption. Communication-performance trade-off: Experiments varied synchronization intervals (every 1, 2, or 4 epochs, or once at end) using two clients (each with 200 training CTs and 100 testing CTs; GPU GTX 1080Ti + Xeon E5-2660 v4; bandwidth ~7.2 Mbit/s). More frequent synchronization yielded better AUC (every epoch: AUC 0.986) with <20% additional time compared to least frequent communication. Interpretability: Grad-CAM heatmaps localized class-discriminative regions overlapping radiologist annotations, suggesting learning of radiologic lesion features and aiding clinical lesion localization. Radiologist comparison: Six radiologists (average 9 years’ experience) provided individual and majority-vote diagnoses on the China test split for four-class classification; performance compared on COVID-19 vs non-COVID-19.
Key Findings
- Federated learning improved generalization across multinational datasets without sharing patient data. On China test CTs (1,076 CTs from 254 cases), the federated model achieved sensitivity 0.973, specificity 0.951, AUC 0.980, outperforming all local models. - On the UK NCCID test set (from 18 hospitals), the federated model achieved sensitivity 0.730, specificity 0.942, AUC 0.894, again surpassing locally trained variants. - Local model baselines: Across the three Chinese UCADI sites, locally trained models averaged sensitivity/specificity 0.804/0.708 for COVID-19 identification. For NCCID non-contrast CTs, local model sensitivity/specificity improved from 0.703/0.961 to 0.784/0.961 using CycleGAN augmentation. - Cross-domain generalization of local models was poor: the NCCID non-contrast–trained model tested on China achieved sensitivity/specificity/AUC 0.313/0.907/0.745. - Radiologist comparison (China test split): individual radiologists averaged sensitivity 0.79 and specificity 0.90; majority vote consensus achieved 0.900 sensitivity and 0.956 specificity. The federated 3D-DenseNet showed comparable performance (0.973/0.951). - Hold-out external validation: On two independent COVID-19-only cohorts (Wuhan Tianyou Hospital, n=645; Wuhan Union Hospital, n=507), the federated model performed better than local models, though calibration issues were noted. - Communication-performance trade-off: Synchronizing every epoch produced the best test AUC (0.986) with less than 20% increment in total time compared to minimal synchronization. Estimated times illustrated increased transmission cost with more frequent syncs but improved accuracy. - Privacy-preserving encryption: LWE increased encrypted model size from 2.8 MB to 62 MB and transmission time from ~3.1 s to ~68.9 s at 900 KB/s; encryption–decryption per round averaged 2.7 s. - Interpretability: Grad-CAM heatmaps aligned with radiologist-annotated lesions, indicating the model learned meaningful radiologic features.
Discussion
The study demonstrates that federated learning can effectively address the core challenges of incompleteness, isolation, and heterogeneity in multinational medical imaging datasets. By training 3D-DenseNet models locally and aggregating updates centrally with privacy-preserving LWE encryption, the federated model generalized better than any local model on both China and UK test sets, despite pronounced differences in acquisition protocols and demographics. Comparable performance to a panel of experienced radiologists suggests clinical relevance, while Grad-CAM visualizations provide lesion-localizing evidence to support trust and facilitate workflow. CycleGAN-based augmentation mitigated contrast/non-contrast domain gaps in the UK data, highlighting the importance of distribution alignment in heterogeneous cohorts. Communication frequency in federated training influences performance, with modest time overheads yielding measurable AUC gains, informing practical deployments where bandwidth and synchronization cadence must be balanced. Collectively, these findings support federated learning as a viable strategy for developing robust, privacy-preserving AI tools in global health crises and routine clinical practice.
Conclusion
This work introduces UCADI, a multinational federated learning framework for CT-based COVID-19 diagnosis that preserves privacy while enabling cross-institutional collaboration. A tailored 3D-DenseNet, combined with standardized preprocessing, CycleGAN augmentation for contrast heterogeneity, and secure LWE-encrypted FedAvg aggregation, yielded a federated model that outperformed local models and matched expert radiologists on key metrics. The framework, code, and trained model are shared to support continued evolution and broader applications beyond COVID-19. Future directions include improving calibration, enhancing 3D CNN computational efficiency, expanding to additional modalities and diseases, refining domain adaptation across more diverse protocols, and optimizing communication strategies and encryption to further balance accuracy, latency, and cost.
Limitations
- Potential bias in expert comparison: Due to legal constraints, radiologists were recruited locally rather than across geographies, which may limit generalizability of the human–AI comparison. - Engineering and infrastructure issues: Participants occasionally dropped out due to unstable internet connections; communication overhead from encryption and synchronization impacts training time. - Computational efficiency: The 3D CNN has room for optimization; model efficiency and deployment latency could be improved. - Calibration: Although the federated model performed better on hold-out datasets, its confidence calibration was sometimes suboptimal. - Data heterogeneity and demographics: Differences in acquisition protocols (contrast vs non-contrast) and demographic distributions across regions (e.g., older, more male UK cohort) may affect performance and generalizability; not all non-COVID-19 demographics in NCCID were recorded.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny