Medicine and Health

Machine-learning algorithms for asthma, COPD, and lung cancer risk assessment using circulating microbial extracellular vesicle data and their application to assess dietary effects

A. Mcdowell, J. Kang, et al.

This groundbreaking study reveals high-performing predictive models for COPD, asthma, and lung cancer using machine learning on microbial extracellular vesicle metagenomes from patient serum. Conducted by Andrea McDowell and her team, the research proposes serum microbial EVs as noninvasive diagnostic features with remarkable accuracy.... show more

Introduction

Chronic respiratory diseases such as COPD, asthma, and lung cancer are major global health burdens. COPD accounted for 3.0 million deaths in 2016 and asthma contributed 24.8 million DALYs. Chronic airway inflammation can promote carcinogenesis, elevating lung cancer risk. Accurate, noninvasive predictive methods for these diseases are needed for early diagnosis and prevention. The human microbiome profoundly influences health, and microbial extracellular vesicles (EVs) are emerging as functional mediators that can enter circulation and carry microbial nucleic acids, proteins, and lipids systemically. Prior work indicates microbial EVs can provoke or protect in pulmonary contexts and show disease-associated signatures. This study aimed to determine whether serum microbial EV metagenomes can serve as robust features for machine-learning models to predict risk of COPD, asthma, and lung cancer, and to test whether dietary interventions modulate model-predicted risk in a high-fat diet mouse model.

Literature Review

Methodology

Human cohort: 1825 Korean participants were enrolled from five hospitals (2017–2020): COPD (n=93), asthma (n=454), lung cancer (n=283), and healthy controls (n=995). After quality filtering, 1727 samples remained for analysis: COPD (n=92), asthma (n=428), lung cancer (n=279), controls (n=928). IRB approvals were obtained and informed consent collected. Sample processing and sequencing: Serum was collected in SSTs. EVs were isolated by centrifugation, filtration, and boiling; DNA extracted with QIAGEN DNeasy Blood & Tissue Kit and quantified. The V3–V4 regions of 16S rDNA were amplified and sequenced on Illumina MiSeq. Taxonomic assignment: Reads were trimmed/merged (Cutadapt 1.1.6, CASPER). High-quality reads were retained (length 350–550 bp; Phred ≥20). OTUs were clustered de novo (VSEARCH, 97% similarity). Taxa were assigned against Silva 132; unresolved genera were assigned to highest resolvable rank. Samples with <1000 OTUs were excluded. Taxonomic hierarchical accumulation (feature coding): Species were summarized to genus; abundances log2-scaled and then normalized to relative abundance per sample. To reduce zero-inflation and weight imprecise assignments, accumulated values per genus combined contributions from higher taxa (order, class, phylum, kingdom) using small weighting coefficients (e.g., k1=10^-1, etc.). Accumulated values across genus-to-kingdom levels were computed and used as features. Machine learning: 1513 serum EV taxa features were used. Five methods were implemented per disease (asthma, COPD, lung cancer): generalized linear model (GLM) without feature selection (all 1513 features), GLM with feature selection (Wilcoxon test with Bonferroni-adjusted p<0.05), gradient boosting machine (GBM; scikit-learn GradientBoostingRegressor: learning_rate=0.01, n_estimators=3000, max_depth=10), artificial neural network (ANN; Keras, 5-layer with L1 regularization, loss=MSE, activation=ReLU, optimizer=RMSProp, epochs=200), and an ANN+GBM ensemble (average of outputs). To mitigate overfitting, 30 iterations of random 70/30 train-test splits were performed; training for model fitting and test for validation. Permutation feature importance was computed over 30 iterations (ELI5 0.10.1). In vivo dietary study: 180 female C57BL/6 mice (6 weeks old) were randomized (5 per IVC cage) to 36 dietary groups: normal chow diet (NCD), high-fat diet (HFD; 60% fat, 20% protein, 20% carbohydrate), or HFD plus one of 34 supplements (foods added to drinking water at 2% w/v, alternated every 12 h). After 4 weeks, mice were sacrificed, serum collected, EVs processed as above, and the human-trained ensemble models applied to compute asthma, COPD, and lung cancer prediction values. Animal protocols were approved (Chung-Ang University Approval No. 2018-00057). Statistical analyses: Alpha diversity (observed OTUs, Chao1, ACE, Shannon, Simpson) via phyloseq in R. Beta diversity assessed with PCA, MDS (stats package), and t-SNE (tsne package) using UniFrac distances. Between-group differences evaluated by Pearson correlation, t-test, or Wilcoxon test (p≤0.05).

Key Findings

Cohort and sequencing: 1727 serum samples passed QC (COPD 92, asthma 428, lung cancer 279, controls 928). Mean read count: 55,993; mean OTUs: 10,384. Dominant phyla: Firmicutes, Proteobacteria, Actinobacteria, Bacteroidetes. Dominant genera: Acinetobacter, Cutibacterium, Pseudomonas, Bacteroides, Staphylococcus, Sphingomonas, Lactobacillus.
Diversity: Alpha diversity (richness and diversity indices) ranked lowest in COPD, followed by controls, asthma, and highest in lung cancer. t-SNE provided distinct clustering by disease group; PCA and MDS did not clearly separate groups.
Model performance (AUC, mean over 30 test iterations): • COPD: GLM-all 0.49 (SD 0.054), GLM-selected 0.66 (0.158), ANN 0.86 (0.012), GBM 0.91 (0.003), Ensemble 0.93 (0.004). • Asthma: GLM-all 0.50 (0.031), GLM-selected 0.80 (0.057), ANN 0.97 (0.007), GBM 0.98 (0.001), Ensemble 0.99 (0.002). • Lung cancer: GLM-all 0.50 (0.040), GLM-selected 0.77 (0.057), ANN 0.87 (0.030), GBM 0.92 (0.003), Ensemble 0.94 (0.004). Ensemble models showed low variability (SD <0.01) and highest accuracy across diseases.
Feature importance: Proteobacteria contributed most important features for asthma and lung cancer; Firmicutes dominated in COPD. Notable genera: • Asthma: Fimbriimonadaceae (Fimbriimonas-related features) highly associated; shared mild-moderate importance with Ralstonia; overlap with lung cancer included Stenotrophomonas, Acinetobacter, Ralstonia. • COPD: Megamonas most associated; shared mild-moderate with Burkholderia-Caballeronia-Paraburkholderia and Ralstonia (overlap with lung cancer). • Lung cancer: Curvibacter and Helicobacter most important.
Dietary modulation in mice (model predictions applied to serum EVs): HFD increased asthma prediction values (mean 0.31 vs 0.09 NCD) and slightly increased lung cancer prediction (0.23 vs 0.18 NCD); COPD predictions unchanged on average (0.09 for both HFD and NCD). Foods lowering HFD-associated risk: • Asthma: glutinous rice flour, lotus root powder, mungbean powder, sesame oil (lowered predictions); policosanol and mealworm oil increased predictions. • Lung cancer: mungbean powder, glutinous rice flour, Kakadu plum powder lowered; honey and propolis spray increased predictions. • COPD: minimal changes; brown rice oil, pear extract, sesame oil slightly lowered; safflower extract slightly increased.

Discussion

The study addressed whether circulating microbial EV metagenomes can serve as robust, noninvasive biomarkers for respiratory disease risk. By introducing a taxonomic hierarchical accumulation method to mitigate zero inflation and leverage higher-rank taxonomy, and applying modern ML (ANN, GBM, ensemble), the models achieved high discriminative performance, particularly for asthma (AUC 0.99) and lung cancer (AUC 0.94), and strong performance for COPD (AUC 0.93). The results suggest serum EV-derived microbial signatures reflect systemic alterations associated with these diseases. Feature importance patterns aligned with and extended prior literature: Proteobacteria prominence in asthma and lung cancer and Firmicutes relevance in COPD. Genera such as Stenotrophomonas, Helicobacter, Curvibacter, Megamonas, and Fimbriimonadaceae emerged as influential features, offering hypotheses for disease mechanisms and potential targets for further research. Dimensionality reduction showed that while linear methods (PCA, MDS) poorly separated groups, t-SNE captured disease-specific structure in the EV microbiome feature space, supporting the utility of nonlinear approaches. Applying the human-trained models to mouse serum demonstrated biologically plausible dietary effects: HFD elevated asthma and lung cancer risk predictions and had minimal impact on COPD, while certain foods attenuated HFD-associated risk. These in vivo results provide preliminary support for the translational relevance of the EV-based models and suggest diet-microbiome-EV interactions influence respiratory disease risk signatures.

Conclusion

Serum microbial extracellular vesicles provide rich, noninvasive features for predicting COPD, asthma, and lung cancer risk. Using a novel taxonomic hierarchical accumulation strategy and an ANN/GBM ensemble, the study achieved high AUCs across diseases and identified key taxa contributing to prediction. Preliminary mouse experiments indicated that high-fat diets increase predicted asthma and lung cancer risk, and that specific foods (e.g., mungbean, glutinous rice flour, lotus root, Kakadu plum) may mitigate these effects, while some (e.g., honey, propolis spray) may worsen risk predictions. Future work should include large, multi-center, demographically diverse cohorts; stage-specific analyses; detailed data on medications and smoking; external validation; longitudinal designs; mechanistic studies of key taxa; and controlled trials to validate dietary interventions impacting EV-based risk signatures.

Limitations

Cohort composition was limited to Korean subjects with skewed sex ratios (e.g., predominantly male in COPD and lung cancer groups), which may limit generalizability.
No external validation cohort; performance was assessed via repeated random train-test splits (cross-validation-like), raising potential overfitting concerns despite low variance.
Disease staging, medication use, and smoking history were not comprehensively integrated, which could confound microbiome EV signatures.
16S rDNA profiling and reliance on public taxonomic databases can limit genus-level resolution; although the hierarchical accumulation method addresses imprecision and zero inflation, it requires further validation.
Mouse dietary findings represent model-predicted risk changes rather than clinical outcomes and require controlled validation in humans.
Some influential taxa (e.g., Fimbriimonadaceae, Megamonas, Curvibacter) lack established links to these diseases, necessitating cautious interpretation and mechanistic follow-up.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Engineering and Technology

Improved Fault Classification and Localization in Power Transmission Networks Using VAE-Generated Synthetic Data and Machine Learning Algorithms

M. A. Khan, B. Asad, et al.

Engineering and Technology

Searching for chromate replacements using natural language processing and machine learning algorithms

S. Zhao and N. Birbilis

Medicine and Health

Predictive model of castration resistance in advanced prostate cancer by machine learning using genetic and clinical data: KYUCOG-1401-A study

M. Shiota, S. Nemoto, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny