Medicine and Health

Faecal microbiome-based machine learning for multi-class disease diagnosis

Q. Su, Q. Liu, et al.

This groundbreaking study by Qi Su and colleagues reveals how the systemic characterization of the human faecal microbiome can lead to innovative, non-invasive disease diagnostics. By leveraging metagenomic data from over 2,300 individuals, the machine-learning model they developed shows impressive predictive power across multiple diseases, showcasing the promise of microbiome-based solutions in clinical applications.

00:00

Playback language: English

Index

Introduction

Recent studies have linked imbalanced gut microbiota (dysbiosis) to various diseases. Current microbial marker development primarily uses binary classifiers, which are limited by overlapping gut microbiome signatures across multiple health conditions. Single-disease models risk misclassification due to confounding from unrelated diseases. While previous attempts at multi-class diagnostic models exist, they suffer from limitations like heterogeneity and batch effects in public datasets. This study addresses these issues by creating a large single-site dataset of multiple diseases and applying machine learning multi-class models to predict diseases using species-level fecal microbiome profiling. The findings are then validated against public metagenome datasets.

Literature Review

The existing literature highlights the strong correlation between the gut microbiome and various diseases. However, most studies rely on binary classification models, comparing a single disease state against healthy controls. This approach often overlooks the complexity of disease interactions, resulting in a lack of specificity and potentially misleading conclusions. The use of public datasets, while valuable, often introduces challenges due to variability in sample collection, processing, and sequencing protocols, leading to batch effects and heterogeneity. Therefore, the need for a larger, more homogeneous dataset with a multi-class approach is crucial for accurate and reliable disease prediction using microbiome profiling.

Methodology

This study utilized metagenomic sequencing of fecal samples from 2320 Hong Kong Chinese individuals encompassing nine distinct disease phenotypes: colorectal cancer (CRC), colorectal adenomas (CA), Crohn's disease (CD), ulcerative colitis (UC), irritable bowel syndrome (IBS-D), obesity, cardiovascular disease (CVD), post-acute COVID-19 syndrome (PACS), and healthy controls. 14.3 terabytes of sequencing data yielded 1208 bacterial species, with 325 species (relative abundance > 0.15% and present in >5% of subjects) selected for analysis. Alpha diversity and richness varied across disease phenotypes, suggesting that these ecological indices alone are not robust disease indicators. Associations between species-level microbial composition and disease phenotypes were assessed using MaAsLin2, adjusting for biological and technical confounders. Five machine learning multi-class classifiers (random forest (RF), K-nearest neighbours (KNN), multi-layer perceptron (MLP), support vector machine (SVM), and graph convolutional neural network (GCN)) were trained on 70% of the data, with the remaining 30% used as an independent test set. Model performance was evaluated using AUROC, sensitivity, specificity, and accuracy. The RF model exhibited superior performance and was selected for further analysis and external validation using 1597 samples from 12 public datasets across diverse populations. Finally, the top 50 bacterial species contributing to the model were correlated with disease phenotypes to understand model interpretability.

Key Findings

The random forest (RF) multi-class model achieved high performance in predicting multiple diseases from fecal microbiome profiles. In the independent test set, the model achieved a mean AUROC ranging from 0.90 to 0.99 (IQR 0.91–0.94) for different disease phenotypes. Sensitivity ranged from 0.81 to 0.95 (IQR 0.87–0.93) at a specificity of 0.76 to 0.98 (IQR 0.83–0.95). The model's performance was consistent across different data splits and age strata. Validation on independent public datasets (1597 samples) yielded a mean AUROC of 0.69 to 0.91 (IQR 0.79–0.87), demonstrating the model's generalizability across populations. Analysis of the top 50 bacterial species revealed 363 significant associations with disease phenotypes. Many diseases showed decreased abundance of Firmicutes or Actinobacteria and increased abundance of Bacteroidetes compared to healthy controls. Disease-specific microbial signatures were also identified; for example, *Parvimonas micra* was significantly higher in CRC patients than in those with colorectal adenomas. The model accurately classified COVID-19 recovered patients as healthy, and showed a high degree of specificity for the nine phenotypes studied, demonstrating low misclassification for unrelated diseases.

Discussion

This study demonstrates the feasibility of a fecal microbiome-based multi-class model for disease diagnosis. The high performance and generalizability of the RF model, coupled with its non-invasive nature, offer potential clinical applications for disease screening, risk assessment, and treatment response monitoring. The identification of shared and disease-specific microbial signatures contributes to a deeper understanding of the microbiome's role in disease pathogenesis. However, further research is needed to fully elucidate the mechanisms underlying the observed microbiome-phenotype associations.

Conclusion

This research presents the largest single-site fecal microbiome dataset to date, encompassing multiple disease phenotypes, and a high-performing machine learning multi-class model for disease classification. This non-invasive approach holds promise for clinical applications in disease diagnostics and treatment response monitoring. Future work should focus on expanding the disease spectrum, investigating the underlying mechanisms, and validating the model in diverse clinical settings.

Limitations

The study has some limitations. The disease spectrum is limited, and inclusion of more phenotypes could improve the model's diagnostic capabilities. Further research is needed to establish the biological mechanisms underlying the identified microbiome-phenotype associations. The public datasets used for validation lacked detailed information on comorbidities and antibiotic use, potentially affecting model performance. Finally, while the model predicts probabilities for multiple diseases simultaneously, this aspect warrants further investigation and validation.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Integrative machine learning approaches for predicting disease risk using multi-omics data from the UK Biobank

O. Aguilar, C. Chang, et al.

Engineering and Technology

Small dataset machine-learning approach for efficient design space exploration: engineering ZnTe-based high-entropy alloys for water splitting

S. V. Oh, S. Yoo, et al.

Medicine and Health

Machine learning-based prediction of COVID-19 diagnosis based on symptoms

Y. Zoabi, S. Deri-rozov, et al.

Medicine and Health

Interpretable machine learning-based decision support for prediction of antibiotic resistance for complicated urinary tract infections

J. Yang, D. W. Eyre, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny