logo
ResearchBunny Logo
Faecal microbiome-based machine learning for multi-class disease diagnosis

Medicine and Health

Faecal microbiome-based machine learning for multi-class disease diagnosis

Q. Su, Q. Liu, et al.

This groundbreaking study by Qi Su and colleagues reveals how the systemic characterization of the human faecal microbiome can lead to innovative, non-invasive disease diagnostics. By leveraging metagenomic data from over 2,300 individuals, the machine-learning model they developed shows impressive predictive power across multiple diseases, showcasing the promise of microbiome-based solutions in clinical applications.

00:00
00:00
Playback language: English
Introduction
Recent studies have linked imbalanced gut microbiota (dysbiosis) to various diseases. Current microbial marker development primarily uses binary classifiers, which are limited by overlapping gut microbiome signatures across multiple health conditions. Single-disease models risk misclassification due to confounding from unrelated diseases. While previous attempts at multi-class diagnostic models exist, they suffer from limitations like heterogeneity and batch effects in public datasets. This study addresses these issues by creating a large single-site dataset of multiple diseases and applying machine learning multi-class models to predict diseases using species-level fecal microbiome profiling. The findings are then validated against public metagenome datasets.
Literature Review
The existing literature highlights the strong correlation between the gut microbiome and various diseases. However, most studies rely on binary classification models, comparing a single disease state against healthy controls. This approach often overlooks the complexity of disease interactions, resulting in a lack of specificity and potentially misleading conclusions. The use of public datasets, while valuable, often introduces challenges due to variability in sample collection, processing, and sequencing protocols, leading to batch effects and heterogeneity. Therefore, the need for a larger, more homogeneous dataset with a multi-class approach is crucial for accurate and reliable disease prediction using microbiome profiling.
Methodology
This study utilized metagenomic sequencing of fecal samples from 2320 Hong Kong Chinese individuals encompassing nine distinct disease phenotypes: colorectal cancer (CRC), colorectal adenomas (CA), Crohn's disease (CD), ulcerative colitis (UC), irritable bowel syndrome (IBS-D), obesity, cardiovascular disease (CVD), post-acute COVID-19 syndrome (PACS), and healthy controls. 14.3 terabytes of sequencing data yielded 1208 bacterial species, with 325 species (relative abundance > 0.15% and present in >5% of subjects) selected for analysis. Alpha diversity and richness varied across disease phenotypes, suggesting that these ecological indices alone are not robust disease indicators. Associations between species-level microbial composition and disease phenotypes were assessed using MaAsLin2, adjusting for biological and technical confounders. Five machine learning multi-class classifiers (random forest (RF), K-nearest neighbours (KNN), multi-layer perceptron (MLP), support vector machine (SVM), and graph convolutional neural network (GCN)) were trained on 70% of the data, with the remaining 30% used as an independent test set. Model performance was evaluated using AUROC, sensitivity, specificity, and accuracy. The RF model exhibited superior performance and was selected for further analysis and external validation using 1597 samples from 12 public datasets across diverse populations. Finally, the top 50 bacterial species contributing to the model were correlated with disease phenotypes to understand model interpretability.
Key Findings
The random forest (RF) multi-class model achieved high performance in predicting multiple diseases from fecal microbiome profiles. In the independent test set, the model achieved a mean AUROC ranging from 0.90 to 0.99 (IQR 0.91–0.94) for different disease phenotypes. Sensitivity ranged from 0.81 to 0.95 (IQR 0.87–0.93) at a specificity of 0.76 to 0.98 (IQR 0.83–0.95). The model's performance was consistent across different data splits and age strata. Validation on independent public datasets (1597 samples) yielded a mean AUROC of 0.69 to 0.91 (IQR 0.79–0.87), demonstrating the model's generalizability across populations. Analysis of the top 50 bacterial species revealed 363 significant associations with disease phenotypes. Many diseases showed decreased abundance of Firmicutes or Actinobacteria and increased abundance of Bacteroidetes compared to healthy controls. Disease-specific microbial signatures were also identified; for example, *Parvimonas micra* was significantly higher in CRC patients than in those with colorectal adenomas. The model accurately classified COVID-19 recovered patients as healthy, and showed a high degree of specificity for the nine phenotypes studied, demonstrating low misclassification for unrelated diseases.
Discussion
This study demonstrates the feasibility of a fecal microbiome-based multi-class model for disease diagnosis. The high performance and generalizability of the RF model, coupled with its non-invasive nature, offer potential clinical applications for disease screening, risk assessment, and treatment response monitoring. The identification of shared and disease-specific microbial signatures contributes to a deeper understanding of the microbiome's role in disease pathogenesis. However, further research is needed to fully elucidate the mechanisms underlying the observed microbiome-phenotype associations.
Conclusion
This research presents the largest single-site fecal microbiome dataset to date, encompassing multiple disease phenotypes, and a high-performing machine learning multi-class model for disease classification. This non-invasive approach holds promise for clinical applications in disease diagnostics and treatment response monitoring. Future work should focus on expanding the disease spectrum, investigating the underlying mechanisms, and validating the model in diverse clinical settings.
Limitations
The study has some limitations. The disease spectrum is limited, and inclusion of more phenotypes could improve the model's diagnostic capabilities. Further research is needed to establish the biological mechanisms underlying the identified microbiome-phenotype associations. The public datasets used for validation lacked detailed information on comorbidities and antibiotic use, potentially affecting model performance. Finally, while the model predicts probabilities for multiple diseases simultaneously, this aspect warrants further investigation and validation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny