Microbiome disease prediction: AI foundation models fail to beat classical baselines
Source PublicationSpringer Science and Business Media LLC
Primary AuthorsMu, Tang, Chen

Researchers have completed a massive benchmarking study of microbiome disease prediction, revealing that classical machine learning algorithms still outperform modern AI foundation models. This comparative task was exceptionally difficult to achieve because bacterial data is notoriously sparse, and baseline differences between separate patient groups often obscure true biological signals. The sheer scale of this evaluation—spanning 83 public cohorts and 20 diseases—provides unprecedented clarity on the current limits of artificial intelligence in biology.
The challenge of microbiome disease prediction
Predicting illness from biological data relies on identifying patterns among the trillions of microbes inhabiting the human body. Translating these patterns into reliable clinical diagnostics is notoriously difficult. Data from different clinical sites often look completely different due to substantial inter-study heterogeneity and severe domain shifts between cohorts.
Scientists recently theorised that the same artificial intelligence architecture powering modern text models could smooth out these inconsistencies. Foundation models are designed to learn broad representations from vast datasets, which could theoretically improve robustness when tested on new patient groups.
Benchmarking the algorithms
In this large-scale systematic evaluation, researchers compared classic machine learning techniques against newer foundation models. They measured predictive performance using both 16S rRNA sequencing and shotgun metagenomics. The traditional methods included regularised logistic regression and random forests, which rely on standard numerical data representations.
The investigators compared these against GPT-derived semantic embeddings, a general tabular foundation model called TabPFN, and a microbiome-specific foundation model known as MGM. GPT-derived models consistently underperformed the traditional numerical methods. While TabPFN achieved strong out-of-the-box performance and competitive cross-cohort robustness, it did not consistently outperform well-tuned classical algorithms when tested across different cohorts.
What this preliminary research does not solve
This early-stage study makes it clear that researchers cannot simply plug biological sequence data into an AI framework and expect immediate clarity. The research does not solve the fundamental problem of inter-study heterogeneity, as standard batch-effect correction methods provided limited and non-uniform improvements. Furthermore, the microbiome-specific model (MGM) lagged behind standard baselines, suggesting that current pretraining is restricted in scope—specifically, its reliance on broad genus-level resolution lacks the necessary biological granularity to capture fine disease signals.
Future implications for microbial analysis
The findings suggest that developers may need to rethink how they train artificial intelligence on biological data. Building better diagnostic tools will require more than just adopting the latest algorithmic architectures. For future models to succeed, developers will likely need to:
- Increase the overall scale of pretraining data.
- Improve the taxonomic resolution to identify specific bacterial strains rather than broad genera.
- Design new architectures capable of translating pretraining into reliable cross-study generalisation.
Until these technical hurdles are cleared, classic machine learning remains a formidable standard for assessing microbial data. However, rather than being strictly theoretical, models like TabPFN already offer modest gains and strong out-of-the-box utility, pointing towards a promising—if challenging—future for foundation models in biology.