Microbiome disease prediction: Why new AI models struggle to beat classical algorithms in early study
Source PublicationSpringer Science and Business Media LLC
Primary AuthorsMu, Tang, Chen

Researchers evaluating microbiome disease prediction have found that the latest AI foundation models offer little advantage over older, classical machine learning techniques. Achieving accurate forecasts from gut bacteria is notoriously difficult because biological data is highly sparse and varies wildly between different clinical studies.
The challenge of microbiome disease prediction
For years, scientists have tried to use our gut bacteria to diagnose illness. The old method relies on classical machine learning, such as random forests and regularised logistic regression. These older algorithms require careful, manual tuning to handle the noisy, inconsistent data generated by bacterial sequencing.
Recently, researchers questioned whether foundation models could bypass these hurdles and improve cross-cohort generalisation. A new preprint, currently awaiting peer review, systematically tested this assumption by comparing classical baselines against modern AI paradigms.
Testing the new algorithms
The investigators benchmarked these approaches across 83 public case-control cohorts spanning 20 different diseases. The source data was profiled using both 16S rRNA sequencing and shotgun metagenomics, providing a characteristically sparse, compositional dataset for evaluation.
The study measured the predictive accuracy of several distinct models using intra-cohort cross-validation, cross-cohort transfer, and leave-one-study-out validation:
- Standard numerical feature representations and classical baselines.
- GPT-derived semantic embeddings based on text representations.
- TabPFN, a general-purpose tabular foundation model.
- MGM, a foundation model specifically designed for microbiome data.
The results were sobering. GPT-derived text embeddings consistently underperformed standard numerical data. The general-purpose TabPFN model showed strong out-of-the-box performance, but it failed to consistently outpace well-tuned classical baselines. The microbiome-specific MGM model generally lagged behind the older methods, showing high disease dependency.
What this preliminary research does not solve
This early-stage research highlights a persistent barrier: inter-study heterogeneity. The new foundation models do not yet solve the problem of data generalisation, meaning an algorithm trained on patients in one hospital often fails when tested on patients in another.
Furthermore, the study suggests that current microbiome-specific pretraining at the genus level is simply not detailed enough. Bacteria within the same genus can have vastly different functions, so models may require species or strain-level resolution to detect meaningful clinical signals. Standard batch-effect correction methods also provided limited and uneven improvements across the evaluated cohorts, leaving scientists without a reliable way to harmonise data across different laboratories.
Future outlook
These findings offer a rigorous reality check for the field. While foundation models excel in text and image processing, biological data presents a distinct structural challenge that current architectures struggle to process efficiently. The older methods, despite their need for manual tuning, still capture the complex compositional nature of the microbiome more effectively than off-the-shelf AI.
Moving forward, developers may need to drastically increase the scale of pretraining and improve taxonomic resolution to make these models viable. Until those advances materialise, standard numerical representations and classical machine learning algorithms remain the most robust tools available.