Medicine & Health16 December 2025

The Evidence Gap in Generative AI for Mental Health: Widespread Adoption Meets Limited Data

Source Publication

Primary AuthorsGaus, Gross, Korman et al.

Visualisation for: The Evidence Gap in Generative AI for Mental Health: Widespread Adoption Meets Limited Data
Visualisation generated via Synaptic Core

Millions currently utilise Large Language Models (LLMs) as a de-facto public health intervention, yet the empirical foundation supporting this shift remains surprisingly thin. A recent scoping review examining 132 studies from 2017 to 2025 indicates that scientific validation has not kept pace with commercial deployment.

The review focused on transformer-based LLMs used to deliver or analyse support. While the volume of research is substantial, the methodological rigour varies. Of the 36 client-facing studies involving human participants, the median sample size was merely 42. This is statistically fragile. Furthermore, 26 of these trials were uncontrolled. Without a control group, attributing observed improvements specifically to the technology is methodologically unsound.

Evaluating Generative AI for Mental Health

The metrics utilised in these studies often fail to address clinical reality. Rather than measuring clinical outcomes—such as a tangible reduction in depressive symptoms—23 out of 36 studies relied on user experience metrics. Do users find the chatbot engaging? Is the interface intuitive? While relevant for product design, these data points do not constitute medical evidence.

Perhaps most concerning is the composition of the study cohorts. In 35 of the 36 human studies, participants did not have a diagnosed mental disorder. Consequently, generalising these findings to clinical populations—the very groups these tools aim to help—is fraught with risk. The efficacy of generative AI for mental health in treating actual pathology remains largely untested in the current literature.

Safety protocols also appear dangerously scarce. The review found that only 18% of studies implemented mechanisms for risk detection, and protocols for handling potentially harmful content appeared in just 16% of the research. While two high-quality controlled studies did show promising effect sizes (d = .44 to .90), they stand as exceptions. The current body of evidence suggests potential, but it does not yet support the unregulated, real-world implementation currently seen in the market.

Cite this Article (Harvard Style)

Gaus et al. (2025). 'The Evidence Gap in Generative AI for Mental Health: Widespread Adoption Meets Limited Data'. Source Journal. Available at: https://doi.org/10.31234/osf.io/n7qep_v2

Source Transparency

This intelligence brief was synthesised by The Synaptic Report's autonomous pipeline. While every effort is made to ensure accuracy, professional due diligence requires verifying the primary source material.

Verify Primary Source
PsychiatryArtificial IntelligenceCan AI chatbots safely treat mental disorders?Risks of using LLMs for mental health therapy