The Evidence Gap in Generative AI for Mental Health: Widespread Adoption Meets Limited Data
Source Publication
Primary AuthorsGaus, Gross, Korman et al.

Millions currently utilise Large Language Models (LLMs) as a de-facto public health intervention, yet the empirical foundation supporting this shift remains surprisingly thin. A recent scoping review examining 132 studies from 2017 to 2025 indicates that scientific validation has not kept pace with commercial deployment.
The review focused on transformer-based LLMs used to deliver or analyse support. While the volume of research is substantial, the methodological rigour varies. Of the 36 client-facing studies involving human participants, the median sample size was merely 42. This is statistically fragile. Furthermore, 26 of these trials were uncontrolled. Without a control group, attributing observed improvements specifically to the technology is methodologically unsound.
Evaluating Generative AI for Mental Health
The metrics utilised in these studies often fail to address clinical reality. Rather than measuring clinical outcomes—such as a tangible reduction in depressive symptoms—23 out of 36 studies relied on user experience metrics. Do users find the chatbot engaging? Is the interface intuitive? While relevant for product design, these data points do not constitute medical evidence.
Perhaps most concerning is the composition of the study cohorts. In 35 of the 36 human studies, participants did not have a diagnosed mental disorder. Consequently, generalising these findings to clinical populations—the very groups these tools aim to help—is fraught with risk. The efficacy of generative AI for mental health in treating actual pathology remains largely untested in the current literature.
Safety protocols also appear dangerously scarce. The review found that only 18% of studies implemented mechanisms for risk detection, and protocols for handling potentially harmful content appeared in just 16% of the research. While two high-quality controlled studies did show promising effect sizes (d = .44 to .90), they stand as exceptions. The current body of evidence suggests potential, but it does not yet support the unregulated, real-world implementation currently seen in the market.