Large Language Models (LLMs) are increasingly proposed as tools for mental health research and applications, from early risk detection to clinical decision support. Yet, despite rapid progress, a fundamental question remains largely unanswered: How well do modern LLMs actually perform across core psychiatric tasks, and what limits their deployment in sensitive mental health settings? In our paper, “A Comprehensive Evaluation of Large Language Models on Mental Illnesses,” we present the largest and most systematic evaluation to date of LLMs in mental health contexts. We assess 33 state-of-the-art models, spanning open and closed systems and ranging from 2B to over 405B parameters, across multiple clinically relevant tasks using human-annotated datasets.
Why This Study
Mental health disorders represent a growing global burden, affecting hundreds of millions of people worldwide. At the same time, access to qualified mental health professionals remains limited. AI, and, in particular, LLMs, offer the promise of scalable support, from screening and triage to education and clinical assistance.
Prior work has demonstrated encouraging results, but existing evaluations suffer from three major gaps:
1. Limited scope: Most studies evaluate only a small number of models, often older generations such as GPT‑3.5.
2. Inconsistent prompting: Prompt engineering is rarely examined systematically, despite its known impact on performance.
3. Reproducibility challenges: Many evaluations lack transparent reporting of prompts, sampling strategies, or variability across runs.
Our work directly addresses these gaps by providing a broad, transparent, and reproducible evaluation of modern LLMs on mental health tasks.
What We Evaluated
We evaluated LLMs across three core task categories that reflect common use cases in mental health research and applications:
1. Binary Disorder Detection (Zero-Shot)
Models were asked to determine whether a social media post indicates the presence of a specific mental health condition, such as depression, suicide risk, or stress, without seeing any labeled examples.
This task measures the model’s inherent psychiatric understanding based solely on its pretraining and instructions.
2. Disorder Severity Evaluation (Zero-Shot and Few-Shot)
Beyond detection, we evaluated whether models could assess severity levels (e.g., mild to severe) from social media text.
We compared zero-shot prompting with few-shot prompting, where models were given a small number of labeled examples before inference. This allowed us to quantify how much contextual guidance improves performance on more nuanced clinical judgments.
3. Psychiatric Knowledge Assessment
To test foundational psychiatric knowledge, models answered multiple-choice questions drawn from medical exam datasets, focusing on psychiatry-related content.
This task evaluates factual accuracy rather than interpretation of personal narratives.
Data: Prioritizing Human Annotation
A key principle of our study was data quality. We deliberately avoided weakly labeled datasets (e.g., posts inferred from subreddit membership or self-declared diagnoses) whenever possible.
Instead, we prioritized human-annotated datasets, labeled by:
● Clinical professionals,
● Trained crowdworkers,
● Or expert-guided volunteers.
Our experiments focused on depression, suicide risk, and stress conditions for which sufficiently large, high-quality datasets are available. In total, we evaluated models across six primary datasets, supplemented by a comprehensive survey of existing mental health datasets in the literature.
Models and Scale
We evaluated 33 LLMs, including:
● Proprietary models (e.g., GPT‑4, GPT‑4o, Claude 3.5, Gemini),
● Open-weight models (e.g., Llama 2, Llama 3, Llama 3.1, Gemma, Mistral, Phi),
● Models spanning more than two orders of magnitude in parameter count.
This breadth allowed us to analyze how scale, architecture, and accessibility interact with performance and cost, an essential consideration for real-world deployment.
Prompt Engineering Matters A Lot
Rather than treating prompts as a fixed detail, we systematically evaluated multiple prompt templates for each task.
Key insights include:
● Structured prompts consistently outperformed vague or open-ended instructions.
● Explicit constraints (e.g., “Answer with Yes or No only”) significantly reduced invalid responses.
● For severity estimation, few-shot prompting reduced mean absolute error by up to 1.3 points, particularly for smaller models such as Phi‑3‑mini.
● Simple techniques such as repeating output-format instructions yielded measurable gains for certain model families.
These findings highlight prompt design as a first-class component of LLM evaluation, not an afterthought.
Key Findings
● Top-tier models perform best—but unevenly. Models like GPT‑4 and Claude 3.5 achieved up to ~85% accuracy in disorder detection and over 90% accuracy in psychiatric knowledge tasks.
● Few-shot prompting matters. Providing a small number of examples significantly improved severity estimation, reducing error by up to 1.3 points, especially for smaller models.
● Prompt design is critical. Structured prompts and explicit output constraints consistently reduced errors and invalid responses.
● Safety filters limit evaluation. Some models refused to answer sensitive prompts, creating blind spots in benchmarking and highlighting tension between safety and measurability.
● Cost and access shape feasibility. High-performing models are often expensive, while smaller or open models offer practical trade-offs for experimentation and deployment.
Why It Matters
Our results show that while LLMs demonstrate real promise in mental health contexts, their performance is highly sensitive to prompting, task design, and safety constraints. Strong results on one task do not guarantee reliability across others.
Responsible deployment in mental health will require:
● Transparent evaluation,
● Careful prompt engineering,
● High-quality data,
● And close collaboration with clinical experts.
