In PsychiatryBench: A Multi-Task Benchmark for LLMs in Psychiatry, the authors examine how well large language models handle the kind of nuanced, safety-critical reasoning required in psychiatric settings. Clinical reasoning is central to psychiatric practice, requiring clinicians to interpret subjective symptoms, integrate patient history, and assess risk under uncertainty. As large language models advance, a key question is how well they can perform clinically grounded reasoning in mental health contexts. While recent models perform well on general reasoning and medical benchmarks, these evaluations often fail to reflect the ambiguity and safety-critical nature of psychiatry. To address this gap, the paper introduces PsychiatryBench, a multi-task benchmark that evaluates diagnostic reasoning, symptom interpretation, and risk awareness using realistic psychiatric case vignettes.

Introducing PsychiatryBench

Psychiatric reasoning requires interpreting subjective symptoms, integrating patient history, and assessing risk under uncertainty. As large language models advance, a key question is how well they can perform this kind of reasoning in mental health settings.

Although recent models achieve strong results on gy and safety-critical nature of psychiatry. We introduce PsychiatryBench, a multi-task benchmark that evaluates diagnostic reasoning, symptom interpretation, and risk awareness using clinically grounded psychiatric case vignettes, providing a ceneral and medical benchmarks, these evaluations often fail to capture the ambiguitlearer view of model capabilities and limitations in mental health contexts.

What PsychiatryBench measures and how we built it

PsychiatryBench is designed to evaluate how large language models reason in psychiatric contexts, not just whether they produce correct final answers. The benchmark focuses on clinically meaningful behaviors that are central to mental health assessment and decision-making.

What PsychiatryBench measures

PsychiatryBench evaluates models across multiple psychiatric reasoning dimensions, including:

● Diagnostic reasoning: identifying plausible psychiatric diagnoses and supporting them with appropriate clinical justification.

● Differential diagnosis: considering alternative explanations and reasoning about why they are less likely.

● Symptom interpretation: mapping narrative descriptions to psychiatric constructs and symptom clusters.

● Risk awareness: recognizing indicators of suicide, self-harm, or harm to others and responding appropriately.

● Clinical appropriateness: avoiding unsafe, unethical, or misleading recommendations.

These dimensions are evaluated jointly to capture both reasoning quality and safety-relevant behavior.

How we built PsychiatryBench

PsychiatryBench is built from a curated set of psychiatric case vignettes inspired by real clinical training materials and published casebooks. Each case includes a presenting complaint, relevant psychiatric history, and mental status information and is intentionally designed to include ambiguity or incomplete information.

Cases span a broad range of psychiatric conditions, including mood disorders, psychotic disorders, anxiety disorders, trauma-related disorders, and neurocognitive conditions. Rather than optimizing for diagnostic clarity, cases are constructed to reflect the uncertainty and complexity encountered in real clinical practice.

Model responses are evaluated using structured scoring rubrics that assess clinical accuracy, reasoning coherence, completeness, and safety. This approach allows us to distinguish between models that arrive at correct conclusions by chance and those that demonstrate clinically grounded reasoning.

Task development pipeline

We developed PsychiatryBench tasks through an iterative, clinically grounded pipeline designed to reflect real psychiatric reasoning while maintaining consistency and evaluability.

First, we identified core psychiatric reasoning behaviors that are essential to clinical decision-making, including diagnostic formulation, differential consideration, symptom interpretation, and risk assessment. These behaviors guided the selection of task types and evaluation criteria.

Next, we constructed task prompts around realistic psychiatric case vignettes. Each task is designed to isolate a specific reasoning challenge while preserving the clinical context needed for meaningful evaluation. Prompts are structured to minimize ambiguity in task instructions without simplifying the underlying clinical complexity. We then iteratively reviewed and refined tasks to ensure clinical plausibility, clarity, and coverage across psychiatric domains. This process included removing tasks that relied on surface-level pattern matching or that could be solved without genuine clinical reasoning. Finally, we standardized task formats and scoring rubrics to enable consistent evaluation across models. This ensures that differences in performance reflect model behavior rather than prompt design or evaluation artifacts.

How we grade model performance
We grade models on PsychiatryBench using a combination of quantitative metrics and structured evaluation of open-ended responses. Classification tasks are scored using standard metrics such as F1 score and subset accuracy, while open-ended clinical reasoning is assessed based on accuracy, reasoning coherence, completeness, and clinical appropriateness. Safety is integrated across all tasks: responses that miss high-risk signals or provide unsafe guidance are penalized, even if other aspects are correct. This combined approach captures clinically meaningful behavior beyond accuracy alone.

Model performance

We evaluated 15 large language models across PsychiatryBench to assess performance across psychiatric reasoning, classification, and safety-related tasks. Models perform more strongly on structured classification tasks than on open-ended clinical reasoning, where differential diagnosis, treatment planning, and ambiguity pose greater challenges. Fluency alone does not reliably indicate clinical competence.

Frontier and reasoning-oriented models show more consistent performance, particularly in diagnostic reasoning and clinical interpretation. However, safety-related failures and drops in performance on ambiguous cases remain common. Overall, PsychiatryBench highlights meaningful differences in reasoning quality and risk awareness that are not captured by single-metric evaluations, underscoring the need for clinically grounded, multi-task assessment in mental health settings.

Limitations and what’s next

PsychiatryBench focuses on static psychiatric case vignettes and does not capture interactive or longitudinal clinical reasoning. It also does not fully reflect the cultural and linguistic diversity of real-world psychiatric practice.

Future work will extend the benchmark to multi-turn clinical dialogues, longitudinal cases, and broader population coverage, with continued emphasis on safety-critical evaluation.

Why PsychiatryBench Matters

PsychiatryBench is a groundbreaking benchmark that evaluates large language models on realistic psychiatric reasoning. Unlike prior tests using simplified or synthetic cases, it draws from expert-validated clinical vignettes, covering diagnosis, treatment planning, and multi-turn clinical reasoning. By doing so, it exposes current limitations of LLMs, highlights where models may make unsafe or inconsistent recommendations, and provides a standardized framework for researchers to measure progress. Ultimately, PsychiatryBench helps guide AI toward safer, clinically meaningful applications in mental health.

Benchmarking LLMs for Psychiatry: Compumacy’s Research into Multi-Task Clinical Evaluation

Abstract

What PsychiatryBench measures

How we built PsychiatryBench

Why PsychiatryBench Matters

Citation

Need Similar Research?