Skip to main content
Benchmark

PsychiatryBench

A multi-task benchmark for LLMs in psychiatry.

11
Tasks
5,188
Items
Paper· DOI to be addedDataset· Contact corresponding author

Summary

PsychiatryBench is a benchmark for evaluating large language models on psychiatric reasoning, built entirely from authoritative, expert-validated psychiatric textbooks and casebooks rather than social-media posts or synthetic dialogues. It contains 5,188 expert-annotated items spanning eleven task types: Diagnosis, Treatment, Treatment Follow-Up, Classification (categories and specific disorders), Management Plan, Clinical Approach, Mental QA, Sequential QA, MCQ, Extended Matching Items (EMI), and exam simulations.

Items were curated by licensed psychiatrists from sources including DSM-5 and DSM-5-TR Clinical Cases, 100 Cases in Psychiatry, Case Files Psychiatry, Stahl's Essential Psychopharmacology, and others. Open-ended tasks are scored with an LLM-as-judge framework (Llama 3.3 70B) against expert-written reference answers; multi-label classification uses weighted F1 and subset accuracy; MCQ and EMI use accuracy and partial credit scoring.

Importance

Most prior mental-health benchmarks for LLMs rely on unverified social-media text or synthetic dialogues, which do not capture the layered diagnostic reasoning, comorbidity handling, and treatment planning that define actual psychiatric practice. PsychiatryBench grounds evaluation in expert-authored clinical material and probes both knowledge and multi-step reasoning across eleven task types, exposing where frontier and specialized-medical models still fail — especially in multi-label disorder classification and longitudinal follow-up — and providing a structured basis for safer deployment of LLMs in mental-health settings.

Leaderboard

Loading leaderboard…