Large Language Models (LLMs) are increasingly explored as tools for mental health research and support, from screening and monitoring to assisting clinical decision-making. At the same time, recent progress in multimodal modeling raises an important new question: does incorporating audio, alongside text, meaningfully improve how LLMs reason about mental health?
In our paper, “Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance,” we present a systematic evaluation of text-only and multimodal LLMs on clinically grounded mental health tasks. Rather than focusing on model scale alone, our study examines how different input modalities, text, audio, and their combination affect performance, robustness, and error patterns in zero-shot settings.

Why This Study
Mental health disorders such as depression and post-traumatic stress disorder (PTSD) represent a growing global burden, while access to trained clinicians remains limited. Clinical assessments rely not only on what patients say, but also on how they say it tone, pauses, and emotional intensity often carry critical diagnostic information.
Most existing computational approaches treat these signals separately, using handcrafted audio features and task-specific classifiers. Meanwhile, prior LLM-based studies largely focus on text alone, leaving the role of audio underexplored.
Our work addresses three key gaps in the literature:
● Modality bias: Most evaluations assess LLMs using transcripts only, ignoring acoustic information.
● Model diversity: Comparisons between text-only and multimodal LLMs are limited and often anecdotal.
● Measurement of complementarity: Few studies explicitly analyze when and how modalities help resolve model errors.
We aim to provide a clearer, more reproducible picture of how modern LLMs perform when mental health assessment is framed as a multimodal reasoning problem.
What We Evaluated
We evaluate LLMs on data from the Extended Distress Analysis Interview Corpus (E‑DAIC), which consists of semi-structured interviews conducted with a virtual interviewer. Each interview includes aligned raw audio, text transcripts, and clinically validated labels.
Our evaluation spans four task categories:
1. Binary Disorder Detection (Zero-Shot)
Models predict the presence or absence of depression or PTSD based on an interview.
2. Severity Estimation
Models assess symptom severity levels derived from standardized clinical scales.
3. Multiclass Diagnosis
Models distinguish between no disorder, depression only, PTSD only, or comorbid conditions.
4. Multi-Label Classification
Models predict overlapping conditions simultaneously, reflecting real-world clinical complexity.
All tasks are performed without task-specific fine-tuning, allowing us to isolate the intrinsic reasoning capabilities of each model.
Data: Clinically Grounded and Multimodal
A central design principle of our study is data quality. We rely on interviews collected under controlled protocols and annotated using established clinical instruments, rather than weak proxies such as self-declared diagnoses or platform-level labels.
Each interview provides:
● Approximately 15–20 minutes of conversational audio
● High-quality transcripts
● Labels grounded in standardized assessments for depression and PTSD
This setup enables a direct comparison between text-based and audio-based reasoning under consistent conditions.
Models and Modalities
We benchmark a diverse set of large language models, including:
● Text-only LLMs (e.g., LLaMA, Mistral, Phi, DeepSeek families)
● Multimodal LLMs capable of ingesting both text and raw audio (e.g., Gemini variants)
For multimodal models, we evaluate three configurations:
● Text only
● Audio only
● Text and audio combined
This design allows us to disentangle the contribution of each modality and study whether fusion consistently improves performance.
Prompting as an Experimental Variable
Rather than fixing prompts arbitrarily, we treat prompt design as part of the experimental methodology.
Across tasks, prompts are:
● Structured and task-specific
● Constrained to produce short, unambiguous outputs
● Designed to minimize stylistic variability
We find that prompt phrasing materially affects outcomes, reinforcing the need for transparent and reproducible prompt reporting, especially in high-stakes domains like mental health.
Key Findings
Text is a strong baseline, but incomplete.
Text-only LLMs achieve competitive performance in several tasks, confirming that linguistic content alone carries a substantial diagnostic signal.
Audio adds complementary information.
For models capable of processing audio directly, acoustic cues improve performance, particularly for PTSD-related tasks and severity estimation.
Multimodal fusion helps capable models.
When models natively support multimodal reasoning, combining text and audio often outperforms either modality alone, improving both accuracy and error consistency.
Gains are task-dependent.
Severity estimation and multiclass settings benefit more from multimodal input than simple binary classification.
Zero-shot performance is surprisingly strong.
Several models approach results reported by specialized pipelines, despite requiring no task-specific training or feature engineering.
Why It Matters
Mental health assessment is inherently multimodal. Our results show that:
● Audio is not merely auxiliary; it provides a meaningful diagnostic signal.
● Multimodal LLMs can leverage complementary cues when properly designed.
● Careful evaluation, not just larger models, is critical for responsible use.
As multimodal LLMs continue to evolve, understanding when and why modalities help will be essential for building reliable, ethical mental health technologies.
