Automated Multi-Label Annotation for Mental Health Illnesses Using Large Language Models

The research addresses the challenge of accurately diagnosing co-occurring mental health conditions, such as depression and anxiety, from social media data, noting that existing datasets often focus on single-disorder labels. This paper proposes a novel methodology utilizing Large Language Models (LLMs) for creating versatile multi-label datasets.

Main Contributions

This paper offers several key contributions to the field of mental health diagnostics through LLMs and synthetic labeling:

Novel Methodology and Synthetic Labeling: Proposing a new methodology that includes cleaning, sampling, labeling, and combining data, along with introducing a synthetic labeling technique to transform single-label datasets into multi-label annotations, effectively capturing the complexity of overlapping mental health conditions.
Evaluation of LLM Prompting Strategies: Designing and evaluating various prompting strategies (single-label, multi-label, unrestricted) for LLMs in diagnosing multiple disorders.
Development of SPAADE-DR Dataset: Creating a new, comprehensive multi-label dataset encompassing six distinct mental disorders by applying the proposed approach to the RMHD dataset.
Analysis of Mental Disorder Comorbidities: Exploring associations and comorbidities among various mental disorders using the created multi-label dataset.

Methods

Data and Preprocessing

The study utilizes seven primary single-label datasets focusing on specific mental health conditions, as well as the RMHD dataset. The Depseverity and Dreaddit datasets, containing identical posts, were merged into a unified multi-label dataset labeled for both depression and stress. The RMHD dataset, sourced from mental health and general interest subreddits, was used to create the SPAADE-DR dataset. A cleaning and sampling process was applied to the RMHD dataset to address incorrect labeling based on subreddit origin.

LLMs and Prompting Strategies

Experiments were conducted using five LLMs: GPT-4o-mini, Llama-3 70b, Mistral NeMo 12b, Phi-3.5-MoE, and Gemma-2 9b. Three prompt template types were tested:

Single-Label Prompts: Designed to diagnose one mental illness at a time, prompting a binary (Yes/No) decision.
Multi-Label Prompts: Developed to evaluate the presence of multiple mental illnesses simultaneously, utilizing multi-class or multi-label classification approaches.
Unrestricted Prompts: Enabling the model to diagnose multiple mental illnesses without being restricted to predefined categories.

Evaluation Metrics

LLM performance was evaluated using per-class, overall (multi-label), and multi-class metrics. These include Balanced Accuracy (BA), F1-Score, Precision, Recall, Overall Balanced Accuracy (OBA), Overall Precision (OP), Overall Recall (OR), Hamming Loss (HL), and Multi-class balanced accuracy (BA).

Multi-label Labeling

The SPAADE-DR dataset was labeled using the most effective prompt and LLMs identified from the evaluation on the Depseverity-Dreaddit dataset. The single-label prompt was applied with LLaMA-3 70b, GPT-40-mini, and Phi-3.5 MoE to annotate the presence of six different disorders. The figure below shows the Distribution of the SPAADE-DR dataset.

Results

Evaluation on the Depseverity-Dreaddit dataset showed that the single-label prompt template consistently outperformed multi-label templates, with Llama-3 70b demonstrating the highest scores across most metrics, particularly in multi-label classification.

Upon re-evaluation on the SPAADE-DR dataset with an increased number of disorders (from 2 to 6), GPT-40-mini outperformed the other models when using multi-label and unrestricted prompts, indicating its robustness. However, when using the single-label prompt, Llama-3 70b achieved the highest scores. Single-label prompts were found to be the most robust as their performance remained unaffected by the number of labels. GPT-40-mini and Phi-3.5-MoE also achieved high overall balanced accuracy (OBA) of 0.86 with single-label prompts.

Analysis of comorbidities in the SPAADE-DR dataset, as shown in the figure below, revealed strong associations between depression and suicidal tendencies, high comorbidity between PTSD and anxiety, and a close relationship between depression and anxiety. Suicide and depression were the most positively associated disorders, while PTSD and eating disorders were the most negatively associated.

Contingency matrix showing associations between mental disorders (comorbidity)

Future Directions

Future directions include improving model precision for specific mental health conditions through fine-tuning, extending multi-label annotation to other domains and languages, and integrating LLMs into real-time monitoring systems for early detection and intervention. Continued refinement and addressing current challenges could position LLMs as a key component in the future of mental health research and interventions.

How to cite this work

@misc{hassan2024automated, title={Automated Multi-Label Annotation for Mental Health Illnesses Using Large Language Models}, author={Abdelrahaman A. Hassan and Radwa J. Hanafy and Mohammed E. Fouda},year={2024}, eprint={2412.03796}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2412.03796}}

Paper link:

[2412.03796] Automated Multi-Label Annotation for Mental Health Illnesses Using Large Language Models