Benchmark

SalamahBench

Standardized safety evaluation for Arabic language models.

12
Tasks: 8,170
Items

Summary

SalamahBench is a unified safety benchmark for Arabic Language Models, containing 8,170 prompts aligned with the MLCommons AI Safety Hazard Taxonomy across 12 categories — including Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, Indiscriminate Weapons, Suicide and Self-Harm, Hate Speech, Privacy, Intellectual Property, Defamation, Sexual Content, and an "Others" bucket for harms that don’t map cleanly to a single category.

The dataset was built by harmonizing multiple heterogeneous Arabic and translated safety corpora (RTP-LX, WildGuardMix/PGPrompts, AraSafe, X-Safety, LinguaSafe, AdvBench, ClearHarm, HarmBench, and others) through a three-stage pipeline: dataset-specific preprocessing and taxonomy mapping, dual AI-judge filtering with Claude Sonnet 4.5 and GPT-5, and multi-stage human verification covering harm verification, disagreement adjudication, and category validation. Models are scored by Attack Success Rate — the proportion of responses classified as unsafe by safeguard models (Qwen3Guard, Llama Guard 4, PolyGuard, and a majority-vote ensemble).

Importance

Safety alignment does not generalize uniformly across languages. Models that refuse harmful prompts reliably in English often fail when the same request is expressed in Arabic, and existing Arabic safety evaluations rely on translated benchmarks or coarse labels that miss category-level vulnerabilities. SalamahBench is the first large-scale, category-aware safety benchmark for Arabic LMs built from native and carefully harmonized data, and it reveals substantial model-to-model variation — some models achieve strong aggregate safety scores while remaining vulnerable in specific harm domains like Intellectual Property or Sexual Content that a single aggregate number would hide.

Evaluation

Results

Across the twelve harm categories and four safeguard regimes, Fanar 2 and Falcon H1R achieve the lowest aggregate Attack Success Rate — roughly 0.8% under the majority-vote ensemble — while Jais 2 is the consistent outlier. Strong aggregate safety does not imply uniform robustness: Intellectual Property and Sexual Content remain weak spots even for the top models, and the majority vote over three safeguards produces the most conservative, reliable estimate of unsafe behavior.

Loading results…

Contact

For questions, comments, or collaboration, reach out at salamahbench@compumacy.com.