paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis conversation. We present EMPATH, a benchmark for safety evaluation of emotional-support chatbots. An auditor model role-plays help-seeking users, generating multi-turn conversations from 140 seed instructions and 34 personas. A judge model scores each full transcript against 19 metrics across five dimensions: crisis handling, therapeutic quality, conversational inte

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

newsCommunity fine-tune tops the open leaderboard

Covers (incoming)

newsYou Can Now Sound the Alarm on AI Behaving Badly

Related across the graph

newsYou Can Now Sound the Alarm on AI Behaving Badly newsCommunity fine-tune tops the open leaderboard

Topics

cs.AI