Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

Safe completion requires models to provide useful assistance without enabling harm, but this behavior is difficult to evaluate with isolated prompts. We introduce OpenSafeIntent, a benchmark of controlled prompt-sets that vary intent while holding the underlying task fixed. Each datapoint contains benign, dual-use, and malicious variants of the same task. This design lets us evaluate whether models calibrate assistance across intent shifts, rather than merely appearing safe on average. Across a broad model suite, we find that prompt-level safety hides important failures: models often fail to r

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorRheeya Uppaal

    OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

  • Linked via arxiv authorSeungwoo Lyu

    OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

  • Linked via arxiv authorSelina Sung

    OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

  • Linked via arxiv authorJunjie Hu

    OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

authored (incoming)

Related across the graph

Topics