paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

Safe completion requires models to provide useful assistance without enabling harm, but this behavior is difficult to evaluate with isolated prompts. We introduce OpenSafeIntent, a benchmark of controlled prompt-sets that vary intent while holding the underlying task fixed. Each datapoint contains benign, dual-use, and malicious variants of the same task. This design lets us evaluate whether models calibrate assistance across intent shifts, rather than merely appearing safe on average. Across a broad model suite, we find that prompt-level safety hides important failures: models often fail to r

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorRheeya Uppaal →
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
Linked via arxiv authorSeungwoo Lyu →
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
Linked via arxiv authorSelina Sung →
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
Linked via arxiv authorJunjie Hu →
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

authored (incoming)

personRheeya Uppaal personSeungwoo Lyu personSelina Sung personJunjie Hu

Related across the graph

personRheeya Uppaal personSeungwoo Lyu personJunjie Hu personSelina Sung

Topics

cs.CL