Read original ↗
paperarXivTrust 82 · PrimaryPublished 8d agoLive · 7d ago

Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

We argue that safety classifiers should model user intent as an explicit signal between the prompt and the final label. To study this, we introduce AIMS, a human-annotated dataset of 1,724 difficult safety prompts, each paired with an intent description and harm label. We use AIMS to evaluate intent-aware training across supervised fine-tuning, preference learning, reasoning distillation, and reinforcement learning. Despite its size, AIMS enables competitive safety classifiers across training regimes: DPO from model-generated intent errors improves over SFT, and intent-conditioned distillation

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Topics