paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies

Large Language Models (LLMs) often fail to maintain instruction hierarchies (IH) when processing multi-source inputs with varying role-level priorities, paradoxically adhering to lower-priority directives during conflicts. While existing defenses mitigate this issue, they are largely restricted to single-turn scenarios and require expensive fine-tuning. In this paper, we formalize this failure mode in multi-turn contexts via a Jensen-Shannon Divergence (JSD) framework, uncovering a pervasive role-influence inversion phenomenon where subordinate inputs override superior roles. To rectify this w

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Topics

cs.CL