paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago
Harnessing Textual Refusal Directions for Multimodal Safety
To improve safety in Large Language Models (LLMs) we can either perform post-training alignment or exploit refusal directions in the activation space. Both strategies are less feasible in Multimodal LLMs (MLLMs) as they require unsafe multimodal data, harder to collect than their unimodal counterpart. In this work, we relax this constraint and investigate whether textual refusal directions, extracted directly from the LLM backbone, generalize across modalities (i.e., image, video). Preliminary findings confirm this ability, though effectiveness is conditioned by layer selection, steering stren
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
