Read original ↗
paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

Harnessing Textual Refusal Directions for Multimodal Safety

To improve safety in Large Language Models (LLMs) we can either perform post-training alignment or exploit refusal directions in the activation space. Both strategies are less feasible in Multimodal LLMs (MLLMs) as they require unsafe multimodal data, harder to collect than their unimodal counterpart. In this work, we relax this constraint and investigate whether textual refusal directions, extracted directly from the LLM backbone, generalize across modalities (i.e., image, video). Preliminary findings confirm this ability, though effectiveness is conditioned by layer selection, steering stren

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Has model

Covers

Implements

Related across the graph

Topics