Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 18h ago

Fast Multi-dimensional Refusal Subspaces via RFM-AGOP

Steering and monitoring activations in Large Language Models (LLMs) are increasingly used for both safety and interpretability. Early work assumed behaviours are encoded along single linear directions, but recent findings suggest complex behaviours, such as the refusal to answer harmful queries, live in multi-dimensional subspaces. However, existing methods for extracting these subspaces are computationally expensive, which becomes prohibitive on reasoning models who produce long reasoning traces. By adapting the Recursive Feature Machine (RFM) algorithm -- which can be computed efficiently --

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Covers

Implements

authored (incoming)

Related across the graph

Topics