paperarXivTrust 82 · PrimaryPublished yesterdayLive · 18h ago
Fast Multi-dimensional Refusal Subspaces via RFM-AGOP
Steering and monitoring activations in Large Language Models (LLMs) are increasingly used for both safety and interpretability. Early work assumed behaviours are encoded along single linear directions, but recent findings suggest complex behaviours, such as the refusal to answer harmful queries, live in multi-dimensional subspaces. However, existing methods for extracting these subspaces are computationally expensive, which becomes prohibitive on reasoning models who produce long reasoning traces. By adapting the Recursive Feature Machine (RFM) algorithm -- which can be computed efficiently --
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorThomas Winninger →
Fast Multi-dimensional Refusal Subspaces via RFM-AGOP
