paper · arXiv

Refusal as a single linear direction

Evidence that model refusals are mediated by one interpretable direction in activation space.

Want the primary source?View original →