paper · arXiv
Refusal as a single linear direction
Evidence that model refusals are mediated by one interpretable direction in activation space.
Want the primary source?View original →
Evidence that model refusals are mediated by one interpretable direction in activation space.