Topic cluster · 1 items

interpretability

paper

Refusal as a single linear direction

Evidence that model refusals are mediated by one interpretable direction in activation space.

Related topics