Angestrom
Search
Papers
Models
Live AI
Intelligence
Search
⌕
Go
⌘K
More
▾
Enterprise
Pricing
Sign in
≡
Home
/
Topics
/
interpretability
Topic cluster · 1 items
interpretability
paper
Refusal as a single linear direction
Evidence that model refusals are mediated by one interpretable direction in activation space.
Related topics
safety (1)
✦