Topic cluster · 6 items
safety
glossary_term
Alignment
Making a model's behavior match human intent and values.
modelNano-Refuse-0.4B
A tiny safety classifier for fast content filtering.
glossary_termHallucination
When a model states something fluent but false.
paperRefusal as a single linear direction
Evidence that model refusals are mediated by one interpretable direction in activation space.
paperConstitutional methods for alignment
Training models to critique and revise their own outputs against principles.
companyVerisight
AI safety and evaluation tooling for production systems.