Topic cluster · 6 items

safety

glossary_term

Alignment

Making a model's behavior match human intent and values.

model

Nano-Refuse-0.4B

A tiny safety classifier for fast content filtering.

glossary_term

Hallucination

When a model states something fluent but false.

paper

Refusal as a single linear direction

Evidence that model refusals are mediated by one interpretable direction in activation space.

paper

Constitutional methods for alignment

Training models to critique and revise their own outputs against principles.

company

Verisight

AI safety and evaluation tooling for production systems.

Related topics