Topic cluster · 6 items

safety

Alignment

Making a model's behavior match human intent and values.

A tiny safety classifier for fast content filtering.

When a model states something fluent but false.

Evidence that model refusals are mediated by one interpretable direction in activation space.

Training models to critique and revise their own outputs against principles.

AI safety and evaluation tooling for production systems.