Read original ↗

paperarXivTrust 82 · PrimaryPublished yesterdayLive · 18h ago

Online Safety Monitoring for LLMs

Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorMona Schirmer →
Online Safety Monitoring for LLMs
Linked via arxiv authorMetod Jazbec →
Online Safety Monitoring for LLMs
Linked via arxiv authorAlexander Timans →
Online Safety Monitoring for LLMs
Linked via arxiv authorChristian Naesseth →
Online Safety Monitoring for LLMs
Linked via arxiv authorMaja Waldron →
Online Safety Monitoring for LLMs
Linked via arxiv authorEric Nalisnick →
Online Safety Monitoring for LLMs

Related to

companyVerisight

Covers

newsIEEE Rolls Out Large Language Models Virtual Training Course

Implements

repoeval-harness-plus

authored (incoming)

personMona Schirmer personMetod Jazbec personAlexander Timans personChristian Naesseth personMaja Waldron personEric Nalisnick

Related across the graph

personMetod Jazbec personAlexander Timans personChristian Naesseth personMaja Waldron repoeval-harness-plus personEric Nalisnick personMona Schirmer newsIEEE Rolls Out Large Language Models Virtual Training Course companyVerisight

Topics