paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures

LLM evaluation and AI safety face a shared measurement problem: benchmark scores, reward-model signals, and reported safety metrics can improve while the latent properties they are meant to represent remain difficult to verify. This paper combines a hybrid survey - a systematic search paired with narrative synthesis and separately tracked grey evidence - with a conceptual framework and a structured ten-model audit. The synthesis spans eight evidence streams: benchmark validity, dynamic evaluation, LLM-as-judge reliability, safety evaluation, jailbreak/refusal robustness, reward hacking, mechan

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Related to

companyVerisight

Covers

news"Dangerous" AI models are coming no matter what newsPrompt injection is exploiting enterprise AI's biggest design flaws by targeting agents, RAG pipelines and model routers newsInvesting in multi-agent AI safety research

Implements

repoeval-harness-plus

Related across the graph

newsPrompt injection is exploiting enterprise AI's biggest design flaws by targeting agents, RAG pipelines and model routers newsInvesting in multi-agent AI safety research repoeval-harness-plus news"Dangerous" AI models are coming no matter what companyVerisight

Topics

cs.AI