newsReddit r/MachineLearningTrust 52 · CommunityPublished 12h agoLive · 23m ago

What does "Safe AI" look like? [D]

For open-weight LLMs, how practical is it to study defenses against post-release fine-tuning that weakens refusal or safety behavior? I've been seeing “uncensored” or “heretic” variants of new models appear very quickly after release, which raises a question I’m curious about: is fine-tuning resistance a meaningful safety goal for open-weight releases, or is it too narrow because determined users can always modify weights, switch models, o

Covers

articleThe case for open weights paperEvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures paperOnline Safety Monitoring for LLMs paperBehind the Refusal: Determining Guardrail Activation via Behavioral Monitoring companyVerisight

Related across the graph

paperBehind the Refusal: Determining Guardrail Activation via Behavioral Monitoring articleThe case for open weights paperEvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures paperOnline Safety Monitoring for LLMs companyVerisight