newsReddit r/MachineLearningTrust 52 · CommunityPublished 12h agoLive · 23m ago
What does "Safe AI" look like? [D]
For open-weight LLMs, how practical is it to study defenses against post-release fine-tuning that weakens refusal or safety behavior? I've been seeing “uncensored” or “heretic” variants of new models appear very quickly after release, which raises a question I’m curious about: is fine-tuning resistance a meaningful safety goal for open-weight releases, or is it too narrow because determined users can always modify weights, switch models, o
