Read original ↗
EnrichedResearchReddit r/MachineLearningCommunityLive · 3d agoPublished 6/30/2026

Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset and weights [R]

Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between harmfu

View in news graph →

Why it matters

This story from Reddit r/MachineLearning is relevant to the Research branch of the AI ecosystem and may affect models, products, or research direction.

Technical breakdown

Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between harmful and harmless activation caches, then project it out of the weight matrices. The problem with vanilla abliteration (as

Business impact

Watch for product launches, funding moves, or policy shifts tied to this headline.