Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset and weights [R]
Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between harmfu
Why it matters
This story from Reddit r/MachineLearning is relevant to the Research branch of the AI ecosystem and may affect models, products, or research direction.
Technical breakdown
Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between harmful and harmless activation caches, then project it out of the weight matrices. The problem with vanilla abliteration (as
Business impact
Watch for product launches, funding moves, or policy shifts tied to this headline.
