Read original ↗
paperarXivTrust 82 · PrimaryPublished 2d agoLive · yesterday

The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology

Model organisms (MOs) - language models trained to exhibit undesired or unnatural behaviours - are frequently used as testbeds for evaluating white-box interpretability techniques. Current MOs are typically constructed via post-hoc supervised fine-tuning (SFT) on behavioural transcripts or synthetic documents. Prior research has shown that interpretability methods can easily identify hidden behaviours in these MOs. However, recent work suggests that such post-hoc training methods may make interpretability unrealistically easy. We investigate this claim by constructing a suite of 54 $\verb|OLMo

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

Related across the graph

Topics