Read original ↗
paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model? We evaluate surrogate fidelity at the prediction, attribution, and representation levels. For binary classification tasks, log-odds provide an API-compatible scalar readout of the model's representation space, and leave-one-out attributions provide insight into model behavior. Across eleven model

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

Explains

Related across the graph

Topics