paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model? We evaluate surrogate fidelity at the prediction, attribution, and representation levels. For binary classification tasks, log-odds provide an API-compatible scalar readout of the model's representation space, and leave-one-out attributions provide insight into model behavior. Across eleven model

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

newsIs it agentic enough? Benchmarking open models on your own tooling newsAre there good closed vs open LLM rankings? Also, are 70B–350B models actually worth it?newsNew Server Hopes to Break Through AI’s “Memory Wall”newsIEEE Rolls Out Large Language Models Virtual Training Course

Explains

tutorialEvaluate a model properly

Related across the graph

newsNew Server Hopes to Break Through AI’s “Memory Wall”newsAre there good closed vs open LLM rankings? Also, are 70B–350B models actually worth it?newsIEEE Rolls Out Large Language Models Virtual Training Course tutorialEvaluate a model properly newsIs it agentic enough? Benchmarking open models on your own tooling

Topics

cs.LG