paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago
Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?
Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model? We evaluate surrogate fidelity at the prediction, attribution, and representation levels. For binary classification tasks, log-odds provide an API-compatible scalar readout of the model's representation space, and leave-one-out attributions provide insight into model behavior. Across eleven model
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Covers
Explains
Related across the graph
newsNew Server Hopes to Break Through AI’s “Memory Wall”newsAre there good closed vs open LLM rankings? Also, are 70B–350B models actually worth it?newsIEEE Rolls Out Large Language Models Virtual Training CoursetutorialEvaluate a model properlynewsIs it agentic enough? Benchmarking open models on your own tooling
