Read original ↗
paperarXivTrust 82 · PrimaryPublished 2d agoLive · 21h ago

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-bas

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorShayan Talaei

    Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

  • Linked via arxiv authorAbhinav Chinta

    Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

  • Linked via arxiv authorDevvrit Khatri

    Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

  • Linked via arxiv authorAmin Karbasi

    Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

  • Linked via arxiv authorAzalia Mirhoseini

    Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

  • Linked via arxiv authorAmin Saberi

    Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

Covers

authored (incoming)

Implements (incoming)

Covers (incoming)

Related across the graph

Topics