Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-bas
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorShayan Talaei →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
- Linked via arxiv authorAbhinav Chinta →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
- Linked via arxiv authorDevvrit Khatri →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
- Linked via arxiv authorAmin Karbasi →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
- Linked via arxiv authorAzalia Mirhoseini →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
- Linked via arxiv authorAmin Saberi →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
