paperarXivTrust 82 · PrimaryPublished 2d agoLive · 21h ago

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-bas

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorShayan Talaei →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
Linked via arxiv authorAbhinav Chinta →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
Linked via arxiv authorDevvrit Khatri →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
Linked via arxiv authorAmin Karbasi →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
Linked via arxiv authorAzalia Mirhoseini →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
Linked via arxiv authorAmin Saberi →
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

Covers

newsA system-level approach to prompt injection: separating instruction and data channels in LLM agents [P]newsPrompt injection is exploiting enterprise AI's biggest design flaws by targeting agents, RAG pipelines and model routers newsDefending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

authored (incoming)

personShayan Talaei personAbhinav Chinta personDevvrit Khatri personAmin Karbasi personAzalia Mirhoseini personAmin Saberi

Implements (incoming)

repochrisliu298/awesome-on-policy-distillation reponick7nlp/Awesome-LLM-On-Policy-Distillation

Covers (incoming)

newsContrastive Decoding Diffing (CDD): recovering verbatim finetuning data from logits alone, no weight access needed[R]

Related across the graph

personAzalia Mirhoseini personShayan Talaei repochrisliu298/awesome-on-policy-distillation newsPrompt injection is exploiting enterprise AI's biggest design flaws by targeting agents, RAG pipelines and model routers personAmin Saberi reponick7nlp/Awesome-LLM-On-Policy-Distillation personAbhinav Chinta personAmin Karbasi newsDefending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)personDevvrit Khatri newsA system-level approach to prompt injection: separating instruction and data channels in LLM agents [P]newsContrastive Decoding Diffing (CDD): recovering verbatim finetuning data from logits alone, no weight access needed[R]

Topics

cs.CL