newsReddit r/MachineLearningTrust 52 · CommunityPublished 4h agoLive · 1h ago
Contrastive Decoding Diffing (CDD): recovering verbatim finetuning data from logits alone, no weight access needed[R]
We built a model diffing method that recovers verbatim content from narrowly finetuned LLMs using only grey-box logit access (no weights, no activations, no probe corpus). Recent work (Minder, Dumas et al., "Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences") showed that finetuning leaves detectable traces in activation differences between base and finetuned models. Their method, Activation Difference Lens (ADL), steers
Covers
paperDistill to Detect: Exposing Stealth Biases in LLMs through Cartridge DistillationpaperLogit-Contribution Scoring Identifies Non-Literal Retrieval HeadspaperUnderstanding Evaluation Illusion in Diffusion Large Language ModelspaperDNA Language Models: An Assessment of Pre-Training for Fine-Tuning TaskspaperOn the Role of Directionality in Structural Generalization
Related across the graph
paperOn the Role of Directionality in Structural GeneralizationpaperUnderstanding Evaluation Illusion in Diffusion Large Language ModelspaperLogit-Contribution Scoring Identifies Non-Literal Retrieval HeadspaperDNA Language Models: An Assessment of Pre-Training for Fine-Tuning TaskspaperDistill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
