paperarXivTrust 82 · PrimaryPublished 2d agoLive · yesterday

Condensing Large-Scale Datasets Directly with Minimal Information Loss

Recent advancements in scaling dataset distillation rely heavily on decoupled information extraction pipelines, comprising SQUEEZE, RECOVER, and RELABEL stages. Despite their scalability to large-scale datasets, these methods suffer from prohibitive computational overhead and poor cross-architecture generalization. In this paper, we reveal the root cause of these bottlenecks: the implicit dual-compression process, from data to model and back to images, inherently induces severe information loss. Crucially, we empirically and theoretically demonstrate that this loss creates a distribution shift

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Related to

glossary_termQuantization

Covers

newsWhat if context compression is a diffusion noise function? Proposal + honest results from untrained-model experiments [R]newsDeepSeek open-sources inference optimizations with 60–85% faster generation [pdf]

Related across the graph

glossary_termQuantization newsWhat if context compression is a diffusion noise function? Proposal + honest results from untrained-model experiments [R]newsDeepSeek open-sources inference optimizations with 60–85% faster generation [pdf]

Topics

cs.CV