paperarXivTrust 82 · PrimaryPublished 2d agoLive · yesterday
Condensing Large-Scale Datasets Directly with Minimal Information Loss
Recent advancements in scaling dataset distillation rely heavily on decoupled information extraction pipelines, comprising SQUEEZE, RECOVER, and RELABEL stages. Despite their scalability to large-scale datasets, these methods suffer from prohibitive computational overhead and poor cross-architecture generalization. In this paper, we reveal the root cause of these bottlenecks: the implicit dual-compression process, from data to model and back to images, inherently induces severe information loss. Crucially, we empirically and theoretically demonstrate that this loss creates a distribution shift
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
