paperarXivTrust 82 · PrimaryPublished yesterdayLive · 13h ago

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat embedding clusters, commit to one semantic axis at one granularity; changing the resolution rebuilds the labels. We argue the bottleneck is the label system, not the mixer, and provide a hierarchical one. HERMES is a data-derived labeling substrate: a Learned Semantic Transform followed by 3-stage residual vector quantization annotates each document once into a coarse-

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorZiyun Qiao →
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
Linked via arxiv authorYue Min →
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
Linked via arxiv authorRuining Chen →
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
Linked via arxiv authorYujun Li →
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

authored (incoming)

personZiyun Qiao personYue Min personRuining Chen personYujun Li

Related across the graph

personRuining Chen personYue Min personYujun Li personZiyun Qiao

Topics

cs.AI