Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 13h ago

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat embedding clusters, commit to one semantic axis at one granularity; changing the resolution rebuilds the labels. We argue the bottleneck is the label system, not the mixer, and provide a hierarchical one. HERMES is a data-derived labeling substrate: a Learned Semantic Transform followed by 3-stage residual vector quantization annotates each document once into a coarse-

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorZiyun Qiao

    HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

  • Linked via arxiv authorYue Min

    HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

  • Linked via arxiv authorRuining Chen

    HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

  • Linked via arxiv authorYujun Li

    HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

authored (incoming)

Related across the graph

Topics