person profile

Ruining Chen

Ruining Chen — researcher or builder tracked in the Angestrom contributor network.

4Connections

1Papers

0Models

0Repos

0News

Papers · 1

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat embedding clusters, commit to one semantic axis at one granularity; changing the resolution rebuilds the labels. We argue the bottleneck is the label system, not the mixer, and provide a hierarchical one. HERMES is a data-derived labeling substrate: a Learned Semantic Transform followed by 3-stage residual vector quantization annotates each document once into a coarse-