paperarXivTrust 82 · PrimaryPublished 5d agoLive · 3d ago

Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies

While existing data attribution methods can identify which training examples build specific mechanistic circuits, they cannot explain how training data shapes the high-level behavioral decisions a model learns to make. To bridge this gap, we introduce Symbolic Mechanistic Data Attribution (SMDA), a framework that attributes training pairs to the interpretable symbolic policies governing model behavior. SMDA fits a closed-form Ridge regression over sparse autoencoder (SAE) features to model a target behavior, then analytically decomposes how each supervised fine-tuning example shifts that polic

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Topics

cs.CL