paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

Muon learns balanced solutions in matrix factorization without slow saddle-to-saddle dynamics

Matrix factorization (i.e., problems of the form $\min_{\mathbf{P},\mathbf{Q}} \|\mathbf{M}^\star - \mathbf{P}^\top\mathbf{Q}\|_\mathrm{F}^2$) is a minimal learning problem that exhibits both nonlinear parameter dynamics and representation learning. In this setting, we study how parameter trajectories under the Muon optimizer differ from those of gradient descent. We identify three main dynamical differences: 1) Muon avoids the slow saddle-to-saddle dynamics from small initialization. Muon instead learns all the top modes of $\mathbf{M}^\star$ at the same rate, with the smaller modes convergin

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Topics

cs.LG