Read original ↗
paperarXivTrust 82 · PrimaryPublished 8d agoLive · 7d ago

Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization

Muon-type optimizers construct update directions for dense neural-network weights by applying a finite Newton-Schulz map to momentum-gradient matrices. For an $H \times W$ matrix, with $r=\min\{H,W\}$ and $s=\max\{H,W\}$, $K$ steps of the full-matrix Newton-Schulz update require $O(r^2 s K)$ work and couple all rows and columns through repeated Gram matrix products. We introduce Hierarchical Muon (HiMuon), a tiled Newton-Schulz scheme for Muon-type optimization. HiMuon partitions each momentum-gradient matrix into $T \times T$ tiles, applies the same finite Newton-Schulz map independently to e

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Implements (incoming)

Related across the graph

Topics