paper · arXiv
Scaling laws for mixture-of-experts models
How sparse expert routing changes the compute-optimal frontier for large models.
Want the primary source?View original →
How sparse expert routing changes the compute-optimal frontier for large models.