paper · arXiv

Scaling laws for mixture-of-experts models

How sparse expert routing changes the compute-optimal frontier for large models.

Want the primary source?View original →