news · AI News

New benchmark exposes reasoning gaps in top models

A harder evaluation suite shows even leading models struggle on multi-hop tasks.

Want the primary source?View original →