news · AI News
New benchmark exposes reasoning gaps in top models
A harder evaluation suite shows even leading models struggle on multi-hop tasks.
Want the primary source?View original →
A harder evaluation suite shows even leading models struggle on multi-hop tasks.