The state of agent benchmarks, 2026

Benchmarks have an evergreen problem: the ones that get cited are the ones that are easy to run, and the ones that are easy to run measure something narrower than the question they're asked to answer.

What current benchmarks tell you, and don't

Task-completion suites measure whether the agent finishes a defined task. They tell you very little about whether it would have stopped, asked, or escalated when the task was the wrong one.
Tool-use accuracy measures whether the right API was called with the right arguments. It does not measure what happens when the right API doesn't exist.
Adversarial robustness measures whether the agent resists a known attack class. It does not measure whether it stays sane under benign-but-unexpected input.

The benchmarks worth running alongside the cited ones are the boring ones: graceful-degradation rate, refusal-precision, and time-to-escalate.