Benchmark
The state of agent benchmarks, 2026
A field guide to the benchmarks people cite, the benchmarks people ought to cite, and the gap between what they measure and what matters in production.
- Manzia Editorial
- 1 min read
The state of agent benchmarks, 2026
Benchmarks have an evergreen problem: the ones that get cited are the ones that are easy to run, and the ones that are easy to run measure something narrower than the question they're asked to answer.
What current benchmarks tell you, and don't
- Task-completion suites measure whether the agent finishes a defined task. They tell you very little about whether it would have stopped, asked, or escalated when the task was the wrong one.
- Tool-use accuracy measures whether the right API was called with the right arguments. It does not measure what happens when the right API doesn't exist.
- Adversarial robustness measures whether the agent resists a known attack class. It does not measure whether it stays sane under benign-but-unexpected input.
The benchmarks worth running alongside the cited ones are the boring ones: graceful-degradation rate, refusal-precision, and time-to-escalate.
Author
Manzia Editorial— Editorial team
The Manzia editorial team curates research, frameworks, and field reports on building, deploying, and benchmarking Trusted Agents.