Evaluate
Measuring agent reliability in production
Offline eval suites tell you whether your agent is good on the problems you thought to write down. Production telemetry tells you whether it's good on the problems you didn't.
- Manzia Editorial
- 1 min read
Measuring agent reliability in production
Pre-deployment eval suites are necessary and insufficient. They measure the agent against the problems an engineer thought to enumerate; they cannot measure it against the long tail it will actually see.
Three telemetry signals worth instrumenting
- Tool-call confidence drift. Track the distribution of the model's self-reported confidence on tool calls over time. A widening lower tail predates regressions by days.
- Human-confirmation override rate. When a human is gating an irreversible action, how often do they reject the agent's plan? A flat curve is fine; a rising one means the agent is silently getting worse.
- Replay divergence. Periodically replay a fixed slice of recent traces through the current model. If outputs diverge from the original handling, you have a behavior-change signal independent of any benchmark.
None of these is a "score." Each is a leading indicator you can alarm on.
Author
Manzia Editorial— Editorial team
The Manzia editorial team curates research, frameworks, and field reports on building, deploying, and benchmarking Trusted Agents.