Measuring agent reliability in production

Pre-deployment eval suites are necessary and insufficient. They measure the agent against the problems an engineer thought to enumerate; they cannot measure it against the long tail it will actually see.

Three telemetry signals worth instrumenting

Tool-call confidence drift. Track the distribution of the model's self-reported confidence on tool calls over time. A widening lower tail predates regressions by days.
Human-confirmation override rate. When a human is gating an irreversible action, how often do they reject the agent's plan? A flat curve is fine; a rising one means the agent is silently getting worse.
Replay divergence. Periodically replay a fixed slice of recent traces through the current model. If outputs diverge from the original handling, you have a behavior-change signal independent of any benchmark.

None of these is a "score." Each is a leading indicator you can alarm on.