Tag
1 article.
Measuring agent reliability in production
Offline eval suites tell you whether your agent is good on the problems you thought to write down. Production telemetry tells you whether it's good on the problems you didn't.