Tag
#evaluation
3 articles.
2026
- · Design
The Year We Taught a Machine to Tutor
A team built an ambitious AI tutor, watched it slowly degrade under its own safeguards, paused the pilot, and rebuilt it twice — the stubbornly simple rebuild won, decisively.
- · Evaluate
Measuring agent reliability in production
Offline eval suites tell you whether your agent is good on the problems you thought to write down. Production telemetry tells you whether it's good on the problems you didn't.
- · Benchmark
The state of agent benchmarks, 2026
A field guide to the benchmarks people cite, the benchmarks people ought to cite, and the gap between what they measure and what matters in production.