The Year We Taught a Machine to Tutor

How an AI tutor was built, broke, and was rebuilt

February – May 2026.

Overview

In one line: A team built an ambitious AI tutor, watched it slowly degrade under its own safeguards, paused the pilot, and rebuilt it twice — the stubbornly simple rebuild won, decisively.

In early 2026, a small team set out to build an AI tutor for secondary-school students in Maths and Geography. By spring they had something genuinely good — good enough for real classrooms. Then, over a few weeks, it quietly got worse. Not because anyone broke it, but because of how they tried to improve it: every mistake earned another safety net, until the nets tangled and fought each other. The pilot was paused. In late May, the team tried two ways of starting over — one clever and ambitious, one almost stubbornly simple. The simple one won. This is how that happened, and what it teaches.

Act One — The fast start (February to April)

The essence: They built fast and built a lot — and optimized for everything the tutor could do instead of the few things it could never get wrong.

The project began on 1 February 2026 as a blank slate, with a worthy goal: a patient, always-available tutor that could walk students through their real curriculum, check answers, and show teachers how their classes were doing.

In about ten weeks it grew from prototype to platform. It read teachers' uploaded textbooks and worksheets — even scanned pages and diagrams — and turned them into lessons. It generated quizzes, drew figures, spoke aloud, ran on a phone with patchy internet, tracked each student against learning objectives, and gave teachers a dashboard. It even had points and streaks. By March and April it was functionally good, and the team launched a real classroom pilot.

But the seed of the trouble was inside that sprint. The feature list kept growing — in a single week at the end of April the team shipped offline support, a mobile app, exams, a competency tracker, an in-app help assistant, email verification, sometimes a dozen things a day. Each was reasonable alone. Together they meant the part that mattered most — the actual tutoring conversation — was getting harder to reason about. The single piece of code running that conversation had swollen past twelve thousand lines.

The seed: the team optimized for capability — what the tutor could do — when success would be decided by reliability — whether it could be trusted never to make certain mistakes.

Act Two — The slow slide (the spiral of safeguards)

The essence: The tutor degraded because of the team's fixes. Stacking AI checkers onto an AI failed — the checkers shared the same blind spots and made the system rigid. So the team paused the pilot, and reframed the three "unacceptable errors" as the entire spec.

The uncomfortable truth: the tutor degraded not in spite of the team's efforts, but because of them.

The key idea is the unacceptable error. Most mistakes are survivable — a clunky explanation, a bit of repetition, and the student moves on. But three are trust-destroying:

Telling a student their correct answer is wrong.
Telling a student their wrong answer is correct.
Asking a question that can't be answered — because needed information is missing from what the student can see.

A tutor that does any of these, even rarely, is worse than no tutor. A child told "that's wrong" when they were right learns to distrust themselves and the tool. These are the villains of the story.

The original tutor ran on one giant brief — picture a 460-line document handed to a single AI, telling it to do everything: run the conversation, judge answers, choose what to teach, stay safe, stay on topic, use the textbook, and never make the three fatal mistakes. One AI doing the job of a whole staffroom.

When that overloaded AI slipped, the instinct was natural: add a checker. A second AI to review the first. Then another, and another — for facts, for arithmetic, for coherence, one to re-run the answer and vote on the best version. Safeguards on safeguards. Two deep problems followed, and they are the heart of the report:

Problem one — the checkers shared the tutor's blind spots. The reviewers ran on the same technology as the tutor. Asking them to catch its mistakes was like asking someone to proofread their own writing: they missed the same things. When the tutor confidently called a wrong answer right, the reviewer often confidently agreed. The nets were woven from the same thread that was tearing.

Problem two — every safeguard made the system more rigid. More gates and checkpoints sat between a student's message and the reply, making the conversation mechanical — the opposite of the warm, adaptive experience that makes tutoring valuable. The machine was so busy checking itself it had forgotten how to teach.

By early May the degradation was real, and the team made the hard, correct call: they paused the pilot. Not over a noisy dashboard — because the rarest, most damaging mistakes wouldn't stay fixed.

That pause reframed everything. The failed pilot wasn't a setback; it was evidence — an expensive measurement of exactly what was wrong. The three unacceptable errors weren't edge cases to patch. They were the entire specification. Anything that didn't help eliminate them was a distraction.

Act Three — Two roads out (late May)

The essence: Both rebuilds started on the same day from the same fix — stop asking an AI to do what plain code can do reliably. The ambitious one rebuilt the old disease in a prettier form; the simple one shrank the AI's job and worked.

With the old design discredited, the team did something wise: instead of one bet, they pursued two rebuilds in parallel, both starting on 25 May from the same hard-won insight — the most transferable lesson of all:

Stop asking one AI to do a job a simple program can do reliably.

Grading is the clearest case. Nine in ten curriculum questions are multiple-choice or single-number. Checking whether a student typed "B" or "42" needs no large, fallible, expensive AI — an ordinary, boring, utterly reliable piece of code does it instantly, the same way every time. The old design had handed that deterministic job to the AI, which sometimes got it wrong. Both rebuilds took correctness-checking away from the conversational AI and gave it to a dedicated checker — real arithmetic for maths, trusted course material for the rest — that was allowed to say "I'm not sure" rather than bluff.

That was the agreement. Here is where the roads diverged.

Road one — the ambitious rebuild ("refactor")

Architecturally rigorous. It split the tutor into specialists: a grader, an AI "router" choosing the next teaching move (hint? worked example? name the misconception?), the tutor that talked, and a battery of automatic checks every response had to pass. A beautiful design — every piece reasoned and documented.

But evaluation after evaluation, the unacceptable errors kept returning, fixed in one run and resurfacing in another. The team's own notes after the seventh run say it plainly:

"After 7 runs we are just going round in circles… we have implemented so many 'fixes' but we still have [unacceptable] errors, in fact they have steadily increased… We obviously have the wrong design for this Tutor."

This is the turning point. Despite the right diagnosis, the ambitious rebuild had recreated the original disease — layers checking layers, gates controlling the flow, complexity where fixing one thing broke another. Setting out to escape the maze, they'd built a more elegant maze. The questions they started asking — do we even need this step? this gate? this loop? — were the sound of a team realizing the verb was never add. It was remove.

Road two — the stubbornly simple rebuild ("dev")

Different temperament from the first line written down:

"Stay simple enough that a new contributor can read it in one sitting."

No team of specialist AIs. Essentially one AI call ran the conversation, with a small set of clear tools (record an answer, advance a step, show a figure, redirect off-topic chat) — and plain, deterministic code owned everything that didn't truly need intelligence. Grading ran programmatically first, with the AI consulted only for genuinely ambiguous cases. The current question was tracked by the server, not improvised by the AI. Far fewer moving parts, far fewer of the tangled safeguards that sank the original.

It wasn't glamorous. By design, a newcomer could understand it in an afternoon. And it worked.

The verdict — simplicity won, and it wasn't close

The team chose not by taste but by benchmark: 80 carefully built test conversations spanning six student types — the capable one, the error-prone one, the disengaged one, the one who resists probing, the struggler, the average learner. Every change scored against the same yardstick. Improvement stopped being a feeling and became a number.

The simple rebuild climbed from a shaky 69% to 78 out of 80 — 97.5%, handling four of the six student types perfectly. The two remaining failures were minor and in one narrow category; the evaluation was "largely saturated." By the end of May the decision made itself: the simple tutor became the new tutor, and the capable, feature-rich, safeguard-laden original was retired.

The lessons — for anyone building an AI tutor

These cost a real team a paused pilot and weeks of work. They're offered so the next team doesn't pay the same price.

1. Decide what "unacceptable" means — and let that be the spec. The project found its footing only when it named the three errors it could never tolerate: a right answer called wrong, a wrong answer called right, an unanswerable question. Everything else is negotiable; these are not. ("Unacceptable errors are the design spec.")

2. Capability is not the goal — trustworthiness is. The original could do almost anything, and that was the problem. Every feature added weight, and the weight crushed reliability on the basics. A tutor students trust to get "right or wrong" correct every time beats a dazzling one that occasionally betrays them.

3. Don't make one AI do the whole job. One giant brief asking an AI to teach, grade, stay safe, stay on topic, and never err is a recipe for an unreliable everything-machine. Give the AI one clear role; hand every other job to the simplest tool that can do it.

4. Don't fight AI mistakes by stacking more AI on top. A reviewer built from the same technology shares the same blind spots — two of them tend to be wrong about the same things. Safety nets only work if they're made of different material than what they're catching.

5. Take work away from the AI wherever plain code can do it reliably. Checking a multiple-choice answer, doing arithmetic, remembering which question is open — these need reliability, which deterministic code gives for free and AI does not. Shrink the AI's job to only what genuinely requires judgment.

6. Prevent errors at the source, don't catch them after. The old design generated answers then frantically inspected them. The new one made bad outcomes structurally impossible — e.g., never letting the AI invent a question, so an unanswerable one can't be asked. Prevention beats inspection.

7. When you're stuck, remove — don't add. Both the original and the ambitious rebuild died of accumulation. The winner's instinct was to take a piece out. If you're adding your seventh safeguard, the safeguards aren't the answer. The design is.

8. Measure, or you're guessing. The team escaped the circle only with a fixed benchmark — the same 80 scenarios, scored the same way. It turned "I think this is better" into "this is 9 points better." Build the measuring stick before you start improving.

9. Start simple, add only what the evidence demands. The winning tutor began as something a newcomer could read in one sitting, and grew only when a failing test forced it. Simplicity as default, complexity as justified exception — that's what kept it from becoming the thing it replaced.

The moral

The team set out to build the most capable tutor they could, and learned the hard way that the goal was never capability. It was trust — a student believing that when the tutor says "yes, that's right," it's telling the truth. The path there ran not through more cleverness, features, or checking, but through the courage to throw most of that away and keep only what was simple enough to be reliable.

The most sophisticated thing the team ever did was decide to stop being sophisticated.