You Don't Have an Eval Problem

Imagine a Series B startup ships a customer-facing AI assistant trained on their internal docs. It works great in demos. Engineers kick the tires — ask it a few questions, looks right, ship it.

Six weeks later, their support queue is flooded with users citing company policies that don't exist. The model had been confidently making things up — filling in gaps with plausible-sounding fiction. Nobody knew because nobody was measuring.

The hallucinations were the symptom. The real problem was that no one could detect them — and that's what turned a fixable issue into a six-week catastrophe.

This is the pattern I see everywhere right now. Teams think they have an eval problem. They don't. They have a no-eval problem. Those are different diseases with different cures.

The myth that keeps teams stuck

Talk to enough AI engineers and you hear the same line: "We know we should have better evals. We're going to set that up soon."

Soon never comes.

Most teams think building evals means building infrastructure. A framework. A dashboard. Automated pipelines. A dedicated engineer. Something you need to plan before you can start.

So they put it on the roadmap. It stays there.

The confusion starts earlier than that, though. "Ask it a few questions, looks right, ship it" isn't an eval — it's vibes. An eval is a specific question with a binary answer, applied consistently to model outputs. No defined question, no defined pass criterion, no eval. Most teams that think they're testing are actually just sampling.

And once you see it that way, the fix gets a lot smaller. An eval doesn't have to start as a system. It just has to be a question you can answer yes or no to. Did the model stay within scope? Did the response contain a hallucinated fact? Did the model follow the required format? That's an eval. You can write ten of those in an afternoon and run them against your last 100 outputs by the end of the week.

The system comes later — once you know what questions actually matter.

The 100 traces exercise

Before you build anything, do this:

Pull 100 real traces from your production logs. Actual inputs and outputs, not your test suite.
Read them. All 100. By hand. This is not optional.
Write down every failure mode you see. Not categories — specific failures. "Model suggested deleting user data when asked to 'clean up' the account." That specific.
Turn each failure mode into a binary test. Pass or fail, no partial credit. "Does the model ever suggest deleting user data in response to account management queries?"
Run those tests on your next 100 traces. Now you have evals.

That's it. No framework. No tooling. A spreadsheet and your own judgment.

Once you have 10–20 of these, you wire them into CI/CD so every new model version gets checked before it ships. But that comes later. Step one is just reading your outputs.

A single binary test — "Does this response contain a claim not supported by the source document?" — would have caught the Series B startup's problem six weeks earlier. You don't need an eval framework to write that. You need an afternoon.

Why this matters more than you think

Your eval suite is a living document of everything your model shouldn't do. Every failure mode you catch and codify becomes a regression test — institutional memory that survives engineer turnover, model upgrades, and prompt refactors.

Teams that skip this don't just ship buggy AI. They ship buggy AI and have no idea which change broke it, or when.

You don't need perfect evals to start. You need some evals. The bar isn't comprehensive test coverage. The bar is "better than nothing" — and right now, for most teams, nothing is exactly what they have.

Start with 100 traces. This week.

You Don't Have an Eval Problem

The myth that keeps teams stuck

The 100 traces exercise

Why this matters more than you think

Reply

Keep Reading

Failmode