Evals that predict production behavior, not vibes
Why most eval suites pass while production regresses, how LLM-as-judge quietly lies to you, and the harness that actually catches what ships broken.
by BlackSpruce Lab
Why most eval suites pass while production regresses, how LLM-as-judge quietly lies to you, and the harness that actually catches what ships broken.
by BlackSpruce Lab
Most teams have an eval suite. Most eval suites are theater. They pass at ninety-something percent, the number trends up release over release, and then a model swap or a prompt edit ships a regression that the suite never saw. The suite was never measuring the thing that breaks. It was measuring whether the model could do the easy version of the task on a frozen set of inputs that stop resembling production the week after you write them.
An eval is only worth running if a drop in the score reliably predicts a drop in the experience users actually have. That is the whole bar. Everything else is a dashboard that feels like rigor and delivers none.
Keep these separate in your head, because conflating them is where teams go wrong. Offline evals run against a fixed dataset in CI. They are fast, deterministic-ish, and good for one job: catching regressions before they ship. Online evals run against live traffic — sampled, logged, scored after the fact — and they are the only thing that tells you whether your offline set still corresponds to reality.
The trap is treating a high offline score as a release gate and never closing the loop. Your offline set is a snapshot of last quarter’s traffic. User behavior drifts, your retrieval corpus grows, the upstream model gets silently updated. Within weeks the offline set is measuring a distribution nobody is in anymore. The only fix is to continuously mine production traffic — especially the failures, the thumbs-down, the escalations — back into the offline set. An eval dataset is a living thing or it is dead weight.
Using a strong model to grade outputs is the only thing that scales to the volume and subjectivity of real tasks. It also fails in specific, learnable ways, and if you do not know them you will trust a number that is lying.
The discipline that fixes most of this: do not ask the judge for a 1–10 score. Floating-point scores are noise dressed as precision. Ask for a binary or small-ordinal decision against an explicit, concrete rubric — “does this response cite a source that actually contains the claimed fact: yes/no.” Then, and this is the part people skip, calibrate the judge against a few hundred human labels. Measure judge-human agreement. If your judge agrees with humans 65% of the time, your eval has the resolution of a coin with opinions.
Pick one task your product actually performs and build a real harness around it.
Ignore the leaderboard reflex — the public benchmarks are saturated, gamed, and contaminated by training data, and a model topping MMLU tells you nothing about whether it will follow your tool schema. Ignore any vendor selling a single “quality score” that compresses your whole application into one number; quality is per-slice or it is meaningless. And ignore the instinct to chase a higher offline number once it has decoupled from production outcomes — at that point you are optimizing the test, not the system.
The teams that ship reliable LLM products are not the ones with the most sophisticated judge prompts. They are the ones whose eval score, when it moves, tells them something true about what their users are about to feel. Build that, and the rest is tuning. Skip it, and every release is a coin flip you have dressed up as engineering.