January 14, 2025 8 min

Evals that predict production behavior, not vibes

Why most eval suites pass while production regresses, how LLM-as-judge quietly lies to you, and the harness that actually catches what ships broken.

by BlackSpruce Lab

Most teams have an eval suite. Most eval suites are theater. They pass at ninety-something percent, the number trends up release over release, and then a model swap or a prompt edit ships a regression that the suite never saw. The suite was never measuring the thing that breaks. It was measuring whether the model could do the easy version of the task on a frozen set of inputs that stop resembling production the week after you write them.

An eval is only worth running if a drop in the score reliably predicts a drop in the experience users actually have. That is the whole bar. Everything else is a dashboard that feels like rigor and delivers none.

Offline evals tell you what changed; online evals tell you if it matters

Keep these separate in your head, because conflating them is where teams go wrong. Offline evals run against a fixed dataset in CI. They are fast, deterministic-ish, and good for one job: catching regressions before they ship. Online evals run against live traffic — sampled, logged, scored after the fact — and they are the only thing that tells you whether your offline set still corresponds to reality.

The trap is treating a high offline score as a release gate and never closing the loop. Your offline set is a snapshot of last quarter’s traffic. User behavior drifts, your retrieval corpus grows, the upstream model gets silently updated. Within weeks the offline set is measuring a distribution nobody is in anymore. The only fix is to continuously mine production traffic — especially the failures, the thumbs-down, the escalations — back into the offline set. An eval dataset is a living thing or it is dead weight.

LLM-as-judge: useful, and quietly full of failure modes

Using a strong model to grade outputs is the only thing that scales to the volume and subjectivity of real tasks. It also fails in specific, learnable ways, and if you do not know them you will trust a number that is lying.

Position bias. Judges systematically prefer the first answer in a pairwise comparison. Always randomize order, and run both orderings to measure the gap.
Length and verbosity bias. Judges reward longer, more confident answers even when they are wrong. A model that learns to pad will climb your eval while degrading your product.
Self-preference. A judge tends to prefer outputs from its own family. Grade GPT outputs with a Claude judge and vice versa when the stakes are real.
Sycophancy to the rubric. If your prompt telegraphs the desired answer, the judge agrees with you. Write rubrics that force a decision, not ones that beg for confirmation.

The discipline that fixes most of this: do not ask the judge for a 1–10 score. Floating-point scores are noise dressed as precision. Ask for a binary or small-ordinal decision against an explicit, concrete rubric — “does this response cite a source that actually contains the claimed fact: yes/no.” Then, and this is the part people skip, calibrate the judge against a few hundred human labels. Measure judge-human agreement. If your judge agrees with humans 65% of the time, your eval has the resolution of a coin with opinions.

What to do Monday

Pick one task your product actually performs and build a real harness around it.

Collect a golden set from production, not your imagination. A few hundred real inputs, weighted toward the hard and failing cases. Synthetic data is fine for coverage of rare paths but must never be the whole set.
Write assertions, not just judge calls. Most “LLM quality” failures are not subtle. They are malformed JSON, a missing citation, a refused task, a hallucinated API field, an answer in the wrong language. Cheap deterministic checks catch the majority of regressions for free. Reserve the judge for the genuinely subjective remainder.
Gate CI on the score and alert on the diff. A regression on a single high-value slice should fail the build. Reporting an aggregate that quietly averages away a 20-point drop on your most important segment is how regressions ship.
Sample production and re-grade weekly. Feed the failures back into the golden set. The loop is the product.

What’s hype to ignore

Ignore the leaderboard reflex — the public benchmarks are saturated, gamed, and contaminated by training data, and a model topping MMLU tells you nothing about whether it will follow your tool schema. Ignore any vendor selling a single “quality score” that compresses your whole application into one number; quality is per-slice or it is meaningless. And ignore the instinct to chase a higher offline number once it has decoupled from production outcomes — at that point you are optimizing the test, not the system.

The teams that ship reliable LLM products are not the ones with the most sophisticated judge prompts. They are the ones whose eval score, when it moves, tells them something true about what their users are about to feel. Build that, and the rest is tuning. Skip it, and every release is a coin flip you have dressed up as engineering.

#evals#llm-as-judge#regression#production