February 20, 2025 9 min

The agent reliability problem

Why the agent demo that wowed everyone falls apart in production: compounding error, brittle tool use, planning that doesn't replan, and the verification loops that actually make autonomy survivable.

by BlackSpruce Lab

The agent demo is the most reliably misleading artifact in our field. It runs a seven-step task end to end, the model picks the right tools, the output is correct, the room is impressed. Then it ships, and the same agent that nailed the demo succeeds maybe sixty percent of the time on real inputs, fails in ways nobody can reproduce, and burns a fortune in tokens deciding to do the wrong thing confidently. The demo was a sample of size one from a distribution with a long, ugly tail.

The core problem is not intelligence. The models are smart enough. The problem is that autonomy compounds error, and most agent architectures have no mechanism to stop it compounding.

Compounding error is the whole story

A single LLM call that is 95% reliable is a fine component. Chain ten of them where each step depends on the last, and your end-to-end reliability is 0.95^10 — about 60%. Chain twenty and you are near a coin flip. This is the math that agent demos hide and production exposes. Every additional autonomous step is a multiplication, not an addition, and the multiplication is against you.

This single fact should drive your architecture. The reliable move is almost always to take autonomy away — to constrain the agent to the shortest path that solves the problem, not the most general one that could solve any problem. Most “agents” that work in production are closer to well-structured workflows with a small amount of bounded model judgement at the decision points, not open-ended loops free to wander.

Where agents actually break

Tool use is more fragile than it looks. The model calls the right tool with slightly wrong arguments. It hallucinates a parameter the API does not have. It misreads an error and retries the identical failing call. It succeeds at the call but misinterprets the result. Each is common and each is a silent correctness failure, not a crash. Your tools need strict schemas, validation before execution, and errors written for the model — “field user_id must be a UUID, you sent an email” — not raw stack traces it cannot parse.

Planning rarely survives contact with reality. Plan-then-execute looks clean and breaks the moment step three returns something step two did not anticipate, because nothing replans. A plan made before any tool has run is a guess. Agents that work interleave planning and acting — take a step, observe, decide the next step against what actually happened — rather than committing to a plan upfront and marching off a cliff.

No memory of its own failures. A naive loop will repeat the same failing action indefinitely because each turn starts fresh. You need explicit loop detection, step budgets, and state that records “I already tried this and it failed” so the model does not rediscover the same dead end forever.

Bounded autonomy and verification loops

Two principles separate agents that ship from agents that demo.

Bound everything. Hard caps on steps, tool calls, tokens, and wall-clock time. A budget the agent cannot exceed. Define an explicit failure state — it is infinitely better for an agent to stop and say “I could not complete this” than to spend forty tool calls producing a confident wrong answer. Failing loud and cheap beats failing silent and expensive every time.

Verify, do not trust. This is the one that matters most. Do not let the model be the sole judge of its own success. Wherever you can, check the work with something cheaper and more reliable than the model that produced it: did the code actually compile and pass tests; does the SQL parse and return rows; does the generated JSON validate against the schema; did the booking API return a confirmation number. A verification loop — act, check against ground truth, correct if the check fails — is what turns a 60% agent into a 95% one. The generator can be creative and unreliable as long as the verifier is strict and deterministic. The asymmetry is the design.

What to do Monday

Write down your agent’s reliability budget. Decompose the task, estimate per-step success, multiply. If the product is unacceptable, you have too many autonomous steps — cut them, not your standards.
Add a verifier to the highest-value step. Find the one action that, when wrong, costs the most, and put a deterministic check after it. Highest ROI change you can make this week.
Bound the loop. Step cap, token cap, explicit give-up state. Ship the give-up path before you ship the agent.
Make tool errors legible to the model and validate arguments before execution. Half your retries-on-the-same-failure problem disappears.

What’s hype to ignore

Ignore “multi-agent” architectures as a default — a swarm of agents talking to each other multiplies the failure surface and the token bill, and for the vast majority of tasks a single well-bounded agent with good tools and a verifier beats it on reliability and cost. Ignore demos that do not report success rate over a real distribution; a video is a sample of one. And be deeply skeptical of any pitch for “fully autonomous” anything in a domain where a wrong action has real consequences — the entire engineering problem is deciding how much autonomy to remove, and the systems that work are the ones that removed the most while keeping the task solved.

Reliable agents are not the most capable ones. They are the most constrained ones that still get the job done — bounded, verified, and built by people who respect the math of compounding error instead of being surprised by it.

#agents#reliability#tool-use#verification