The agent reliability problem
Why the agent demo that wowed everyone falls apart in production: compounding error, brittle tool use, planning that doesn't replan, and the verification loops that actually make autonomy survivable.
by BlackSpruce Lab
Why the agent demo that wowed everyone falls apart in production: compounding error, brittle tool use, planning that doesn't replan, and the verification loops that actually make autonomy survivable.
by BlackSpruce Lab
The agent demo is the most reliably misleading artifact in our field. It runs a seven-step task end to end, the model picks the right tools, the output is correct, the room is impressed. Then it ships, and the same agent that nailed the demo succeeds maybe sixty percent of the time on real inputs, fails in ways nobody can reproduce, and burns a fortune in tokens deciding to do the wrong thing confidently. The demo was a sample of size one from a distribution with a long, ugly tail.
The core problem is not intelligence. The models are smart enough. The problem is that autonomy compounds error, and most agent architectures have no mechanism to stop it compounding.
A single LLM call that is 95% reliable is a fine component. Chain ten of them where each step depends on the last, and your end-to-end reliability is 0.95^10 — about 60%. Chain twenty and you are near a coin flip. This is the math that agent demos hide and production exposes. Every additional autonomous step is a multiplication, not an addition, and the multiplication is against you.
This single fact should drive your architecture. The reliable move is almost always to take autonomy away — to constrain the agent to the shortest path that solves the problem, not the most general one that could solve any problem. Most “agents” that work in production are closer to well-structured workflows with a small amount of bounded model judgement at the decision points, not open-ended loops free to wander.
Tool use is more fragile than it looks. The model calls the right tool with
slightly wrong arguments. It hallucinates a parameter the API does not have. It
misreads an error and retries the identical failing call. It succeeds at the
call but misinterprets the result. Each is common and each is a silent
correctness failure, not a crash. Your tools need strict schemas, validation
before execution, and errors written for the model — “field user_id must be
a UUID, you sent an email” — not raw stack traces it cannot parse.
Planning rarely survives contact with reality. Plan-then-execute looks clean and breaks the moment step three returns something step two did not anticipate, because nothing replans. A plan made before any tool has run is a guess. Agents that work interleave planning and acting — take a step, observe, decide the next step against what actually happened — rather than committing to a plan upfront and marching off a cliff.
No memory of its own failures. A naive loop will repeat the same failing action indefinitely because each turn starts fresh. You need explicit loop detection, step budgets, and state that records “I already tried this and it failed” so the model does not rediscover the same dead end forever.
Two principles separate agents that ship from agents that demo.
Bound everything. Hard caps on steps, tool calls, tokens, and wall-clock time. A budget the agent cannot exceed. Define an explicit failure state — it is infinitely better for an agent to stop and say “I could not complete this” than to spend forty tool calls producing a confident wrong answer. Failing loud and cheap beats failing silent and expensive every time.
Verify, do not trust. This is the one that matters most. Do not let the model be the sole judge of its own success. Wherever you can, check the work with something cheaper and more reliable than the model that produced it: did the code actually compile and pass tests; does the SQL parse and return rows; does the generated JSON validate against the schema; did the booking API return a confirmation number. A verification loop — act, check against ground truth, correct if the check fails — is what turns a 60% agent into a 95% one. The generator can be creative and unreliable as long as the verifier is strict and deterministic. The asymmetry is the design.
Ignore “multi-agent” architectures as a default — a swarm of agents talking to each other multiplies the failure surface and the token bill, and for the vast majority of tasks a single well-bounded agent with good tools and a verifier beats it on reliability and cost. Ignore demos that do not report success rate over a real distribution; a video is a sample of one. And be deeply skeptical of any pitch for “fully autonomous” anything in a domain where a wrong action has real consequences — the entire engineering problem is deciding how much autonomy to remove, and the systems that work are the ones that removed the most while keeping the task solved.
Reliable agents are not the most capable ones. They are the most constrained ones that still get the job done — bounded, verified, and built by people who respect the math of compounding error instead of being surprised by it.