February 2, 2025 9 min

RAG is becoming retrieval-plus-reasoning

Naive vector RAG was a 2023 pattern. What works now is hybrid retrieval, re-ranking, context engineering, and treating retrieval as a step the model reasons over — not a lookup it trusts.

by BlackSpruce Lab

The naive RAG pattern — embed the query, pull the top-k nearest chunks, stuff them in the prompt, hope — was a reasonable first approximation in 2023. It is now the thing most underperforming systems are still doing. The frontier moved. Retrieval stopped being a lookup the model trusts blindly and became a step the model reasons over, queries iteratively, and is allowed to distrust.

If your RAG system is one embedding call and a top-k fetch, you are leaving most of the quality on the table, and no amount of prompt tinkering recovers it.

Why naive vector search underperforms

Dense embeddings are good at semantic similarity and bad at a list of things production queries actually need: exact identifiers, rare entity names, dates, SKUs, version numbers, negation. Ask a pure-vector system about “the bug fixed in v2.3.1 but not v2.3.0” and watch it return everything about the project and nothing about the version boundary. Embeddings smear precisely the tokens that carry the answer.

The fix is not a better embedding model. It is to stop pretending one retrieval method covers every query shape.

The stack that actually works

Hybrid retrieval. Run dense (vector) and sparse (BM25 / keyword) retrieval in parallel and fuse the results — reciprocal rank fusion is the boring, effective default. Sparse catches the exact-match and rare-term queries dense misses; dense catches the paraphrase and concept queries sparse misses. The union beats either alone on basically every real corpus. This is the single highest-leverage change most teams have not made.

Re-ranking. Retrieval optimizes for recall — cast a wide net, fetch 50–100 candidates. Then a cross-encoder re-ranker, which actually attends to the query and document together rather than comparing precomputed vectors, scores that candidate set and you keep the top handful. The recall/precision split is the point: retrieve wide and cheap, re-rank narrow and accurate. A re-ranker is the cheapest large quality win available right now and most pipelines skip it.

Context engineering. Once you have the right chunks, how you assemble them into the context window matters as much as which ones they are. Models attend unevenly — the lost-in-the-middle effect is real — so order matters: most relevant material near the top and bottom, not buried. Deduplicate near-identical chunks. Include enough surrounding context that a chunk is self-contained. Carry structured metadata (source, date, section) so the model can reason about provenance and recency instead of treating a stale doc and a current one as equally true.

Retrieval as a reasoning step. The real shift: stop treating retrieval as a one-shot prefix. Let the model decompose a complex query into sub-queries, retrieve for each, and synthesize. Let it issue a follow-up retrieval when the first pass is insufficient. Let it decide a query needs no retrieval at all. This is where “agentic RAG” actually earns the name — not autonomy for its own sake, but the model treating retrieval as a tool it controls rather than a fixed step it is subjected to.

Long context did not kill retrieval

Every time the context window grows, someone declares RAG dead — just put the whole corpus in the prompt. This is wrong on three independent axes and the math does not care about the hype.

Cost and latency. A million-token prompt on every request is absurd economics when the answer lives in two thousand tokens. Retrieval is a cost-control mechanism, not just an accuracy one.
Attention degrades with length. Models do not use a 200k window uniformly. Effective recall over very long contexts is materially worse than over a tight, relevant context. More tokens is often less signal.
Freshness and scale. Your corpus is larger than any window and changes constantly. You cannot fine-tune or stuff your way out of a knowledge base that updates hourly.

Long context and retrieval are complements. The right pattern is retrieve to a generous-but-bounded budget, then let the long window hold reasoning scratch space and multi-document synthesis. Use the window for thinking, not for storage.

What to do Monday

Add sparse retrieval and fuse it with your vector search. If you only do one thing, do this. Reciprocal rank fusion, a few lines.
Put a cross-encoder re-ranker after retrieval. Retrieve 50, re-rank, keep 5. Measure the answer-quality delta — it is usually large.
Instrument retrieval quality independently of generation. You cannot fix what you cannot see. Log recall@k against a labeled set; most “the LLM is dumb” complaints are retrieval failures wearing a generation costume.
Let the model issue follow-up queries on hard questions before you reach for anything more exotic.

What’s hype to ignore

Ignore “just use a bigger context window” as a retrieval strategy — it is a reasoning aid, not a database. Ignore exotic chunking schemes before you have hybrid search and a re-ranker; you are optimizing a second-order term while the first-order one is broken. And be skeptical of GraphRAG and knowledge-graph pipelines until you have proven the simple stack is the bottleneck — they are real for a few genuinely relational domains and expensive overkill for the rest.

Retrieval is no longer the boring part of the system. It is where most of the quality, and most of the unrealized quality, actually lives.

#rag#retrieval#reranking#context-engineering