RAG is becoming retrieval-plus-reasoning
Naive vector RAG was a 2023 pattern. What works now is hybrid retrieval, re-ranking, context engineering, and treating retrieval as a step the model reasons over — not a lookup it trusts.
by BlackSpruce Lab
Naive vector RAG was a 2023 pattern. What works now is hybrid retrieval, re-ranking, context engineering, and treating retrieval as a step the model reasons over — not a lookup it trusts.
by BlackSpruce Lab
The naive RAG pattern — embed the query, pull the top-k nearest chunks, stuff them in the prompt, hope — was a reasonable first approximation in 2023. It is now the thing most underperforming systems are still doing. The frontier moved. Retrieval stopped being a lookup the model trusts blindly and became a step the model reasons over, queries iteratively, and is allowed to distrust.
If your RAG system is one embedding call and a top-k fetch, you are leaving most of the quality on the table, and no amount of prompt tinkering recovers it.
Dense embeddings are good at semantic similarity and bad at a list of things production queries actually need: exact identifiers, rare entity names, dates, SKUs, version numbers, negation. Ask a pure-vector system about “the bug fixed in v2.3.1 but not v2.3.0” and watch it return everything about the project and nothing about the version boundary. Embeddings smear precisely the tokens that carry the answer.
The fix is not a better embedding model. It is to stop pretending one retrieval method covers every query shape.
Hybrid retrieval. Run dense (vector) and sparse (BM25 / keyword) retrieval in parallel and fuse the results — reciprocal rank fusion is the boring, effective default. Sparse catches the exact-match and rare-term queries dense misses; dense catches the paraphrase and concept queries sparse misses. The union beats either alone on basically every real corpus. This is the single highest-leverage change most teams have not made.
Re-ranking. Retrieval optimizes for recall — cast a wide net, fetch 50–100 candidates. Then a cross-encoder re-ranker, which actually attends to the query and document together rather than comparing precomputed vectors, scores that candidate set and you keep the top handful. The recall/precision split is the point: retrieve wide and cheap, re-rank narrow and accurate. A re-ranker is the cheapest large quality win available right now and most pipelines skip it.
Context engineering. Once you have the right chunks, how you assemble them into the context window matters as much as which ones they are. Models attend unevenly — the lost-in-the-middle effect is real — so order matters: most relevant material near the top and bottom, not buried. Deduplicate near-identical chunks. Include enough surrounding context that a chunk is self-contained. Carry structured metadata (source, date, section) so the model can reason about provenance and recency instead of treating a stale doc and a current one as equally true.
Retrieval as a reasoning step. The real shift: stop treating retrieval as a one-shot prefix. Let the model decompose a complex query into sub-queries, retrieve for each, and synthesize. Let it issue a follow-up retrieval when the first pass is insufficient. Let it decide a query needs no retrieval at all. This is where “agentic RAG” actually earns the name — not autonomy for its own sake, but the model treating retrieval as a tool it controls rather than a fixed step it is subjected to.
Every time the context window grows, someone declares RAG dead — just put the whole corpus in the prompt. This is wrong on three independent axes and the math does not care about the hype.
Long context and retrieval are complements. The right pattern is retrieve to a generous-but-bounded budget, then let the long window hold reasoning scratch space and multi-document synthesis. Use the window for thinking, not for storage.
Ignore “just use a bigger context window” as a retrieval strategy — it is a reasoning aid, not a database. Ignore exotic chunking schemes before you have hybrid search and a re-ranker; you are optimizing a second-order term while the first-order one is broken. And be skeptical of GraphRAG and knowledge-graph pipelines until you have proven the simple stack is the bottleneck — they are real for a few genuinely relational domains and expensive overkill for the rest.
Retrieval is no longer the boring part of the system. It is where most of the quality, and most of the unrealized quality, actually lives.