what we cover
Real signal on building with generative AI.
Evaluation
Evals that predict production behavior, not vibes. Offline versus online, LLM-as-judge failure modes, and catching regressions before they ship.
Retrieval
The shift from naive RAG to retrieval-plus-reasoning: hybrid search, re-ranking, context engineering, and the long-context tradeoffs that actually matter.
Agents
The reliability problem. Why demos do not survive production, and how bounded autonomy plus verification loops make tool use dependable.
latest
Articles
The agent reliability problem
Why the agent demo that wowed everyone falls apart in production: compounding error, brittle tool use, planning that doesn't replan, and the verification loops that actually make autonomy survivable.
RAG is becoming retrieval-plus-reasoning
Naive vector RAG was a 2023 pattern. What works now is hybrid retrieval, re-ranking, context engineering, and treating retrieval as a step the model reasons over — not a lookup it trusts.
Evals that predict production behavior, not vibes
Why most eval suites pass while production regresses, how LLM-as-judge quietly lies to you, and the harness that actually catches what ships broken.