Strategy papers for
production LLM systems.
Working notes from the studio — evaluation, agents, retrieval and the infrastructure that turns prototypes into businesses. Written so the next person doing this work moves faster than we did.
- 01Evaluation
Eval-driven development for production LLM systems
What the recent literature says about judging the judges — and how to ship around it.
Most production LLM systems are evaluated by other LLMs, against benchmarks that have probably leaked into the model's training data, with judges that systematically prefer their own kin. The recent literature is clear-eyed about all three problems. This paper is a working playbook for shipping evaluation infrastructure that survives those facts.
2026-05-1216 minSal Anvarov - 02Agents
Agent architectures that actually ship
A 2026 pattern catalog grounded in the ReAct → Reflexion → SWE-agent literature.
The literature on production LLM agents has matured from "can we make this work at all?" to "which interface and memory structure makes it work reliably?" This paper walks the dominant patterns — ReAct, Reflexion, AutoGen-style multi-agent, and SWE-agent's interface-as-product — and the failure modes that come with them.
2026-05-0818 minSal Anvarov - 03Retrieval
RAG that actually retrieves
From Self-RAG to GraphRAG to Self-Route — what the 2024 literature changed about production retrieval.
Production RAG in 2026 looks nothing like the "embed, retrieve, generate" of 2023. The recent literature converges on retrieval as a routing problem — when to retrieve, how to verify retrieval quality, when to switch to graph indexing, when to lean on long-context instead. This paper is the practitioner-facing version.
2026-03-1517 minSal Anvarov