rag

RAG that actually retrieves

From Self-RAG to GraphRAG to Self-Route — what the 2024 literature changed about production retrieval.

Sal AnvarovMarch 15, 202617 min read

RAG
Self-RAG
CRAG
GraphRAG
Long context

The thesis

The default RAG stack — naive chunking, single-vector search, top-k by cosine — caps out around 60% retrieval accuracy on real corpora. The 2024-2025 literature is unanimous about why: production RAG is a routing problem, not a pipeline.

Four recent papers reshaped the field:

Self-RAG (Asai et al., ICLR 2024) — retrieve adaptively, not always.
CRAG (Yan et al., 2024) — verify retrieval quality with a classifier and fall back when retrieval fails.
GraphRAG (Edge et al., Microsoft Research, 2024) — graph indexing for global / sensemaking questions where vector retrieval breaks down.
RAG or Long-Context? (Li et al., Google DeepMind, 2024) — route queries between RAG and long-context based on the model's own self-reflection.

Combined with RULER's sobering results on real long-context capability, the picture is clear: a production RAG system in 2026 is a router with a quality classifier attached, not a fixed pipeline.

What the literature actually says

Adaptive retrieval — Self-RAG

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (Asai et al., 2023) trains a single LLM to emit "reflection tokens" that decide on-demand whether retrieval is needed, whether the retrieved passage is relevant, and whether the generation is supported by it.

The authors are explicit about why the dominant pattern of 2023 was wrong: "indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility."

The 7B and 13B Self-RAG models outperform ChatGPT and retrieval-augmented Llama2-chat on open-domain QA, reasoning, and fact verification. Citation accuracy on long-form output sees the biggest gain.

Takeaway: stop always-retrieving. Make retrieval conditional on the model's own uncertainty.

Retrieval quality as a first-class signal — CRAG

Corrective Retrieval Augmented Generation (Yan et al., 2024) adds a lightweight retrieval evaluator that scores documents before they reach the generator. Low-confidence retrievals trigger fallbacks: web-search augmentation, query rewriting, or rejection.

The paper's framing: "a lightweight retrieval evaluator is designed to assess the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered."

Plug-and-play with any base RAG. The cheapest, highest-leverage RAG upgrade we deploy in 2026 is some descendant of this pattern.

When vector retrieval is the wrong tool — GraphRAG

From Local to Global: A Graph RAG Approach to Query-Focused Summarization (Edge et al., Microsoft Research, 2024) clarifies when graph-based indexing actually helps: not for lookup-style QA, but for query-focused summarization — questions like "what are the main themes in this corpus?" that require synthesis across many documents.

The paper's framing is precise: "RAG fails on global questions directed at an entire text corpus, such as 'What are the main themes in the dataset?', since this is inherently a query-focused summarization (QFS) task."

GraphRAG builds an LLM-extracted entity graph and pre-generated community summaries, then composes partial answers across communities. The reported wins are "substantial" on comprehensiveness and diversity at million-token scale.

Practitioner takeaway: graph indexing is the right tool for global questions. For lookup, vanilla dense retrieval still usually wins.

RAG vs long-context — actually, both — Self-Route

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach (Li et al., Google DeepMind, 2024) tests RAG against long-context (Gemini-1.5, GPT-4) on the same benchmarks. Long-context wins on average when you can afford the tokens. RAG remains dramatically cheaper.

Their solution, Self-Route, has the model itself classify whether to use RAG or long-context per query, recovering most of long-context's quality at a fraction of the cost. The framing: "when resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage."

Don't pick RAG vs long-context. Route between them.

Don't trust the context-window marketing — RULER

RULER: What's the Real Context Size of Your Long-Context Language Models? (Hsieh et al., NVIDIA, 2024) shows that simple needle-in-a-haystack tests massively over-state real long-context capability. On harder multi-hop, aggregation, and multi-key retrieval tasks, most models claiming 32K+ contexts degrade sharply.

The bottom line from Hsieh et al.: "despite achieving nearly perfect performance on the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases."

For practitioners this is the strongest argument for keeping RAG in the stack: nominal context windows lie.

The new playbook (eight steps, in order)

We work in this order on every project. The order matters because each step's lift depends on the ones before it.

1. Measure first

Build an eval set of query → expected document ID pairs. Measure recall@k, MRR, and win-rate vs baseline. (See our paper on eval-driven development.)

2. Fix chunking

Default chunking splits text at fixed character counts, slicing paragraphs and functions in the middle. Chunk by structure first (headers, semantic boundaries), length second.

3. Add hybrid search

Combine BM25 with dense retrieval; fuse with reciprocal rank fusion. Five to fifteen percentage points of accuracy for a two-line change.

4. Add a reranker

A cross-encoder over the top 30 → 5 routinely lifts MRR by another chunk. Latency cost is small relative to LLM calls.

5. Make retrieval adaptive (Self-RAG)

Stop always-retrieving. Condition retrieval on the model's own uncertainty. Asai et al.'s reflection-token framing translates to a prompt-level routing decision for non-trained pipelines.

6. Add a retrieval quality classifier (CRAG)

Score retrievals before generation. Define a fallback path for low-confidence retrievals — query rewriting, web search, or abstaining.

7. Decide: lookup or synthesis (GraphRAG vs vector)

Lookup-style questions → vector. Corpus-wide synthesis → graph indexing. Most teams over-deploy GraphRAG into the wrong regime; it is a tool for sensemaking, not a general upgrade.

8. Route RAG vs long-context (Self-Route)

For each query, classify whether the answer is more likely to come from a targeted retrieval or from streaming a large chunk into a long-context model. RULER's results remind us nominal contexts lie — test on multi-hop benchmarks before retiring retrieval.

Practitioner takeaways

Treat retrieval quality as a measurable signal: train or prompt a lightweight evaluator (CRAG-style) and define a fallback (rewrite, web search, abstain) when scores are low.
Use graph-based indexing only for corpus-wide / sensemaking questions — keep vector retrieval for lookup-style QA.
Don't choose RAG vs long-context — route. Self-Route-style query classification captures most of long-context quality at RAG-level cost.
Distrust nominal context-window claims; benchmark your actual workloads on RULER-style multi-hop tasks before retiring retrieval.
Make retrieval adaptive (Self-RAG): "always retrieve K" is a worse default than "retrieve when the model is uncertain."

Closing

Most production RAG is built once and never measured again, which is why most production RAG is bad. The 2024 literature gives us a clear, defensible, citation-grounded playbook for what to build instead. Measure first. Route, don't pipeline. Treat retrieval quality as a signal. And remember that the marketing-claimed context window is not the real one.

Sources

Asai et al., Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection, arXiv:2310.11511
Yan et al., Corrective Retrieval Augmented Generation, arXiv:2401.15884
Edge et al., From Local to Global: A Graph RAG Approach to Query-Focused Summarization, arXiv:2404.16130
Li et al., Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach, arXiv:2407.16833
Hsieh et al., RULER: What's the Real Context Size of Your Long-Context Language Models?, arXiv:2404.06654

Need help getting your retrieval into the 90s? Book a call.

Next from the lab.

Work with us

evaluation

Eval-driven development for production LLM systems

Most production LLM systems are evaluated by other LLMs, against benchmarks that have probably leaked into the model's training data, with judges that systematically prefer their own kin. The recent literature is clear-eyed about all three problems. This paper is a working playbook for shipping evaluation infrastructure that survives those facts.

agents

Agent architectures that actually ship

The literature on production LLM agents has matured from "can we make this work at all?" to "which interface and memory structure makes it work reliably?" This paper walks the dominant patterns — ReAct, Reflexion, AutoGen-style multi-agent, and SWE-agent's interface-as-product — and the failure modes that come with them.