Papersevaluation

evaluation

Eval-driven development for production LLM systems

What the recent literature says about judging the judges — and how to ship around it.

Sal AnvarovMay 12, 202616 min read

Evaluation
LLM-as-judge
Benchmarking
Production systems

TL;DR

The 2024-2025 evaluation literature converges on three uncomfortable points: LLM-as-judge works, but is biased in ways that compound silently (Zheng et al., 2023; Li et al., 2025); public benchmarks are contaminated and over-state model capability (Xu et al., 2024); and "quality" is a vector, not a scalar, with calibration and robustness often mattering more than peak accuracy (Liang et al., 2022).

This paper is the working playbook we use to ship eval infrastructure that survives those facts.

Why "vibe-checked" systems decay

A typical LLM system goes through three phases:

Prototype. An engineer hand-tunes prompts against a small set of inputs they trust. The output looks great. The thing ships.
Production. Real traffic hits the system. Inputs are messier than the prototype set. Some outputs are great. Some are bad in ways nobody notices for weeks.
Decay. Models update. Prompts drift. New features get bolted on. The system silently gets worse — sometimes catastrophically — and the team cannot tell, because there is no measurement.

The pathology is not the absence of an eval; it is the absence of an eval loop. Most "AI products" we audit ship without a single automated regression test on model output.

What the literature actually says

LLM-as-judge: useful, biased, and now leaky

The foundational result is Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023): a strong judge model (GPT-4) achieves over 80% agreement with both expert and crowdsourced human preferences — "the same level of agreement between humans" — making LLM-as-judge a defensible primitive for evaluation at scale.

The same paper catalogs the failure modes that practitioners need to design around: position bias (judges prefer the first option), verbosity bias (judges prefer longer answers), and self-enhancement bias (judges prefer outputs that resemble their own).

By 2025, a new failure mode entered the canon. Preference Leakage (Li et al.) shows that when the same model — or even a kin model from the same family or with the same teacher — is used both to synthesize training data and to judge outputs, the judge systematically prefers its own kin in a way that's harder to detect than the older biases.

That means a common practitioner pipeline — generate fine-tuning data with GPT-4, evaluate the fine-tuned model with GPT-4 — is a closed loop that inflates scores invisibly. The fix is structural separation between data generation and judging.

The broader survey, A Survey on LLM-as-a-Judge (Gu et al., 2024), reframes the entire problem as reliability engineering: the judge is a system that itself needs an eval harness before you trust it to grade your model.

Public benchmarks are contaminated

Benchmark Data Contamination of Large Language Models: A Survey (Xu et al., 2024) is the paper that ended the era of citing MMLU as if it meant something. Their review documents systematic leakage of evaluation benchmarks into model training corpora — including MMLU, GSM8K, HumanEval, and others — inflating reported scores via memorization rather than generalization.

The practical implication: public benchmark numbers should be treated as approximate floors, not ceilings. Production systems need private, rotated holdout sets per use case. The benchmark you keep secret is the only benchmark that tells you whether the model generalized.

Quality is a vector

Holistic Evaluation of Language Models (Liang et al., 2022) is old by AI standards but the argument has aged well: a single leaderboard number hides the trade-offs that matter in production. HELM measures seven axes — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency — and shows that no model dominates on all of them.

For production teams the takeaway is structural: quality is a scorecard, not a score. Calibration and robustness usually matter more than peak accuracy for shipped systems.

The four kinds of evals you need (in order)

We build evals in this order on every project:

1. Reference evals

The dataset has a known correct answer. Compare model output via exact match, structural match, or a deterministic scoring function. Use for classification, extraction, structured output. These are cheap, fast, and the only kind of eval that gives you the green or red bar most engineering teams expect from CI.

2. Rubric evals

The dataset has expected qualities, not expected outputs. A scoring function — usually another model — evaluates whether the output exhibits those qualities. The workhorse for content; also where the most methodological errors happen.

3. Pairwise evals

Two outputs, one input. The scoring function picks the better one (or calls it a tie). Aggregated over thousands of comparisons, you get a stable preference ranking. Robust to absolute scoring drift — graded on relative quality, which is easier to get right.

4. Production evals

The same scoring functions, running on a sample of live traffic. Output quality is logged like latency. Regressions trigger alerts. This is what separates a project from a system.

The metric ladder

Layer	What it answers	Cadence
Unit	Structurally valid output?	Every commit
Reference	Known-answer cases passing?	Every PR
Rubric	Quality bar holding?	Pre-merge to main
Pairwise	New variant better than current?	Pre-deploy
Production	Live quality holding?	Continuous
User	Real users prefer this?	Weekly review

The most expensive evals (pairwise, production) run least often. The cheapest ones (unit, reference) run constantly. Most teams have this inverted.

Anti-patterns

"GPT-4 is our judge." Calibrate the judge before scaling it. Run human-vs-judge agreement on a sample. Pin the judge version. Use structured rubrics, not free-form prompts. Re-audit periodically.

"GPT-4 wrote our training data and GPT-4 judges our model." This is preference leakage in textbook form. Split generator and judge — ideally across model families.

"We A/B test in production." A/B testing is downstream of evals, not upstream. By the time you A/B test, you have already decided which variants are worth showing users.

"Our prompts are too dynamic to test." Almost always wrong. Prompts may be templated, but the template is what you test.

When to build the harness

Before the first prompt. We mean this literally — the hour you spend writing the first reference eval is the hour that saves you the week of post-launch debugging when the rubric falls over.

A reference architecture

type EvalCase<TInput, TOutput> = {
  id: string;
  input: TInput;
  expected?: TOutput;
  rubric?: {
    criteria: string[];
    judge: 'gpt-4o' | 'claude-sonnet-4-6';
    // CRUCIAL: judge family must differ from the model under test's family
    judgeFamilyMustDifferFrom?: string;
  };
  tags: string[];
};
 
type EvalResult = {
  caseId: string;
  pass: boolean;
  score?: number;
  judgeRationale?: string;
  // Per HELM: log multiple axes, not just one
  calibration?: number;
  robustness?: number;
  latencyMs: number;
  costUsd: number;
};

Cases are versioned in git next to the prompts they test. Runs are stored in a small Postgres table. Comparisons across runs are a SQL query. The whole thing fits in under 1,000 lines of code for most projects.

Closing

Evals are not the glamorous part of LLM engineering. They are the part that separates products that improve from products that drift. The literature is now mature enough to tell us how to do them honestly: calibrate the judge, separate the generator and judge, treat public benchmarks as floors, measure quality as a vector.

If you do those four things you are already operating in the top quartile of teams shipping LLM software today.

Sources

Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv:2306.05685
Gu et al., A Survey on LLM-as-a-Judge, arXiv:2411.15594
Xu et al., Benchmark Data Contamination of Large Language Models: A Survey, arXiv:2406.04244
Li et al., Preference Leakage: A Contamination Problem in LLM-as-a-judge, arXiv:2502.01534
Liang et al., Holistic Evaluation of Language Models, arXiv:2211.09110

If this maps to a system you are trying to build or rescue, book a 30-minute call — we will tell you whether eval-driven development is the right shape for it.

Next from the lab.

Work with us

agents

Agent architectures that actually ship

The literature on production LLM agents has matured from "can we make this work at all?" to "which interface and memory structure makes it work reliably?" This paper walks the dominant patterns — ReAct, Reflexion, AutoGen-style multi-agent, and SWE-agent's interface-as-product — and the failure modes that come with them.

rag

RAG that actually retrieves

Production RAG in 2026 looks nothing like the "embed, retrieve, generate" of 2023. The recent literature converges on retrieval as a routing problem — when to retrieve, how to verify retrieval quality, when to switch to graph indexing, when to lean on long-context instead. This paper is the practitioner-facing version.