Blog2026-04-18

The eval set is the spec

For LLM-shaped products, the document that defines what the product does is the eval set, not the PRD. We have stopped writing PRDs for AI features and started writing the cases. The work is better for it.

Sal AnvarovApril 18, 20264 min read

LLM
Product
Evaluation

For traditional software, the spec is a document that describes what the system should do. For LLM-shaped software, the spec is the eval set.

This is not a metaphor. The eval set is literally what defines correct behavior, because there is no other source of truth.

Why the PRD fails

A product spec for an LLM feature reads something like:

The assistant should accurately answer customer questions about the product, in a friendly tone, without making up information.

Every word in that sentence is doing work that the team has not done. What counts as accurate? Whose definition of friendly? What does "without making up information" mean when the model genuinely does not know?

The PRD cannot answer these questions. The eval set has to.

Writing cases instead of paragraphs

When we start an AI feature, the first artifact we ship is not a document. It is roughly 50 input/output pairs that demonstrate what good looks like, and another 20 that demonstrate what bad looks like.

The interesting thing about writing those cases is that it forces the disagreements early. Product wants the assistant to answer in three sentences. Engineering thinks one is enough. Until you write a case where the difference matters, the disagreement is theoretical. The case makes it concrete.

By the time we have 70 cases agreed on, the team knows what we are building in a way no PRD ever delivers.

The cases stay alive

The other quiet benefit: the cases become the regression suite, the fine-tuning dataset (if we end up there), the customer support FAQ ground truth, and the QA script. One artifact, several uses.

Compare to a PRD, which is read once, debated for two weeks, and then becomes wallpaper.

When the spec is the cases, planning gets honest

If you cannot write the cases, you cannot build the feature. We have used this as a forcing function on more than one project — when product cannot produce the cases, the feature was not ready to be built. Better to learn that in week one than week ten.

The case writing itself takes a day or two. The PRD-then-build cycle loses weeks before anyone realizes the spec has holes.

Closing

For AI features, the eval set is the spec. Write the cases first. Disagree about them on purpose. Ship against them.

Everything else is wallpaper.

Keep reading.

All posts

2026-05-14

The four-week MVP, defended

People keep asking why we ship in four weeks instead of six months. It's not because we're faster. It's because four weeks is what's actually required to learn anything.

2026-05-02

Why we (mostly) don't use LLM frameworks

LangChain, LlamaIndex, Haystack — we have used them all and we ship most of our agents without them. Here is when frameworks help, when they hurt, and why we usually choose the harder-looking option.