Skip to content
DEV BY SALDEV BY SAL
Blog

The eval set is the spec

For LLM-shaped products, the document that defines what the product does is the eval set, not the PRD. We have stopped writing PRDs for AI features and started writing the cases. The work is better for it.

Sal Anvarov4 min read
  • LLM
  • Product
  • Evaluation

For traditional software, the spec is a document that describes what the system should do. For LLM-shaped software, the spec is the eval set.

This is not a metaphor. The eval set is literally what defines correct behavior, because there is no other source of truth.

Why the PRD fails

A product spec for an LLM feature reads something like:

The assistant should accurately answer customer questions about the product, in a friendly tone, without making up information.

Every word in that sentence is doing work that the team has not done. What counts as accurate? Whose definition of friendly? What does "without making up information" mean when the model genuinely does not know?

The PRD cannot answer these questions. The eval set has to.

Writing cases instead of paragraphs

When we start an AI feature, the first artifact we ship is not a document. It is roughly 50 input/output pairs that demonstrate what good looks like, and another 20 that demonstrate what bad looks like.

The interesting thing about writing those cases is that it forces the disagreements early. Product wants the assistant to answer in three sentences. Engineering thinks one is enough. Until you write a case where the difference matters, the disagreement is theoretical. The case makes it concrete.

By the time we have 70 cases agreed on, the team knows what we are building in a way no PRD ever delivers.

The cases stay alive

The other quiet benefit: the cases become the regression suite, the fine-tuning dataset (if we end up there), the customer support FAQ ground truth, and the QA script. One artifact, several uses.

Compare to a PRD, which is read once, debated for two weeks, and then becomes wallpaper.

When the spec is the cases, planning gets honest

If you cannot write the cases, you cannot build the feature. We have used this as a forcing function on more than one project — when product cannot produce the cases, the feature was not ready to be built. Better to learn that in week one than week ten.

The case writing itself takes a day or two. The PRD-then-build cycle loses weeks before anyone realizes the spec has holes.

Closing

For AI features, the eval set is the spec. Write the cases first. Disagree about them on purpose. Ship against them.

Everything else is wallpaper.