March 25, 2025

ikayaniaamirshahzad@gmail.com

Evaluation-Driven Development for AI Systems – O’Reilly

Let’s be real: building LLM applications today feels like purgatory. Someone hacks together a quick demo with ChatGPT and LlamaIndex. Leadership gets excited. “We can answer any question about our docs!” But then… reality hits. The system is inconsistent, slow, hallucinating—and that amazing demo starts collecting digital dust. We call this “POC Purgatory”—that frustrating limbo where you’ve built something cool but can’t quite turn it into something real.

We’ve seen this across dozens of companies, and the teams that break out of this trap all adopt some version of Evaluation-Driven Development (EDD), where testing, monitoring, and evaluation drive every decision from the start.

Learn faster. Dig deeper. See farther.

The truth is, we’re in the earliest days of understanding how to build robust LLM applications. Most teams approach this like traditional software development but quickly discover it’s a fundamentally different beast. Check out the graph below—see how excitement for traditional software builds steadily while GenAI starts with a flashy demo and then hits a wall of challenges?

Traditional versus GenAI software: Excitement builds steadily—or crashes after the demo.

What makes LLM applications so different? Two big things:

They bring the messiness of the real world into your system through unstructured data.
They’re fundamentally non-deterministic—we call it the “flip-floppy” nature of LLMs: same input, different outputs. They’re fundamentally nondeterministic—we call it the “flip-floppy” nature of LLMs: Same input, different outputs. What’s worse: Inputs are rarely exactly the same. Tiny changes in user queries, phrasing, or surrounding context can lead to wildly different results.

This creates a whole new set of challenges that traditional software development approaches simply weren’t designed to handle. When your system is both ingesting messy real-world data AND producing nondeterministic outputs, you need a different approach.

The way out? Evaluation-driven development: A systematic approach where continuous testing and assessment guide every stage of your LLM application’s lifecycle. This isn’t anything new. People have been building data products and machine learning products for the past couple of decades. The best practices in those fields have always centered around rigorous evaluation cycles. We’re simply adapting and extending these proven approaches to address the unique challenges of LLMs.

We’ve been working with dozens of companies building LLM applications, and we’ve noticed patterns in what works and what doesn’t. In this article, we’re going to share an emerging SDLC for LLM applications that can help you escape POC Purgatory. We won’t be prescribing specific tools or frameworks (those will change every few months anyway) but rather the enduring principles that can guide effective development regardless of which tech stack you choose.

Throughout this article, we’ll explore real-world examples of LLM application development and then consolidate what we’ve learned into a set of first principles—covering areas like nondeterminism, evaluation approaches, and iteration cycles—that can guide your work regardless of which models or frameworks you choose.

FOCUS ON PRINCIPLES, NOT FRAMEWORKS (OR AGENTS)

A lot of people ask us: What tools should I use? Which multiagent frameworks? Should I be using multiturn conversations or LLM-as-judge?

Of course, we have opinions on all of these, but we think those aren’t the most useful questions to ask right now. We’re betting that lots of tools, frameworks, and techniques will disappear or change, but there are certain principles in building LLM-powered applications that will remain.

We’re also betting that this will be a time of software development flourishing. With the advent of generative AI, there’ll be significant opportunities for product managers, designers, executives, and more traditional software engineers to contribute to and build AI-powered software. One of the great aspects of the AI Age is that more people will be able to build software.

We’ve been working with dozens of companies building LLM-powered applications and have started to see clear patterns in what works. We’ve taught this SDLC in a live course with engineers from companies like Netflix, Meta, and the US Air Force—and recently distilled it into a free 10-email course to help teams apply it in practice.

IS AI-POWERED SOFTWARE ACTUALLY THAT DIFFERENT FROM TRADITIONAL SOFTWARE?

When building AI-powered software, the first question is: Should my software development lifecycle be any different from a more traditional SDLC, where we build, test, and then deploy?

Traditional software development: Linear, testable, predictable

AI-powered applications introduce more complexity than traditional software in several ways:

Introducing the entropy of the real world into the system through data.
The introduction of nondeterminism or stochasticity into the system: The most obvious symptom here is what we call the flip-floppy nature of LLMs—that is, you can give an LLM the same input and get two different results.
The cost of iteration—in compute, staff time, and ambiguity around product readiness.
The coordination tax: LLM outputs are often evaluated by nontechnical stakeholders (legal, brand, support) not just for functionality, but for tone, appropriateness, and risk. This makes review cycles messier and more subjective than in traditional software or ML.

What breaks your app in production isn’t always what you tested for in dev!

This inherent unpredictability is precisely why evaluation-driven development becomes essential: Rather than an afterthought, evaluation becomes the driving force behind every iteration.

Evaluation is the engine, not the afterthought.

The first property is something we saw with data and ML-powered software. What this meant was the emergence of a new stack for ML-powered app development, often referred to as MLOps. It also meant three things:

Software was now exposed to a potentially large amount of messy real-world data.
ML apps needed to be developed through cycles of experimentation (as we’re no longer able to reason about how they’ll behave based on software specs).
The skillset and the background of people building the applications were realigned: People who were at home with data and experimentation got involved!

Now with LLMs, AI, and their inherent flip-floppiness, an array of new issues arises:

Nondeterminism: How can we build reliable and consistent software using models that are nondeterministic and unpredictable?
Hallucinations and forgetting: How can we build reliable and consistent software using models that both forget and hallucinate?
Evaluation: How do we evaluate such systems, especially when outputs are qualitative, subjective, or hard to benchmark?
Iteration: We know we need to experiment with and iterate on these system. How do we do so?
Business value: Once we have a rubric for evaluating our systems, how do we tie our macro-level business value metrics to our micro-level LLM evaluations? This becomes especially difficult when outputs are qualitative, subjective, or context-sensitive—a challenge we saw in MLOps, but one that’s even more pronounced in GenAI systems.

Beyond the technical challenges, these complexities also have real business implications. Hallucinations and inconsistent outputs aren’t just engineering problems—they can erode customer trust, increase support costs, and lead to compliance risks in regulated industries. That’s why integrating evaluation and iteration into the SDLC isn’t just good practice, it’s essential for delivering reliable, high-value AI products.

A TYPICAL JOURNEY IN BUILDING AI-POWERED SOFTWARE

In this section, we’ll walk through a real-world example of an LLM-powered application struggling to move beyond the proof-of-concept stage. Along the way, we’ll explore:

Why defining clear user scenarios and understanding how LLM outputs will be used in the product prevents wasted effort and misalignment.
How synthetic data can accelerate iteration before real users interact with the system.
Why early observability (logging and monitoring) is crucial for diagnosing issues.
How structured evaluation methods move teams beyond intuition-driven improvements.
How error analysis and iteration refine both LLM performance and system design.

By the end, you’ll see how this team escaped POC purgatory—not by chasing the perfect model, but by adopting a structured development cycle that turned a promising demo into a real product.

You’re not launching a product: You’re launching a hypothesis.

At its core, this case study demonstrates evaluation-driven development in action. Instead of treating evaluation as a final step, we use it to guide every decision from the start—whether choosing tools, iterating on prompts, or refining system behavior. This mindset shift is critical to escaping POC purgatory and building reliable LLM applications.

POC PURGATORY

Every LLM project starts with excitement. The real challenge is making it useful at scale.

The story doesn’t always start with a business goal. Recently, we helped an EdTech startup build an information-retrieval app.¹ Someone realized they had tons of content a student could query. They hacked together a prototype in ~100 lines of Python using OpenAI and LlamaIndex. Then they slapped on tool used to search the web, saw low retrieval scores, called it an “agent,” and called it a day. Just like that, they landed in POC purgatory—stuck between a flashy demo and working software.

They tried various prompts and models and, based on vibes, decided some were better than others. They also realized that, although LlamaIndex was cool to get this POC out the door, they couldn’t easily figure out what prompt it was throwing to the LLM, what embedding model was being used, the chunking strategy, and so on. So they let go of LlamaIndex for the time being and started using vanilla Python and basic LLM calls. They used some local embeddings and played around with different chunking strategies. Some seemed better than others.

EVALUATING YOUR MODEL WITH VIBES, SCENARIOS, AND PERSONAS

Before you can evaluate an LLM system, you need to define who it’s for and what success looks like.

They then decided to try to formalize some of these “vibe checks” into an evaluation framework (commonly called a “harness”), which they can use to test different versions of the system. But wait: What do they even want the system to do? Who do they want to use it? Eventually, they want to roll it out to students, but perhaps a first goal would be to roll it out internally.

Vibes are a fine starting point—just don’t stop there.

We asked them:

Who are you building it for?
In what scenarios do you see them using the application?
How will you measure success?

The answers were:

Our students.
Any scenario in which a student is looking for information that the corpus of documents can answer.
If the student finds the interaction helpful.

The first answer came easily, the second was a bit more challenging, and the team didn’t even seem confident with their third answer. What counts as success depends on who you ask.

We suggested:

Keeping the goal of building it for students but orient first around whether internal staff find it useful before rolling it out to students.
Restricting the first goals of the product to something actually testable, such as giving helpful answers to FAQs about course content, course timelines, and instructors.
Keeping the goal of finding the interaction helpful but recognizing that this contains a lot of other concerns, such as clarity, concision, tone, and correctness.

So now we have a user persona, several scenarios, and a way to measure success.

SYNTHETIC DATA FOR YOUR LLM FLYWHEEL

Why wait for real users to generate data when you can bootstrap testing with synthetic queries?

With traditional, or even ML, software, you’d then usually try to get some people to use your product. But we can also use synthetic data—starting with a few manually written queries, then using LLMs to generate more based on user personas—to simulate early usage and bootstrap evaluation.

So we did that. We made them generate ~50 queries. To do this, we needed logging, which they already had, and we needed visibility into the traces (prompt + response). There were nontechnical SMEs we wanted in the loop.

Also, we’re now trying to develop our eval harness so we need “some form of ground truth,” that is, examples of user queries + helpful responses.

This systematic generation of test cases is a hallmark of evaluation-driven development: Creating the feedback mechanisms that drive improvement before real users encounter your system.

Evaluation isn’t a stage, it’s the steering wheel.

LOOKING AT YOUR DATA, ERROR ANALYSIS, AND RAPID ITERATION

Logging and iteration aren’t just debugging tools, they’re the heart of building reliable LLM apps. You can’t fix what you can’t see.

To build trust with our system, we needed to confirm at least some of the responses with our own eyes. So we pulled them up in a spreadsheet and got our SMEs to label responses as “helpful or not” and to also give reasons.

Then we iterated on the prompt and noticed that it did well with course content but not as well with course timelines. Even this basic error analysis allowed us to decide what to prioritize next.

When playing around with the system, I tried a query that many people ask LLMs with IR but few engineers think to handle: “What docs do you have access to?” RAG performs horribly with this most of the time. An easy fix for this involved engineering the system prompt.

Essentially, what we did here was:

Build
Deploy (to only a handful of internal stakeholders)
Log, monitor, and observe
Evaluate and error analysis
Iterate

Now it didn’t involve rolling out to external users; it didn’t involve frameworks; it didn’t even involve a robust eval harness yet, and the system changes involved only prompt engineering. It involved a lot of looking at your data!² We only knew how to change the prompts for the biggest effects by performing our error analysis.

What we see here, though, is the emergence of the first iterations of the LLM SDLC: We’re not yet changing our embeddings, fine-tuning, or business logic; we’re not using unit tests, CI/CD, or even a serious evaluation framework, but we’re building, deploying, monitoring, evaluating, and iterating!

In AI systems, evaluation and monitoring don’t come last—they drive the build process from day one

FIRST EVAL HARNESS

Evaluation must move beyond ‘vibes’: A structured, reproducible harness lets you compare changes reliably.

In order to build our first eval harness, we needed some ground truth, that is, a user query and an acceptable response with sources.

To do this, we either needed SMEs to generate acceptable responses + sources from user queries or have our AI system generate them and an SME to accept or reject them. We chose the latter.

So we generated 100 user interactions and used the accepted ones as our test set for our evaluation harness. We tested both retrieval quality (e.g., how well the system fetched relevant documents, measured with metrics like precision and recall), semantic similarity of response, cost, and latency, in addition to performing heuristics checks, such as length constraints, hedging versus overconfidence, and hallucination detection.

We then used thresholding of the above to either accept or reject a response. However, looking at why a response was rejected helped us iterate quickly:

🚨 Low similarity to accepted response: Reviewer checks if the response is actually bad or just phrased differently.
🔍 Wrong document retrieval: Debug chunking strategy, retrieval method.
⚠️ Hallucination risk: Add stronger grounding in retrieval or prompt modifications.
🏎️ Slow response/high cost: Optimize model usage or retrieval efficiency.

There are many parts of the pipeline one can focus on, and error analysis will help you prioritize. Depending on your use case, this might mean evaluating RAG components (e.g. chunking or OCR quality), basic tool use (e.g. calling an API for calculations), or even agentic patterns (e.g. multistep workflows with tool selection). For example, if you’re building a document QA tool, upgrading from basic OCR to AI-powered extraction—think Mistral OCR—might give the biggest lift on your system!

Anatomy of a modern LLM system: Tool use, memory, logging, and observability—wired for iteration

On the first several iterations here, we also needed to iterate on our eval harness by looking at its outputs and adjusting our thresholding accordingly.

And just like that, the eval harness becomes not just a QA tool but the operating system for iteration.

FIRST PRINCIPLES OF LLM-POWERED APPLICATION DESIGN

What we’ve seen here is the emergence of an SDLC distinct from the traditional SDLC and similar to the ML SDLC, with the added nuances of now needing to deal with nondeterminism and masses of natural language data.

The key shift in this SDLC is that evaluation isn’t a final step, it’s an ongoing process that informs every design decision. Unlike traditional software development where functionality is often validated after the fact with tests or metrics, AI systems require evaluation and monitoring to be built in from the start. In fact, acceptance criteria for AI applications must explicitly include evaluation and monitoring. This is often surprising to engineers coming from traditional software or data infrastructure backgrounds who may not be used to thinking about validation plans until after the code is written. Additionally, LLM applications require continuous monitoring, logging, and structured iteration to ensure they remain effective over time.

We’ve also seen the emergence of the first principles for generative AI and LLM software development. These principles are:

We’re working with API calls: These have inputs (prompts) and outputs (responses); we can add memory, context, tool use, and structured outputs using both the system and user prompts; we can turn knobs, such as temperature and top p.
LLM calls are nondeterministic: The same inputs can result in drastically different outputs. ← This is an issue for software!
Logging, monitoring, tracing: You need to capture your data.
Evaluation: You need to look at your data and results and quantify performance (a combination of domain expertise and binary classification).
Iteration: Iterate quickly using prompt engineering, embeddings, tool use, fine-tuning, business logic, and more!

*Five first principles for LLM systems—from nondeterminism to evaluation and iteration*

As a result, we get methods to help us through the challenges we’ve identified:

Nondeterminism: Log inputs and outputs, evaluate logs, iterate on prompts and context, and use API knobs to reduce variance of outputs.
Hallucinations and forgetting:
- Log inputs and outputs in dev and prod.
- Use domain-specific expertise to evaluate output in dev and prod.
- Build systems and processes to help automate assessment, such as unit tests, datasets, and product feedback hooks.
Evaluation: Same as above.
Iteration: Build an SDLC that allows you to rapidly Build → Deploy → Monitor → Evaluate → Iterate.
Business value: Align outputs with business metrics and optimize workflows to achieve measurable ROI.

An astute and thoughtful reader may point out that the SDLC for traditional software is also somewhat circular: Nothing’s ever finished; you release 1.0 and immediately start on 1.1.

We don’t disagree with this but we’d add that, with traditional software, each version completes a clearly defined, stable development cycle. Iterations produce predictable, discrete releases.

By contrast:

ML-powered software introduces uncertainty due to real-world entropy (data drift, model drift), making testing probabilistic rather than deterministic.
LLM-powered software amplifies this uncertainty further. It isn’t just natural language that’s tricky; it’s the “flip-floppy” nondeterministic behavior, where the same input can produce significantly different outputs each time.
Reliability isn’t just a technical concern, it’s a business one. Flaky or inconsistent LLM behavior erodes user trust, increases support costs, and makes products harder to maintain. Teams need to ask: What’s our business tolerance for that unpredictability and what kind of evaluation or QA system will help us stay ahead of it?

This unpredictability demands continuous monitoring, iterative prompt engineering, maybe even fine-tuning, and frequent updates just to maintain basic reliability.

Every AI system feature is an experiment—you just might not be measuring it yet.

So traditional software is iterative but discrete and stable, while LLM-powered software is genuinely continuous and inherently unstable without constant attention—it’s more of a continuous limit than distinct version cycles.

Getting out of POC purgatory isn’t about chasing the latest tools or frameworks: it’s about committing to evaluation-driven development through an SDLC that makes LLM systems observable, testable, and improvable. Teams that embrace this shift will be the ones that turn promising demos into real, production-ready AI products.

The AI age is here, and more people than ever have the ability to build. The question isn’t whether you can launch an LLM app. It’s whether you can build one that lasts—and drive real business value.

Want to go deeper? We created a free 10-email course that walks through how to apply these principles—from user scenarios and logging to evaluation harnesses and production testing. And if you’re ready to get hands-on with guided projects and community support, the next cohort of our Maven course kicks off April 7.

Many thanks to Shreya Shankar, Bryan Bischof, Nathan Danielsen, and Ravin Kumar for their valuable and critical feedback on drafts of this essay along the way.

Footnotes

This consulting example is a composite scenario drawn from multiple real-world engagements and discussions, including our own work. It illustrates common challenges faced across different teams, without representing any single client or organization.
Hugo Bowne-Anderson and Hamel Husain (Parlance Labs) recently recorded a live streamed podcast for Vanishing Gradients about the importance of looking at your data and how to do it. You can watch the livestream here and and listen to it here (or on your app of choice).

Source link

Evaluation-Driven Development for AI Systems – O’Reilly

Learn faster. Dig deeper. See farther.

FOCUS ON PRINCIPLES, NOT FRAMEWORKS (OR AGENTS)

IS AI-POWERED SOFTWARE ACTUALLY THAT DIFFERENT FROM TRADITIONAL SOFTWARE?