How do you regression test AI agent prompts?

Build a golden dataset of input-output pairs that represent expected agent behavior. Define behavioral assertions that test what the output does rather than its exact wording. Run these assertions automatically in CI on every prompt change, and block merges when scores drop below defined thresholds. Start with 20 to 50 test cases covering your most critical paths and grow the dataset over time from production failures.

What tools are used for prompt regression testing?

The three most widely adopted frameworks are Promptfoo (open source, CLI-first, acquired by OpenAI in March 2026), DeepEval (pytest-native with 50+ built-in metrics), and Braintrust (managed SaaS with native GitHub Action integration). All three support golden datasets, behavioral assertions, and CI pipeline gating. Promptfoo fits teams that want config-file simplicity, DeepEval fits Python-heavy workflows, and Braintrust fits teams that want managed infrastructure with PR comments out of the box.

How do you set up CI/CD for prompt testing?

Configure your CI workflow to trigger on changes to prompt files and eval configuration. The pipeline checks out the code, installs dependencies, and runs your eval framework against the golden dataset. Set per-metric pass rate thresholds, commonly 90% overall with 100% for safety checks, and block the merge if any threshold drops below the baseline on your main branch. Run 3 to 5 trials per test case to account for nondeterminism in LLM outputs.

What is a golden dataset for prompt testing?

A golden dataset is a versioned collection of test inputs paired with expected behaviors and evaluation criteria. It serves as the source of truth for measuring prompt quality. Effective golden datasets combine real production inputs that capture authentic user phrasing, manually crafted edge cases for known failure modes, and synthetic scenarios covering rare conditions. Each entry describes the expected behavior in natural language rather than requiring an exact output match.

How many test cases should a golden dataset start with?

Start with 20 to 50 carefully selected examples that cover your most critical use cases and known failure modes. Quality matters more than quantity. Each case should have clear success criteria and test a distinct behavior. Grow the dataset over time by converting every production failure into a new test case. A team adding two cases per week from real incidents will have over 130 cases within a year.

What pass rate threshold should you use for prompt regression tests?

A 90% overall pass rate is the standard CI gate for merging prompt changes. Break this down by metric type: factual accuracy at 0.85 or higher, relevance at 0.90 or higher, and safety at 100% with zero tolerance. Run 3 to 5 trials per test case to account for nondeterminism, and use aggregate scores rather than individual trial results to distinguish real regressions from noise.

How to Build a Prompt Regression Testing Pipeline

What Prompt Regression Testing Actually Solves

You change a system prompt so your agent handles refund requests better. The next morning, you discover it stopped including order numbers in its responses. Nobody tested for that because the change seemed unrelated.

This is the core problem. Prompt regression testing tracks how changes to agent prompts affect response quality by comparing outputs against golden datasets and behavioral contracts rather than exact string matches. Unlike general evaluation, which asks "how good is this model?", regression testing asks a narrower question: "did this specific change make anything worse?"

The difference matters because prompt changes ripple unpredictably. A two-word edit to the system prompt can alter tone, formatting, tool-calling patterns, and safety compliance all at once. Traditional software regression tests work because code changes are scoped to functions with defined inputs and outputs. Prompt changes have blast radii that are hard to predict from the diff alone.

The field has shifted toward asserting behavioral contracts in natural language instead of matching exact strings. When you test a prompt, you don't check whether the output contains "Order #12345" verbatim. You check whether the response includes an order reference, maintains a professional tone, and stays within the agent's defined scope. This approach accounts for the inherent variability in LLM outputs while still catching meaningful regressions.

For multi-agent systems, the stakes compound. An agent that handles intake might feed results to an agent that generates reports. A regression in the first agent's output format can cascade into failures downstream that look nothing like the original prompt change. Testing each agent's behavioral contracts in isolation, then testing the handoff points between them, is the only reliable way to catch these failures before users do.

How to Curate a Golden Dataset That Catches Real Failures

A golden dataset is a versioned collection of inputs, expected behaviors, and evaluation criteria that becomes your source of truth for measuring prompt quality. Think of it as the test fixtures for your prompt engineering.

Start with 20 to 50 carefully selected examples. Anthropic's engineering team recommends building initial tasks from real failures, prioritized by user impact, with each task requiring unambiguous success criteria and a reference solution proving the task is solvable.

Pull test cases from three places:

Production logs. Real user inputs capture phrasing and intent that synthetic examples miss. Filter for cases where the agent struggled, where users rephrased their question, or where feedback flagged a bad response.

Manual edge cases. Write inputs that probe known failure modes: ambiguous requests, adversarial prompts, boundary conditions like empty inputs or unusually long context. Include cases where the correct behavior is to refuse or ask for clarification.

Synthetic inputs. Generate rare scenarios that production data hasn't surfaced yet. If your agent handles financial questions, create test cases for unusual currency formats, negative amounts, or multi-currency calculations.

A useful golden dataset entry looks like this:

- input: "Cancel my subscription and refund the last payment"
  context: "User has active monthly plan, last charged 3 days ago"
  expected_behavior: >
    Acknowledges cancellation request, explains refund
    policy, asks for confirmation before proceeding
  tags: ["billing", "cancellation", "refund"]
  difficulty: "medium"

Notice that expected_behavior describes what the response should do, not what it should say word-for-word. This is the behavioral contract that your assertions will check against.

Keep the dataset in version control. Store it in the same repository as your prompts, so a commit that changes a prompt also updates the golden dataset if new cases are needed. Every production failure that slips past your tests becomes a new entry. This feedback loop is what makes the dataset stronger over time. A team that starts with 30 test cases and adds two per week from production issues will have over 130 cases within a year, each one grounded in a real failure.

Dashboard showing indexed documents and evaluation summaries

Behavioral Assertions: Testing Intent Instead of Exact Output

Exact string matching breaks on the first valid variation. If your test expects "I've processed your refund" and the model says "Your refund has been processed," a string comparison fails even though the behavior is correct. Behavioral assertions solve this by testing what the output does rather than what it says.

Organize your test suite around three categories, from cheapest to most expensive:

Deterministic checks run without an LLM call. They're fast, reproducible, and free. Use them for structural requirements: output is valid JSON, response length stays within a defined range, required fields are present, no forbidden strings appear (competitor names, internal system IDs), and response language matches the input language.

Semantic checks use an LLM as a judge to evaluate meaning. They cost tokens but catch regressions that string matching misses:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="How do I reset my password?",
    actual_output=agent_response,
    retrieval_context=["Password reset docs..."]
)
assert_test(test_case, [metric])

The LLM judge evaluates whether the response actually addresses the user's question, not whether it matches a template. You can define custom judges that check for specific behavioral contracts: "Does the response include a concrete next step?" or "Does the response avoid making promises about delivery dates?"

Safety checks form a non-negotiable baseline. These test for harmful content, data leakage, prompt injection vulnerability, and compliance with your agent's defined boundaries. Safety assertions should have a 100% pass rate threshold with zero tolerance for regression.

Anthropic recommends maintaining separate evaluation suites: capability evals start at low pass rates and track improvement, while regression suites target near-100% and catch degradation. As your capability evals approach 100%, graduate them into the regression suite. When a regression eval saturates at 100% for weeks, it's no longer catching anything. Replace it with a harder test.

How to Wire Evals Into Your CI Pipeline

The mechanical part: make your CI system run evals on every prompt change and block the merge if scores drop. This turns prompt quality from a manual review step into an automated gate.

Store prompts in git. Every prompt your agent uses should live in a prompts/ directory, versioned alongside code. A commit that changes a prompt must reproduce identical agent behavior when re-run. This is only possible if prompts are immutable per commit and explicitly referenced in pipeline metadata.

Configure the trigger. Only run the full eval suite when prompt-related files change. Running evals on every commit wastes tokens and slows down unrelated PRs:

name: Prompt Regression Tests
on:
  pull_request:
    paths:
      - "prompts/**"
      - "eval/**"
      - "promptfoo.config.yaml"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run test:evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Set pass rate thresholds. A 90% overall pass rate is the standard gate for merging prompt changes. But a single number hides important detail. Set per-metric thresholds that reflect your priorities:

Factual accuracy: 0.85 or higher
Relevance: 0.90 or higher
Safety: 100%, no exceptions
Format compliance: 0.95 or higher

If any metric drops below its threshold compared to the baseline on the main branch, the PR fails.

Account for nondeterminism. LLM outputs vary between runs even with identical inputs. Run 3 to 5 trials per test case and use the aggregate score. A single failed trial out of five on one test case isn't a regression. A consistent 10-point drop across your dataset is.

Keep eval costs predictable. Full eval suites with LLM-as-judge assertions get expensive at scale. Run deterministic checks on every PR, since they're free. Run the full semantic suite only when prompt files change. Use a smaller, faster model for judge calls when possible, and reserve your production model for the final pre-merge run.

Keep eval artifacts where your agents can reach them

Fastio gives your regression testing pipeline persistent storage for golden datasets, prompt versions, and score reports. generous storage, no credit card, with MCP access for automated reads and writes.

Start 14-Day Trial

Tracking Score Deltas Across Prompt Versions

Running evals is step one. Making the results useful over time requires tracking how scores change across prompt versions.

PR-level delta reports. The most immediate feedback loop is a comment on the pull request showing what changed. Braintrust's GitHub Action, for example, posts results directly to PRs, highlighting improvements and regressions per scorer. When a reviewer opens a prompt change PR, they see not just the text diff but the behavioral impact: relevance went up 3%, but factual accuracy dropped 2% on the billing test cases.

This visibility changes how teams review prompt changes. Instead of debating whether a prompt "looks right," the conversation centers on concrete metric movements. The review shifts from subjective judgment to data.

Trending over time. Individual PR reports show local changes. A dashboard that plots eval scores across prompt versions shows the trajectory. You want to see whether your regression suite is getting tighter (fewer near-miss scores) or whether a slow drift is accumulating across multiple small changes that each passed the threshold individually.

Graduated evaluation. Anthropic's engineering team describes a pattern where tests move between suites based on performance. A new test case starts in the capability suite with a low pass-rate expectation. As the agent improves, the test graduates to the regression suite with a high threshold. When a regression test stays at 100% for weeks, it no longer provides signal. Replace it or raise the difficulty.

Where to store eval artifacts. Your pipeline produces golden datasets, prompt versions, score histories, and detailed eval reports. These need to live somewhere accessible to both your CI system and the humans reviewing results. Git handles prompt versions and golden datasets well. For eval reports and score histories, you need persistent storage that supports querying and sharing.

Some teams use S3 or Google Cloud Storage with custom tooling around it. Fastio workspaces work well here because agents can read and write artifacts directly through the MCP server, and you can hand off the entire eval workspace to a human reviewer when it's time for manual inspection. Intelligence Mode auto-indexes uploaded eval reports so you can ask questions like "which prompt version caused the billing regression?" without digging through files manually. The free tier includes 50GB of storage and included credits, which covers most eval pipelines without a billing surprise.

Choosing an Eval Framework

Three frameworks dominate prompt regression testing in 2026. Each fits a different workflow.

Promptfoo

OpenAI acquired Promptfoo in March 2026, but the project remains open source under its existing license. It's config-file driven and CLI-first, which makes it straightforward to integrate into existing CI pipelines. More than 350,000 developers have used Promptfoo, and teams at over 25% of the Fortune 500 run it in production.

Promptfoo works with any model provider. You define test cases and assertions in a YAML config, run promptfoo eval from the command line, and get a pass/fail report. The config-file approach means your eval setup lives in version control alongside your prompts.

Best for: Teams that want a provider-agnostic, open-source CLI tool with minimal setup overhead.

DeepEval

DeepEval takes a pytest-native approach. If your team already writes Python tests, it feels familiar. You define test cases as LLMTestCase objects, apply metrics like AnswerRelevancyMetric or HallucinationMetric with configurable thresholds, and run them with deepeval test run. The framework includes over 50 built-in metrics covering RAG, safety, agent tool use, and conversational quality.

More than 150,000 developers use DeepEval, and the framework processes over 100 million daily evaluations. Confident AI, the company behind DeepEval, offers a hosted dashboard for tracking eval history, but the core framework is open source.

Best for: Python-heavy teams that want pytest integration and a wide library of pre-built metrics.

Braintrust

Braintrust is a managed SaaS platform that bundles evaluation, prompt management, and dataset curation in one interface. Its standout feature for regression testing is the native GitHub Action: open a PR, and Braintrust runs your eval suite and posts a comment showing exactly which cases improved and which regressed.

The platform includes a prompt playground for interactive testing, dataset management tools, and production monitoring that feeds failures back into your eval suite.

Best for: Teams that want a fully managed solution with built-in PR integration and don't want to assemble their own toolchain.

How they compare

Open source: Promptfoo (yes, post-acquisition), DeepEval core framework (yes), Braintrust (no).

Primary language: Promptfoo uses TypeScript and YAML configs. DeepEval is Python-native. Braintrust supports both TypeScript and Python.

CI integration: Promptfoo and DeepEval both use CLI commands in custom workflow steps. Braintrust provides a native GitHub Action that handles PR comments automatically.

Built-in metrics: DeepEval leads with 50+ metrics out of the box. Promptfoo and Braintrust offer moderate built-in options with strong extensibility.

Hosted dashboard: Braintrust includes one by default. Promptfoo and DeepEval offer optional hosted dashboards through their respective companies.

All three frameworks support the core workflow described in this guide: golden datasets, behavioral assertions, CI gating, and delta tracking. The choice depends on your team's language preferences, whether you want managed infrastructure, and how much custom configuration you're willing to maintain.

How to Build a Prompt Regression Testing Pipeline for AI Agents

What Prompt Regression Testing Actually Solves

How to Curate a Golden Dataset That Catches Real Failures

Behavioral Assertions: Testing Intent Instead of Exact Output

How to Wire Evals Into Your CI Pipeline

Keep eval artifacts where your agents can reach them

Tracking Score Deltas Across Prompt Versions

Choosing an Eval Framework

Promptfoo

DeepEval

Braintrust

How they compare

Frequently Asked Questions

Related Resources

Keep eval artifacts where your agents can reach them