What is deterministic replay for AI agents?

Deterministic replay is the ability to reproduce an agent run exactly by replaying stored inputs, tool responses, and file state instead of re-executing against live services. It turns a one-shot run into a reusable artifact for debugging, regression testing, and training.

How do you make an LLM agent run reproducible?

Pin the model version, set temperature to zero or record the seed, capture the full prompt and tool schema for every model call, record the exact bytes of every tool response, and version every file the agent reads or writes. Replay works by feeding those stored records back to the agent instead of calling live services.

What do you need to store to replay an agent?

At minimum: model inputs and outputs for every call, tool responses in full, file state at each read and write, and environment details like wall clock time and working directory. A typical implementation uses an append-only event log plus a content-addressed blob store for large payloads.

Why not just replay by re-calling the tools?

External tools change. Search results shift hourly, databases get written to, and web pages get redeployed. Re-calling tools during replay introduces drift that defeats the point of determinism. Store the original response and replay from storage.

How much storage does replay add?

It depends on the agent. A short code-writing run might produce a few hundred kilobytes. A multi-hour research agent that scrapes the web can produce gigabytes. Content-addressed deduplication helps a lot because agents re-read the same files many times per run.

Can Fast.io store replay traces?

Yes. Fast.io workspaces provide file versioning, audit trails, and an indexed storage layer that map well to replay needs. The free agent plan offers 50 GB of storage, 5,000 credits per month, and no credit card, which is enough for thousands of small runs.

Deterministic Replay Storage for AI Agents: A Guide

What Deterministic Replay Storage Actually Means

Deterministic replay storage captures every input, tool response, and file state of an agent run so the run can be reproduced exactly later. If you hand the stored trace to a replay engine, it should produce the same sequence of model prompts, tool calls, and file outputs as the original run, without re-calling any external service.

That definition sounds simple. In practice, AI agents are nondeterministic in a long list of small ways: the model samples with temperature, the clock advances, tool calls return different data each hour, and the filesystem changes between steps. Replay storage exists to pin down each of those moving parts.

There are four capture points that matter:

Model inputs: the full prompt, system message, tool schemas, sampling parameters, and model identifier for every call.
Model outputs: the raw completion, including logprobs when available, plus any structured tool-call arguments the model emitted.
Tool responses: the exact bytes returned by each tool call, including HTTP status, headers where relevant, and the body.
File state: the contents of any files the agent read or wrote, captured at the moment of access, not end-of-run.

If any one of those is missing, replay drifts. Miss a tool response and the second run diverges at the first network hop. Miss file state and the agent reads a file that has since changed on disk and makes a different decision.

Why Replay Matters for Evals, RLHF, and Debugging

Deterministic replay is core to RLHF and eval pipelines. When you train a reward model or run an offline eval suite, you want the same behavior every time you feed the same trajectory through the system. Without replay, your eval harness is measuring a moving target: model updates, tool drift, and file changes all contaminate the signal.

Three concrete use cases drive the investment:

Regression testing. You change a prompt or swap a model. You want to know if the agent still gets 47 out of 50 benchmark tasks right. Replay lets you rerun yesterday's traces against today's agent code and see exactly what changed. Without replay you are comparing two independent samples, and small accuracy swings get lost in noise.

Bug reproduction. A customer reports that the agent deleted the wrong file on Tuesday. You pull the trace, replay it locally, and step through each tool call. The bug becomes a debugger session instead of a guessing game. Teams that skip replay storage end up adding print statements to production agents, which is not a good place to be.

RLHF and preference data. If you are training a reward model on agent trajectories, you need the trajectories to be stable. A Reddit thread on r/LocalLLaMA from March 2026 flagged the same point: teams building open-source replay engines cite zero-api-cost replays as the main reason to invest in storage up front.[^1]

The common thread is that replay turns a one-shot event into a reusable artifact. The run becomes a dataset row.

Agent audit trail with expandable tool call details

The Four Capture Points in Detail

Most teams underestimate how much has to be captured. Here is the breakdown.

Model Calls

For each model invocation, store the full request and response. That includes the model ID and version, system prompt, message history, tool schemas exactly as serialized, temperature, top_p, max tokens, stop sequences, seed if set, and any provider-specific flags.

On the response side, capture the completion text, any tool-call blocks with their raw argument JSON, usage counts, and the finish reason. If the provider exposes logprobs or raw token IDs, grab those too. You will want them later for offline scoring.

A subtle trap: providers sometimes rewrite your request before executing it. Claude and GPT both normalize whitespace and reorder tool definitions in some cases. Store what you sent and what the provider acknowledged receiving, if the API surfaces both.

Tool Responses

Tool call responses are the hardest part to capture faithfully. A tool that calls a search API returns different results every hour. A tool that reads a database returns different rows after a write. A tool that fetches a URL returns different HTML after a deploy.

The only reliable approach is to record the full response at the moment of the call. For HTTP tools, that means status, headers you care about, and the body. For database tools, the serialized rows. For filesystem tools, the file contents plus modification time.

Do not try to replay by re-calling the tool. That approach works for about a week and then breaks when the upstream service changes.

File State

This is where tracing tools capture events but not file state, and where replay storage has to fill the gap. Every time the agent reads a file, you need the exact bytes of that file as of the read. Every time the agent writes a file, you need the new bytes and the prior bytes so you can reconstruct the write later.

Content-addressed storage is the clean pattern here: hash each file blob with SHA-256, store the blob once keyed by hash, and record the hash in the trace. Two reads of the same unchanged file cost one blob of storage, not two.

Fast.io handles this pattern natively through file versioning. Every write creates a new version, older versions stay retrievable, and audit trails record who touched what. When an agent works in a Fast.io workspace, the version history becomes a replay log for free. Other options for this layer include self-hosted content-addressed stores like Git LFS or S3 with versioning enabled.

Environment and Clock

The smallest capture point and the one most often skipped. Record the wall clock time of each step, the agent's working directory, relevant environment variables, and any random seeds the agent itself set. If the agent reads the date to decide what quarter it is, the replay has to hand back the same date.

Storage Architecture: What Goes Where

A replay storage system needs three layers: an append-only event log for the trace, a content-addressed blob store for file bytes and large tool responses, and an index so you can find traces later.

The Event Log

The event log is the source of truth. Each entry is a small JSON record describing one step: a model call, a tool call, a file read, a file write, or an internal decision point. Entries reference blob hashes for anything large.

Append-only matters. If you let agents edit prior entries, replay stops being a faithful record and becomes a narrative. Write once, read many.

Typical storage targets: a single JSONL file per run for small workloads, a Parquet dataset partitioned by day for larger ones, or a managed log service like ClickHouse or BigQuery when you need to query across runs.

The Blob Store

The blob store holds everything too big to inline: file contents, HTML bodies from scraped pages, full tool-call response payloads, screenshot images. Content addressing deduplicates naturally, which matters when an agent reads the same config file a hundred times per run.

Good targets: S3 with object versioning, Google Cloud Storage, or a workspace platform like Fast.io that exposes blob storage with versioning through an API or MCP server. Avoid local disk for anything you want to keep past a week. Agents produce a lot of data and local disks fill up.

The Index

The index is how you find traces later. At minimum, index by run ID, agent name, timestamp, and outcome. Add custom tags for eval set membership, user ID, or feature flag state. If you are running RAG-style queries against past runs, a vector index of prompts or outputs pays for itself quickly.

Intelligence Mode on a Fast.io workspace handles this automatically: uploaded trace files get indexed for semantic search, and you can query them through chat or the MCP server without standing up a separate vector database.

Store agent replay traces in a versioned workspace

Fast.io gives AI agents 50 GB of free storage, file versioning, audit trails, and MCP access for writing and reading replay traces. No credit card, no expiration.

Implementing Replay with Fast.io

A concrete sketch of how this fits together when you use Fast.io as the storage layer. The same pattern works with S3 or any other content-addressed store; Fast.io is convenient because it combines versioning, audit trails, and indexing in one workspace.

Create a workspace per agent environment: one for production, one for staging, one per eval suite. Every run writes its event log as a JSONL file at runs/{run_id}/trace.jsonl and its blob store under runs/{run_id}/blobs/{sha256}. File versioning handles the append-only guarantee; audit trails record who wrote what.

The agent itself writes through the Fast.io API or MCP server. Fast.io exposes Streamable HTTP at /mcp and legacy SSE at /sse for MCP clients, and the standard REST API for direct calls. The free agent plan gives you 50 GB of storage, 5,000 credits per month, 5 workspaces, and no credit card. That is enough room for thousands of small runs and a few hundred larger ones.

When you replay, a worker pulls the trace file, rehydrates each blob by hash, and feeds the sequence to the replay engine. Because Fast.io stores every version of every file, you can rewind the workspace to any point in the run and see the exact state the agent saw.

Ownership transfer fits naturally here. An agent can build up a workspace of replay traces for a client, then hand the workspace over to the human team for audit, while keeping admin access for future runs. Webhooks notify your eval harness when a new trace lands, so you can kick off regression runs without polling.

For a deeper walkthrough of the agent-storage pattern, see /storage-for-agents/ and the MCP skill docs.

Agent workspace with versioned files and audit trail

Practical Tradeoffs and Common Mistakes

A few decisions come up on every replay project.

How much to store. Full capture is expensive. A multi-hour research agent can produce gigabytes of HTML. Most teams start with full capture during development, then move to sampled capture in production: every trace keeps the event log, but only 1 in N keeps the full blob payloads. Failed runs always get full capture, successful ones get sampled.

How long to keep it. Replay traces are most useful in the first 30 days after a run. After that, model and prompt changes usually mean a replay is interesting as historical evidence rather than a live test. A 30-day hot tier on fast storage plus a cheap archive tier for anything older is a reasonable default.

PII and secrets. Agents read and write sensitive data. Replay storage inherits whatever the agent saw, which means your replay store is now a compliance surface. Redact at capture time when you can, encrypt at rest, and apply granular permissions so only the replay worker can read blobs. Fast.io's permission model (org, workspace, folder, file) maps cleanly to this, but it works the same with S3 bucket policies.

Nondeterminism in the model itself. Even with temperature zero, some providers do not guarantee bitwise identical outputs across regions or minor version bumps. Pin the model version explicitly in your trace, and accept that replay may produce tokens that differ slightly from the original even with perfect input capture. For most eval work this is fine; for strict bitwise replay you need a self-hosted model with a fixed seed.

Forgetting internal state. If your agent maintains scratchpad memory between steps, that memory is part of the input to the next model call. Capture it in the event log as its own field. Teams that only capture the prompt and forget the scratchpad end up with replays that diverge halfway through and cannot explain why.

The shortest version of all of this: treat every agent run as a data-generating process whose output is a reproducible trace, not just a result. Once you have that mindset, the storage layer follows naturally.

Deterministic Replay Storage for AI Agents

What Deterministic Replay Storage Actually Means

Why Replay Matters for Evals, RLHF, and Debugging

The Four Capture Points in Detail

Model Calls

Tool Responses

File State

Environment and Clock

Storage Architecture: What Goes Where

The Event Log

The Blob Store

The Index

Store agent replay traces in a versioned workspace

Implementing Replay with Fast.io

Practical Tradeoffs and Common Mistakes

Frequently Asked Questions

Related Resources

Store agent replay traces in a versioned workspace