How to Handle Long-Running Tasks in AI Agents
AI agents that run for minutes or hours need more than a basic request-response loop. This guide covers five production strategies for keeping long-running agent tasks reliable: checkpointing state to persistent storage, decoupling work through message queues, using durable execution frameworks, setting timeout and retry policies, and reporting progress to humans.
Why Long-Running Agent Tasks Fail
Most agent frameworks assume tasks complete in seconds. The agent sends a prompt, gets a response, maybe calls a tool, and returns a result. That works for answering questions or summarizing a document. It breaks down when your agent needs to process 300 invoices, run a multi-step research pipeline, or coordinate a week-long content workflow.
The problem is straightforward: long tasks encounter more failure points. Network timeouts, API rate limits, server restarts, context window exhaustion, and simple infrastructure hiccups all become likely over a multi-hour run. A task that has a 99% chance of surviving any given minute still has only a 55% chance of completing a 60-minute job without interruption.
Addy Osmani draws a useful distinction between three types of "long-running" agent work. Long-horizon reasoning involves multi-step dependent planning where the model quality matters most. Long-running execution is about processes that run for hours with thousands of model invocations, where the use architecture matters most. Persistent agency is when the agent's identity outlives individual tasks and it accumulates memory across sessions.
Each type demands different engineering. This guide focuses on the execution layer, the infrastructure patterns that keep an agent productive when tasks take 5 minutes or 5 hours.
Strategy 1: Checkpoint Agent Progress
Checkpointing is the simplest and most effective pattern for long-running agents. The idea: save your agent's progress at regular intervals so it can resume after any interruption instead of starting over.
The implementation follows a loop. Your agent picks up the next unit of work, processes it, writes the result and its current position to storage, then moves on. If the process crashes at step 47 of 100, it reads the checkpoint file, sees that steps 1 through 46 are complete, and continues from step 47.
What to checkpoint
Save the minimum state needed to resume: which items have been processed, any intermediate results, the current position in a task list, and accumulated context the agent needs for decision-making. Avoid checkpointing the entire model context or conversation history. That grows too fast and makes recovery slow.
Checkpoint granularity
Checkpoint too often and you spend more time writing state than doing work. Checkpoint too rarely and you lose too much progress on failure. A good default: checkpoint after every meaningful unit of work. If you are processing documents, checkpoint after each document. If you are running a multi-stage pipeline, checkpoint after each stage.
Where to store checkpoints
Local files work for single-machine agents. For production systems where multiple agents coordinate or where you need human visibility into progress, use a shared workspace. Fast.io workspaces give agents persistent storage that humans can also access, so a team lead can open the workspace and see exactly where a long-running job stands. The Fast.io MCP server lets agents read and write checkpoint files through standard tool calls.
For teams already using cloud infrastructure, S3 or GCS work fine as checkpoint stores. The tradeoff: you get cheap, durable storage but lose the collaboration layer. Nobody can browse an S3 bucket as easily as opening a shared workspace.
Example: the Ralph Loop
Osmani describes the "Ralph Loop" pattern, a surprisingly effective approach that uses a bash script and filesystem state:
1. Pick next unfinished task from a list file
2. Build prompt with task context and persistent notes
3. Invoke the agent
4. Run verification checks on the output
5. Append results to a progress file
6. Update task status
7. Repeat
State lives entirely outside the model's context window, on the filesystem. The agent can crash, restart, even switch to a different model, and pick up exactly where it left off.
Strategy 2: Decouple Work With Message Queues
Checkpointing handles recovery. Message queues handle a different problem: separating the "decide what to do" step from the "do the work" step.
In a direct execution model, the orchestrator calls the agent, the agent does the work, and the orchestrator waits. If the work takes 30 minutes, the orchestrator is blocked for 30 minutes. If the orchestrator crashes, the work is lost.
With a queue, the orchestrator publishes a task message and moves on. A worker picks up the message, does the work, and publishes the result. The orchestrator and worker are fully independent. Either can crash and restart without affecting the other.
When queues help
Queues are valuable when you have multiple tasks that can run in parallel, when task execution time is unpredictable, or when you want to scale workers independently from orchestrators. If your agent pipeline processes 50 articles through research, writing, and review stages, a queue between each stage lets you run three researchers, two writers, and one reviewer simultaneously.
Queue options
For simple setups, Redis with BullMQ gives you reliable queues with retry support in a few lines of TypeScript. For larger deployments, RabbitMQ or Amazon SQS provide managed durability. If you are already in the Kubernetes ecosystem, NATS or Kafka work well.
The pattern for AI agent work is consistent regardless of the queue technology:
### Publisher (orchestrator)
queue.publish({
"task_id": "doc-047",
"action": "research",
"input": {"topic": "quarterly earnings", "sources": 5},
"checkpoint_workspace": "workspace-id"
})
### Worker (agent)
message = queue.consume()
result = agent.run(message.action, message.input)
storage.write(message.checkpoint_workspace, f"{message.task_id}.json", result)
queue.ack(message)
The worker writes results to shared storage (a Fast.io workspace, S3, or local disk) and acknowledges the message only after the result is persisted. If the worker crashes before acknowledging, the queue redelivers the message to another worker.
Combining queues with checkpointing
These strategies compose well. Each worker checkpoints its own progress within a task, so a redelivered message does not restart from scratch. The queue handles task-level reliability. Checkpointing handles step-level reliability within each task.
Give Your Agents a Workspace That Persists
Fast.io gives AI agents 50GB of free persistent storage with built-in intelligence, audit trails, and human collaboration. No credit card required.
Strategy 3: Use Durable Execution Frameworks
Durable execution takes checkpointing and wraps it in a framework that handles the mechanics automatically. You write normal-looking code with function calls, loops, and conditionals. The runtime intercepts each step, persists the result, and replays completed steps on recovery.
How durable execution works
When your agent calls an LLM or external API through a durable runtime, the framework logs the call and its result before returning. If the process restarts, the framework replays the log: instead of re-calling the LLM, it returns the cached result from the previous run. Execution continues from the exact point of failure without repeating any work or making duplicate API calls.
Framework options in 2026
Temporal remains the most mature option for durable workflows. It handles retries, timeouts, and state persistence across language boundaries. The learning curve is steep, but for teams already running Temporal, adding agent workflows is straightforward.
Microsoft Durable Task provides a similar model for .NET, Python, Java, and TypeScript. The runtime automatically checkpoints every state transition, including LLM responses, tool call results, and control flow decisions. It is a strong choice for teams in the Azure ecosystem.
LangGraph offers durable execution for LangChain users. If you are using LangGraph with a checkpointer, you already have durable execution enabled. You can pause and resume workflows at any point, even after interruptions or failures.
Cloudflare Project Think introduced "fibers" in early 2026, a durable invocation that can checkpoint its own instruction pointer. Unlike external state stores, Project Think uses a co-located SQLite database, so checkpointing is fast and local. The ctx.stash() API lets you save execution state mid-loop:
async function researchLoop(ctx, topics) {
for (const topic of topics) {
const findings = await agent.research(topic);
ctx.stash({ topic, findings, completedAt: Date.now() });
}
}
If the worker restarts, the onFiberRecovered hook triggers, and the agent resumes from the last stashed state.
When to use a framework vs. rolling your own
For tasks under 30 minutes with a single agent, the Ralph Loop or manual checkpointing is simpler and has fewer dependencies. For multi-agent systems, tasks spanning hours or days, or systems that need guaranteed exactly-once execution, a durable framework saves you from reimplementing what Temporal and its peers already handle well.
The decision framework Osmani suggests: what is the longest uninterrupted unit of work your agent needs to perform? If it is minutes, manual checkpointing works. If it is hours or days, reach for a durable execution framework.
Strategy 4: Set Timeout and Retry Policies
Long-running tasks interact with external services: LLM APIs, search engines, databases, file storage. Each call can fail. Your timeout and retry policies determine whether a transient failure kills a multi-hour job or gets handled quietly.
Timeout configuration
Set timeouts at two levels. Request-level timeouts cap individual API calls. A 10 to 20 second timeout for LLM API calls is a reasonable default based on production usage patterns. Task-level timeouts cap the total time an agent spends on a job. Without a task-level timeout, a stuck agent can burn compute indefinitely.
For LLM calls specifically, distinguish between time-to-first-token (how long before the response starts streaming) and total response time. A summarization call might start responding in 2 seconds but take 45 seconds to complete. Timing out at 20 seconds would kill a perfectly healthy response.
Retry policies
Not every failure deserves a retry. Classify errors before deciding:
Retry these: Rate limits (HTTP 429), server errors (500, 502, 503, 504), network timeouts, and connection resets. These are transient.
Do not retry these: Authentication failures (401, 403), bad requests (400), context window overflow, and content policy violations. These will fail again with the same input.
For retryable errors, use exponential backoff with jitter. Start with a 0.5 second delay, double it on each retry, and cap at 30 seconds. Add random jitter (plus or minus 20%) to prevent thundering herd problems when multiple agents hit the same rate limit simultaneously.
A production-ready retry configuration:
retry_config = {
"max_retries": 5,
"base_delay_seconds": 0.5,
"max_delay_seconds": 30,
"exponential_base": 2.0,
"jitter_fraction": 0.2,
"retryable_status_codes": [429, 500, 502, 503, 504]
}
Circuit breakers
When a service is down, retrying every request wastes time and can make the problem worse. A circuit breaker tracks failure rates and "opens" (stops sending requests) when failures exceed a threshold. After a cooldown period, it allows a single test request through. If that succeeds, the circuit closes and normal traffic resumes.
For agent workflows, implement circuit breakers per external service. If your LLM provider is down, the circuit breaker lets your agent pause that step and continue with work that does not require LLM calls, rather than blocking the entire pipeline.
Fallback strategies
For critical paths, configure fallback providers. If your primary LLM times out after retries, fall back to a secondary model. The output quality might differ, but the task continues. Some teams run Claude as primary with GPT-4 as fallback, or vice versa, depending on the task type.
Strategy 5: Report Progress and Notify Humans
Long-running tasks that operate in silence create anxiety. Is it still working? Did it crash? Is it stuck in a loop? Progress reporting solves this and also provides the data you need for debugging when things go wrong.
Structured progress artifacts
Rather than streaming logs, have your agent maintain structured files that track progress. A progress.json file that records completed steps, current step, estimated completion, and any errors gives both humans and monitoring systems a clear picture.
Store these artifacts in a shared workspace where team members can check status without interrupting the agent. On Fast.io, agents can write progress files to a workspace, and team members see updates in real time through the web interface. The same files serve as both human-readable status reports and machine-parseable monitoring data.
Progress reporting patterns
Completion percentage: Simple and effective for batch jobs. "Processed 47 of 100 documents" is immediately understandable.
Stage-based reporting: For multi-stage pipelines, report the current stage and substage. "Research complete. Writing section 3 of 6" tells you both where the agent is and roughly how long it has left.
Anomaly flagging: Have the agent report when something unexpected happens, even if it handles it. "Rate limited on API call, backing off 30 seconds" is more useful than silence followed by "Done, took 45 minutes instead of 30."
Human notification
For tasks running longer than a few minutes, send notifications at key milestones. The minimum set: task started, task completed, and task failed. For longer runs, add periodic status updates and notifications when the agent needs human input.
Notification channels depend on your team's workflow. Slack webhooks are the most common integration. Email works for less time-sensitive updates. For agents running in Fast.io workspaces, the activity feed and comment system provide built-in notification without external integrations. An agent can post a comment on a file when processing is complete, and workspace access settings receive the update through their normal workflow.
The human-in-the-loop pause
Some long-running tasks hit decision points that require human judgment. A research agent might find conflicting information and need a human to choose the authoritative source. A content agent might need approval before publishing.
Durable execution frameworks handle this well. The agent pauses with its full execution state intact, including reasoning chain, working memory, and tool history. The human takes hours or days to respond. The agent consumes zero compute while waiting. When the human responds, the agent resumes with sub-second latency, picking up exactly where it left off. Osmani notes this is one of the strongest arguments for durable runtimes: the ability to pause indefinitely without losing state or burning resources.
Putting It All Together: A Production Architecture
These five strategies are not alternatives. They compose into a layered reliability system. Here is how they fit together in a production agent deployment.
The architecture stack
At the base, checkpointing saves progress after each meaningful unit of work. On top of that, message queues decouple task assignment from task execution, letting you scale workers and survive orchestrator failures. A durable execution framework wraps the agent's core logic, handling replay and recovery automatically.
Timeout and retry policies protect every external call.
Progress reporting keeps humans informed and provides debugging data.
Example: a document processing pipeline
Consider an agent that processes a batch of contracts, extracting key terms and generating summaries. The pipeline:
- The orchestrator reads the task list and publishes one message per document to a queue
- Worker agents pick up messages and process documents inside a durable execution runtime
- Each worker checkpoints after extracting terms and again after generating the summary
- LLM calls use exponential backoff with a circuit breaker
- Workers write results to a shared Fast.io workspace where the legal team can review them
- Progress updates post to the workspace activity feed after every 10 documents
- On completion, the agent notifies the team via a workspace comment
If a worker crashes mid-document, the durable runtime replays completed steps and the queue redelivers the message. If the LLM provider goes down, the circuit breaker pauses LLM-dependent work while document fetching and preprocessing continue. The legal team sees real-time progress in their workspace without pinging anyone for status updates.
Monitoring and observability
Track four metrics for long-running agent systems: task completion rate, median task duration, retry frequency per external service, and checkpoint recovery count. A spike in retries signals a degrading dependency. Increasing checkpoint recoveries suggest infrastructure instability. These metrics catch problems before they become incidents.
For audit and compliance requirements, the session-as-event-log pattern captures every thought, tool call, and observation in an append-only log. Fast.io audit trails record file operations and agent activity automatically, giving you a compliance-ready record of what the agent did and when.
Starting small
You do not need all five strategies from day one. Start with checkpointing. If your tasks take more than a few minutes, that single addition dramatically improves reliability. Add queues when you need parallelism. Add a durable framework when manual checkpointing becomes tedious to maintain. Add timeout policies when you works alongside external services. Add progress reporting when humans start asking "is it done yet?"
The goal is reliability that matches the task duration. A 5-minute task needs checkpointing. A 5-hour task needs the full stack.
Frequently Asked Questions
How do you handle long-running tasks in AI agents?
Use a combination of five strategies: checkpoint agent state to persistent storage after each unit of work, decouple task assignment from execution with message queues, wrap agent logic in a durable execution framework that replays completed steps on recovery, configure timeout and retry policies for every external API call, and report progress to humans through structured artifacts and notifications.
What happens when an AI agent times out?
Without safeguards, a timeout kills the agent process and all progress is lost. With checkpointing, the agent restarts and resumes from its last saved state. With a durable execution framework like Temporal or Cloudflare Project Think, the runtime automatically replays all completed steps and continues from the exact point of failure, without repeating API calls or losing any work.
How do agents checkpoint their work?
Agents save their current position and intermediate results to persistent storage at regular intervals. This can be as simple as writing a JSON file after each completed task, or as sophisticated as a durable execution runtime that automatically logs every function call and its result. The key is choosing the right granularity: checkpoint after each meaningful unit of work, not after every micro-step or only at completion.
Can AI agents run tasks in the background?
Yes. Message queues let you submit tasks that agents process asynchronously while the requesting system continues other work. The agent picks up the task, processes it, writes results to shared storage like a Fast.io workspace or S3 bucket, and notifies the requester on completion. This pattern is standard for production agent systems where tasks take minutes to hours.
What is durable execution for AI agents?
Durable execution is a runtime pattern that automatically checkpoints every step of an agent's workflow. If the agent crashes, the runtime replays the log of completed steps using cached results and continues from where it stopped. Frameworks like Temporal, LangGraph, Microsoft Durable Task, and Cloudflare Project Think all implement this pattern. It eliminates the need to write manual checkpoint and recovery code.
How long can AI agent tasks run in production?
Production agent tasks commonly run from 5 minutes to several hours. Anthropic has demonstrated Claude running autonomous coding tasks for over 30 hours. Google's Agent Runtime supports agents maintaining state for up to seven days. The practical limit depends on your reliability infrastructure: without checkpointing, long tasks have high failure rates. With proper durable execution, multi-day tasks are viable.
Related Resources
Give Your Agents a Workspace That Persists
Fast.io gives AI agents 50GB of free persistent storage with built-in intelligence, audit trails, and human collaboration. No credit card required.