What is the best retry strategy for LLM APIs?

Exponential backoff with jitter is the industry standard for LLM APIs. It handles rate limits gracefully by doubling wait time between retries and adding randomness to prevent retry storms. Start with a 1-2 second base delay, double on each retry, and stop after 5-7 attempts. Libraries like Tenacity (Python) make this trivial to implement.

How do I handle AI agent failures?

Use a layered approach: exponential backoff for transient errors, circuit breakers for persistent failures, fallback models for LLM unavailability, and human escalation for unrecoverable errors. Design operations to be idempotent so retries are safe. Store agent state in persistent storage like Fastio workspaces so you can resume workflows after crashes.

What is exponential backoff for AI agents?

Exponential backoff is a retry pattern where the wait time between retries doubles after each failure. For example: retry 1 waits 1 second, retry 2 waits 2 seconds, retry 3 waits 4 seconds, retry 4 waits 8 seconds. This gives failing services time to recover without overwhelming them with immediate retries. Adding jitter (random variation) prevents retry storms when many agents fail simultaneously.

How do you make AI agents fault tolerant?

Combine retry patterns (exponential backoff, circuit breakers), fallback strategies (use backup LLMs), persistent state (checkpointing), and monitoring (track retry rates and failure types). Design agent workflows to be resumable after crashes. Use file locks for concurrent access. Test failure scenarios explicitly. Fastio's agent tier provides persistent storage, audit logs, and file locks to support fault-tolerant multi-agent systems.

Should I retry on all LLM API errors?

No. Only retry on transient errors like rate limits (HTTP 429), server errors (HTTP 500, 502, 503, 504), and network timeouts. Don't retry on authentication failures (HTTP 401, 403), bad requests (HTTP 400), or context window overflow. Use retry_if_exception_type to filter which errors trigger retries.

How many retries should an AI agent attempt?

Start conservative with 3-5 retries. For rate limits, 5-7 retries with exponential backoff is common. For server errors, 3 retries is sufficient. Always set a maximum retry count and total timeout to prevent infinite loops. Monitor retry success rates and adjust based on actual failure patterns.

What's the difference between retry patterns and circuit breakers?

Retries handle individual request failures by waiting and trying again. Circuit breakers handle systemic failures by detecting when a service is consistently down and failing fast without attempting requests. Use retries for transient errors (rate limits, timeouts). Use circuit breakers to protect against cascading failures when a dependency is unavailable.

AI Agent Retry Patterns - Exponential Backoff Guide 2026

Why AI Agents Need Retry Patterns

AI agents fail differently than traditional software. LLM API calls fail 1-5% of the time due to rate limits, timeouts, and server errors. Unlike deterministic code, AI agents encounter non-deterministic failures: partial LLM responses, tool timeouts, context window overflow, and model unavailability.

A single failed LLM call can cascade through a multi-agent workflow. Without retry logic, one rate limit error stops the entire pipeline. Retry patterns catch these failures, wait, and try again with smarter strategies.

Common AI agent failure modes:

Rate limit errors (HTTP 429) from LLM providers
Timeout failures from slow tool invocations
Partial responses from context window overflow
Server errors (HTTP 500, 502, 503, 504)
Network connection drops
Tool execution failures (file locks, API downtime)

The goal is resilience: agents should handle transient failures gracefully without manual intervention. For a broader look at building reliable agent systems, see our guide on AI agent error handling.

Core Retry Patterns for AI Agents

Simple Retry

The basic pattern: try the operation, if it fails, wait a fixed amount, then retry up to N times.

When to use: Low-stakes operations, infrequent failures, or when you need predictable timing.

Limitations: Fixed delays don't adapt to system load. All agents retry simultaneously, creating retry storms.

Exponential Backoff

Retrying with exponential backoff means performing a short sleep when a failure occurs, then retrying. If the request is still unsuccessful, the sleep length increases exponentially, then the process repeats.

How it works: Start with a base delay (1 second). On the next retry, wait base_delay × 2. On the third retry, wait base_delay × 4. Continue doubling the wait time up to a maximum delay until the maximum number of retries is reached.

Why it works: This gives the external service (like the LLM API) more time to recover if it's experiencing sustained load or issues. According to AWS research on distributed systems, exponential backoff with jitter reduces retry storms by 60-80%.

When to use: Rate limits, server-side errors, network failures. This is the default pattern for LLM API retries.

Exponential Backoff with Jitter

Add a small random amount of time (jitter) to the exponential delay. This prevents a "thundering herd" problem where many clients retry simultaneously after a widespread transient failure, overwhelming the service again.

Implementation: wait_time = (base_delay * 2^attempt) + random(0, jitter_max)

Example: Instead of waiting exactly 2 seconds, wait 2.3 seconds. The randomness spreads retries over time.

Circuit Breaker

Circuit breaker patterns should be considered for agent dependencies. The circuit has three states: Closed (normal operation), Open (failures detected, stop trying), Half-Open (testing if service recovered).

How it works:

Monitor failure rate for a service (e.g., LLM API)
If failures exceed threshold (e.g., 50% in 1 minute), open the circuit
While open, fail fast without attempting the request
After a cooldown period, enter Half-Open and test with one request
If successful, close the circuit. If failed, reopen.

When to use: Protecting agents from cascading failures when a dependency is consistently down. Prevents wasting time and credits on requests that will fail.

Fallback Models

If the primary LLM fails repeatedly, switch to a backup model.

Strategy: Try your primary model first, then fall back to a faster or cheaper alternative if unavailable. Fastio works with Claude, GPT, Gemini, LLaMA, and local models, making multi-LLM fallback straightforward.

When to use: High-availability agent systems that can tolerate reduced quality over complete failure.

Human Escalation

Some failures can't be resolved automatically. After N retries, escalate to a human.

How it works: Agent detects repeated failures, creates a notification or task for a human operator, pauses the workflow until resolved.

When to use: Agent job failures where correctness matters more than speed (document processing, invoice generation, contract analysis).

Give Your AI Agents Persistent Storage

Fastio's agent tier gives you generous storage, built-in RAG, 19 consolidated tools, and file locks for multi-agent systems. Store checkpoints, query state with AI, and keep agents running reliably.

Start 14-Day Trial

How to Implement Exponential Backoff in Python

Tenacity is an Apache 2.0 licensed general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything.

Basic exponential backoff with Tenacity:

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(5)
)
def call_llm_api(prompt):
    response = llm_client.chat(prompt)
    return response

This configuration:

Waits 2 seconds, then 4, then 8, then 16, then 32 (max 60)
Stops after 5 attempts
Automatically retries on exceptions

Adding jitter to prevent retry storms:

from tenacity import retry, wait_random_exponential

@retry(
    wait=wait_random_exponential(multiplier=1, max=60),
    stop=stop_after_attempt(5)
)
def call_llm_with_jitter(prompt):
    response = llm_client.chat(prompt)
    return response

Retry only on specific errors:

from tenacity import retry, retry_if_exception_type, stop_after_attempt

@retry(
    retry=retry_if_exception_type((RateLimitError, TimeoutError)),
    stop=stop_after_attempt(5)
)
def safe_llm_call(prompt):
    response = llm_client.chat(prompt)
    return response

This only retries on rate limit and timeout errors, not on other exceptions like authentication failures.

Python code implementing exponential backoff for LLM API calls

How to Choose a Retry Strategy by Failure Type

Different failure types need different retry strategies.

Rate Limits (HTTP 429):

Pattern: Exponential backoff with jitter
Base delay: 1-2 seconds
Max retries: 5-7
Why: Rate limits are temporary. Backoff gives the API time to reset.

Server Errors (HTTP 500, 502, 503, 504):

Pattern: Exponential backoff
Base delay: 2 seconds
Max retries: 3-5
Why: Server issues may resolve quickly, but don't retry indefinitely.

Network Timeouts:

Pattern: Simple retry with fixed delay
Delay: 5 seconds
Max retries: 2-3
Why: Network issues are often transient but may indicate a deeper problem.

Tool Execution Failures:

Pattern: Simple retry with backoff
Delay: Depends on tool (file lock: 1s, API call: 5s)
Max retries: 3
Why: Tool failures can be idempotent (safe to retry) or non-idempotent (dangerous to retry).

Context Window Overflow:

Pattern: Fallback to model with larger context
No retry: Context is deterministic, retrying won't help
Why: Switch to a model with a larger context window, or truncate input.

Partial LLM Responses:

Pattern: Resume generation with continuation prompt
Max attempts: 2
Why: Partial responses often mean the model hit token limits mid-generation.

Multi-Agent Retry Coordination

In multi-agent systems, retry patterns need coordination to prevent cascading failures.

Pattern 1: Centralized Retry Queue

Failed tasks go into a shared retry queue
A coordinator agent re-dispatches after delay
Prevents individual agents from clogging the system with retries

Pattern 2: Agent-Level Circuit Breakers

Each agent tracks its own failure rate
If agent A's LLM calls fail 50% of the time, agent A stops making calls
Other agents continue working normally

Pattern 3: Shared State with File Locks

When multiple agents access shared files, use file locks to prevent conflicts
Fastio supports file locks for concurrent access in multi-agent systems
Agents acquire locks, retry if locked, release when done

Pattern 4: Idempotent Operations

Design agent operations to be safely retryable
Use unique task IDs to detect duplicate work
Store completed task IDs to prevent re-execution

Fastio's workspace model supports multi-agent collaboration with granular permissions, file locks, and audit logs to track which agent performed which action.

Storing Agent State for Retries

Effective retry patterns need persistent state. If an agent crashes mid-workflow, where does it resume?

Checkpointing Strategy:

Save workflow state after each successful step
On retry, load the last checkpoint and resume
Avoid re-executing completed work

Where to store state:

File-based: Write JSON state files to a workspace (Fastio's Intelligence Mode auto-indexes them for retrieval)
Database: SQLite for local agents, PostgreSQL for distributed systems
Object storage: S3 or Fastio workspaces for large state objects

State structure:

{
  "task_id": "generate-report-2026-02-14",
  "status": "in_progress",
  "completed_steps": ["fetch_data", "analyze"],
  "pending_steps": ["generate_pdf", "upload"],
  "retry_count": 2,
  "last_error": "Rate limit exceeded",
  "last_checkpoint": "2026-02-14T10:30:00Z"
}

Fastio's agent tier includes generous storage with built-in RAG. Agents can query their own state files using natural language: "Show me all tasks that failed with rate limits in the last hour."

Testing Retry Logic

Simulate failures in tests:

import pytest
from unittest.mock import Mock, patch

def test_retry_on_rate_limit():
    mock_api = Mock()
    mock_api.chat.side_effect = [
        RateLimitError("Too many requests"),
        RateLimitError("Too many requests"),
        {"response": "Success"}
    ]

result = call_llm_with_retry(mock_api, "test prompt")
    assert result["response"] == "Success"
    assert mock_api.chat.call_count == 3

Test exponential backoff timing:

import time

def test_backoff_timing():
    start = time.time()

with pytest.raises(RateLimitError):
        call_llm_with_retry(always_fails_api, "test")

duration = time.time() - start
    assert duration > 31   # sum of waits (1+2+4+8+16)
    assert duration < 35   # allow some margin

Test circuit breaker state transitions:

def test_circuit_breaker():
    circuit = CircuitBreaker(threshold=3)

for _ in range(3):  # cause 3 failures to open circuit
        with pytest.raises(Exception):
            circuit.call(failing_function)

assert circuit.state == "OPEN"

with pytest.raises(CircuitOpenError):  # verify fast-fail
        circuit.call(failing_function)

Production Monitoring for Retries

Track retry metrics to understand agent reliability:

Key metrics:

Retry rate (retries / total requests)
Success after retry rate (successful retries / total retries)
Average retry delay (time spent waiting)
Failure types (rate limits vs server errors vs timeouts)
Circuit breaker state changes

Alerts to set:

Retry rate > 10% (something's wrong upstream)
Circuit breaker open for > 5 minutes
Max retries exhausted > 5% of the time
Agent stuck in retry loop for > 30 minutes

Fastio provides audit logs for all agent actions, including file access, API calls, and workspace changes. Use webhooks to send retry events to your monitoring system in real time.

How to Build Retry Logic for Reliable AI Agents

Why AI Agents Need Retry Patterns

Core Retry Patterns for AI Agents

Simple Retry

Exponential Backoff

Exponential Backoff with Jitter

Circuit Breaker

Fallback Models

Human Escalation

Give Your AI Agents Persistent Storage

How to Implement Exponential Backoff in Python

How to Choose a Retry Strategy by Failure Type

Multi-Agent Retry Coordination

Storing Agent State for Retries

Testing Retry Logic

Production Monitoring for Retries

Frequently Asked Questions

Related Resources

Give Your AI Agents Persistent Storage