How to Build Retry Logic for Reliable AI Agents
Retry patterns for AI agents are fault-tolerance strategies that automatically re-attempt failed LLM calls, tool invocations, and API requests with backoff, jitter, and fallback logic to keep agents running reliably in production. This guide covers exponential backoff, circuit breakers, and agent-specific failure modes.
Why AI Agents Need Retry Patterns
AI agents fail differently than traditional software. LLM API calls fail 1-5% of the time due to rate limits, timeouts, and server errors. Unlike deterministic code, AI agents encounter non-deterministic failures: partial LLM responses, tool timeouts, context window overflow, and model unavailability.
A single failed LLM call can cascade through a multi-agent workflow. Without retry logic, one rate limit error stops the entire pipeline. Retry patterns catch these failures, wait, and try again with smarter strategies.
Common AI agent failure modes:
- Rate limit errors (HTTP 429) from LLM providers
- Timeout failures from slow tool invocations
- Partial responses from context window overflow
- Server errors (HTTP 500, 502, 503, 504)
- Network connection drops
- Tool execution failures (file locks, API downtime)
The goal is resilience: agents should handle transient failures gracefully without manual intervention. For a broader look at building reliable agent systems, see our guide on AI agent error handling.
Core Retry Patterns for AI Agents
Simple Retry
The basic pattern: try the operation, if it fails, wait a fixed amount, then retry up to N times.
When to use: Low-stakes operations, infrequent failures, or when you need predictable timing.
Limitations: Fixed delays don't adapt to system load. All agents retry simultaneously, creating retry storms.
Exponential Backoff
Retrying with exponential backoff means performing a short sleep when a failure occurs, then retrying. If the request is still unsuccessful, the sleep length increases exponentially, then the process repeats.
How it works: Start with a base delay (1 second). On the next retry, wait base_delay × 2. On the third retry, wait base_delay × 4. Continue doubling the wait time up to a maximum delay until the maximum number of retries is reached.
Why it works: This gives the external service (like the LLM API) more time to recover if it's experiencing sustained load or issues. According to AWS research on distributed systems, exponential backoff with jitter reduces retry storms by 60-80%.
When to use: Rate limits, server-side errors, network failures. This is the default pattern for LLM API retries.
Exponential Backoff with Jitter
Add a small random amount of time (jitter) to the exponential delay. This prevents a "thundering herd" problem where many clients retry simultaneously after a widespread transient failure, overwhelming the service again.
Implementation: wait_time = (base_delay * 2^attempt) + random(0, jitter_max)
Example: Instead of waiting exactly 2 seconds, wait 2.3 seconds. The randomness spreads retries over time.
Circuit Breaker
Circuit breaker patterns should be considered for agent dependencies. The circuit has three states: Closed (normal operation), Open (failures detected, stop trying), Half-Open (testing if service recovered).
How it works:
- Monitor failure rate for a service (e.g., LLM API)
- If failures exceed threshold (e.g., 50% in 1 minute), open the circuit
- While open, fail fast without attempting the request
- After a cooldown period, enter Half-Open and test with one request
- If successful, close the circuit. If failed, reopen.
When to use: Protecting agents from cascading failures when a dependency is consistently down. Prevents wasting time and credits on requests that will fail.
Fallback Models
If the primary LLM fails repeatedly, switch to a backup model.
Strategy: Try your primary model first, then fall back to a faster or cheaper alternative if unavailable. Fast.io works with Claude, GPT, Gemini, LLaMA, and local models, making multi-LLM fallback straightforward.
When to use: High-availability agent systems that can tolerate reduced quality over complete failure.
Human Escalation
Some failures can't be resolved automatically. After N retries, escalate to a human.
How it works: Agent detects repeated failures, creates a notification or task for a human operator, pauses the workflow until resolved.
When to use: Agent job failures where correctness matters more than speed (document processing, invoice generation, contract analysis).
Give Your AI Agents Persistent Storage
Fast.io's agent tier gives you 50GB free storage, built-in RAG, 251 MCP tools, and file locks for multi-agent systems. Store checkpoints, query state with AI, and keep agents running reliably.
How to Implement Exponential Backoff in Python
Tenacity is an Apache 2.0 licensed general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything.
Basic exponential backoff with Tenacity:
from tenacity import retry, wait_exponential, stop_after_attempt
@retry(
wait=wait_exponential(multiplier=1, min=2, max=60),
stop=stop_after_attempt(5)
)
def call_llm_api(prompt):
response = llm_client.chat(prompt)
return response
This configuration:
- Waits 2 seconds, then 4, then 8, then 16, then 32 (max 60)
- Stops after 5 attempts
- Automatically retries on exceptions
Adding jitter to prevent retry storms:
from tenacity import retry, wait_random_exponential
@retry(
wait=wait_random_exponential(multiplier=1, max=60),
stop=stop_after_attempt(5)
)
def call_llm_with_jitter(prompt):
response = llm_client.chat(prompt)
return response
Retry only on specific errors:
from tenacity import retry, retry_if_exception_type, stop_after_attempt
@retry(
retry=retry_if_exception_type((RateLimitError, TimeoutError)),
stop=stop_after_attempt(5)
)
def safe_llm_call(prompt):
response = llm_client.chat(prompt)
return response
This only retries on rate limit and timeout errors, not on other exceptions like authentication failures.
How to Choose a Retry Strategy by Failure Type
Different failure types need different retry strategies.
Rate Limits (HTTP 429):
- Pattern: Exponential backoff with jitter
- Base delay: 1-2 seconds
- Max retries: 5-7
- Why: Rate limits are temporary. Backoff gives the API time to reset.
Server Errors (HTTP 500, 502, 503, 504):
- Pattern: Exponential backoff
- Base delay: 2 seconds
- Max retries: 3-5
- Why: Server issues may resolve quickly, but don't retry indefinitely.
Network Timeouts:
- Pattern: Simple retry with fixed delay
- Delay: 5 seconds
- Max retries: 2-3
- Why: Network issues are often transient but may indicate a deeper problem.
Tool Execution Failures:
- Pattern: Simple retry with backoff
- Delay: Depends on tool (file lock: 1s, API call: 5s)
- Max retries: 3
- Why: Tool failures can be idempotent (safe to retry) or non-idempotent (dangerous to retry).
Context Window Overflow:
- Pattern: Fallback to model with larger context
- No retry: Context is deterministic, retrying won't help
- Why: Switch to a model with a larger context window, or truncate input.
Partial LLM Responses:
- Pattern: Resume generation with continuation prompt
- Max attempts: 2
- Why: Partial responses often mean the model hit token limits mid-generation.
Multi-Agent Retry Coordination
In multi-agent systems, retry patterns need coordination to prevent cascading failures.
Pattern 1: Centralized Retry Queue
- Failed tasks go into a shared retry queue
- A coordinator agent re-dispatches after delay
- Prevents individual agents from clogging the system with retries
Pattern 2: Agent-Level Circuit Breakers
- Each agent tracks its own failure rate
- If agent A's LLM calls fail 50% of the time, agent A stops making calls
- Other agents continue working normally
Pattern 3: Shared State with File Locks
- When multiple agents access shared files, use file locks to prevent conflicts
- Fast.io supports file locks for concurrent access in multi-agent systems
- Agents acquire locks, retry if locked, release when done
Pattern 4: Idempotent Operations
- Design agent operations to be safely retryable
- Use unique task IDs to detect duplicate work
- Store completed task IDs to prevent re-execution
Fast.io's workspace model supports multi-agent collaboration with granular permissions, file locks, and audit logs to track which agent performed which action.
Storing Agent State for Retries
Effective retry patterns need persistent state. If an agent crashes mid-workflow, where does it resume?
Checkpointing Strategy:
- Save workflow state after each successful step
- On retry, load the last checkpoint and resume
- Avoid re-executing completed work
Where to store state:
- File-based: Write JSON state files to a workspace (Fast.io's Intelligence Mode auto-indexes them for retrieval)
- Database: SQLite for local agents, PostgreSQL for distributed systems
- Object storage: S3 or Fast.io workspaces for large state objects
State structure:
{
"task_id": "generate-report-2026-02-14",
"status": "in_progress",
"completed_steps": ["fetch_data", "analyze"],
"pending_steps": ["generate_pdf", "upload"],
"retry_count": 2,
"last_error": "Rate limit exceeded",
"last_checkpoint": "2026-02-14T10:30:00Z"
}
Fast.io's agent tier includes 50GB free storage with built-in RAG. Agents can query their own state files using natural language: "Show me all tasks that failed with rate limits in the last hour."
Testing Retry Logic
Simulate failures in tests:
import pytest
from unittest.mock import Mock, patch
def test_retry_on_rate_limit():
mock_api = Mock()
mock_api.chat.side_effect = [
RateLimitError("Too many requests"),
RateLimitError("Too many requests"),
{"response": "Success"}
]
result = call_llm_with_retry(mock_api, "test prompt")
assert result["response"] == "Success"
assert mock_api.chat.call_count == 3
Test exponential backoff timing:
import time
def test_backoff_timing():
start = time.time()
with pytest.raises(RateLimitError):
call_llm_with_retry(always_fails_api, "test")
duration = time.time() - start
assert duration > 31 # sum of waits (1+2+4+8+16)
assert duration < 35 # allow some margin
Test circuit breaker state transitions:
def test_circuit_breaker():
circuit = CircuitBreaker(threshold=3)
for _ in range(3): # cause 3 failures to open circuit
with pytest.raises(Exception):
circuit.call(failing_function)
assert circuit.state == "OPEN"
with pytest.raises(CircuitOpenError): # verify fast-fail
circuit.call(failing_function)
Production Monitoring for Retries
Track retry metrics to understand agent reliability:
Key metrics:
- Retry rate (retries / total requests)
- Success after retry rate (successful retries / total retries)
- Average retry delay (time spent waiting)
- Failure types (rate limits vs server errors vs timeouts)
- Circuit breaker state changes
Alerts to set:
- Retry rate > 10% (something's wrong upstream)
- Circuit breaker open for > 5 minutes
- Max retries exhausted > 5% of the time
- Agent stuck in retry loop for > 30 minutes
Fast.io provides audit logs for all agent actions, including file access, API calls, and workspace changes. Use webhooks to send retry events to your monitoring system in real time.
Frequently Asked Questions
What is the best retry strategy for LLM APIs?
Exponential backoff with jitter is the industry standard for LLM APIs. It handles rate limits gracefully by doubling wait time between retries and adding randomness to prevent retry storms. Start with a 1-2 second base delay, double on each retry, and stop after 5-7 attempts. Libraries like Tenacity (Python) make this trivial to implement.
How do I handle AI agent failures?
Use a layered approach: exponential backoff for transient errors, circuit breakers for persistent failures, fallback models for LLM unavailability, and human escalation for unrecoverable errors. Design operations to be idempotent so retries are safe. Store agent state in persistent storage like Fast.io workspaces so you can resume workflows after crashes.
What is exponential backoff for AI agents?
Exponential backoff is a retry pattern where the wait time between retries doubles after each failure. For example: retry 1 waits 1 second, retry 2 waits 2 seconds, retry 3 waits 4 seconds, retry 4 waits 8 seconds. This gives failing services time to recover without overwhelming them with immediate retries. Adding jitter (random variation) prevents retry storms when many agents fail simultaneously.
How do you make AI agents fault tolerant?
Combine retry patterns (exponential backoff, circuit breakers), fallback strategies (use backup LLMs), persistent state (checkpointing), and monitoring (track retry rates and failure types). Design agent workflows to be resumable after crashes. Use file locks for concurrent access. Test failure scenarios explicitly. Fast.io's agent tier provides persistent storage, audit logs, and file locks to support fault-tolerant multi-agent systems.
Should I retry on all LLM API errors?
No. Only retry on transient errors like rate limits (HTTP 429), server errors (HTTP 500, 502, 503, 504), and network timeouts. Don't retry on authentication failures (HTTP 401, 403), bad requests (HTTP 400), or context window overflow. Use retry_if_exception_type to filter which errors trigger retries.
How many retries should an AI agent attempt?
Start conservative with 3-5 retries. For rate limits, 5-7 retries with exponential backoff is common. For server errors, 3 retries is sufficient. Always set a maximum retry count and total timeout to prevent infinite loops. Monitor retry success rates and adjust based on actual failure patterns.
What's the difference between retry patterns and circuit breakers?
Retries handle individual request failures by waiting and trying again. Circuit breakers handle systemic failures by detecting when a service is consistently down and failing fast without attempting requests. Use retries for transient errors (rate limits, timeouts). Use circuit breakers to protect against cascading failures when a dependency is unavailable.
Related Resources
Give Your AI Agents Persistent Storage
Fast.io's agent tier gives you 50GB free storage, built-in RAG, 251 MCP tools, and file locks for multi-agent systems. Store checkpoints, query state with AI, and keep agents running reliably.