How to Add Caching to Your MCP Server
MCP server caching stores and reuses tool call results, resource responses, and prompt outputs to reduce latency, lower API costs, and improve agent performance. This guide walks through three caching strategies, shows you how to implement each one, and covers the invalidation patterns that keep your cached data fresh.
What Is MCP Server Caching?
MCP server caching intercepts requests to a Model Context Protocol server and returns stored responses instead of re-executing the underlying logic. When an AI agent calls the same tool with the same arguments twice, a cached server returns the stored result in milliseconds rather than running the full computation again. Three types of MCP data benefit from caching:
- Tool results: Output from expensive tools like database queries, API calls to external services, or document processing. These are the highest-value cache targets.
- Resource responses: Static or semi-static content served to the LLM, such as documentation files, configuration data, or reference material.
- Prompt outputs: Pre-computed prompt templates or context blocks that don't change between requests. Without caching, every agent interaction triggers a full round-trip execution. Consider a tool like
summarize_large_pdf. Without a cache, the server re-processes the same megabytes of text for every follow-up question about that document. With caching, the summary is computed once and reused until the source file changes. The MCP specification itself doesn't mandate a caching mechanism. That's intentional. You pick the strategy that fits your infrastructure, whether that's a simple in-process dictionary or a distributed Redis cluster shared across multiple server instances.
Why Caching Matters for AI Agents
The performance gap between a cached and uncached MCP server is often the difference between a responsive assistant and a frustrating one. Benchmarks from production MCP deployments show that result caching can reduce server response times by 80-95% for repetitive tasks. Uncached servers can cost more in underlying API calls because agents redundantly fetch the same data across conversation turns. Caching is important in the MCP context for these reasons:
Latency compounds across tool chains. AI agents often call multiple tools in sequence. Without caching, each tool call takes several seconds, causing multi-tool chains to take much longer. With caching, those tools respond in tens of milliseconds, completing chains in under a second. Users notice immediately.
Agents repeat themselves. During a single conversation, an agent might call list_files or get_user_context dozens of times. Each call hits the same endpoint with identical arguments. Without caching, every call pays full cost.
External API rate limits hit fast. If your MCP tools call third-party APIs (search engines, financial data providers, weather services), you'll hit rate limits quickly under heavy agent usage. A cache layer acts as a buffer, absorbing repeated requests before they reach the external service.
Token costs add up. Every uncached tool call that returns large payloads increases the token count flowing through the LLM. Cached responses arrive faster and often prevent redundant context from piling up in the conversation window.
Give Your AI Agents Persistent Storage
Fast.io gives AI agents 50GB of free cloud storage, 251 MCP tools, and built-in file versioning. Use it as a cache backend that survives restarts.
Three Caching Strategies Compared
Choosing the right caching backend depends on your server's deployment model, persistence needs, and whether multiple agents share state. Here's how the three main approaches compare:
In-Memory Caching
- Best for: Local development, single-instance servers, short-lived sessions
- Speed: Sub-millisecond lookups (fast option)
- Setup: A Python
dictor Node.jsMapwith TTL tracking - Tradeoff: Data is lost on server restart. Memory consumption grows with cache size. Not shared across server instances. A basic in-memory cache works well when you're building and testing an MCP server locally. It's the simplest approach: store results in a hash map keyed by the tool name and serialized arguments.
File-Based Caching
- Best for: Persistent agents, local workflows, single-server production
- Speed: 1-10ms reads depending on file size and disk type
- Setup: Write JSON or binary files to a local directory or cloud storage
- Tradeoff: Slower than memory. Requires disk space management and cleanup routines. Not natively shared across instances. File-based caching adds persistence without the complexity of a database. Cache entries survive server restarts, which matters for agents that run intermittently. You can use Fast.io's agent storage as a persistent file-based cache backend. The free agent tier provides 50GB of cloud storage, and you can organize cache files into workspaces with automatic versioning.
Distributed Caching (Redis)
- Best for: Production deployments, multi-agent systems, horizontal scaling
- Speed: 1-5ms over network (usually)
- Setup: Redis or Memcached instance with client library
- Tradeoff: Additional infrastructure cost. Network latency for each lookup. Requires connection management and error handling. Redis is the standard choice when multiple MCP server instances need to share cached state. It handles TTL expiration natively, supports atomic operations, and scales horizontally. For teams running multiple agents at once, distributed caching prevents duplicate work across agents.
Which one should you pick? Start with in-memory for development. Move to file-based when you need persistence. Switch to Redis when you're running multiple server instances or serving multiple agents at the same time.
How to Implement Response Caching Step by Step
Here's a practical walkthrough for adding caching to an MCP server. These examples use Python, but the patterns apply to any language.
Step 1: Identify Cacheable Tools
Not every tool should be cached. The rule is simple: cache reads, skip writes.
Cache these (deterministic, read-only):
get_file_contents- same file, same contentssearch_documents- same query, same results (within a time window)fetch_weather- same location, same forecast (for a few minutes)list_directory- same path, same listing (until files change)
Skip these (side effects, non-deterministic):
send_email- must execute every timecreate_file- write operationget_random_quote- different result expected each time
Step 2: Generate Cache Keys
Build a unique key from the tool name and its arguments. The key must be deterministic, meaning identical inputs always produce the same key:
import hashlib, json
def cache_key(tool_name: str, arguments: dict) -> str:
payload = json.dumps(arguments, sort_keys=True)
arg_hash = hashlib.sha256(payload.encode()).hexdigest()[:16]
return f"mcp:{tool_name}:{arg_hash}"
Sorting keys before hashing ensures that {"a": 1, "b": 2} and {"b": 2, "a": 1} produce the same cache key.
Step 3: Add the Cache Wrapper
Wrap your tool handler with check-then-execute logic:
from functools import wraps
from time import time
cache_store = {}
def cached(ttl_seconds: int = 300):
def decorator(func):
@wraps(func)
async def wrapper(arguments: dict):
key = cache_key(func.__name__, arguments)
### Check cache
if key in cache_store:
entry = cache_store[key]
if time() - entry["timestamp"] < ttl_seconds:
return entry["result"]
### Execute and store
result = await func(arguments)
cache_store[key] = {
"result": result,
"timestamp": time()
}
return result
return wrapper
return decorator
@cached(ttl_seconds=600)
async def search_documents(arguments: dict):
### Expensive search logic here
... ```
### Step 4: Set TTLs by Data Volatility
Assign different time-to-live values based on how frequently the underlying data changes:
- **Static data** (documentation, historical records): 24-48 hours
- **Slow-moving data** (user profiles, org settings): 15-60 minutes
- **Moderate data** (search results, file listings): 5-15 minutes
- **Volatile data** (stock prices, live metrics): 30 seconds to 2 minutes
When in doubt, start with a shorter TTL and extend it once you've confirmed the data doesn't change often.
Cache Invalidation Patterns for MCP
Cache invalidation is the hard part. Stale data in an MCP cache is worse than no cache at all, because the agent makes decisions based on outdated information without knowing it. Here are three invalidation patterns that work well for MCP servers.
Time-Based Invalidation (TTL)
The simplest approach. Every cached entry has an expiration timestamp. After the TTL expires, the next request triggers a fresh execution. This works well when "eventually consistent" is acceptable. Set TTLs based on the data source, not the tool. A get_weather tool calling a weather API should expire within minutes. A get_documentation tool pulling from a static docs site can cache for hours.
Event-Based Invalidation
Clear specific cache keys when a related write action occurs. If an agent calls update_file("report.pdf"), the server should immediately invalidate any cached read_file("report.pdf") results. ```python
def invalidate_on_write(tool_name: str, arguments: dict):
"""Clear related cache entries after a write operation."""
if tool_name == "update_file":
path = arguments.get("path")
read_key = cache_key("read_file", {"path": path})
cache_store.pop(read_key, None)
Also invalidate directory listing
parent = "/".join(path.split("/")[:-1])
list_key = cache_key("list_directory", {"path": parent})
cache_store.pop(list_key, None)
This pattern requires mapping write tools to their related read tools, but it keeps the cache accurate in real time.
### Versioned Invalidation
Tag cached items with a version number. When a major change happens (schema migration, bulk data update), increment the global version to force all cache entries to refresh:
```python
CACHE_VERSION = 3 # Bump this to invalidate everything
def versioned_cache_key(tool_name: str, arguments: dict) -> str:
base = cache_key(tool_name, arguments)
return f"v{CACHE_VERSION}:{base}"
This is the "nuclear option." Use it sparingly for situations like deploying a new data model or recovering from a corrupted cache.
Combining Patterns
Production MCP servers usually use all three together. TTL provides a safety net so nothing stays cached forever. Event-based invalidation handles the common case of writes that affect reads. Versioned invalidation handles exceptional situations. Layer them and you get a cache that's both fast and accurate.
Using Persistent Storage as a Cache Backend
In-memory caches are fast but temporary. For agents that run intermittently or need to share state across sessions, persistent storage makes a better cache backend. Fast.io's MCP server provides 251 tools for file operations, and these tools work well with a caching strategy. Here's how:
Store cache files in a dedicated workspace. Create a workspace called "agent-cache" and write serialized cache entries as JSON files. Each file represents one cached tool result, named by its cache key. When your MCP server starts up, it reads existing cache files to warm the cache instead of starting cold.
Use file versioning as a cache history. Fast.io automatically versions files on update, so you get a history of past cache states. If a cache entry looks wrong, you can inspect or restore previous versions.
Set up webhooks for cross-agent invalidation. If multiple agents share the same cache workspace, webhooks notify other agents when cache files change. Agent A updates a cache entry, and agents B and C receive the webhook and refresh their local copies. The free agent tier includes 50GB of storage, 5,000 monthly credits, and 5 workspaces with no credit card required. That covers most caching scenarios, and the data persists indefinitely (no expiration).
Frequently Asked Questions
How do I add caching to an MCP server?
Add a middleware layer that intercepts tool calls before execution. Generate a cache key from the tool name and serialized arguments, check your storage backend (in-memory dict, local files, or Redis) for a matching key, and return the stored result if it exists and hasn't expired. If there's no cache hit, execute the tool normally and store the result with a TTL.
What should I cache in MCP?
Cache read-only, deterministic tool results such as database queries, document fetches, search results, and API responses. Avoid caching write operations (create, update, delete), non-deterministic outputs (random generators), or tools where real-time freshness is required. The highest-value cache targets are tools that call expensive external APIs.
How do I invalidate MCP cache?
Use a combination of TTL-based expiration, event-based invalidation, and versioned keys. TTLs automatically expire entries after a set period. Event-based invalidation clears specific keys when related write operations occur. Versioned keys let you force a full cache refresh by incrementing a version number in the key prefix.
Does caching improve MCP server performance?
Yes. Caching can reduce MCP server response times by 80-95% for repetitive tool calls. Instead of re-executing expensive logic on every request, the server returns a pre-computed result in milliseconds. This also lowers API costs, reduces token consumption, and protects against rate limits on external services.
Related Resources
Give Your AI Agents Persistent Storage
Fast.io gives AI agents 50GB of free cloud storage, 251 MCP tools, and built-in file versioning. Use it as a cache backend that survives restarts.