How do I add caching to an MCP server?

Add a middleware layer that intercepts tool calls before execution. Generate a cache key from the tool name and serialized arguments, check your storage backend (in-memory dict, local files, or Redis) for a matching key, and return the stored result if it exists and hasn't expired. If there's no cache hit, execute the tool normally and store the result with a TTL.

What should I cache in MCP?

Cache read-only, deterministic tool results such as database queries, document fetches, search results, and API responses. Avoid caching write operations (create, update, delete), non-deterministic outputs (random generators), or tools where real-time freshness is required. The highest-value cache targets are tools that call expensive external APIs.

How do I invalidate MCP cache?

Use a combination of TTL-based expiration, event-based invalidation, and versioned keys. TTLs automatically expire entries after a set period. Event-based invalidation clears specific keys when related write operations occur. Versioned keys let you force a full cache refresh by incrementing a version number in the key prefix.

Does caching improve MCP server performance?

Yes. Caching can reduce MCP server response times by 80-95% for repetitive tool calls. Instead of re-executing expensive logic on every request, the server returns a pre-computed result in milliseconds. This also lowers API costs, reduces token consumption, and protects against rate limits on external services.

MCP Server Caching: Reduce Latency and Cut API Costs

What Is MCP Server Caching?

MCP server caching intercepts requests to a Model Context Protocol server and returns stored responses instead of re-executing the underlying logic. When an AI agent calls the same tool with the same arguments twice, a cached server returns the stored result in milliseconds rather than running the full computation again. Three types of MCP data benefit from caching:

Tool results: Output from expensive tools like database queries, API calls to external services, or document processing. These are the highest-value cache targets.
Resource responses: Static or semi-static content served to the LLM, such as documentation files, configuration data, or reference material.
Prompt outputs: Pre-computed prompt templates or context blocks that don't change between requests. Without caching, every agent interaction triggers a full round-trip execution. Consider a tool like summarize_large_pdf. Without a cache, the server re-processes the same megabytes of text for every follow-up question about that document. With caching, the summary is computed once and reused until the source file changes. The MCP specification itself doesn't mandate a caching mechanism. That's intentional. You pick the strategy that fits your infrastructure, whether that's a simple in-process dictionary or a distributed Redis cluster shared across multiple server instances.

Server log showing reduced latency after implementing response caching

Why Caching Matters for AI Agents

The performance gap between a cached and uncached MCP server is often the difference between a responsive assistant and a frustrating one. Benchmarks from production MCP deployments show that result caching can reduce server response times by 80-95% for repetitive tasks. Uncached servers can cost more in underlying API calls because agents redundantly fetch the same data across conversation turns. Caching is important in the MCP context for these reasons:

Latency compounds across tool chains. AI agents often call multiple tools in sequence. Without caching, each tool call takes several seconds, causing multi-tool chains to take much longer. With caching, those tools respond in tens of milliseconds, completing chains in under a second. Users notice immediately.

Agents repeat themselves. During a single conversation, an agent might call list_files or get_user_context dozens of times. Each call hits the same endpoint with identical arguments. Without caching, every call pays full cost.

External API rate limits hit fast. If your MCP tools call third-party APIs (search engines, financial data providers, weather services), you'll hit rate limits quickly under heavy agent usage. A cache layer acts as a buffer, absorbing repeated requests before they reach the external service.

Token costs add up. Every uncached tool call that returns large payloads increases the token count flowing through the LLM. Cached responses arrive faster and often prevent redundant context from piling up in the conversation window.

Give Your AI Agents Persistent Storage

Fastio gives AI agents 50GB of free cloud storage, 19 consolidated tools, and built-in file versioning. Use it as a cache backend that survives restarts.

Get Free Agent Storage

Three Caching Strategies Compared

Choosing the right caching backend depends on your server's deployment model, persistence needs, and whether multiple agents share state. Here's how the three main approaches compare:

In-Memory Caching

Best for: Local development, single-instance servers, short-lived sessions
Speed: Sub-millisecond lookups (fast option)
Setup: A Python dict or Node.js Map with TTL tracking
Tradeoff: Data is lost on server restart. Memory consumption grows with cache size. Not shared across server instances. A basic in-memory cache works well when you're building and testing an MCP server locally. It's the simplest approach: store results in a hash map keyed by the tool name and serialized arguments.

File-Based Caching

Best for: Persistent agents, local workflows, single-server production
Speed: 1-10ms reads depending on file size and disk type
Setup: Write JSON or binary files to a local directory or cloud storage
Tradeoff: Slower than memory. Requires disk space management and cleanup routines. Not natively shared across instances. File-based caching adds persistence without the complexity of a database. Cache entries survive server restarts, which matters for agents that run intermittently. You can use Fastio's agent storage as a persistent file-based cache backend. The free agent tier provides 50GB of cloud storage, and you can organize cache files into workspaces with automatic versioning.

Distributed Caching (Redis)

Best for: Production deployments, multi-agent systems, horizontal scaling
Speed: 1-5ms over network (usually)
Setup: Redis or Memcached instance with client library
Tradeoff: Additional infrastructure cost. Network latency for each lookup. Requires connection management and error handling. Redis is the standard choice when multiple MCP server instances need to share cached state. It handles TTL expiration natively, supports atomic operations, and scales horizontally. For teams running multiple agents at once, distributed caching prevents duplicate work across agents.

Which one should you pick? Start with in-memory for development. Move to file-based when you need persistence. Switch to Redis when you're running multiple server instances or serving multiple agents at the same time.

Dashboard showing cached vs uncached response metrics

How to Implement Response Caching Step by Step

Here's a practical walkthrough for adding caching to an MCP server. These examples use Python, but the patterns apply to any language.

Step 1: Identify Cacheable Tools

Not every tool should be cached. The rule is simple: cache reads, skip writes.

Cache these (deterministic, read-only):

get_file_contents - same file, same contents
search_documents - same query, same results (within a time window)
fetch_weather - same location, same forecast (for a few minutes)
list_directory - same path, same listing (until files change)

Skip these (side effects, non-deterministic):

send_email - must execute every time
create_file - write operation
get_random_quote - different result expected each time

Step 2: Generate Cache Keys

Build a unique key from the tool name and its arguments. The key must be deterministic, meaning identical inputs always produce the same key:

import hashlib, json

def cache_key(tool_name: str, arguments: dict) -> str:
    payload = json.dumps(arguments, sort_keys=True)
    arg_hash = hashlib.sha256(payload.encode()).hexdigest()[:16]
    return f"mcp:{tool_name}:{arg_hash}"

Sorting keys before hashing ensures that {"a": 1, "b": 2} and {"b": 2, "a": 1} produce the same cache key.

Step 3: Add the Cache Wrapper

Wrap your tool handler with check-then-execute logic:

from functools import wraps
from time import time

cache_store = {}

def cached(ttl_seconds: int = 300):
    def decorator(func):
        @wraps(func)
        async def wrapper(arguments: dict):
            key = cache_key(func.__name__, arguments)

### Check cache
            if key in cache_store:
                entry = cache_store[key]
                if time() - entry["timestamp"] < ttl_seconds:
                    return entry["result"]

### Execute and store
            result = await func(arguments)
            cache_store[key] = {
                "result": result,
                "timestamp": time()
            }
            return result
        return wrapper
    return decorator

@cached(ttl_seconds=600)
async def search_documents(arguments: dict):
    ### Expensive search logic here
    ...

Step 4: Set TTLs by Data Volatility

Assign different time-to-live values based on how frequently the underlying data changes:

Static data (documentation, historical records): 24-48 hours
Slow-moving data (user profiles, org settings): 15-60 minutes
Moderate data (search results, file listings): 5-15 minutes
Volatile data (stock prices, live metrics): 30 seconds to 2 minutes

When in doubt, start with a shorter TTL and extend it once you've confirmed the data doesn't change often.

Cache Invalidation Patterns for MCP

Cache invalidation is the hard part. Stale data in an MCP cache is worse than no cache at all, because the agent makes decisions based on outdated information without knowing it. Here are three invalidation patterns that work well for MCP servers.

Time-Based Invalidation (TTL)

The simplest approach. Every cached entry has an expiration timestamp. After the TTL expires, the next request triggers a fresh execution. This works well when "eventually consistent" is acceptable. Set TTLs based on the data source, not the tool. A get_weather tool calling a weather API should expire within minutes. A get_documentation tool pulling from a static docs site can cache for hours.

Event-Based Invalidation

Clear specific cache keys when a related write action occurs. If an agent calls update_file("report.pdf"), the server should immediately invalidate any cached read_file("report.pdf") results.

def invalidate_on_write(tool_name: str, arguments: dict):
    """Clear related cache entries after a write operation."""
    if tool_name == "update_file":
        path = arguments.get("path")
        read_key = cache_key("read_file", {"path": path})
        cache_store.pop(read_key, None)

### Also invalidate directory listing
        parent = "/".join(path.split("/")[:-1])
        list_key = cache_key("list_directory", {"path": parent})
        cache_store.pop(list_key, None)

This pattern requires mapping write tools to their related read tools, but it keeps the cache accurate in real time.

Versioned Invalidation

Tag cached items with a version number. When a major change happens (schema migration, bulk data update), increment the global version to force all cache entries to refresh:

CACHE_VERSION = 3  # Bump this to invalidate everything

def versioned_cache_key(tool_name: str, arguments: dict) -> str:
    base = cache_key(tool_name, arguments)
    return f"v{CACHE_VERSION}:{base}"

This is the "nuclear option." Use it sparingly for situations like deploying a new data model or recovering from a corrupted cache.

Combining Patterns

Production MCP servers usually use all three together. TTL provides a safety net so nothing stays cached forever. Event-based invalidation handles the common case of writes that affect reads. Versioned invalidation handles exceptional situations. Layer them and you get a cache that's both fast and accurate.

Using Persistent Storage as a Cache Backend

In-memory caches are fast but temporary. For agents that run intermittently or need to share state across sessions, persistent storage makes a better cache backend. Fastio's MCP server provides 19 consolidated tools for file operations, and these tools work well with a caching strategy. Here's how:

Store cache files in a dedicated workspace. Create a workspace called "agent-cache" and write serialized cache entries as JSON files. Each file represents one cached tool result, named by its cache key. When your MCP server starts up, it reads existing cache files to warm the cache instead of starting cold.

Use file versioning as a cache history. Fastio automatically versions files on update, so you get a history of past cache states. If a cache entry looks wrong, you can inspect or restore previous versions.

Set up webhooks for cross-agent invalidation. If multiple agents share the same cache workspace, webhooks notify other agents when cache files change. Agent A updates a cache entry, and agents B and C receive the webhook and refresh their local copies. The free agent tier includes 50GB of storage, 5,000 monthly credits, and 5 workspaces with no credit card required. That covers most caching scenarios, and the data persists indefinitely (no expiration).

How to Add Caching to Your MCP Server

What Is MCP Server Caching?

Why Caching Matters for AI Agents

Give Your AI Agents Persistent Storage

Three Caching Strategies Compared

In-Memory Caching

File-Based Caching

Distributed Caching (Redis)

How to Implement Response Caching Step by Step

Step 1: Identify Cacheable Tools

Step 2: Generate Cache Keys

Step 3: Add the Cache Wrapper

Step 4: Set TTLs by Data Volatility

Cache Invalidation Patterns for MCP

Time-Based Invalidation (TTL)

Event-Based Invalidation

Versioned Invalidation

Combining Patterns

Using Persistent Storage as a Cache Backend

Frequently Asked Questions

Related Resources

Give Your AI Agents Persistent Storage