How to Deploy AI Agents on Kubernetes
Learn how to deploy AI agents on Kubernetes for scalable, production-ready systems. This guide covers container orchestration, auto-scaling with HPA, multi-agent coordination, and persistent file storage using Fastio workspaces. Topics include YAML examples, deployment checklists, security hardening, and monitoring setup. Perfect for developers building AI agent kubernetes deployment infrastructure.
What Is AI Agent Kubernetes Deployment?
Kubernetes deployment for AI agents enables scalable, orchestrated multi-agent systems with persistent state. Containers run agent code, while Kubernetes handles orchestration, scaling, and resilience.
According to the CNCF Annual Survey 2024, 60% of organizations run container workloads on Kubernetes, making it the standard for production AI systems. AI agent deployments have grown multiple% year-over-year as teams move from local scripts to distributed systems.
Key components include:
- Deployments: Manage replica pods with rolling updates.
- Services: Expose agents for inter-pod communication.
- PersistentVolumes (PV): Store stateful data like model weights or conversation history.
- ConfigMaps/Secrets: Handle API keys for LLMs like OpenAI or Anthropic.
Benefits include automatic scaling based on CPU, memory, or custom metrics like queue length. Rollouts happen without downtime, and failed pods restart automatically. This reliability matters for agents handling customer queries or processing streams.
Fastio workspaces complement Kubernetes by providing shared file storage across pods. Agents use multiple MCP tools for file operations, with built-in RAG for querying workspace documents. See Fastio AI for agent-specific features.
In practice, start with a simple agent that summarizes documents. Dockerize it, deploy to Minikube, then scale to EKS. Measure pod uptime and response latency to validate the setup.
Real-world example: A content generation pipeline uses three agent types, a researcher agent pulling web data, a writer agent creating drafts, and an editor agent reviewing output. Each runs in separate pods, coordinated through a Kafka message queue. Input documents land in a Fastio workspace, all three agents access them via MCP tools, and final output syncs back for human review. This pattern scales from multiple pods to multiple without code changes, just HPA adjustments.
Prerequisites for Kubernetes AI Agent Setup
Set up a Kubernetes cluster first. For development, use Minikube or Kind on your laptop. For production, choose managed services: Amazon EKS, Google GKE, or Azure AKS. These handle control plane scaling and upgrades.
Development Cluster Options:
- Minikube: Single-node cluster, runs locally with VirtualBox or Docker driver. Good for initial testing.
- Kind (Kubernetes in Docker): Faster startup, runs containers as nodes. Preferred for CI/CD pipelines.
- k3s: Lightweight Kubernetes for edge or resource-constrained environments.
Production Cluster Options:
- Amazon EKS: Managed control plane, works alongside AWS services like IAM, S3, and CloudWatch.
- Google GKE: Autopilot mode handles node provisioning automatically. Strong GPU support.
- Azure AKS: works alongside Azure AD for authentication. Good for Microsoft-heavy shops.
- Self-managed: Lower cost but requires dedicated ops effort for upgrades and security patches.
Install prerequisites:
kubectlfor cluster interaction. Configure with~/.kube/configor cloud provider credentials.helmfor package management. Add repos withhelm repo add.k9sor Lens for visual management. Useful for debugging pod issues quickly.kubectxandkubensfor switching between clusters and namespaces.
Dockerize your AI agent. The agent should expose health endpoints and handle graceful shutdowns:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
ENV PORT=8080
CMD ["uvicorn", "agent:app", "--host", "multiple.0.0.0"]
Build with docker build -t your-registry/agent:v1 . and push to ECR, GCR, or Docker Hub. Use tagging strategies like :latest for development and :v1.2.multiple for production releases.
Useful tools:
- KServe: For scalable ML inference with auto-scaling.
- Kubeflow: Pipelines for agent training and deployment.
- Helm charts: Community charts for LangGraph, CrewAI, or AutoGen frameworks.
- Argo CD: GitOps-based continuous delivery for agent updates.
- Istio: Service mesh for secure agent-to-agent communication with mTLS.
Resource planning: Agents need CPU for logic, GPU for inference (request nvidia.com/gpu: 1). Use Vertical Pod Autoscaler for memory adjustments. Test with kubectl top pods to monitor usage. Plan for multiple-multiple memory per agent handling typical LLM requests, more for large context windows.
Fastio integration starts here: Agents can pull files via URL import during init containers, avoiding local storage limits. This eliminates the need for persistent volumes on the Kubernetes side for file storage. Agents reference remote Fastio workspaces.
Step-by-Step Guide to Deploy a Single AI Agent
Follow this step-by-step process to deploy your first AI agent pod.
Step 1: Create Secrets for Sensitive Data
kubectl create secret generic agent-secrets \
--from-literal=api-key=sk-... \
--from-literal=anthropic-key=...
Step 2: Deployment YAML Expand the basic Deployment with liveness/readiness probes and resource limits:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-agent
spec:
replicas: 1
selector:
matchLabels:
app: ai-agent
template:
metadata:
labels:
app: ai-agent
spec:
containers:
- name: agent
image: your-registry/agent:latest
ports:
- containerPort: 8080
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: agent-secrets
key: api-key
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
Step 3: Service for Exposure
apiVersion: v1
kind: Service
metadata:
name: ai-agent-service
spec:
selector:
app: ai-agent
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ai-agent-ingress
spec:
rules:
- host: agent.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ai-agent-service
port:
number: 80
Step 4: Apply and Verify
kubectl apply -f agent-manifests.yaml
Check with kubectl get pods, kubectl logs, kubectl port-forward svc/ai-agent-service multiple:multiple.
Deployment Checklist:
- Secrets created securely
- Pods in Running state
- Probes passing (no restarts)
- Endpoint responds to curl /health
- Ingress routes traffic (if used)
Common issues: Image pull errors (check registry auth), OOM kills (increase limits), probe failures (adjust delays). Test the deployment locally with kubectl run --image=your-agent --dry-run=client before applying. Verify the agent responds to health checks before exposing via Ingress.
Resource tuning: Start with the multiple CPU / 512Mi memory limits above, monitor actual usage with kubectl top pods after multiple hours, then adjust. Agents handling long LLM context windows need more memory. GPU workloads require nvidia.com/gpu resources and the NVIDIA Device Plugin installed on the cluster.
Scaling to Multi-Agent Kubernetes Systems
Multi-agent systems require scaling and coordination. Kubernetes Horizontal Pod Autoscaler (HPA) handles replica growth based on metrics.
Basic HPA YAML:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-agent
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
Advanced Scaling:
- Custom metrics: Use Prometheus Adapter for queue length or request rate.
- Vertical Pod Autoscaler (VPA): Auto-tune CPU/memory requests.
- Cluster Autoscaler: Add nodes during peaks.
For stateful multi-agents:
- StatefulSets: Ordered pods with stable identities (e.g., agent-0, agent-1).
- DaemonSets: One agent per node for distributed tasks like monitoring.
Coordination patterns:
- Leader election: Kubernetes Leases API.
- Message queues: Kafka for task distribution, Redis for pub/sub.
- Service mesh (Istio): Traffic management, mTLS between agents.
Example: Deploy a supervisor agent that routes tasks to worker agents via Kafka topics. Workers pull from Fastio workspaces using MCP tools for input files. For teams building this pattern, see Fastio AI product page for workspace configuration.
Monitor scaling with kubectl get hpa, adjust targets based on load tests. Run load tests with tools like k6 or Locust to simulate concurrent users before going production. Track metrics like requests per second, average latency, and error rate at different replica counts to find optimal HPA thresholds.
For GPU workloads, configure NVIDIA Device Plugin and set GPU limits in the Deployment spec. Use time-slicing for cost optimization if full GPUs aren't always needed.
Give Your AI Agents Persistent Storage
Fastio workspaces: 50GB free, 251 MCP tools, locks, RAG. Agents and humans work together. Built for agent kubernetes deployment workflows.
Persistent File Sharing for Multi-Agent Systems
Standard PVCs provide block storage but struggle with multi-pod concurrency and features like search. Fastio workspaces solve this for AI agents ([/storage-for-agents/]).
Agents access files via multiple MCP tools over Streamable HTTP or SSE, no Kubernetes volumes needed. Key advantages:
- File locks: Acquire/release to prevent race conditions in multi-writes.
- Webhooks: Real-time notifications on uploads/changes, trigger pod restarts or new tasks.
- URL Import: Pull from Google Drive, OneDrive, etc., via OAuth without pod storage.
- Intelligence Mode: Auto-index files for RAG queries with citations, no external vector DB.
- Free Agent Tier: 50GB storage, 5,000 credits/month, 5 workspaces, no credit card.
Integration example with OpenClaw:
clawhub install dbalve/fast-io
### Now use natural language: "Upload report.pdf to workspace/project"
Or direct MCP calls in agent code:
import requests
response = requests.post("/storage-for-agents/",
stream=True,
json={"workspace_id": "ws_123", "file": file_stream})
Persistence survives pod evictions. Agents checkpoint state to Fastio, query via RAG for context. Humans review outputs in the same workspace UI. See Fastio workspaces for collaboration features.
Compared to S3: Fastio adds agent-native tools, collaboration, and intelligence without custom Lambda glue.
Fastio audit logs track file access across agents, aiding debugging. Use webhooks to notify on anomalies.
Agent Workspace Integration Patterns
Most Kubernetes AI agent tutorials stop at pod deployment. They miss the critical piece: how agents share files, coordinate work, and hand off to humans. This gap costs teams weeks of integration work.
The Multi-Pod File Problem When multiple agent pods need to access the same files, typical solutions fall short. NFS volumes require complex provisioning. S3 buckets need custom sync logic. Neither provides file locking or real-time notifications. Agents overwrite each other's work, miss updated inputs, or poll endlessly for changes.
Fastio Workspace Pattern Fastio workspaces provide a different model. Instead of mounting storage into pods, agents access files through multiple MCP tools. Each pod runs independently, calling the Fastio API for file operations. This eliminates shared filesystem complexity entirely.
Implementation pattern for a document processing pipeline:
- Supervisor pod receives incoming documents via webhook, writes to workspace.
- Worker pods poll workspace for new files (or receive notifications via webhook).
- Each worker acquires file lock before processing, releases after.
- Output writes back to workspace, triggers next stage or human review.
This pattern works across clouds. The agent pods run on EKS, GKE, or self-hosted Kubernetes. Fastio handles storage separately, avoiding cloud-specific volume drivers.
Human-Agent Collaboration Kubernetes deployments typically separate agent output from human review. Fastio bridges this by giving agents and humans the same workspace. A developer builds an agent that generates code, tests it, and writes results to a shared workspace. The human opens the same workspace in their browser, reviews outputs, leaves comments. The agent sees comments on next run and adjusts. This tight loop, agent build, human feedback, agent iteration, happens in minutes rather than hours of file transfer.
The free agent tier includes multiple workspaces, enough to separate staging from production or multiple projects. Agents and humans see the same files, the same version history, the same audit log. No sync scripts, no S3 bucket juggling.
Ownership Transfer Build agents create workspaces, populate them with generated content, then transfer ownership to humans. The agent keeps admin access for ongoing maintenance but the human owns the data. This matters for agencies delivering client work. The agent does the production work, the client receives the final artifacts without seeing the build process.
This workspace-centric architecture is what competitors miss. They focus on container orchestration but skip the file layer that makes multi-agent systems actually work.
Production Best Practices and Monitoring
Production deployments need security, monitoring, and safe updates.
Security:
- Secrets: External Vault or Sealed Secrets over base64 Kubernetes Secrets.
- Network: NetworkPolicies to restrict agent-to-agent traffic.
- RBAC: Limit service accounts to necessary resources.
- Pod Security Standards: Enforce non-root, read-only FS.
Monitoring:
- Prometheus for metrics (CPU, latency, error rates).
- Grafana dashboards for agent-specific views (tokens used, tasks completed).
- ELK or Loki for structured logs.
- Alerts: PagerDuty on high error rates or OOM.
Updates:
- Argo Rollouts: Canary/blue-green with traffic shifting.
- Flux or Argo CD for GitOps.
Troubleshooting Table:
Fastio audit logs track file access across agents, aiding debugging. Use webhooks to notify on anomalies.
Frequently Asked Questions
How to deploy AI agents on Kubernetes?
Containerize your agent code with LLM clients, create Deployment and Service YAMLs with secrets for API keys, apply with kubectl, and expose via Ingress. Test with port-forward and health checks. Scale later with HPA.
Best practices for multi-agent K8s deployments?
Use StatefulSets for ordered agents, leader election via Leases API, shared storage like Fastio workspaces with file locks, HPA for scaling, NetworkPolicies for security, and Prometheus for monitoring.
What is the best storage for AI agents on Kubernetes?
PersistentVolumes for basic state, but Fastio agent workspaces excel with 251 MCP tools, file locks, webhooks, RAG intelligence, and free 50GB tier. Access via API without volume mounts.
How do you scale AI agents on Kubernetes?
Deploy HorizontalPodAutoscaler targeting CPU (multiple% utilization), memory, or custom metrics like queue depth via Prometheus Adapter. Set minReplicas:multiple, max:multiple. Use Cluster Autoscaler for nodes.
What tools help with Kubernetes AI agent setup?
KServe for inference serving, Kubeflow for pipelines, Helm for frameworks like CrewAI, Argo CD for GitOps, and Fastio for persistent file sharing across pods.
How to handle secrets in AI agent deployments?
Use Kubernetes Secrets or HashiCorp Vault. Reference via envFrom or secretKeyRef in Deployment spec. Rotate keys with external secrets operator and avoid hardcoding.
Can AI agents share files across Kubernetes pods?
Yes, with shared storage like NFS PVCs or cloud volumes, but Fastio provides agent-optimized features: concurrent locks, URL imports, webhooks, and RAG without managing infrastructure.
What monitoring for Kubernetes AI agents?
Prometheus scrapes metrics (latency, errors, tokens), Grafana dashboards, Loki for logs. Alert on high error rates or pod restarts. Fastio audit logs track file operations.
Related Resources
Give Your AI Agents Persistent Storage
Fastio workspaces: 50GB free, 251 MCP tools, locks, RAG. Agents and humans work together. Built for agent kubernetes deployment workflows.