AI & Agents

How to Run Hermes Agent with Ollama and Local LLMs

Hermes Agent from Nous Research can run entirely on your own hardware using Ollama as the inference backend, with no API keys, no cloud calls, and no per-token costs. This guide covers installation, model selection by VRAM budget, context window configuration, performance tuning, and how to persist agent-generated files in a shared workspace for handoff to humans.

Fast.io Editorial Team 10 min read
AI agent operating within a shared workspace environment

What Running Hermes Agent Locally Gets You

Hermes Agent is an open-source (MIT-licensed) autonomous AI agent from Nous Research. It ships with persistent memory, over 70 built-in skills, scheduled automations, and connections to messaging platforms like Telegram, Discord, Slack, WhatsApp, Signal, and email. By default it connects to cloud LLM providers like OpenRouter, Anthropic, or OpenAI, but its architecture treats the model as a swappable component. Any service that implements the OpenAI-compatible /v1/chat/completions endpoint works as a drop-in backend.

Ollama is the simplest way to serve models locally. It downloads quantized model weights, manages GPU memory allocation, and exposes an OpenAI-compatible API at http://localhost:11434/v1. Once Ollama is running a model, pointing Hermes at it takes one configuration change.

The practical benefits of local inference break down into three categories. Cost: there are zero API fees, which matters when an agent session can burn through hundreds of thousands of tokens during extended reasoning chains. Privacy: no prompts, tool outputs, or file contents leave your network. Availability: the agent works offline, on airgapped machines, or in environments where outbound API traffic is restricted. The tradeoff is hardware. You need a GPU with enough VRAM to hold the model weights and a 32K-to-64K token context window in memory simultaneously. The next section covers exactly how much.

Hardware Requirements by Model Size

Hermes Agent's tool-calling, memory, and compression features consume context aggressively. The official docs recommend at least 32,768 tokens of context, and 64K or higher gives the agent more room for multi-step reasoning. That context window lives in GPU memory alongside the model weights, so VRAM is the binding constraint.

Here is what to expect at Q4_K_M quantization, the sweet spot between quality and memory use:

8B parameter models (Hermes 3 8B, Qwen 2.5 Coder 7B, Gemma 3 9B)

  • VRAM: 6 GB minimum, 8 GB comfortable
  • Context: 32K tokens fits in 8 GB total. 64K tokens needs 10-12 GB.
  • Cards: RTX 3060 12GB, RTX 4060 8GB (tight at 64K), any card with 12GB+
  • Speed: 40-60 tokens/sec on RTX 4060, fast enough for interactive use

27B parameter models (Qwen 2.5 Coder 32B, Gemma 4 27B)

  • VRAM: 16 GB minimum, 24 GB recommended
  • Context: 32K fits in 20 GB. 64K needs 24 GB+.
  • Cards: RTX 3090, RTX 4090, any 24GB card
  • Speed: 15-25 tokens/sec on RTX 4090

70B parameter models (Llama 3.1 70B, Hermes 3 70B)

  • VRAM: 48 GB+ for full GPU inference
  • Cards: A6000, dual RTX 3090 (via tensor parallelism in vLLM), A100
  • Speed: 5-10 tokens/sec on single A6000

CPU-only fallback: Ollama can run models on CPU with 8 GB of system RAM. Expect 15-20 tokens/sec on a modern desktop processor. Usable for testing and light tasks, too slow for production agent loops that chain dozens of tool calls.

For most developers starting out, an 8B model with 32K context on a 12 GB GPU is the right entry point. You get fast inference, low cost, and enough capability for file operations, web searches, and code generation.

Neural network visualization representing local AI model inference

Step-by-Step: Install Ollama and Pull a Model

Install Ollama Download Ollama from ollama.com for macOS, Linux, or Windows. On Linux, the one-liner works:

curl -fsSL https://ollama.com/install.sh | sh

Verify the install by checking the version:

ollama --version

Ollama 0.5 or later is required for the context length environment variable used below.

Pull a Model

Choose a model based on your VRAM budget from the previous section. For a 12 GB GPU, qwen2.5-coder:32b is too large, but qwen2.5-coder:7b fits with room to spare:

ollama pull qwen2.5-coder:7b

Other solid choices for agent work:

  • hermes3:8b: Nous Research's own model, fine-tuned for function calling
  • gemma3:12b: Google's compact model with strong instruction following
  • qwen2.5-coder:32b: Best code quality, but needs 24 GB VRAM

Start Ollama with Extended Context

Ollama defaults to 4,096 tokens of context, which is far too small for Hermes Agent. Set the context length before starting the server:

OLLAMA_CONTEXT_LENGTH=32768 ollama serve

For 64K context (recommended if your VRAM allows):

OLLAMA_CONTEXT_LENGTH=65536 ollama serve

On Linux with systemd, make this permanent:

sudo systemctl edit ollama.service

Add the following override:

[Service]
Environment="OLLAMA_CONTEXT_LENGTH=32768"

Then restart the service:

sudo systemctl restart ollama

Verify the model is loaded with the right context:

ollama ps

The CONTEXT column should show your configured value, not the 4K default.

Fastio features

Persist your local agent's output in a shared workspace

Fast.io gives your Hermes Agent 50 GB of free cloud storage with MCP access, file versioning, and one-click handoff to humans. No credit card, no trial expiration.

Configure Hermes Agent to Use Ollama

Install Hermes Agent

If you don't have Hermes installed yet:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
source ~/.bashrc

Or install via pip if you prefer managing Python environments directly:

pip install hermes-agent

Interactive Setup

The fast path is the interactive model selector:

hermes model

Select Custom OpenAI-compatible endpoint when prompted. Enter these values:

  • API Base URL: http://localhost:11434/v1
  • API Key: leave blank (Ollama doesn't require one locally)
  • Model name: the exact tag you pulled, like qwen2.5-coder:7b
  • Context length: 32768 (or 65536 if you configured Ollama for 64K)

Hermes saves this to ~/.hermes/config.yaml and secrets to ~/.hermes/.env automatically.

Manual Configuration

If you prefer editing the config file directly:

### ~/.hermes/config.yaml
model:
  default: qwen2.5-coder:7b
  provider: custom
  base_url: http://localhost:11434/v1
  context_length: 32768

No API key entry is needed in ~/.hermes/.env for local Ollama.

Verify the Connection Start Hermes and confirm it connects to your local model:

hermes

Type a simple prompt like "What model are you running?" The agent should respond using your local model with no network calls. If you see connection errors, confirm Ollama is running (ollama ps) and that port 11434 is not blocked by a firewall.

Windows WSL2 Note

Hermes Agent requires a Unix environment. On Windows, install WSL2 with wsl --install, then run Hermes inside WSL. If Ollama runs on the Windows host, WSL2's NAT networking can't reach localhost. Two fixes:

Enable mirrored networking (Windows 11 22H2+) by adding to .wslconfig:

[wsl2]
networkingMode=mirrored

Or find the Windows host IP with ip route show | grep default | awk '{print $3}' and use that IP in your Hermes config instead of localhost. Make sure Ollama on Windows binds to 0.0.0.0 by setting OLLAMA_HOST=0.0.0.0.

Performance Tuning and Troubleshooting

Local models behave differently from cloud APIs in ways that affect agent reliability. Here are the common issues and how to fix them.

Tool Calls Appear as Raw JSON Text

This means Ollama is serving the model without proper tool-call parsing. Hermes expects structured tool-call responses, not JSON dumped into the chat stream. The fix depends on your setup:

  • For Ollama: upgrade to 0.5+ where tool-call support is built in for supported model families
  • For vLLM: add --enable-auto-tool-choice --tool-call-parser hermes flags
  • For llama.cpp: add the --jinja flag, which enables the chat template's tool-call formatting

Incoherent or Truncated Responses

Almost always a context window problem. Check ollama ps to verify the context column shows your configured value. If it shows 4096, Ollama didn't pick up the environment variable. Restart the Ollama service after setting OLLAMA_CONTEXT_LENGTH.

Another cause is running out of VRAM mid-session. As the conversation grows and fills the context window, the KV cache expands. A model that fits fine at session start can OOM at 50K tokens. Monitor GPU memory with nvidia-smi and consider dropping to a smaller model or lower context if you see OOM kills.

Slow Response Times

If generation speed drops below 10 tokens/sec, the model is likely spilling from GPU to system RAM. Options:

  • Use a smaller quantization (Q3_K_M instead of Q4_K_M) to reduce VRAM footprint
  • Lower the context length from 64K to 32K
  • Switch to a smaller model (8B instead of 27B)

For Hermes specifically, enable context compression in your config to keep the active window smaller:

compression:
  enabled: true
  threshold: 0.50
  target_ratio: 0.20

This compresses older messages when the context hits 50% capacity, preserving the most recent 20% verbatim. It reduces the effective VRAM demand during long sessions.

Switching Models Mid-Session Inside an active Hermes chat, use the /model command to swap between any configured provider without losing conversation history:

/model custom:qwen2.5-coder:7b

This is useful for testing different models against the same task, or for switching to a larger model when a smaller one can't handle a complex multi-step operation.

AI-powered analysis and audit trail visualization

Persist Agent Output in a Shared Workspace

Running Hermes Agent locally solves the inference problem, but creates a new one: file isolation. The agent generates files, code, reports, and artifacts that live on your local machine. If you're working with a team, handing off that output means zipping folders, emailing attachments, or pushing to a Git repo manually.

A more practical pattern is to connect your local Hermes instance to a cloud workspace that both agents and humans can access. Fast.io works well here because it exposes an MCP server that agents can call directly, and humans access the same files through a browser.

The workflow looks like this: Hermes Agent runs locally with Ollama for private, zero-cost inference. When it produces output worth sharing, it writes files to a Fast.io workspace via the MCP server or REST API. Team members browse, comment on, and download those files through the web interface. The agent keeps admin access even after transferring ownership to a human.

This separation matters because it preserves local inference privacy while giving you a collaboration layer. Your prompts and reasoning chains never leave your machine. Only the finished artifacts go to the shared workspace.

What Fast.io adds to a local Hermes setup:

  • Persistent storage that survives machine restarts and agent session resets (50 GB free, no credit card)
  • Workspace-level permissions so different team members see different outputs
  • Built-in Intelligence Mode that auto-indexes uploaded files for semantic search and AI chat
  • Metadata Views for extracting structured data from uploaded documents
  • Branded shares for delivering agent output to external clients
  • Audit trails showing who accessed or modified each file

You can also use local storage, S3 buckets, or Google Drive for persistence. The advantage of a workspace designed for agent-human handoff is that it handles permissions, versioning, and discoverability without you building that plumbing yourself.

For the full setup guide on connecting any agent to Fast.io workspaces, see Storage for Agents.

Frequently Asked Questions

Can Hermes Agent run completely offline?

Yes. With Ollama serving a local model, Hermes Agent operates with no internet connection. All inference, memory, tool execution, and skill usage happen on your machine. The only features that require connectivity are web search, messaging gateways (Telegram, Discord, etc.), and cloud storage integrations. If you disable those, the agent is fully airgapped.

What models work best with Hermes Agent locally?

For 8-12 GB VRAM, Hermes 3 8B and Qwen 2.5 Coder 7B are the strongest options for agent tasks that involve tool calling and code generation. With 24 GB VRAM, Qwen 2.5 Coder 32B and Gemma 4 27B give noticeably better reasoning quality. The key constraint is context length, not model size alone. Any model needs at least 32K tokens of context to work reliably with Hermes, and 64K is better for multi-step sessions.

How much VRAM does Hermes Agent need?

Hermes Agent itself uses negligible resources. The VRAM requirement comes from the local model. At Q4_K_M quantization, an 8B model needs about 6 GB for weights plus 2-4 GB for a 32K-token context window, totaling 8-10 GB. A 27B model needs 16-20 GB for weights plus 4-5 GB for context. Budget your GPU memory for model weights plus the KV cache for your target context length.

Is Hermes Agent free when using Ollama?

Yes. Hermes Agent is MIT-licensed open-source software, and Ollama is free. There are no API costs, subscription fees, or usage limits when running both locally. The only cost is your hardware and electricity. Cloud provider integrations (OpenRouter, Anthropic, OpenAI) have their own pricing if you choose to use them instead.

How do I switch between local and cloud models?

Run `hermes model` to configure a new provider, or use the `/model` command inside an active chat session to swap without losing conversation history. You can also set up fallback providers in config.yaml so Hermes automatically tries a cloud API if the local model fails or is unavailable.

Can I use Hermes Agent with Apple Silicon Macs?

Yes. Ollama supports Apple Silicon natively and uses the unified memory architecture for inference. An M1 with 16 GB unified memory can run 8B models at reasonable speeds. M2 Pro/Max/Ultra with 32-96 GB of unified memory can handle 27B or even 70B models since the GPU and CPU share the same memory pool. Set context length the same way as on Linux.

Related Resources

Fastio features

Persist your local agent's output in a shared workspace

Fast.io gives your Hermes Agent 50 GB of free cloud storage with MCP access, file versioning, and one-click handoff to humans. No credit card, no trial expiration.