How to Configure Multi-Model LLM Routing in Hermes Agent
Hermes Agent separates heavyweight reasoning from specialized side-tasks through a dual-model architecture with eight auxiliary slots. This guide walks through configuring main and auxiliary models, building fallback chains across providers, pooling credentials for rate-limit resilience, and using the Pareto Code Router to auto-select the cheapest coding model that meets your quality bar.
How Multi-Model Routing Works in Hermes Agent
Most AI coding agents bind to a single model for everything. Hermes Agent from Nous Research takes a different approach: it runs a primary reasoning model for core conversation and tool-call loops, then routes eight specialized task types to independent model slots. Each slot can point to a different provider, model, and set of credentials.
This matters because not every task needs the same model. Session titles don't require a reasoning-heavy model. Image analysis needs vision capabilities your main model might lack. Context compression burns tokens on expensive models when a smaller one handles summarization just as well. By splitting these concerns, you reduce cost without sacrificing quality where it counts.
The routing architecture breaks into four layers:
- Main model handles user messages, tool calls, and streamed responses
- Auxiliary slots route vision, compression, session search, approval, web extraction, skills hub, MCP, and title generation independently
- Fallback chains switch to backup providers when the primary fails
- Credential pools rotate multiple API keys for the same provider to survive rate limits
Hermes supports over 100 models across OpenAI, Anthropic, Google Gemini, OpenRouter, xAI, Ollama, vLLM, SGLang, llama.cpp, LM Studio, and regional providers like Kimi, MiniMax, Alibaba DashScope, and Tencent TokenHub. Any OpenAI-compatible endpoint works as a custom provider.
How to Configure the Main Model and Auxiliary Slots
The fast way to set your main model is the interactive wizard:
hermes model
This walks through provider selection, authentication (OAuth or API key), and model picking. The result persists to ~/.hermes/config.yaml. For non-interactive setup, edit the config directly:
model:
provider: openrouter
default: anthropic/claude-opus-4.7
base_url: ''
api_mode: chat_completions
You can also switch models mid-session without restarting:
/model gpt-5.4 --provider openrouter # session-only
/model gpt-5.4 --provider openrouter --global # persists to config
Auxiliary Model Overrides
Eight task slots default to auto, which delegates to whatever your main model is. Override individual slots when a cheaper or more capable model fits:
auxiliary:
vision:
provider: openrouter
model: google/gemini-2.5-flash
base_url: ''
api_key: ''
timeout: 120
compression:
provider: openrouter
model: google/gemini-3-flash-preview
title_gen:
provider: openrouter
model: google/gemini-3-flash-preview
approval:
provider: anthropic
model: claude-haiku-4.5
Each auxiliary entry accepts provider, model, base_url, api_key, timeout, extra_body, and download_timeout fields. Leave a slot on provider: auto with an empty model to keep it on the main model.
Which Slots to Override
The dashboard's Models page shows usage analytics with token counts and costs per model. Each model card has a "Use as" dropdown for one-click assignment to main, all auxiliary, or individual slots.
Persist Hermes Agent outputs across sessions and model switches
50GB free workspace with MCP-native access. Your agent writes files, Intelligence Mode indexes them, humans pick up the results. No credit card, no expiration.
How to Build Fallback Chains Across Providers
Fallback chains keep your session alive when a provider goes down. Rather than stopping mid-conversation, Hermes switches to a backup provider automatically and restores the primary on the next user message.
Basic Fallback Configuration
fallback_model:
provider: openrouter
model: anthropic/claude-sonnet-4
For multiple sequential fallbacks, use the fallback_providers list:
hermes fallback add
hermes fallback list
hermes fallback remove
What Triggers Fallback
Hermes activates fallback on these conditions:
- Rate limits (429) after exhausting retries
- Server errors (500, 502, 503) after exhausting retries
- Auth failures (401, 403) immediately
- Not found (404) immediately
- Malformed responses repeated empty or invalid outputs
Fallback is turn-scoped: each new user message starts fresh with the primary model. Within a single turn, fallback fires at most once. If the fallback provider also fails, normal error handling takes over rather than cascading indefinitely.
Custom Endpoint Fallback
Point fallback to a local model for zero-cost resilience:
fallback_model:
provider: custom
model: my-local-model
base_url: http://localhost:8000/v1
key_env: MY_LOCAL_KEY
Auxiliary Task Fallback
Each auxiliary slot has its own independent fallback chain. For text tasks, the auto-detection order is:
- OpenRouter
- Nous Portal
- Custom endpoints
- Codex
- API-key providers
Vision tasks follow a different chain: Main provider, then OpenRouter, Nous Portal, Codex, and Anthropic. If compression becomes unavailable entirely, Hermes degrades gracefully by dropping middle conversation turns without summaries rather than failing the session.
Credential Pools for Rate-Limit Resilience
Credential pools solve a different problem than fallback chains. Instead of switching providers when one fails, pools rotate multiple API keys for the same provider. This is critical for teams running multiple concurrent agents or hitting per-key rate limits on providers like OpenRouter or Anthropic.
Adding Keys to a Pool
hermes auth add openrouter --type api-key --api-key sk-or-v1-second-key
hermes auth add anthropic --type api-key --api-key sk-ant-api03-second-key
hermes auth add openrouter --type oauth
Each hermes auth add call appends to the pool for that provider. View all pools with hermes auth list.
Rotation Strategies
Configure how Hermes picks keys from a pool in config.yaml:
credential_pool_strategies:
openrouter: round_robin
anthropic: least_used
Available strategies:
fill_first(default): Use the first healthy key until it hits limits, then move to the nextround_robin: Cycle through keys evenly across requestsleast_used: Always pick the key with the lowest request countrandom: Random selection among healthy keys
Error Handling and Cooldowns
The rotation logic handles different failure modes:
- 429 rate limit: Retry the same key once (handles transient blips), rotate on second 429, apply 1-hour cooldown
- 402 billing/quota: Immediately rotate to next key, apply 24-hour cooldown
- 401 auth expired: Attempt OAuth token refresh first, rotate if refresh fails
- All keys exhausted: Fall through to
fallback_modelif configured
Subagent Credential Sharing
When Hermes spawns subagents via delegate_task, the parent's credential pool extends to children automatically. Per-task credential leasing prevents conflicts when multiple subagents rotate keys concurrently. The pool implementation uses threading locks for all state mutations, so concurrent access stays safe.
Pool Management Commands
Pareto Code Router for Cost-Optimized Coding
The Pareto Code Router is an OpenRouter feature that Hermes Agent integrates natively. Instead of committing to a specific coding model, you express a quality threshold and the router picks the cheapest model that meets it.
How It Works
OpenRouter maintains a curated shortlist of coding models ranked by their Artificial Analysis coding percentile. You set a min_coding_score between 0 and 1. The router maps your score to a quality tier, filters available models, and picks the cheapest one in that tier.
Configuration in Hermes
Set the Pareto Code Router as your main model in ~/.hermes/config.yaml:
model:
provider: openrouter
default: openrouter/pareto-code
Tune the quality threshold:
model:
provider: openrouter
default: openrouter/pareto-code
extra_body:
min_coding_score: 0.8
The default min_coding_score is 0.65 if you don't specify one. Higher values mean stronger (and more expensive) coders. Lower values optimize for cost.
What Makes This Useful
Selection is deterministic for a given score on any particular day, but the actual model behind it shifts as the Pareto frontier moves. New models get benchmarked, existing models get repriced, and the router adapts. You never need to update your config when a better option appears at the same price point.
The :nitro variant (openrouter/pareto-code:nitro) optimizes for speed instead of price, selecting by p50 throughput rather than cost per token.
No additional routing fees apply. You only pay the underlying model's per-token price. The response includes a model field revealing which concrete model handled your request, so you always know what you're getting.
Combining with Auxiliary Slots
A practical setup: use Pareto Code Router for your main model (coding tasks) while pinning auxiliary slots to known-cheap models for non-coding work:
model:
provider: openrouter
default: openrouter/pareto-code
extra_body:
min_coding_score: 0.7
auxiliary:
title_gen:
provider: openrouter
model: google/gemini-3-flash-preview
compression:
provider: openrouter
model: google/gemini-3-flash-preview
vision:
provider: openrouter
model: google/gemini-2.5-flash
approval:
provider: anthropic
model: claude-haiku-4.5
This gives you frontier-quality coding on the main thread and minimal spend on everything else.
Persistent Storage for Multi-Model Agent Workflows
Multi-model routing solves the compute layer. But agents running complex workflows still need persistent file storage that survives across sessions, model switches, and provider outages.
When Hermes Agent switches between models mid-session or falls back to a different provider, the conversation context transfers. Files the agent has been working with need the same continuity. Local filesystem works for single-machine setups, but breaks down when you run agents on remote servers, Docker containers, or Modal instances.
Options for Persistent Agent Storage
Local filesystem works if your agent runs on one machine and you never need to share results. Simple, zero-config, but fragile.
S3 or GCS gives durability and access control, but requires infrastructure setup, IAM configuration, and custom tooling for your agent to read and write files.
Fast.io workspaces provide a middle path: 50GB free storage with an MCP server that any LLM-powered agent can call directly. Files uploaded to a workspace are auto-indexed for semantic search through Intelligence Mode, so your agent can query its own outputs later without building a separate retrieval pipeline.
The handoff pattern works well with multi-model setups: your Hermes Agent generates artifacts (reports, code, analyses) using whichever model the router selects, stores them in a shared workspace, and a human collaborator picks up the finished work through the same workspace UI. No manual file transfers, no "which server was that on?" questions.
For teams running multiple Hermes instances with credential pools, Fast.io's workspace permissions let each agent write to designated folders while humans retain read access everywhere. File locks prevent conflicts when parallel agents touch the same documents.
The free agent tier includes 50GB storage, 5,000 AI credits per month, 5 workspaces, and MCP access with no credit card required.
Frequently Asked Questions
How do I switch models in Hermes Agent?
Use the /model slash command in any active session. For example, /model claude-sonnet-4 --provider anthropic switches immediately. Add --global to persist the change to config.yaml for future sessions. You can also run hermes model for an interactive wizard that walks through provider and model selection.
Can Hermes Agent use multiple AI providers at once?
Yes. The auxiliary model system lets you assign different providers to different task types simultaneously. Your main model might run on Anthropic while vision uses Google Gemini, compression uses a cheap OpenRouter model, and approval scoring runs on a fast Haiku instance. Each slot operates independently with its own authentication.
What is the Pareto Code Router in Hermes Agent?
The Pareto Code Router (openrouter/pareto-code) auto-selects the cheapest coding model that meets your quality threshold. Set a min_coding_score between 0 and 1 in your config, and OpenRouter picks from a curated shortlist ranked by coding benchmarks. The selection updates automatically as new models appear or pricing changes, so you always get optimal price-to-performance without manual updates.
How do I set up fallback models in Hermes Agent?
Add a fallback_model block to ~/.hermes/config.yaml with provider and model fields. Hermes switches to this model automatically when the primary fails due to rate limits, server errors, or auth issues. Fallback is turn-scoped, meaning each new user message retries the primary first. Use hermes fallback add for interactive setup, or define multiple fallbacks in the fallback_providers list for sequential failover.
What rotation strategies does Hermes Agent support for credential pools?
Four strategies are available: fill_first (default, uses one key until exhausted), round_robin (cycles evenly), least_used (picks lowest request-count key), and random. Configure per-provider in config.yaml under credential_pool_strategies. Keys that hit rate limits get a 1-hour cooldown; billing errors trigger 24-hour cooldowns with immediate rotation.
Do auxiliary model changes apply to existing sessions?
No. Config changes only apply to new sessions. Existing chat sessions retain their model assignments until you explicitly use the /model command to hot-swap within that session. For gateway integrations (Telegram, Discord, Slack), restart the gateway process to force all new sessions onto the updated config.
Related Resources
Persist Hermes Agent outputs across sessions and model switches
50GB free workspace with MCP-native access. Your agent writes files, Intelligence Mode indexes them, humans pick up the results. No credit card, no expiration.