AI & Agents

How to Configure Multi-Model LLM Routing in Hermes Agent

Hermes Agent separates heavyweight reasoning from specialized side-tasks through a dual-model architecture with eight auxiliary slots. This guide walks through configuring main and auxiliary models, building fallback chains across providers, pooling credentials for rate-limit resilience, and using the Pareto Code Router to auto-select the cheapest coding model that meets your quality bar.

Fast.io Editorial Team 11 min read
AI agent workspace routing configuration

How Multi-Model Routing Works in Hermes Agent

Most AI coding agents bind to a single model for everything. Hermes Agent from Nous Research takes a different approach: it runs a primary reasoning model for core conversation and tool-call loops, then routes eight specialized task types to independent model slots. Each slot can point to a different provider, model, and set of credentials.

This matters because not every task needs the same model. Session titles don't require a reasoning-heavy model. Image analysis needs vision capabilities your main model might lack. Context compression burns tokens on expensive models when a smaller one handles summarization just as well. By splitting these concerns, you reduce cost without sacrificing quality where it counts.

The routing architecture breaks into four layers:

  • Main model handles user messages, tool calls, and streamed responses
  • Auxiliary slots route vision, compression, session search, approval, web extraction, skills hub, MCP, and title generation independently
  • Fallback chains switch to backup providers when the primary fails
  • Credential pools rotate multiple API keys for the same provider to survive rate limits

Hermes supports over 100 models across OpenAI, Anthropic, Google Gemini, OpenRouter, xAI, Ollama, vLLM, SGLang, llama.cpp, LM Studio, and regional providers like Kimi, MiniMax, Alibaba DashScope, and Tencent TokenHub. Any OpenAI-compatible endpoint works as a custom provider.

Neural network routing diagram

How to Configure the Main Model and Auxiliary Slots

The fast way to set your main model is the interactive wizard:

hermes model

This walks through provider selection, authentication (OAuth or API key), and model picking. The result persists to ~/.hermes/config.yaml. For non-interactive setup, edit the config directly:

model:
  provider: openrouter
  default: anthropic/claude-opus-4.7
  base_url: ''
  api_mode: chat_completions

You can also switch models mid-session without restarting:

/model gpt-5.4 --provider openrouter              # session-only
/model gpt-5.4 --provider openrouter --global     # persists to config

Auxiliary Model Overrides

Eight task slots default to auto, which delegates to whatever your main model is. Override individual slots when a cheaper or more capable model fits:

auxiliary:
  vision:
    provider: openrouter
    model: google/gemini-2.5-flash
    base_url: ''
    api_key: ''
    timeout: 120
  compression:
    provider: openrouter
    model: google/gemini-3-flash-preview
  title_gen:
    provider: openrouter
    model: google/gemini-3-flash-preview
  approval:
    provider: anthropic
    model: claude-haiku-4.5

Each auxiliary entry accepts provider, model, base_url, api_key, timeout, extra_body, and download_timeout fields. Leave a slot on provider: auto with an empty model to keep it on the main model.

Which Slots to Override

Slot When to Override Recommended Model Type
Title Gen Almost always Flash/chat models (cheap, fast)
Vision When main model lacks vision Gemini 2.5 Flash, GPT-4o-mini
Compression When main model burns reasoning tokens Any summarization-capable model
Session Search High-volume recall queries Cheap models, runs 3 concurrent
Approval Smart approval mode only Haiku, Flash, GPT-5-mini
Web Extract Heavy web summarization Any chat model
Skills Hub Rarely Leave on auto
MCP Rarely Leave on auto

The dashboard's Models page shows usage analytics with token counts and costs per model. Each model card has a "Use as" dropdown for one-click assignment to main, all auxiliary, or individual slots.

Fastio features

Persist Hermes Agent outputs across sessions and model switches

50GB free workspace with MCP-native access. Your agent writes files, Intelligence Mode indexes them, humans pick up the results. No credit card, no expiration.

How to Build Fallback Chains Across Providers

Fallback chains keep your session alive when a provider goes down. Rather than stopping mid-conversation, Hermes switches to a backup provider automatically and restores the primary on the next user message.

Basic Fallback Configuration

fallback_model:
  provider: openrouter
  model: anthropic/claude-sonnet-4

For multiple sequential fallbacks, use the fallback_providers list:

hermes fallback add
hermes fallback list
hermes fallback remove

What Triggers Fallback

Hermes activates fallback on these conditions:

  • Rate limits (429) after exhausting retries
  • Server errors (500, 502, 503) after exhausting retries
  • Auth failures (401, 403) immediately
  • Not found (404) immediately
  • Malformed responses repeated empty or invalid outputs

Fallback is turn-scoped: each new user message starts fresh with the primary model. Within a single turn, fallback fires at most once. If the fallback provider also fails, normal error handling takes over rather than cascading indefinitely.

Custom Endpoint Fallback

Point fallback to a local model for zero-cost resilience:

fallback_model:
  provider: custom
  model: my-local-model
  base_url: http://localhost:8000/v1
  key_env: MY_LOCAL_KEY

Auxiliary Task Fallback

Each auxiliary slot has its own independent fallback chain. For text tasks, the auto-detection order is:

  1. OpenRouter
  2. Nous Portal
  3. Custom endpoints
  4. Codex
  5. API-key providers

Vision tasks follow a different chain: Main provider, then OpenRouter, Nous Portal, Codex, and Anthropic. If compression becomes unavailable entirely, Hermes degrades gracefully by dropping middle conversation turns without summaries rather than failing the session.

Credential Pools for Rate-Limit Resilience

Credential pools solve a different problem than fallback chains. Instead of switching providers when one fails, pools rotate multiple API keys for the same provider. This is critical for teams running multiple concurrent agents or hitting per-key rate limits on providers like OpenRouter or Anthropic.

Adding Keys to a Pool

hermes auth add openrouter --type api-key --api-key sk-or-v1-second-key
hermes auth add anthropic --type api-key --api-key sk-ant-api03-second-key
hermes auth add openrouter --type oauth

Each hermes auth add call appends to the pool for that provider. View all pools with hermes auth list.

Rotation Strategies

Configure how Hermes picks keys from a pool in config.yaml:

credential_pool_strategies:
  openrouter: round_robin
  anthropic: least_used

Available strategies:

  • fill_first (default): Use the first healthy key until it hits limits, then move to the next
  • round_robin: Cycle through keys evenly across requests
  • least_used: Always pick the key with the lowest request count
  • random: Random selection among healthy keys

Error Handling and Cooldowns

The rotation logic handles different failure modes:

  • 429 rate limit: Retry the same key once (handles transient blips), rotate on second 429, apply 1-hour cooldown
  • 402 billing/quota: Immediately rotate to next key, apply 24-hour cooldown
  • 401 auth expired: Attempt OAuth token refresh first, rotate if refresh fails
  • All keys exhausted: Fall through to fallback_model if configured

Subagent Credential Sharing

When Hermes spawns subagents via delegate_task, the parent's credential pool extends to children automatically. Per-task credential leasing prevents conflicts when multiple subagents rotate keys concurrently. The pool implementation uses threading locks for all state mutations, so concurrent access stays safe.

Pool Management Commands

Command Function
hermes auth list Show all pools and credential status
hermes auth add <provider> Add a credential (interactive)
hermes auth remove <provider> <index> Remove by index
hermes auth reset <provider> Clear cooldowns and exhaustion flags

Pareto Code Router for Cost-Optimized Coding

The Pareto Code Router is an OpenRouter feature that Hermes Agent integrates natively. Instead of committing to a specific coding model, you express a quality threshold and the router picks the cheapest model that meets it.

How It Works

OpenRouter maintains a curated shortlist of coding models ranked by their Artificial Analysis coding percentile. You set a min_coding_score between 0 and 1. The router maps your score to a quality tier, filters available models, and picks the cheapest one in that tier.

Score Range Tier What You Get
0.66 or above High Top coding models on the current frontier
0.33 to 0.66 Medium Strong flagships below the top tier
Below 0.33 Low Capable models exceeding median performance
Omitted High Defaults to strongest available

Configuration in Hermes

Set the Pareto Code Router as your main model in ~/.hermes/config.yaml:

model:
  provider: openrouter
  default: openrouter/pareto-code

Tune the quality threshold:

model:
  provider: openrouter
  default: openrouter/pareto-code
  extra_body:
    min_coding_score: 0.8

The default min_coding_score is 0.65 if you don't specify one. Higher values mean stronger (and more expensive) coders. Lower values optimize for cost.

What Makes This Useful

Selection is deterministic for a given score on any particular day, but the actual model behind it shifts as the Pareto frontier moves. New models get benchmarked, existing models get repriced, and the router adapts. You never need to update your config when a better option appears at the same price point.

The :nitro variant (openrouter/pareto-code:nitro) optimizes for speed instead of price, selecting by p50 throughput rather than cost per token.

No additional routing fees apply. You only pay the underlying model's per-token price. The response includes a model field revealing which concrete model handled your request, so you always know what you're getting.

Combining with Auxiliary Slots

A practical setup: use Pareto Code Router for your main model (coding tasks) while pinning auxiliary slots to known-cheap models for non-coding work:

model:
  provider: openrouter
  default: openrouter/pareto-code
  extra_body:
    min_coding_score: 0.7

auxiliary:
  title_gen:
    provider: openrouter
    model: google/gemini-3-flash-preview
  compression:
    provider: openrouter
    model: google/gemini-3-flash-preview
  vision:
    provider: openrouter
    model: google/gemini-2.5-flash
  approval:
    provider: anthropic
    model: claude-haiku-4.5

This gives you frontier-quality coding on the main thread and minimal spend on everything else.

Persistent Storage for Multi-Model Agent Workflows

Multi-model routing solves the compute layer. But agents running complex workflows still need persistent file storage that survives across sessions, model switches, and provider outages.

When Hermes Agent switches between models mid-session or falls back to a different provider, the conversation context transfers. Files the agent has been working with need the same continuity. Local filesystem works for single-machine setups, but breaks down when you run agents on remote servers, Docker containers, or Modal instances.

Options for Persistent Agent Storage

Local filesystem works if your agent runs on one machine and you never need to share results. Simple, zero-config, but fragile.

S3 or GCS gives durability and access control, but requires infrastructure setup, IAM configuration, and custom tooling for your agent to read and write files.

Fast.io workspaces provide a middle path: 50GB free storage with an MCP server that any LLM-powered agent can call directly. Files uploaded to a workspace are auto-indexed for semantic search through Intelligence Mode, so your agent can query its own outputs later without building a separate retrieval pipeline.

The handoff pattern works well with multi-model setups: your Hermes Agent generates artifacts (reports, code, analyses) using whichever model the router selects, stores them in a shared workspace, and a human collaborator picks up the finished work through the same workspace UI. No manual file transfers, no "which server was that on?" questions.

For teams running multiple Hermes instances with credential pools, Fast.io's workspace permissions let each agent write to designated folders while humans retain read access everywhere. File locks prevent conflicts when parallel agents touch the same documents.

The free agent tier includes 50GB storage, 5,000 AI credits per month, 5 workspaces, and MCP access with no credit card required.

Frequently Asked Questions

How do I switch models in Hermes Agent?

Use the /model slash command in any active session. For example, /model claude-sonnet-4 --provider anthropic switches immediately. Add --global to persist the change to config.yaml for future sessions. You can also run hermes model for an interactive wizard that walks through provider and model selection.

Can Hermes Agent use multiple AI providers at once?

Yes. The auxiliary model system lets you assign different providers to different task types simultaneously. Your main model might run on Anthropic while vision uses Google Gemini, compression uses a cheap OpenRouter model, and approval scoring runs on a fast Haiku instance. Each slot operates independently with its own authentication.

What is the Pareto Code Router in Hermes Agent?

The Pareto Code Router (openrouter/pareto-code) auto-selects the cheapest coding model that meets your quality threshold. Set a min_coding_score between 0 and 1 in your config, and OpenRouter picks from a curated shortlist ranked by coding benchmarks. The selection updates automatically as new models appear or pricing changes, so you always get optimal price-to-performance without manual updates.

How do I set up fallback models in Hermes Agent?

Add a fallback_model block to ~/.hermes/config.yaml with provider and model fields. Hermes switches to this model automatically when the primary fails due to rate limits, server errors, or auth issues. Fallback is turn-scoped, meaning each new user message retries the primary first. Use hermes fallback add for interactive setup, or define multiple fallbacks in the fallback_providers list for sequential failover.

What rotation strategies does Hermes Agent support for credential pools?

Four strategies are available: fill_first (default, uses one key until exhausted), round_robin (cycles evenly), least_used (picks lowest request-count key), and random. Configure per-provider in config.yaml under credential_pool_strategies. Keys that hit rate limits get a 1-hour cooldown; billing errors trigger 24-hour cooldowns with immediate rotation.

Do auxiliary model changes apply to existing sessions?

No. Config changes only apply to new sessions. Existing chat sessions retain their model assignments until you explicitly use the /model command to hot-swap within that session. For gateway integrations (Telegram, Discord, Slack), restart the gateway process to force all new sessions onto the updated config.

Related Resources

Fastio features

Persist Hermes Agent outputs across sessions and model switches

50GB free workspace with MCP-native access. Your agent writes files, Intelligence Mode indexes them, humans pick up the results. No credit card, no expiration.