Hermes LLM: Nous Research's Open-Source Model Family Explained
Hermes LLM is Nous Research's family of open-source large language models, fine-tuned for instruction following, function calling, and agentic workflows. This guide covers every generation from the original Hermes 13B through Hermes 4, explains how the models differ from their Llama and Qwen base weights, and walks through running them locally, selecting the right variant for agent pipelines, and connecting model output to a shared workspace.
What Hermes LLM Is
Hermes LLM is a family of open-source large language models fine-tuned by Nous Research. The models start from open base weights (Meta's Llama series, Mistral, ByteDance Seed, or Alibaba Qwen) and add specialized training data for instruction following, structured output, function calling, and multi-turn conversation. Every Hermes model is released with open weights on HuggingFace, and the training methodology is documented in published technical reports.
The name "Hermes" applies specifically to the fine-tune, not the base model. When someone says "Hermes 3 405B," they mean Nous Research's fine-tuned version of Meta's Llama 3.1 405B. The base model provides raw language capabilities. The Hermes fine-tune adds reliable tool use, structured JSON output, roleplay coherence, and a prompt format (ChatML) that makes the model easier to integrate into pipelines.
Nous Research also builds Hermes Agent, an open-source autonomous agent framework released in early 2026. Hermes Agent uses Hermes LLMs as its default models, but the agent framework is model-agnostic. You can run it with Claude, GPT-4, Gemini, or any OpenAI-compatible endpoint. The model family and the agent framework share a name but serve different purposes: one is a set of weights you can download and run, the other is a Python application that orchestrates those weights (or any other model) into a persistent agent.
The practical appeal of Hermes models comes down to three things. First, they are genuinely open: weights, quantizations, and technical reports are all public. Second, the function-calling fine-tune is production-grade, with structured output and tool-call reliability that competitive commercial models charge for. Third, the model family spans parameter counts from 3B to 405B, so you can pick the right size for your hardware and latency budget.
Every Hermes Generation, from 13B to 405B
Nous Research has released four major Hermes generations since 2023, each built on whichever open base model offered the best foundation at the time.
Hermes 1 (2023)
The original release fine-tuned Meta's Llama at 7B and 13B parameter counts. Training data was primarily GPT-4 generated instruction-response pairs. These models proved that a small lab could produce fine-tunes competitive with much larger teams, and they established the ChatML prompt format that every subsequent Hermes version has used. Hermes 1 was a general-purpose instruction follower without dedicated tool-calling support.
Hermes 2 and Hermes 2 Pro (2024)
Hermes 2 scaled to Llama 2 and Mistral base models, training on roughly one million entries of synthetic and curated data. Hermes 2 Pro, released in May 2024, was the inflection point for the family. Nous Research built a dedicated function-calling and JSON-mode dataset in-house. The result: Hermes 2 Pro scored 90% on function-calling evaluations (built in partnership with Fireworks.AI) and 84% on structured JSON output evaluation. The model introduced the <tool_call> token and a reliable parsing format that let downstream systems extract function calls without fragile regex.
Hermes 3 (August 2024)
Hermes 3 moved to Llama 3.1 as the base, with variants at 3B, 8B, 70B, and 405B parameters. The technical report (arXiv:2408.11857) describes training on a primarily synthetic dataset that "aggressively encourages the model to follow system and instruction prompts exactly." Hermes 3 brought advanced agentic capabilities, better multi-turn coherence, long-context retention, and structured output via <tool_call> XML tags. The 405B variant achieved state-of-the-art performance among open-weight models on several public benchmarks. GGUF quantizations for every size shipped the same week, making local deployment practical from day one.
A reasoning-focused variant called DeepHermes 3 Preview followed in February 2025. Built from the Hermes 3 data mix plus 150,000 chain-of-thought examples, it introduced toggleable <think> tags that let users control whether the model reasons internally before answering. DeepHermes 3 scored 67% on MATH benchmarks at 8B parameters, trading raw math performance for broader conversational versatility compared to specialist reasoning models.
Hermes 4 (August 2025)
Hermes 4 expanded beyond the Llama base for the first time, shipping variants at 14B, 70B, and 405B parameters. The training dataset grew to roughly 50x more tokens than Hermes 3. The defining feature is hybrid reasoning: the model can deliberate internally using <think> traces or respond directly, and developers can toggle reasoning off for faster, cheaper inference in production. With reasoning enabled, the 405B variant hit 96% on MATH-500 (up from 93.1% in direct mode) and 81.9% on AIME 2024. Direct mode improved latency by up to 28% in early benchmarks.
Hermes 4.3, released alongside the main Hermes 4 line, was the first Hermes fine-tune based on a non-Meta model entirely: ByteDance's Seed 36B. It delivers 70B-class performance in a 36B dense architecture with a 512K token context window.
How Hermes Models Handle Function Calling
Function calling is where Hermes models distinguish themselves from generic instruction-following fine-tunes. Starting with Hermes 2 Pro, every release includes dedicated training data for tool use, and the format has been consistent enough that downstream frameworks can rely on it.
The mechanism works through the ChatML prompt format. You define available tools as JSON schemas inside <tools> XML tags in the system prompt. When the model decides to call a function, it emits a <tool_call> tag containing a JSON object with the function name and arguments. The calling system parses that tag, executes the function, and returns the result inside a <tool_response> tag. The model then incorporates the result into its next response.
Here is what that looks like in practice:
<|im_start|>system
You are a function calling AI model.
<tools>
[{"type": "function", "function": {"name": "get_weather",
"parameters": {"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]}}}]
</tools><|im_end|>
<|im_start|>user
What's the weather in Tokyo?<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call><|im_end|>
This format is intentionally simple. There are no proprietary tokens or model-specific encoding. Any system that can parse XML tags and JSON can works alongside Hermes function calling. That simplicity is why Hermes models work well with generic agent frameworks: they do not require the framework to implement provider-specific parsing logic.
Hermes 3 and 4 added reliability improvements. The models learned to emit well-formed JSON consistently, handle nested function schemas, chain multiple tool calls in a single turn, and incorporate tool results into coherent follow-up responses. In agentic loops where the model calls tools dozens of times per session, that consistency matters more than raw benchmark scores.
Structured JSON output works similarly. You provide a JSON schema in the system prompt and the model constrains its output to match. This is useful for extraction tasks, form filling, and any workflow where you need predictable output structure from an LLM.
Persist Hermes Agent Files Across Sessions
Free 50 GB workspace with auto-indexing, MCP server access, and ownership transfer. No credit card, no expiration.
Running Hermes Models Locally
Every Hermes model ships with GGUF quantizations on HuggingFace, which means you can run them locally using Ollama, llama.cpp, LM Studio, or any GGUF-compatible runtime. The 8B variant alone has 47 quantized versions available, covering every precision level from Q2_K through F16.
Ollama (simplest path)
Install Ollama from ollama.com, then pull a Hermes model:
ollama pull hermes3:8b
Ollama handles weight downloading, GPU memory allocation, and serves an OpenAI-compatible API at http://localhost:11434/v1. For agent frameworks like Hermes Agent, you point the model configuration at that endpoint and you are running fully local inference with zero API costs.
llama.cpp (more control)
For direct GGUF loading with fine-grained control over context length, batch size, and GPU layers:
llama-server -hf NousResearch/Hermes-3-Llama-3.1-8B-GGUF -c 32768
This gives you an OpenAI-compatible server with explicit context window sizing. For production deployments, vLLM and SGLang offer higher throughput with continuous batching.
Hardware sizing
The right model size depends on your GPU memory. At Q4_K_M quantization (the sweet spot between quality and memory use):
- 3B (Hermes 3 Llama 3.2 3B): runs on 4 GB VRAM, fast enough for edge devices
- 8B (Hermes 3 Llama 3.1 8B): needs 6-8 GB VRAM, runs at 40-60 tokens/sec on an RTX 4060
- 70B (Hermes 3 Llama 3.1 70B): needs 48 GB+ VRAM, requires an A6000 or dual-GPU setup
- 405B (Hermes 3 Llama 3.1 405B): needs 430 GB+ VRAM even with FP8 quantization, multi-node territory
For most developers, the 8B model on a consumer GPU is the right starting point. It handles function calling, code generation, and multi-turn conversation without noticeable quality degradation for typical agent tasks. Move to 70B when you need stronger reasoning over complex tool chains or longer context windows.
Cloud inference
If local hardware is not available, Hermes models are served on OpenRouter, Nous Portal, and several other inference providers. OpenRouter aggregates over 200 models including every Hermes variant, so you can compare pricing and latency across providers without changing your API integration code.
Choosing a Hermes Model for Agent Workflows
Agent frameworks call models hundreds of times per session. The model needs to handle tool calls reliably, maintain context across long conversations, and produce structured output on demand. Not every Hermes variant is equally suited to that workload.
For local development and prototyping
Hermes 3 Llama 3.1 8B at Q4_K_M quantization is the workhorse. It fits on consumer GPUs, runs fast enough for interactive development (40-60 tokens/sec), and its function-calling reliability is high enough for most agent tasks. The 3B variant is usable for simple workflows but struggles with complex multi-step tool chains.
For production agentic pipelines
Hermes 4 70B or 405B, served through a cloud provider or on dedicated inference hardware. The hybrid reasoning mode lets you toggle chain-of-thought on for complex planning steps and off for routine tool calls, balancing quality and cost within the same session. The 405B variant's MATH-500 score of 96% with reasoning enabled reflects meaningfully better complex problem-solving compared to smaller variants.
For long-context workflows
Hermes 4.3 36B offers a 512K token context window, which is far beyond the 128K context of the Llama-based variants. If your agent needs to process large documents, maintain long conversation histories, or work with extensive codebases in a single session, this variant is worth the tradeoff of using a less common base model.
For reasoning-heavy tasks
DeepHermes 3 or Hermes 4 with reasoning enabled. The <think> tag format lets the model show its work on difficult problems. For agent tasks that involve multi-step planning, debugging, or mathematical reasoning, explicit chain-of-thought produces noticeably better results than direct responses.
Model selection in Hermes Agent
Hermes Agent supports any OpenAI-compatible model provider. You can switch models mid-session with the /model command, and the agent uses separate model slots for its main reasoning loop and auxiliary tasks like compression, vision, and web summarization. The recommended pattern is to use a capable model (70B+ or a commercial model like Claude) for the main loop and a cheaper model for auxiliary tasks that do not require deep reasoning.
The agent framework is not locked to Hermes models. Teams running Claude, GPT-4, Gemini, or local models through Ollama get the same skill system, persistent memory, and tool-calling pipeline. The Hermes models are defaults, not requirements.
Where Agent Output Goes After Generation
Models generate text. Agents turn that text into files, code, reports, and data. The gap between "the model produced a good response" and "the team can use what the agent built" is a storage and handoff problem.
Local LLM deployments write output to the local filesystem by default. That works for single-developer workflows, but it breaks down when agents run on remote servers (Docker, SSH, Modal), when multiple agents collaborate, or when agent output needs to reach people who are not on the same machine.
Fast.io solves this as a persistent workspace layer. Agents write files to Fast.io workspaces through the MCP server or REST API. Those files are immediately available to other agents and to humans through the web interface. When an agent finishes a project, the workspace owner can transfer ownership to a client or team member while the agent retains admin access for future updates.
The practical workflow for Hermes Agent deployments: the agent runs on your server (local, Docker, or cloud), uses whatever LLM backend you configure, and stores its persistent output in a Fast.io workspace. Files uploaded to the workspace are auto-indexed by Intelligence Mode, which means team members can search agent output by meaning, ask questions about generated documents with citations, and use Metadata Views to extract structured data from agent-generated files without manual review.
For teams running multiple agents, Fast.io workspaces provide file locks to prevent concurrent write conflicts and granular permissions so each agent accesses only its designated workspace. Webhooks notify downstream systems when agents create or update files, enabling reactive pipelines without polling.
The free agent tier includes 50 GB of storage, 5,000 AI credits per month, and 5 workspaces with no credit card and no expiration. That is enough capacity to run several Hermes Agent instances with persistent file storage across sessions.
Frequently Asked Questions
What is Hermes LLM?
Hermes LLM is a family of open-source large language models created by Nous Research. The models are fine-tuned versions of open base models (like Meta's Llama) with specialized training for instruction following, function calling, structured JSON output, and agentic workflows. The family spans multiple parameter counts from 3B to 405B and includes variants optimized for reasoning (DeepHermes, Hermes 4 with hybrid reasoning). All weights are freely available on HuggingFace.
Is Hermes LLM open source?
Yes. Every Hermes model is released with open weights on HuggingFace. The models inherit the license of their base model, which for Llama-based variants is the Llama Community License. The Hermes function-calling dataset and training code are also publicly available. Hermes Agent, the companion agent framework, is released under the MIT license.
How does Hermes LLM compare to Llama?
Hermes models start from Llama base weights and add fine-tuning for instruction following, function calling, and structured output. The base Llama models are general-purpose pre-trained models without specialized tool-use training. Hermes 3, for example, uses Llama 3.1 as its foundation but adds the ChatML prompt format, reliable tool-call parsing via XML tags, and training data that improves multi-turn conversation coherence. Nous Research reports that Hermes 3 matches or exceeds Llama 3.1 Instruct on general benchmarks while adding capabilities the base model lacks.
Can I use Hermes LLM locally?
Yes. Every Hermes model ships with GGUF quantizations on HuggingFace, compatible with Ollama, llama.cpp, LM Studio, and other local inference tools. The 8B variant runs on consumer GPUs with 8 GB of VRAM. Install Ollama and run "ollama pull hermes3:8b" to get started. For larger models, you need more GPU memory, but quantization options from Q2_K through F16 let you trade quality for memory across a wide range of hardware.
What is the difference between Hermes LLM and Hermes Agent?
Hermes LLM refers to the model weights you download and run for inference. Hermes Agent is a Python-based autonomous agent framework that uses those models (or any other model) to perform tasks with persistent memory, scheduled automations, and messaging integrations. The agent framework is model-agnostic and works with Claude, GPT-4, Gemini, or local models through Ollama. The two share a name because Nous Research builds both, but they serve different purposes.
Which Hermes model should I use for AI agent tasks?
For local development, Hermes 3 Llama 3.1 8B at Q4_K_M quantization offers the best balance of speed, quality, and hardware requirements. For production pipelines, Hermes 4 70B or 405B with hybrid reasoning provides stronger tool-call reliability and planning capability. Hermes 4.3 36B is the best choice when you need a long context window (512K tokens). If you are using a cloud provider, all variants are available on OpenRouter and Nous Portal.
Related Resources
Persist Hermes Agent Files Across Sessions
Free 50 GB workspace with auto-indexing, MCP server access, and ownership transfer. No credit card, no expiration.