AI & Agents

How to Set Up RAG Document Retrieval in Hermes Agent

Hermes Agent's qmd skill provides local RAG document retrieval by combining BM25 keyword matching, vector search, and LLM reranking in a single hybrid pipeline. This guide walks through configuring collections, generating embeddings, querying your knowledge base, and enabling auto-retrieval during conversations, all running locally with sqlite-vec and FTS5.

Fast.io Editorial Team 9 min read
Neural indexing visualization representing document retrieval and search

What qmd Does and Why It Matters

Most RAG setups assume you'll spin up a cloud-hosted vector database like Pinecone or Weaviate. Hermes Agent takes a different approach. Its qmd skill (Query Markup Documents) runs a complete hybrid retrieval engine locally, storing everything in a single SQLite file at ~/.cache/qmd/index.sqlite.

The retrieval pipeline combines three search methods:

  • BM25 keyword matching for exact terms, code identifiers, and fast prefix queries (~0.2s response time)
  • Vector semantic search using locally-hosted GGUF embedding models (~3s on first load, faster when warm)
  • LLM reranking that fuses results from both approaches using reciprocal rank fusion and query expansion (2-3s warm, ~19s cold start)

This means you get accurate retrieval for both "find me the function called parseConfig" (keyword) and "how does the authentication flow work" (semantic) without switching tools or configuring separate services.

All processing stays on your machine. No API keys for embedding services, no data leaving your network, no monthly vector DB bills. The tradeoff is ~2GB of disk space for the three GGUF models that download automatically on first run.

Prerequisites and Installation

Before configuring qmd, verify your environment meets these requirements:

System requirements:

  • Node.js 22 or later
  • SQLite with extension support (on macOS, install via Homebrew: brew install sqlite)
  • Approximately 2GB free disk space for embedding models

Install the qmd skill:

If you installed Hermes Agent from the official repository, qmd lives under optional-skills/research/qmd/. Enable it by adding the skill to your configuration:

### ~/.hermes/config.yaml
skills:
  optional:
    - research/qmd

On first run, qmd downloads three GGUF models automatically. This takes a few minutes depending on your connection speed, but only happens once.

Verify installation:

qmd status

This reports your index health, collection count, and embedding coverage. If the command isn't found, ensure the Hermes Agent binary directory is on your PATH.

Document indexing and search status dashboard

How to Configure Collections and Index Documents

qmd organizes documents into named collections. Each collection points to a directory on your filesystem. You can index meeting notes, project documentation, personal notes, and code references as separate collections, then search across all of them simultaneously.

Add your first collection:

qmd collection add ~/notes --name notes
qmd collection add ~/projects/docs --name project-docs
qmd collection add ~/meetings --name transcripts

Add context descriptions:

This step is often skipped but dramatically improves retrieval accuracy. Context descriptions tell the engine what each collection contains, helping it route queries to the right documents:

qmd context add qmd://notes "Personal research notes, reading highlights, and ideas"
qmd context add qmd://project-docs "Technical documentation for active projects"
qmd context add qmd://transcripts "Meeting notes and call transcripts from 2025-2026"

Generate embeddings:

qmd embed

This processes all documents across all collections. Documents are split at natural break points (headings, code blocks, blank lines) targeting approximately 900 tokens per chunk with 15% overlap. Code blocks remain intact during chunking to preserve syntax.

Re-run qmd embed after adding new documents or collections. The engine only processes changed files, so subsequent runs are faster than the initial indexing pass.

Supported file types:

qmd indexes markdown and text-based files natively. For broader document support including PDFs, DOCX, and PPTX files, the workspace RAG feature proposed in GitHub issue #844 outlines plans for text extraction from binary formats using fastembed.

Fastio features

Persist Hermes Agent files across sessions and teams

Free 50GB workspace with auto-indexing for semantic search. No credit card, no vector DB setup. Add the Fast.io MCP server and your agent reads, writes, and shares files in one call.

Querying Your Knowledge Base

qmd exposes three search modes, each suited to different retrieval needs:

BM25 keyword search (fastest):

qmd search "parseConfig function signature"

Returns results in ~0.2s. Best for exact terms, function names, variable references, and known phrases. Supports phrase matching with quotes and exclusion with -term.

Vector semantic search:

qmd vsearch "how does the authentication flow handle expired tokens"

Returns results in ~3s. Ideal for natural language questions where you don't know the exact terminology used in your documents. Uses HyDE (hypothetical document embeddings) to improve recall on conceptual queries.

Hybrid search with LLM reranking (highest quality):

qmd query "best practices for error handling in the payment module"

Runs both BM25 and vector search in parallel, applies reciprocal rank fusion to merge results, then uses an LLM reranker to score relevance. Takes 2-3s when models are warm. This is the mode to use when you need the best possible answer and can tolerate slightly higher latency.

Retrieve a full document:

qmd get <docid>

After finding relevant chunks, use qmd get with the document ID to pull the complete source file.

Practical guidance: Use qmd search when you know what you're looking for (a specific term, config key, or file name). Use qmd query when you have a conceptual question and want comprehensive results. Reserve qmd vsearch for cases where BM25 misses because your query uses different vocabulary than your documents.

AI agent responding with retrieved document context

MCP Integration and Auto-Retrieval

The most useful way to run qmd is as an MCP server integrated directly into Hermes Agent. This gives the agent access to your knowledge base tools without requiring you to manually run search commands.

Stdio mode (simple setup):

Add to your Hermes Agent configuration:

### ~/.hermes/config.yaml
mcp_servers:
  qmd:
    command: "qmd"
    args: ["mcp"]
    timeout: 30

This registers five tools the agent can call: mcp_qmd_search, mcp_qmd_vsearch, mcp_qmd_deep_search, mcp_qmd_get, and mcp_qmd_status.

HTTP daemon mode (recommended for frequent use):

Start the daemon separately to keep embedding models warm in memory:

qmd mcp --http --daemon

This runs on localhost:8181 by default. Cold starts drop from ~19s to 2-3s because models stay loaded between queries.

For persistent operation, set up a launchd plist on macOS (~/Library/LaunchAgents/) or a systemd user service on Linux (~/.config/systemd/user/). This ensures qmd starts automatically on login and restarts if it crashes.

Auto-retrieval in conversations:

With MCP integration active, Hermes Agent can query your knowledge base automatically when your questions relate to indexed content. The agent decides when retrieval is useful based on your query, then pulls relevant chunks into its context window before generating a response. You don't need to prefix questions with special commands or file references.

The planned knowledgebase RAG system (tracked in GitHub issue #844) proposes deeper integration: automatic context injection with relevance scoring above 0.5, limited to 5-8 chunks (~4,000 tokens) to avoid context bloat. This would eliminate even the agent's decision step, making retrieval fully transparent.

Workspace Storage and Persistence

Hermes Agent's workspace at ~/.hermes/workspace/ provides persistent document storage across sessions. Unlike ephemeral caches that auto-clean after 24 hours, files in the workspace directory remain available indefinitely.

Recommended directory structure:

~/.hermes/workspace/
├── docs/          # PDFs, markdown, reference documents
├── data/          # CSV, JSON, YAML data files
├── uploads/       # Files received from messenger platforms
├── code/          # Code snippets and reference implementations
└── notes/         # Quick notes and scratchpad

Point qmd at your workspace to make it searchable:

qmd collection add ~/.hermes/workspace --name workspace
qmd context add qmd://workspace "Persistent workspace with project docs, data, and notes"
qmd embed

Connecting workspace to external storage:

For teams sharing documents across agent deployments, local workspace directories have an obvious limitation: they live on one machine. If you need documents accessible across multiple Hermes instances or want to hand off agent-generated files to human collaborators, a shared workspace layer fills that gap.

Fast.io provides persistent workspaces where agents and humans share the same file layer. With Intelligence Mode enabled, uploaded documents are automatically indexed for semantic search and AI chat, giving you a cloud-backed RAG layer without configuring separate vector infrastructure. The MCP server at /storage-for-agents/ exposes workspace operations that Hermes Agent can call directly alongside qmd for local files.

The free agent tier includes 50GB storage, 5,000 AI credits per month, and 5 workspaces with no credit card required. This works well as a complement to local qmd indexing: keep sensitive documents local and searchable via qmd, while sharing team resources and agent outputs through Fast.io workspaces.

How to Fix Common qmd Issues

Cold start latency:

The first query after a fresh boot takes ~19s because qmd loads three GGUF models into memory. Run the HTTP daemon to eliminate this. If you only need keyword matching, qmd search skips model loading entirely and returns in ~0.2s.

Embedding failures on macOS:

If qmd embed fails with SQLite extension errors, you likely have the system SQLite (which lacks extension support) taking precedence over Homebrew's version. Fix by ensuring Homebrew's SQLite is first on your PATH:

export PATH="/opt/homebrew/opt/sqlite/bin:$PATH"

Stale results after file changes:

qmd doesn't watch for file changes automatically. After editing documents in an indexed collection, re-run qmd embed to update the index. Only modified files are reprocessed, so this is fast for incremental updates.

Large collections (10,000+ files):

Initial embedding of large collections can take 30-60 minutes. Consider breaking them into focused collections rather than indexing entire home directories. Targeted collections with good context descriptions produce better retrieval results than broad, unfocused indexes.

Memory usage:

The three GGUF models consume approximately 2GB RAM when loaded. On systems with limited memory, use stdio MCP mode (models load on demand and unload after timeout) rather than the persistent HTTP daemon.

Index location and backup:

Everything lives in ~/.cache/qmd/index.sqlite. Back up this file to preserve your embeddings. Restoring it on another machine with the same document paths skips the embedding step entirely.

Frequently Asked Questions

Does Hermes Agent support RAG?

Yes. Hermes Agent supports RAG through its qmd optional skill, which provides hybrid retrieval combining BM25 keyword search, vector semantic search, and LLM reranking. All processing runs locally using SQLite with FTS5 and sqlite-vec extensions. A deeper knowledgebase RAG system with automatic context injection is under active development in GitHub issue #844.

How do I add documents to Hermes Agent's knowledge base?

Add directories as named collections using qmd collection add ~/path --name collection-name, then run qmd embed to generate vector embeddings. You can also place files in ~/.hermes/workspace/ for persistent storage across sessions and index that directory as a collection.

What search methods does Hermes Agent use for retrieval?

Hermes Agent's qmd skill uses three search methods. BM25 keyword matching handles exact terms and identifiers in about 0.2 seconds. Vector semantic search uses local GGUF embedding models for natural language queries. Hybrid mode runs both in parallel and applies LLM reranking via reciprocal rank fusion for highest-quality results.

Can Hermes Agent search local files?

Yes. qmd indexes markdown and text-based files from any local directory you configure as a collection. The index and all embeddings are stored locally in ~/.cache/qmd/index.sqlite with no cloud dependencies. Documents are chunked at natural break points targeting 900 tokens with 15% overlap.

How do I enable auto-retrieval in Hermes Agent conversations?

Configure qmd as an MCP server in ~/.hermes/config.yaml under the mcp_servers key. Once active, Hermes Agent can call qmd search tools automatically when your questions relate to indexed content. For best performance, run the qmd daemon in HTTP mode to keep models warm.

What file types does qmd support?

Currently qmd natively indexes markdown and text-based files. The planned knowledgebase RAG system (issue #844) proposes adding support for PDFs, DOCX, and PPTX through text extraction. For binary document formats today, convert to markdown before indexing.

Related Resources

Fastio features

Persist Hermes Agent files across sessions and teams

Free 50GB workspace with auto-indexing for semantic search. No credit card, no vector DB setup. Add the Fast.io MCP server and your agent reads, writes, and shares files in one call.