How to Use Hermes Agent Voice Mode for Real-Time Spoken Interaction
Hermes Agent voice mode turns a text-based autonomous agent into a spoken conversation partner. It supports push-to-talk in the CLI, voice memo transcription on Telegram, and live voice channel participation on Discord, all while retaining the agent's full tool calling, persistent memory, and skill capabilities. This guide covers installation, provider configuration, and platform-specific setup for each mode.
What Hermes Agent Voice Mode Actually Does
Nous Research Hermes Agent is an open-source (MIT-licensed) autonomous AI agent with persistent memory, installable skills, subagent delegation, and connections to messaging platforms including Telegram, Discord, Slack, WhatsApp, and Signal. Voice mode adds a speech layer on top of this existing agent pipeline. You speak, the agent transcribes your audio, processes it through the same reasoning and tool-calling engine that handles text, then synthesizes a spoken response.
The key distinction from generic voice assistants is that Hermes Agent retains everything it can do in text mode. When you ask it a question by voice, it can call tools, read files, query APIs, delegate to subagents, and reference its persistent memory, then speak the answer back. Voice is an interface layer, not a separate product.
Voice mode works across four surfaces: the local CLI with push-to-talk, Telegram voice memos, Discord text channels with voice replies, and Discord voice channels where the agent joins as a live participant. Each surface shares the same STT and TTS pipeline but has different setup requirements and interaction patterns.
The speech pipeline connects three stages. Transcription converts your audio to text using OpenAI Whisper (running locally or via cloud API). The agent processes the transcribed text through its normal reasoning loop. Synthesis converts the agent's text response to audio using your configured TTS provider. Whisper's multilingual model handles input in 90+ languages without extra configuration.
How to Install Voice Mode Dependencies
Voice mode requires additional Python packages and system-level audio libraries beyond the base Hermes Agent installation. Start from a working Hermes setup where text chat already responds correctly. If you haven't done that yet, run pip install hermes-agent and configure your LLM provider with hermes model first.
Python Packages
Install the voice extras based on which platforms you need:
pip install "hermes-agent[voice]" # CLI microphone + local STT
pip install "hermes-agent[messaging]" # Discord and Telegram gateways
pip install "hermes-agent[tts-premium]" # ElevenLabs premium voices
pip install "hermes-agent[all]" # Everything at once
For local STT transcription without cloud API costs, also install faster-whisper:
pip install faster-whisper
The base Whisper model downloads automatically on first use (around 150 MB).
System Dependencies
macOS:
brew install portaudio ffmpeg opus espeak-ng
Ubuntu/Debian:
sudo apt install portaudio19-dev ffmpeg libopus0 espeak-ng
PortAudio handles microphone input for CLI recording. FFmpeg converts audio formats, which Telegram voice memos require. The Opus codec is necessary for Discord voice channels. espeak-ng provides phoneme processing for certain local TTS engines.
If you only need CLI voice mode, you can skip the Opus and espeak-ng packages. If you only need messaging platform voice, you can skip PortAudio. Install everything if you plan to use voice across all surfaces.
How to Configure Speech-to-Text and Text-to-Speech Providers
Voice mode configuration lives in ~/.hermes/config.yaml for provider settings and ~/.hermes/.env for API keys. The two halves of the pipeline, transcription (STT) and synthesis (TTS), are configured independently.
Speech-to-Text Options
Three STT providers are available, each with different tradeoffs:
Local Whisper runs the model on your machine. No API key, no recurring cost, and your audio never leaves the device. Speed depends on your hardware. The base model is fast and accurate for English. The large-v3 model handles multilingual input and accented speech better but needs more CPU or a GPU.
Groq Whisper sends audio to Groq's inference API using the whisper-large-v3-turbo model. Transcription typically returns in under half a second. Groq offers a free tier with rate limits that work fine for conversational use.
OpenAI Whisper uses OpenAI's hosted whisper-1 or the newer gpt-4o-transcribe model. Reliable and fast, but paid per minute of audio.
Add your provider choice to config.yaml:
stt:
provider: "local"
local:
model: "base"
For cloud providers, add the API key to ~/.hermes/.env:
GROQ_API_KEY=your-key-here
VOICE_TOOLS_OPENAI_KEY=your-key-here
Text-to-Speech Options
Hermes supports ten TTS providers. The practical starting points:
Edge TTS is free, requires no API key, and offers 322 voices across 74 languages. Latency is around one second. This is the default and a solid choice for getting started.
ElevenLabs produces the most natural-sounding output with configurable voice profiles. It requires a paid API key and has slightly higher latency (around two seconds per utterance).
OpenAI TTS supports voices like alloy, echo, fable, onyx, nova, and shimmer through the gpt-4o-mini-tts model. Paid, but consistent quality.
Local options including NeuTTS, KittenTTS, and Piper run entirely on-device. No API key, no cost, and no network dependency. Quality is good but not at the level of ElevenLabs.
Configure your TTS provider in config.yaml:
tts:
provider: "edge"
speed: 1.0
edge:
voice: "en-US-AriaNeural"
For ElevenLabs or OpenAI, add the corresponding key to .env:
ELEVENLABS_API_KEY=your-key-here
VOICE_TOOLS_OPENAI_KEY=your-key-here
Choosing a Provider Combination
For zero cost: local Whisper (base model) plus Edge TTS. Everything runs on your machine with no API calls.
For best quality: Groq Whisper (fast cloud transcription on a free tier) plus ElevenLabs TTS. Natural speech output with sub-second transcription.
For balanced speed: local Whisper (base) plus Edge TTS gives you a fully offline pipeline that responds within a few seconds on modern hardware.
Persist Hermes Agent voice session output across platforms
Free 50 GB workspace with automatic indexing, semantic search, and MCP-ready endpoints for your agent's file storage. No credit card, no trial expiration.
Use Voice Mode in the CLI
CLI voice mode uses your computer's microphone for push-to-talk recording. Start the Hermes Agent CLI and enable voice:
hermes
/voice on
Press Ctrl+B (the default record key) to start recording. A beep confirms that the microphone is active. Speak naturally. When you stop talking, the system detects 3 seconds of continuous silence and plays two beeps to confirm the recording ended. Your speech is transcribed and sent to the agent, which processes it through the normal pipeline and optionally speaks the response.
To hear spoken responses, enable TTS output:
/voice tts
Available CLI Voice Commands
/voice onenables voice input/voice offdisables voice mode/voice ttstoggles spoken output on or off/voice statusshows the current voice configuration
Tuning Recording Behavior
The recording settings in config.yaml control how the push-to-talk system detects your speech:
voice:
record_key: "ctrl+b"
max_recording_seconds: 120
auto_tts: false
beep_enabled: true
silence_threshold: 200
silence_duration: 3.0
If recordings cut off too early, increase silence_duration to 4 or 5 seconds. If the system picks up background noise as speech, raise silence_threshold above 200. You can change the record key to something like ctrl+space if Ctrl+B conflicts with your terminal emulator.
The hallucination filter is worth knowing about. Whisper sometimes generates phantom transcriptions from silence or ambient noise, producing phrases like "Thank you for watching" or "Subscribe to my channel." Hermes filters 26 known phantom phrases automatically, so these false transcriptions never reach the agent.
How to Set Up Voice on Telegram and Discord
Messaging platform voice works through the Hermes gateway process, a single daemon that connects to all your configured messaging platforms simultaneously.
Telegram Voice Memos
If you already have the Hermes Telegram bot running, voice memos work with minimal additional configuration. Send a voice message in your Telegram chat with the bot, and Hermes transcribes it automatically using your configured STT provider, processes the request, and responds. If TTS output is enabled, the agent sends a voice bubble alongside its text response.
Enable voice replies in the Telegram chat:
/voice on
In /voice on mode, the agent sends voice replies only when you send a voice message. Switch to /voice tts to get spoken replies for every message, including text.
FFmpeg must be installed on the server running the gateway. Telegram voice memos use the Opus codec in OGG containers, and FFmpeg handles the conversion pipeline.
Discord Text Channel Voice
The Discord gateway supports the same /voice on and /voice tts commands in text channels and DMs. Send a voice message attachment in Discord, and the agent transcribes and responds. With /voice tts enabled, every response includes a voice attachment.
Discord Voice Channels
This is where voice mode becomes most distinctive. The agent joins a Discord voice channel as a participant, listens to everyone speaking, transcribes each person's speech independently, processes their requests, and speaks responses back into the channel. The agent becomes a voice participant in group conversations.
Bot Permissions
Your Discord bot needs additional permissions for voice channels beyond what text messaging requires:
- Connect to join voice channels
- Speak to play audio in the channel
- Use Voice Activity for speech detection
The permissions integer that covers both text and voice capabilities is 274881432640. Re-invite the bot with the updated permissions URL:
https://discord.com/oauth2/authorize?client_id=YOUR_APP_ID&scope=bot+applications.commands&permissions=274881432640
Privileged Intents
In the Discord Developer Portal under Bot settings, enable:
- Presence Intent
- Server Members Intent (if using username-based access control)
- Message Content Intent (required)
Voice Channel Commands
Issue these commands in a text channel within the same server:
/voice jointells the bot to join your current voice channel/voice leavedisconnects the bot/voice statusshows which channel the bot is connected to
When active in a voice channel, the bot listens to individual audio streams, detects when each person finishes speaking (1.5 seconds of silence after at least 0.5 seconds of speech), transcribes the audio, processes it through the agent pipeline, and speaks the response. Transcripts appear in the associated text channel as [Voice] @username: transcribed text.
The echo prevention system automatically pauses audio listening while the bot plays TTS replies, so it doesn't transcribe its own speech and create a feedback loop.
Store and Share Agent Output With Persistent Workspaces
Voice interactions generate the same kinds of output as text conversations: files, code, analysis, research notes. The challenge with any agent deployment is that this output lives on whatever server runs the agent. If that server resets or you want to share results with a team, the files are stuck in a local directory.
This is where a persistent workspace layer makes the difference. Rather than relying on the agent's local filesystem, you connect Hermes to an external workspace that outlasts any single session and provides sharing, access control, and search across everything the agent produces.
Local storage works for solo experimentation. A folder on the same machine running Hermes gives you zero-latency reads and writes. The limitation is obvious: nothing survives a machine reset, and sharing means manual file transfers.
Cloud storage services like S3, Google Drive, or Dropbox provide durability but lack the intelligence layer. Files are stored but not indexed, not searchable by meaning, and not accessible through a unified API that an agent can call directly.
Fast.io provides workspaces built for this pattern. The free agent plan includes 50 GB of storage, 5,000 monthly credits, and 5 workspaces with no credit card required. Files uploaded to a Fast.io workspace are automatically indexed for semantic search through Intelligence Mode, which means you can ask questions about documents your agent has produced and get cited answers. The MCP server exposes 19 consolidated tools that an agent can call directly for storage, search, sharing, and workflow operations.
The ownership transfer model is particularly useful for agent-generated content. Hermes creates files during voice sessions, stores them in a Fast.io workspace, and you access the same files through the web UI or API. When you want to hand the workspace to a client or team member, transfer ownership without losing your admin access. The agent builds the workspace, the human receives it.
For teams running Hermes Agent across multiple platforms, Fast.io workspaces act as the shared layer where voice session transcripts, generated documents, and research outputs all land in one searchable location. Whether someone triggered the agent by voice on Discord, by text on Telegram, or through the CLI, the output ends up in the same workspace.
Frequently Asked Questions
Does Hermes Agent support voice interaction?
Yes. Hermes Agent supports voice input and output across four surfaces. The CLI offers push-to-talk recording with configurable hotkeys. Telegram and Discord support voice memos that are automatically transcribed. Discord voice channels allow the agent to join as a live participant, listening to and speaking with everyone in the channel. All voice interactions pass through the same agent pipeline as text, so the agent retains full access to tools, memory, and skills.
How do I enable voice mode in the Hermes Agent CLI?
Start the CLI with the `hermes` command, then type `/voice on` to enable voice input. Press Ctrl+B to start recording (configurable in config.yaml). The system detects silence automatically and transcribes your speech. Add `/voice tts` to hear spoken responses. You need the voice extras installed: `pip install "hermes-agent[voice]"` plus PortAudio for microphone access.
Can Hermes Agent join Discord voice channels?
Yes. The agent connects to a Discord voice channel using the `/voice join` command issued in a text channel. It listens to individual speakers, transcribes their speech in real time, processes requests through the full agent pipeline, and speaks responses back into the channel. The bot needs Connect, Speak, and Use Voice Activity permissions, plus the Opus codec installed on the server.
What platforms support Hermes Agent voice?
Voice works on four platforms. The CLI uses push-to-talk with your computer's microphone. Telegram transcribes voice memos sent to the bot and can reply with voice bubbles. Discord supports voice messages in text channels and live participation in voice channels. The underlying STT and TTS pipeline is shared across all platforms, so your provider configuration applies everywhere.
What speech-to-text providers does Hermes Agent support?
Three providers are available. Local Whisper runs on your machine using the faster-whisper library with no API key or cost. Groq offers cloud-hosted whisper-large-v3-turbo with a free tier and sub-second transcription latency. OpenAI provides hosted whisper-1 and gpt-4o-transcribe models on a paid basis. All three support multilingual transcription.
Is Hermes Agent voice mode free to use?
The agent itself is open-source under the MIT license. Voice mode can run at zero cost using local Whisper for transcription and Edge TTS for speech synthesis. Both work without API keys. Premium providers like ElevenLabs and OpenAI TTS produce higher-quality audio but require paid API keys. The choice depends on your quality requirements and whether you want fully offline operation.
Related Resources
Persist Hermes Agent voice session output across platforms
Free 50 GB workspace with automatic indexing, semantic search, and MCP-ready endpoints for your agent's file storage. No credit card, no trial expiration.