AI & Agents

Best OpenClaw Skills for Voice AI Developers

OpenClaw ships with native support for 14 TTS providers and 8 STT engines, but the real power comes from community skills on ClawHub that extend those capabilities into voice calling, podcast generation, speaker diarization, and voice cloning. This guide covers the best voice AI skills for OpenClaw developers, organized by function, with setup details and practical use cases for each.

Fast.io Editorial Team 9 min read
OpenClaw voice skills connect speech engines directly into the agent loop

How OpenClaw Handles Voice Natively

Before installing any skills, it helps to understand what OpenClaw already provides out of the box. The platform treats voice as a first-class I/O channel with two built-in subsystems: text-to-speech (TTS) for outbound audio and speech-to-text (STT) for inbound voice notes.

On the TTS side, OpenClaw supports over a dozen providers including OpenAI, ElevenLabs, Azure Speech, and Google Gemini, plus several local CLI options that need no API key. TTS is off by default and enabled per-agent through your config. The platform automatically selects the right audio format based on the delivery channel, so voice notes on messaging platforms get compressed formats while telephony connections receive raw streams.

For STT, OpenClaw offers both cloud and local processing paths. Cloud providers include OpenAI, Deepgram, Google, and ElevenLabs among others. Local options let you run transcription entirely on your own hardware with no API calls. When no provider is set explicitly, OpenClaw auto-detects the first working option from a built-in fallback chain.

The native audio subsystem also handles file size limits, transcript trimming, and chat-type gating for transcription. This layer covers basic voice I/O. The skills below extend it into more specialized territory.

Best Speech-to-Text Skills for OpenClaw

Native STT covers the common case of transcribing a single voice note. These skills go further with batch processing, speaker identification, local privacy, and specialized transcription models.

1. faster-whisper

The faster-whisper skill uses CTranslate2's reimplementation of OpenAI's Whisper model, which runs roughly 4 to 6 times faster than the original with identical accuracy. On a GPU, expect around 20x realtime transcription, meaning a 10-minute recording finishes in about 30 seconds.

Key strengths:

  • Batch transcription across directories or glob patterns
  • Speaker diarization for multi-speaker recordings
  • SRT/VTT subtitle generation with word-level timestamps
  • 99+ languages with automatic detection
  • Fully offline, no API costs

Best for: Developers processing meeting recordings, interviews, or lecture archives locally. The batch processing and diarization features make it the strongest choice for high-volume transcription work.

Install: Available on ClawHub; add it through the OpenClaw skill manager.

2. OpenAI Whisper (steipete/openai-whisper)

This skill runs OpenAI's Whisper model locally on your machine. Audio never leaves your system, which makes it the go-to option for privacy-sensitive workflows. It handles single-file transcription and subtitle generation without requiring an API key for inference once the model is downloaded.

Key strengths:

  • Complete offline operation after initial model download
  • Video subtitle and caption generation
  • No ongoing API costs

Best for: Privacy-first voice workflows where data cannot leave the local machine. Simpler than faster-whisper but with a smaller feature surface.

3. assemblyai-transcribe

For cloud-based transcription with advanced features, the AssemblyAI skill provides access to their Universal-3 Pro model. One developer tutorial on building voice agents with OpenClaw specifically used AssemblyAI for its keyterm prompting, which lets you teach the model custom names and terminology before transcription.

Key strengths:

  • Keyterm prompting for domain-specific vocabulary
  • Audio event tagging (laughter, background noise)
  • PII redaction built into the transcription pipeline
  • Speaker attribution for multi-party audio

Best for: Production voice agents handling sensitive data where you need both accuracy on domain terms and automatic PII handling.

AI-powered content analysis interface with indexed and searchable data

Text-to-Speech Skills

OpenClaw's native TTS covers basic reply-to-audio. These skills add voice cloning, sound effects, multilingual personas, and podcast-style output.

4. elevenlabs-skill (odrobnik/elevenlabs-skill)

The most feature-complete TTS skill on ClawHub. It wraps the full ElevenLabs API surface into a single skill with text-to-speech, sound effects generation, music creation, voice management, and usage tracking.

Key strengths:

  • 18 curated voice personas for different use cases
  • 32 languages via the multilingual v2 model
  • Streaming mode for real-time audio output
  • AI-generated sound effects from text prompts
  • Voice design to create and save custom voices
  • Pronunciation dictionary for custom word rules
  • Character usage tracking for cost management

Best for: Developers building voice-first applications who need production-quality voices with fine-grained control over pronunciation, persona, and language.

Requires: ELEVENLABS_API_KEY in environment

5. elevenlabs-agents (PennyroyalTea/elevenlabs-agents)

While the elevenlabs-skill focuses on audio generation, the elevenlabs-agents skill integrates ElevenLabs with OpenClaw's agent loop for voice calling. Composio lists it among the top 10 OpenClaw skills. The skill implements a failsafe pattern: if email or text delivery fails, the agent automatically initiates a phone call instead.

Key strengths:

  • Voice-based task automation (reservations, customer service calls)
  • Automatic fallback from text to voice call on delivery failure
  • Hands-free task summaries and status updates

Best for: Agents that need to reach humans by voice when text channels fail. The failsafe call pattern is particularly useful for alert systems and appointment reminders.

6. podcastifier

This skill converts incoming text, such as emails or newsletters, into short TTS podcasts. It parses plain text or HTML input, extracts key points, generates TTS audio per chunk with character-limit safety, and concatenates the segments with ffmpeg.

Best for: Turning long-form written content into listenable audio briefings. Useful for agents that digest information on behalf of users who prefer audio.

Fastio features

Give your voice agents a persistent workspace

Store transcripts, audio files, and agent output in a shared workspace with built-in search and version control. 50GB free, no credit card, MCP-ready.

How to Add Voice Calling to OpenClaw Agents

Voice calling turns an OpenClaw agent from a text-based assistant into something closer to a phone-accessible colleague. These skills handle the telephony layer.

7. DeepClaw (Deepgram)

Deepgram's official OpenClaw integration gives your agent a phone number and the ability to make and receive calls. DeepClaw uses Deepgram's Flux model for speech-to-text with semantic turn detection, which recognizes when a speaker finishes a thought rather than just detecting silence. For TTS, it uses Aura-2 with 90ms time-to-first-byte, fast enough to feel conversational.

Key strengths:

  • Dedicated phone number for inbound and outbound calls
  • Cross-channel memory linking calls and text messages in the same agent instance
  • Proactive callback functionality where the agent initiates contact
  • Full tool access during calls, so the agent can search the web, run code, or execute other skills while on the phone

Best for: Developers who want users to interact with their agent by voice call. The semantic turn detection makes conversations feel natural instead of stilted by fixed silence thresholds.

DeepClaw Hosted is available as a free experimental offering from Deepgram Labs, letting you test the integration without provisioning your own telephony infrastructure.

8. donotify-voice-call-reminder

A simpler voice calling skill that sends immediate voice call reminders or schedules future calls through the DoNotify service. Where DeepClaw provides full bidirectional voice conversation, this skill handles one-directional outbound voice notifications.

Best for: Reminder systems, appointment confirmations, and alert escalation where the agent needs to deliver a message by phone without a full conversation.

Audit log interface tracking agent actions and communications

Voice Cloning and Audio Generation

Beyond standard TTS, these skills handle voice replication and creative audio output.

9. clonev

Voice cloning and speech generation using XTTS v2. This skill lets you create a synthetic voice from a sample recording and then generate speech in that voice. The output maintains the tonal characteristics and speaking patterns of the source material.

Best for: Developers building branded voice experiences where a consistent, recognizable voice identity matters. Also useful for accessibility applications where a user's own voice needs to be preserved digitally.

10. eachlabs-voice-audio

A multi-function audio skill that combines TTS, STT, and voice conversion in one package. The voice conversion capability sets it apart: you can transform audio from one voice into another while preserving the original speech content and timing.

Best for: Creative and media workflows where voice transformation, not just voice synthesis, is the goal.

Comparison Table

Skill Function Cloud/Local Key Differentiator
faster-whisper STT Local Batch processing, diarization, 4-6x faster
OpenAI Whisper STT Local Simplest privacy-first option
assemblyai-transcribe STT Cloud PII redaction, keyterm prompting
elevenlabs-skill TTS Cloud 18 personas, 32 languages, sound effects
elevenlabs-agents Voice calling Cloud Failsafe text-to-call pattern
podcastifier TTS Local+Cloud Text-to-podcast conversion
DeepClaw Voice calling Cloud Bidirectional phone calls, semantic turn detection
donotify-voice-call-reminder Voice calling Cloud Scheduled outbound call reminders
clonev Voice cloning Local XTTS v2 voice replication
eachlabs-voice-audio Multi-function Cloud Voice conversion between speakers

Storing and Sharing Voice Agent Output with Fast.io

Voice AI agents generate files constantly: transcriptions, audio recordings, subtitle files, podcast episodes. Keeping those artifacts organized and accessible to both agents and humans is a separate problem from generating them.

Local storage works during development, but breaks down when agents run across sessions, when multiple agents collaborate, or when a human needs to review what the agent produced. S3 or Google Cloud Storage handles raw persistence, but requires custom code for access control, search, and handoff.

Fast.io provides a workspace layer designed for this pattern. Agents write files through the MCP server, which exposes 19 consolidated tools via Streamable HTTP at /mcp. Files uploaded to a workspace are automatically indexed by Intelligence Mode, making transcripts and audio searchable by meaning without a separate vector database.

For voice AI workflows specifically:

  • Transcript storage: STT output lands in a shared workspace where team members search by content, not filename
  • Audio file management: Generated audio from TTS skills gets versioned automatically. Previous takes persist alongside the latest version
  • Ownership transfer: An agent builds a workspace of transcripts, recordings, and summaries, then transfers ownership to a human. The agent keeps admin access for ongoing updates
  • Audit trail: Every file operation is logged, which matters for voice data that may contain sensitive content

The free agent tier includes 50GB storage, 5,000 credits per month, and 5 workspaces with no credit card required. That is enough capacity for most voice AI development workflows, since even high-quality audio files are relatively small compared to video or large datasets.

OpenClaw agents can connect to Fast.io workspaces through the MCP server and read, write, and query files alongside other tools in their skill stack. The workspace becomes the shared surface where agent-generated voice content meets human review.

Frequently Asked Questions

How do I add voice capabilities to an OpenClaw agent?

OpenClaw has built-in support for 14 TTS providers and 8 STT engines. For basic voice output, set messages.tts.provider in your config to a supported provider like elevenlabs or openai, along with the matching API key. For STT, configure tools.media.audio with your preferred provider. For more advanced voice features like calling, cloning, or podcast generation, install a specialized skill from ClawHub through the OpenClaw skill manager.

What is the best text-to-speech skill for OpenClaw?

The elevenlabs-skill by odrobnik is the most feature-complete TTS option on ClawHub, with 18 voice personas, 32 languages, streaming mode, sound effects generation, and pronunciation dictionaries. For simpler needs, OpenClaw's built-in TTS with any of its 14 supported providers works without installing additional skills. If you need offline TTS on macOS, the mac-tts skill uses the built-in say command with no API key required.

Can OpenClaw agents make voice calls?

Yes. The DeepClaw skill from Deepgram gives your agent a dedicated phone number for inbound and outbound calls, with semantic turn detection that makes conversations feel natural. The elevenlabs-agents skill adds a failsafe calling pattern where the agent falls back to a phone call when text delivery fails. For simpler outbound-only calls, the donotify-voice-call-reminder skill handles scheduled voice notifications.

What is the difference between faster-whisper and OpenAI Whisper skills?

Both run locally without cloud API costs, but faster-whisper uses a CTranslate2 reimplementation that runs 4 to 6 times faster than the original Whisper model. It also adds batch processing across directories, speaker diarization, and SRT/VTT subtitle generation with word-level timestamps. The OpenAI Whisper skill is simpler to set up and focuses on single-file transcription with strong privacy guarantees.

Which OpenClaw voice skills work completely offline?

The faster-whisper skill, OpenAI Whisper skill, and mac-tts skill all work without cloud API calls. faster-whisper and OpenAI Whisper handle speech-to-text locally using downloaded models, while mac-tts uses the macOS built-in say command for text-to-speech. For STT, OpenClaw also supports local CLI providers like whisper-cli and sherpa-onnx-offline without any skill installation.

How do I store transcripts and audio files from OpenClaw voice agents?

For team workflows, a shared workspace like Fast.io lets agents write transcripts and audio through MCP tools, with automatic indexing that makes content searchable by meaning. Files are versioned, access-controlled, and transferable to human team members. For local development, OpenClaw writes output to the local filesystem by default, and you can configure specific output directories in your agent's workspace settings.

Related Resources

Fastio features

Give your voice agents a persistent workspace

Store transcripts, audio files, and agent output in a shared workspace with built-in search and version control. 50GB free, no credit card, MCP-ready.