voicebox

Model Serving & MLOps

An open‑source, local AI voice studio that clones voices, generates speech in 23 languages across seven TTS engines, and provides dictation and agent output—all running on your own machine.

Track releases GitHub Website

TypeScript Latest v0.5.0 · 3mo ago Security brief →

Features

Clone any voice from a few seconds of audio
Generate speech in 23 languages using seven integrated TTS engines
Dictate into any app with global hotkeys and Whisper‑based STT
Give AI agents a custom voice via an API‑first MCP server

Recent releases

View all 25 releases →

v0.5.0 New feature 3mo

Notable features

Global hotkey dictation with auto-paste to focused app
Voice profiles with personality LLM for in-character speech and composition
MCP server for agent integration (Claude Code, Cursor, Windsurf, Cline)

Full changelog

The Capture release.

Voicebox stops being just a voice-cloning studio and becomes a full AI voice studio. Hold a key anywhere on your machine, speak, release — the transcript lands in the focused text field. Flip the primitive around and any MCP-aware agent — Claude Code, Cursor, Spacebot — speaks back through an on-screen pill in one of your cloned voices. A local LLM sits between the two, so transcripts come out clean and voice profiles can carry a personality that reshapes what the agent says before it gets spoken.

Dictation — speak anywhere, paste anywhere

Global hotkey capture. Hold a customizable chord anywhere on your machine (defaults: right-Cmd + right-Option on macOS, right-Ctrl + right-Shift on Windows), speak, release. A floating on-screen pill walks through recording → transcribing → refining → done with a live elapsed timer. The transcript lands as clean text.
Push-to-talk and toggle modes, each with its own chord. The default toggle chord adds Space to the push-to-talk chord. Holding PTT and tapping Space mid-hold upgrades a hold into a hands-free session without a gap in the recording.
Auto-paste into the focused app. Once transcription finishes, Voicebox synthesizes a paste into whatever text field had focus when you started the chord — not wherever focus drifted while you were talking. Works across Dvorak / AZERTY layouts. Your clipboard is saved before and restored after.
Chord picker UI. Customize either chord from Settings → Captures by holding the keys you want. Left/right modifier badges show whether a key is the left or right variant.
Defaults stay out of your way. macOS defaults avoid left-hand Cmd+Option chords so the system shortcuts they collide with stay yours. Windows defaults route around AltGr collisions on German / French / Spanish layouts.
Accessibility permission is scoped. If macOS Accessibility isn't granted, dictation still runs and transcripts still land in the Captures tab — only synthetic paste is disabled. The permission prompt lives inline next to the auto-paste toggle, not as a global banner.

Personality — voice profiles that speak for themselves

Voice profiles now carry an optional personality — a free-form description of who this voice is, up to 2000 characters. When set, two new controls appear next to the generate button, each powered by a new Qwen3 LLM running entirely locally:

Compose — the shuffle button drops a fresh in-character line into the textarea. Click again for variety, edit before speaking.
Speak in character — the wand toggle runs your input through the personality LLM before TTS, preserving every idea but delivering it in the character's voice.

The same LLM doubles as the refinement model, so there's one local LLM in the app, not two.

API surface. POST /generate, POST /speak, and the MCP voicebox.speak tool accept personality: bool. POST /profiles/{id}/compose powers the shuffle button. MCP client bindings carry a default_personality: bool that applies when personality isn't passed explicitly.

Agents — any MCP-aware agent gets a voice

Voicebox ships a built-in Model Context Protocol server at http://127.0.0.1:17493/mcp so Claude Code, Cursor, Windsurf, Cline, VS Code MCP extensions — any MCP-aware agent — can call into your local Voicebox install. Four tools ship with dotted names:

voicebox.speak — speak text in any voice profile, with optional personality: true to run through the profile's personality LLM first
voicebox.transcribe — Whisper transcription of a base64 blob or an absolute local path. Path mode is restricted to loopback callers so a Voicebox bound on 0.0.0.0 doesn't double as an unauthenticated arbitrary-local-file read primitive.
voicebox.list_captures — recent captures with their transcripts
voicebox.list_profiles — available voice profiles (cloned + preset)
Streamable HTTP as primary transport. Cursor / Windsurf / VS Code / Claude Code all support it out of the box — drop a mcpServers block with the URL and an X-Voicebox-Client-Id header.
Stdio shim for clients that don't speak HTTP MCP. A voicebox-mcp binary ships inside the app bundle as a Tauri sidecar. The Settings page renders the install snippet with the right absolute path pre-filled.
Per-client voice binding. Pin Claude Code to Morgan, Cursor to Scarlett, Cline to its own voice — the X-Voicebox-Client-Id header resolves to a bound voice whenever speak is called without an explicit profile. Managed in Settings → MCP.
Profile resolution precedence. Explicit profile arg (name or id, case-insensitive) → per-client binding → global default from capture_settings.default_playback_voice_id → error with a pointer to Settings.
Speaking pill. Agent-initiated speech surfaces the same on-screen pill as dictation, in a speaking state with the profile name and an elapsed timer. Silent background TTS is a trust hazard — the pill always shows what's coming out of your machine.
POST /speak REST wrapper. Same code path and voice resolution for shell scripts, ACP, A2A, GitHub Actions, or anything else that isn't MCP-native.

Claude Code one-liner:

claude mcp add voicebox --transport http --url http://127.0.0.1:17493/mcp --header "X-Voicebox-Client-Id: claude-code"

Refinement

A clean transcript needs more than Whisper. Each capture flows through a small Qwen3 LLM that strips fillers, fixes punctuation, and optionally rewrites self-corrections — all on-device.

Loop-stripping before the LLM sees the transcript. Whisper's "thanks for watching thanks for watching thanks for watching…" hallucination loops are collapsed at a six-identical-tokens threshold (case-insensitive) so a small refinement model can't echo them back. Coverage spans single-word runs, multi-word phrases, CJK character runs, and Japanese emphasis patterns; legitimate repetition ("no, no, no, no, no") doesn't cross the threshold.
Per-capture flag snapshot. smart_cleanup, self_correction, and preserve_technical are stored on each capture, so refinement can be re-run later with different flags without losing the raw transcript.
Model picker — Qwen3 0.6B (400 MB, very fast), 1.7B (1.1 GB, fast), 4B (2.5 GB, full quality). 0.6B is the default; 1.7B is the sweet spot for transcripts with code identifiers.

Captures tab + settings

Settings → Captures is now the home for the whole dictation flow:

Dictation: global shortcut toggle, push-to-talk chord picker, toggle chord picker, live pill preview, auto-paste into focused field (with inline accessibility prompt).
Transcription: model picker (Whisper Base / Small / Medium / Large / Turbo), language lock.
Refinement: auto-refine toggle, model picker, smart cleanup, remove self-corrections, preserve technical terms.
Playback: default voice for the Captures tab's "Play as" action — picking a voice from the split-button persists the choice across tab switches and restarts.
Storage: captures folder quick-open.

Stories — timeline editor

The Stories tab graduates from a TTS sequencer into a real timeline editor. Same generation-row backing, but clips now compose with imported audio, per-clip levels, and a flexible track stack.

Import external audio. Drag a music file onto the story content area or pick one from the new "Import audio" entry in the add-clip popover. Accepted formats: wav / mp3 / flac / ogg / m4a / aac / webm, capped at 200 MB. Imported clips show their filename instead of a profile name and skip the regenerate / version-picker controls — there's nothing to regenerate.
Per-clip volume. A Volume2 icon in the clip-edit toolbar opens a 0–200% slider. Adjustments apply live and to exports. Split and duplicate carry the volume forward into the new clips.
Regenerate from both the clip's chat-list dropdown and the track-editor toolbar. Re-runs the underlying generation through the same path the History tab uses, with completion tracked in the global pending set.
Add empty tracks above or below the timeline via tiny + strips at the top of the topmost label cell and the bottom of the bottommost. Sticky in the label column so they follow horizontal scroll.
Zoom bar tracks the project. Min scope is 10 seconds visible (zoomed in cap), max is the entire project (zoomed out cap), default lands on 60 s. Both the +/− buttons and the scrollbar edge-drag handles clamp to those dynamic bounds.

Interface

Theme selector. Light / dark / system in Settings → General, persisted across sessions. System mode listens for OS-level appearance changes and flips live without a restart.
Scrubbable waveform player on captures. The capture detail card now embeds a WaveSurfer waveform with click-to-seek and a current / total timestamp pair, replacing the static duration label.
Capture pill light mode. The on-screen pill gets a dedicated light palette so it stays legible against bright windows.
Readiness checklist in the Captures settings sidebar. The same six-gate checklist the Captures empty state uses mirrors into Settings → Captures so a red gate can't hide behind a green toggle. Hidden once every gate is green. macOS-only rows (Input Monitoring, Accessibility) hide entirely on Windows and Linux.

Windows parity

Same dictation flow on Windows. Right-hand default chord (Ctrl+Shift) avoids AltGr collisions on layouts where Ctrl+Alt is the compose key. Focus is captured at chord-start so paste lands in the original field even if focus drifts during transcribe/refine.

View release on GitHub

v0.4.5 Breaking risk 3mo

Minor fixes and improvements.

Full changelog

Second hotfix for the "offline mode is enabled" crash on model load. 0.4.4 reverted the inference-path offline guards but kept the same trap on the load path, so users who updated to 0.4.4 kept hitting the exact error the release was supposed to fix (#526). This release removes the load-path guards and patches the transformers tokenizer load to be robust to HuggingFace metadata failures at the source, so the class of bug can't recur.

Reliability

Load no longer fails with "offline mode is enabled" (#530, fixes #526). transformers 4.57.x added an unconditional huggingface_hub.model_info() call inside AutoTokenizer.from_pretrained (via _patch_mistral_regex) that runs for every non-local repo load, regardless of cache state or whether the target model is actually a Mistral variant. The load-time HF_HUB_OFFLINE guard from 0.4.2 turned that into a hard crash for cached online users the moment 0.4.4 removed the inference-path guard that had been masking the problem. Fix wraps _patch_mistral_regex so any exception from the HF metadata check is caught and the tokenizer is returned unchanged — matching the success-path behavior for non-Mistral repos. The wrapper installs at backend.backends import time so it covers Qwen Base, Qwen CustomVoice, TADA, and every other transformers-backed engine on Windows, Linux, and CUDA alike. The load-time force_offline_if_cached guards were removed — with the wrapper in place they provide zero value and only risk re-introducing the same failure mode.
No more 30s pause when generating without a network. The HuggingFace metadata timeout called out as a known caveat in 0.4.4 is covered by the same patch; offline users no longer wait for the check to time out before load completes.

View release on GitHub

v0.4.4 Bug fix 3mo

Fixes regression where generation and transcription failed with 'offline mode' error while online.

Full changelog

Hotfix for a regression in 0.4.3 where generation and transcription could fail outright with "offline mode is enabled" even when the user was online.

Reliability

Inference no longer fails with "offline mode is enabled" while online (#524, reverts the inference-path guards from #503). 0.4.3 wrapped every inference body (generate, transcribe, create_voice_clone_prompt) with a process-wide HF_HUB_OFFLINE flip to stop lazy HuggingFace lookups from hanging when the network drops mid-inference (#462). That flag also blocks legitimate metadata calls (e.g. HfApi().model_info for revision resolution) so online users started seeing generation fail outright. Inference now runs with the process's default HF state. Load-time offline guards — which weren't the source of the regression — stay in place.

Known caveat: users generating without an internet connection may see brief pauses during inference while HuggingFace metadata lookups time out (typically ~30s, after which the library recovers). A proper offline-mode toggle is planned for 0.4.5.

View release on GitHub

v0.4.3 Bug fix 3mo

Fixed macOS 15 Sequoia installation blocks and Gatekeeper dialogs via DMG notarization.

Full changelog

A patch focused on two user-impacting reliability fixes: macOS DMG notarization (unblocks brew install voicebox on macOS 15 Sequoia and fixes spurious "app isn't signed" Gatekeeper dialogs on older Intel Macs) and Kokoro Japanese voice initialization on fresh installs.

macOS

DMGs are now notarized and stapled (#523). Tauri's bundler notarizes the .app inside the DMG but ships the DMG wrapper itself unnotarized. Gatekeeper rejects that on macOS 15 Sequoia (confirmed by Homebrew Cask CI failing on both arm and intel Sequoia runners) and causes the "the app is not signed" dialog on older Intel Macs when Apple's notarization servers are slow or unreachable (#509). The release workflow now submits each DMG to notarytool, staples the ticket, verifies with spctl, and overwrites the draft-release asset tauri-action uploaded. Adds ~5-10 min per macOS job.

Backend

Kokoro Japanese voices no longer crash on fresh installs (#521, fixes #514). misaki[ja] pulls in fugashi, which needs a MeCab dictionary on disk. The unidic package that was being installed ships no data and expects a ~526MB runtime download that just setup doesn't run (and which wouldn't survive PyInstaller anyway). Swapped to unidic-lite, which bundles a MeCab-compatible dict inside the wheel (~50MB). Collected in build_binary.py so frozen builds pick up unidic_lite/dicdir/.

View release on GitHub

v0.4.2 Breaking risk 3mo

Notable features

Full internationalization for English, Simplified Chinese, Traditional Chinese, Japanese (559 translation keys across all UI surfaces)
Offline mode now truly stays offline with HF_HUB_OFFLINE guarding all inference paths
Chatterbox reference samples are resampled to match sample rate and channel layout instead of rejected

Full changelog

This release localizes the entire app. English, Simplified Chinese (zh-CN), Traditional Chinese (zh-TW), and Japanese (ja) are wired up end-to-end across every tab, modal, dialog, and toast — 559 translation keys per locale, parity verified. Plus a batch of reliability fixes: offline-mode now actually stays offline, Chatterbox accepts reference samples it used to reject, MLX Qwen 0.6B points at the right repo, and macOS system audio survives backgrounding.

Internationalization (#508)

i18next foundation with an in-app language switcher that re-renders the tree on change — lazy-loaded components were holding stale strings without an explicit key-bump on the React root.
Four locales at full coverage: English, Simplified Chinese, Traditional Chinese, Japanese. No partial/English-fallback surfaces.
Every user-visible surface translated: Stories (list, content editor, dialogs, toasts), Effects (list, detail, chain editor, built-in preset names), Voices (table, search, inspector, Create/Edit modal, audio sample panels), Audio Channels (list, dialogs, device picker), history + story dropdown menus, ProfileCard / ProfileList / HistoryTable, and the unsupported-model note.
Relative dates localize via date-fns locale objects (3 days ago → 3 天前 / 3 日前) — Intl.RelativeTimeFormat doesn't produce the phrasing we use in the history table.
Dev-build version suffix (v0.4.2 (dev) / (开发版) / (開發版) / (開発版)) is now locale-aware.
559 translation keys across all four locales.

Reliability

HF_HUB_OFFLINE now guards every inference path (#503) — some engines were still attempting a HuggingFace metadata roundtrip on first load when offline mode was enabled, causing hangs on airgapped or flaky networks.
Chatterbox reference samples are preprocessed instead of rejected (#502) — samples outside the expected sample rate or channel layout are resampled to match, rather than failing with an opaque error.
MLX Qwen 0.6B repo path fixed (#501) — now points at the published mlx-community repo so the model actually downloads on Apple Silicon.
macOS system audio survives backgrounding (#486, closes #41) — WKWebView was tearing down the audio session when the app lost focus, silently killing system-audio capture.
MLX backend miniaudio dependency pinned (#506) — mlx_audio.stt needs it at runtime and nothing else transitively pulled it in, so --no-deps installs were breaking on first use.

Landing / Docs

New /download page (#487) — no more dumping first-time visitors onto the GitHub releases list. The API example snippet on the landing page also got an accuracy pass.
Download redirects work behind reverse proxies (#498) — uses the public origin instead of localhost when resolving platform-specific installer URLs.
MDX docs audited against the multi-engine backend (#484) — stale single-engine assumptions removed.
Three more tutorials + mobile navbar / hero CTA fixes (#483).

Linux

Still not shipping. The re-enable attempt (#488) landed on main but CI still hangs in the tauri-action bundler step on ubuntu-22.04 — no output for 25+ minutes after rpm bundling, even with createUpdaterArtifacts: false and --bundles deb,rpm. The matrix entry is disabled again for 0.4.2; the ubuntu-specific setup steps stay in the workflow so re-enabling is a one-line change once we identify the hang. Next release will take another pass.