Goal
Broken out of #245 (Phase 2). The wow moment currently requires: Claude Code subscription ∩ tmux fluency ∩ self-signed TLS certs ∩ GPU/RunPod for TTS. That intersection is too small a funnel bottom. Ship a degraded-but-instant mode: ~60 seconds from install to first voice command, zero infrastructure prerequisites beyond Claude Code itself — restructured so voice backends are a clean two-tier model: default (built-in, zero setup) and custom (bring your own shim).
Design (locked 2026-06-07, validated by spike — see comment)
Two-tier backend model for both STT and TTS: default / custom.
# Tier 1 — zero setup, what a fresh install gets (and what an empty config means)
stt:
backend: default # Chrome Web Speech in the portal + jargon correction map
tts:
backend: default # browser speechSynthesis
# Tier 2 — bring your own model behind a small shim
stt:
backend: custom
url: http://localhost:8101 # any HTTP shim implementing the documented contract
tts:
backend: custom
url: http://localhost:8102
|
default (Chrome) |
custom (your shim) |
| STT |
browser SpeechRecognition (cloud-backed by Google), transcript → existing /api/sessions/<s>/send |
audio upload → /api/transcribe → your HTTP endpoint (audio in, text out) |
| TTS |
browser speechSynthesis (robotic, free, offline) |
text in, audio out per contract |
| Setup |
none; certs not needed (127.0.0.1 = secure context) |
run your shim, point config at it |
| Blessed browser |
Chrome (stated explicitly in docs/UI) |
any |
Key points:
- The shim contract is the extension surface.
STTServerBackend already speaks HTTP-audio-in/text-out; this work documents that contract as a public, stable interface so anyone can wrap any model (a Deepgram key, a local whisper.cpp, whatever) in ~30 lines and set backend: custom.
- The bundled moonshine/faster-whisper STT server is repositioned as the reference shim — shipped, documented, and pointed to from the contract doc as the example implementation. WhisperKit (in-process, Mac) either becomes a bundled shim or a named preset — decide during implementation.
- The portal picks the input path from server-reported backend:
default → browser STT, custom → today's audio-upload pipeline.
- Pre-launch, no backwards compat: old config shapes are migrated by hand, not shimmed.
Capability pass-through (envelope, not vocabulary)
Models differ in what they can receive and produce — e.g., expressive TTS models (Qwen/CosyVoice family) accept emotion tags / style instructions that plain backends cannot. The contract must not flatten every model to the lowest common denominator, and must not force shim authors into an agentwire-defined tag schema:
- Minimal required core: TTS =
text (+ optional voice); STT = audio bytes. That is the whole mandatory surface.
- Opaque pass-through fields:
instructions (free text — effectively a prompt riding along with the transaction: "speak warmly, slightly amused") and options (opaque JSON). agentwire transports them verbatim; only the shim interprets them. Same fields on the STT side (language hints, vocabulary biasing, timestamps).
- Inline markup is the shim's business: if a model wants
[laughter] or emotion tags embedded in text, the agent may emit them and the shim consumes them. Shims for plain models strip unknown markup rather than speaking it literally.
- Graceful degradation: the
default (browser) tier ignores instructions/options and strips inline tags. A capability-blind caller must never produce audibly broken output.
- Capability discovery: shims may expose
GET /capabilities (voices, emotion-tag support, etc.); absence of the endpoint means plain text only.
- Tooldef injection — the loop must close at the producer. Discovery is useless if the agent never learns it: the running session only emits tags it was told about. The
/capabilities response therefore includes a shim-authored tool_prompt (free text, written by the dev who wrote the shim): "This TTS supports inline [laughter], [sigh], and <emotion:happy|sad|excited> tags; use them sparingly for expressive delivery." agentwire appends it verbatim to the say MCP tool description (and the voice role prompt) at MCP-server start. Shim dev writes the prompt → tooldef teaches the agent → agent emits tags → envelope passes them through → shim renders them. agentwire stays a dumb pipe at every step; the shim author programs both ends.
This is how emotion-tag-capable models stay first-class through a custom shim without agentwire ever encoding model-specific schemas.
Note: MCP tooldefs are read at MCP-server start (a separate process launched by Claude Code), so a shim swap needs a session restart to re-teach running agents — document this in the contract doc.
Work items
- Browser STT path — Web Speech API capture; final transcript lands in the message input (edit-before-send default in instant mode) (~1 day, core work)
- Jargon correction map — ~20 deterministic client-side rewrites (
team up→tmux, pie test→pytest, worker pain→worker pane, agent wire→agentwire, get status→git status, …). Chrome has no usable vocabulary-biasing API; post-correction is the only lever. User-extensible.
- Browser TTS out —
speechSynthesis for agent replies under tts.backend: default
default/custom config refactor — backend selection keyed on the two-tier shape above; empty config ≡ default
- Shim contract doc — request/response shapes for STT (audio in → text out) and TTS (text in → audio out), with the bundled moonshine/whisper server as the reference implementation. The contract defines the envelope, not the vocabulary (see Capability pass-through above)
- First-run path — fresh install + empty config boots straight into instant mode: no cert prompts, no service setup, one command to a working portal
- Tooldef capability injection — MCP server fetches
GET /capabilities from the configured shim at startup and appends the shim's tool_prompt to the say tool description + voice role prompt; no shim or no endpoint → stock description
- Upsell banner — in-portal "default mode" notice: what custom adds (real voices, whisper accuracy, phone-from-anywhere via
0.0.0.0 + token auth) and a pointer to the shim contract doc
Known residual risks
- Confidently-wrong jargon: Chrome reported 0.93–0.97 confidence on every miss in the spike — confidence cannot gate anything. Edit-before-send + correction map are the mitigations.
auth→off class of misses (real-word substitutions) survive the correction map; acceptable — Claude asks or the glance catches it.
- Browser STT needs internet (Google cloud recognition) — fine for the target user, stated honestly in docs.
Verification
- A stranger (not the owner) goes from
pip install / clone to first successful voice command in <10 minutes with no GPU and no cert generation
- 5/5 test subjects complete it before this is called done
agentwire portal with a fresh, empty config reaches a usable portal with browser STT + browser TTS — no errors, no setup prompts beyond the essentials
backend: custom pointed at the bundled moonshine/whisper server behaves exactly as today's pipeline (regression check)
- A from-scratch shim written against only the contract doc works without reading agentwire source
- An
instructions/inline-tag-bearing request through the default tier produces clean audio (tags stripped, never spoken literally)
- With a
tool_prompt-bearing shim configured, a fresh session's say tooldef visibly contains the shim's usage prompt, and the agent emits the documented tags unprompted
Built by dotdev.dev
Goal
Broken out of #245 (Phase 2). The wow moment currently requires: Claude Code subscription ∩ tmux fluency ∩ self-signed TLS certs ∩ GPU/RunPod for TTS. That intersection is too small a funnel bottom. Ship a degraded-but-instant mode: ~60 seconds from install to first voice command, zero infrastructure prerequisites beyond Claude Code itself — restructured so voice backends are a clean two-tier model:
default(built-in, zero setup) andcustom(bring your own shim).Design (locked 2026-06-07, validated by spike — see comment)
Two-tier backend model for both STT and TTS:
default/custom.default(Chrome)custom(your shim)SpeechRecognition(cloud-backed by Google), transcript → existing/api/sessions/<s>/send/api/transcribe→ your HTTP endpoint (audio in, text out)speechSynthesis(robotic, free, offline)Key points:
STTServerBackendalready speaks HTTP-audio-in/text-out; this work documents that contract as a public, stable interface so anyone can wrap any model (a Deepgram key, a local whisper.cpp, whatever) in ~30 lines and setbackend: custom.default→ browser STT,custom→ today's audio-upload pipeline.Capability pass-through (envelope, not vocabulary)
Models differ in what they can receive and produce — e.g., expressive TTS models (Qwen/CosyVoice family) accept emotion tags / style instructions that plain backends cannot. The contract must not flatten every model to the lowest common denominator, and must not force shim authors into an agentwire-defined tag schema:
text(+ optionalvoice); STT = audio bytes. That is the whole mandatory surface.instructions(free text — effectively a prompt riding along with the transaction: "speak warmly, slightly amused") andoptions(opaque JSON). agentwire transports them verbatim; only the shim interprets them. Same fields on the STT side (language hints, vocabulary biasing, timestamps).[laughter]or emotion tags embedded intext, the agent may emit them and the shim consumes them. Shims for plain models strip unknown markup rather than speaking it literally.default(browser) tier ignoresinstructions/optionsand strips inline tags. A capability-blind caller must never produce audibly broken output.GET /capabilities(voices, emotion-tag support, etc.); absence of the endpoint means plain text only./capabilitiesresponse therefore includes a shim-authoredtool_prompt(free text, written by the dev who wrote the shim): "This TTS supports inline[laughter],[sigh], and<emotion:happy|sad|excited>tags; use them sparingly for expressive delivery." agentwire appends it verbatim to thesayMCP tool description (and the voice role prompt) at MCP-server start. Shim dev writes the prompt → tooldef teaches the agent → agent emits tags → envelope passes them through → shim renders them. agentwire stays a dumb pipe at every step; the shim author programs both ends.This is how emotion-tag-capable models stay first-class through a custom shim without agentwire ever encoding model-specific schemas.
Note: MCP tooldefs are read at MCP-server start (a separate process launched by Claude Code), so a shim swap needs a session restart to re-teach running agents — document this in the contract doc.
Work items
team up→tmux,pie test→pytest,worker pain→worker pane,agent wire→agentwire,get status→git status, …). Chrome has no usable vocabulary-biasing API; post-correction is the only lever. User-extensible.speechSynthesisfor agent replies undertts.backend: defaultdefault/customconfig refactor — backend selection keyed on the two-tier shape above; empty config ≡defaultGET /capabilitiesfrom the configured shim at startup and appends the shim'stool_promptto thesaytool description + voice role prompt; no shim or no endpoint → stock description0.0.0.0+ token auth) and a pointer to the shim contract docKnown residual risks
auth→offclass of misses (real-word substitutions) survive the correction map; acceptable — Claude asks or the glance catches it.Verification
pip install/ clone to first successful voice command in <10 minutes with no GPU and no cert generationagentwire portalwith a fresh, empty config reaches a usable portal with browser STT + browser TTS — no errors, no setup prompts beyond the essentialsbackend: custompointed at the bundled moonshine/whisper server behaves exactly as today's pipeline (regression check)instructions/inline-tag-bearing request through thedefaulttier produces clean audio (tags stripped, never spoken literally)tool_prompt-bearing shim configured, a fresh session'ssaytooldef visibly contains the shim's usage prompt, and the agent emits the documented tags unpromptedBuilt by dotdev.dev