Skip to content

friction collapse: degraded-but-instant first-run mode (<60s to first voice command) #251

@dotdevdotdev

Description

@dotdevdotdev

Goal

Broken out of #245 (Phase 2). The wow moment currently requires: Claude Code subscription ∩ tmux fluency ∩ self-signed TLS certs ∩ GPU/RunPod for TTS. That intersection is too small a funnel bottom. Ship a degraded-but-instant mode: ~60 seconds from install to first voice command, zero infrastructure prerequisites beyond Claude Code itself — restructured so voice backends are a clean two-tier model: default (built-in, zero setup) and custom (bring your own shim).

Design (locked 2026-06-07, validated by spike — see comment)

Two-tier backend model for both STT and TTS: default / custom.

# Tier 1 — zero setup, what a fresh install gets (and what an empty config means)
stt:
  backend: default     # Chrome Web Speech in the portal + jargon correction map
tts:
  backend: default     # browser speechSynthesis

# Tier 2 — bring your own model behind a small shim
stt:
  backend: custom
  url: http://localhost:8101   # any HTTP shim implementing the documented contract
tts:
  backend: custom
  url: http://localhost:8102
default (Chrome) custom (your shim)
STT browser SpeechRecognition (cloud-backed by Google), transcript → existing /api/sessions/<s>/send audio upload → /api/transcribe → your HTTP endpoint (audio in, text out)
TTS browser speechSynthesis (robotic, free, offline) text in, audio out per contract
Setup none; certs not needed (127.0.0.1 = secure context) run your shim, point config at it
Blessed browser Chrome (stated explicitly in docs/UI) any

Key points:

  • The shim contract is the extension surface. STTServerBackend already speaks HTTP-audio-in/text-out; this work documents that contract as a public, stable interface so anyone can wrap any model (a Deepgram key, a local whisper.cpp, whatever) in ~30 lines and set backend: custom.
  • The bundled moonshine/faster-whisper STT server is repositioned as the reference shim — shipped, documented, and pointed to from the contract doc as the example implementation. WhisperKit (in-process, Mac) either becomes a bundled shim or a named preset — decide during implementation.
  • The portal picks the input path from server-reported backend: default → browser STT, custom → today's audio-upload pipeline.
  • Pre-launch, no backwards compat: old config shapes are migrated by hand, not shimmed.

Capability pass-through (envelope, not vocabulary)

Models differ in what they can receive and produce — e.g., expressive TTS models (Qwen/CosyVoice family) accept emotion tags / style instructions that plain backends cannot. The contract must not flatten every model to the lowest common denominator, and must not force shim authors into an agentwire-defined tag schema:

  • Minimal required core: TTS = text (+ optional voice); STT = audio bytes. That is the whole mandatory surface.
  • Opaque pass-through fields: instructions (free text — effectively a prompt riding along with the transaction: "speak warmly, slightly amused") and options (opaque JSON). agentwire transports them verbatim; only the shim interprets them. Same fields on the STT side (language hints, vocabulary biasing, timestamps).
  • Inline markup is the shim's business: if a model wants [laughter] or emotion tags embedded in text, the agent may emit them and the shim consumes them. Shims for plain models strip unknown markup rather than speaking it literally.
  • Graceful degradation: the default (browser) tier ignores instructions/options and strips inline tags. A capability-blind caller must never produce audibly broken output.
  • Capability discovery: shims may expose GET /capabilities (voices, emotion-tag support, etc.); absence of the endpoint means plain text only.
  • Tooldef injection — the loop must close at the producer. Discovery is useless if the agent never learns it: the running session only emits tags it was told about. The /capabilities response therefore includes a shim-authored tool_prompt (free text, written by the dev who wrote the shim): "This TTS supports inline [laughter], [sigh], and <emotion:happy|sad|excited> tags; use them sparingly for expressive delivery." agentwire appends it verbatim to the say MCP tool description (and the voice role prompt) at MCP-server start. Shim dev writes the prompt → tooldef teaches the agent → agent emits tags → envelope passes them through → shim renders them. agentwire stays a dumb pipe at every step; the shim author programs both ends.

This is how emotion-tag-capable models stay first-class through a custom shim without agentwire ever encoding model-specific schemas.

Note: MCP tooldefs are read at MCP-server start (a separate process launched by Claude Code), so a shim swap needs a session restart to re-teach running agents — document this in the contract doc.

Work items

  1. Browser STT path — Web Speech API capture; final transcript lands in the message input (edit-before-send default in instant mode) (~1 day, core work)
  2. Jargon correction map — ~20 deterministic client-side rewrites (team up→tmux, pie test→pytest, worker pain→worker pane, agent wire→agentwire, get status→git status, …). Chrome has no usable vocabulary-biasing API; post-correction is the only lever. User-extensible.
  3. Browser TTS outspeechSynthesis for agent replies under tts.backend: default
  4. default/custom config refactor — backend selection keyed on the two-tier shape above; empty config ≡ default
  5. Shim contract doc — request/response shapes for STT (audio in → text out) and TTS (text in → audio out), with the bundled moonshine/whisper server as the reference implementation. The contract defines the envelope, not the vocabulary (see Capability pass-through above)
  6. First-run path — fresh install + empty config boots straight into instant mode: no cert prompts, no service setup, one command to a working portal
  7. Tooldef capability injection — MCP server fetches GET /capabilities from the configured shim at startup and appends the shim's tool_prompt to the say tool description + voice role prompt; no shim or no endpoint → stock description
  8. Upsell banner — in-portal "default mode" notice: what custom adds (real voices, whisper accuracy, phone-from-anywhere via 0.0.0.0 + token auth) and a pointer to the shim contract doc

Known residual risks

  • Confidently-wrong jargon: Chrome reported 0.93–0.97 confidence on every miss in the spike — confidence cannot gate anything. Edit-before-send + correction map are the mitigations.
  • auth→off class of misses (real-word substitutions) survive the correction map; acceptable — Claude asks or the glance catches it.
  • Browser STT needs internet (Google cloud recognition) — fine for the target user, stated honestly in docs.

Verification

  • A stranger (not the owner) goes from pip install / clone to first successful voice command in <10 minutes with no GPU and no cert generation
  • 5/5 test subjects complete it before this is called done
  • agentwire portal with a fresh, empty config reaches a usable portal with browser STT + browser TTS — no errors, no setup prompts beyond the essentials
  • backend: custom pointed at the bundled moonshine/whisper server behaves exactly as today's pipeline (regression check)
  • A from-scratch shim written against only the contract doc works without reading agentwire source
  • An instructions/inline-tag-bearing request through the default tier produces clean audio (tags stripped, never spoken literally)
  • With a tool_prompt-bearing shim configured, a fresh session's say tooldef visibly contains the shim's usage prompt, and the agent emits the documented tags unprompted

Built by dotdev.dev

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions