friction collapse: degraded-but-instant first-run mode (<60s to first voice command)

## Goal

Broken out of #245 (Phase 2). The wow moment currently requires: Claude Code subscription ∩ tmux fluency ∩ self-signed TLS certs ∩ GPU/RunPod for TTS. That intersection is too small a funnel bottom. Ship a **degraded-but-instant mode**: ~60 seconds from install to first voice command, zero infrastructure prerequisites beyond Claude Code itself — restructured so voice backends are a clean two-tier model: `default` (built-in, zero setup) and `custom` (bring your own shim).

## Design (locked 2026-06-07, validated by spike — see comment)

**Two-tier backend model for both STT and TTS: `default` / `custom`.**

```yaml
# Tier 1 — zero setup, what a fresh install gets (and what an empty config means)
stt:
  backend: default     # Chrome Web Speech in the portal + jargon correction map
tts:
  backend: default     # browser speechSynthesis

# Tier 2 — bring your own model behind a small shim
stt:
  backend: custom
  url: http://localhost:8101   # any HTTP shim implementing the documented contract
tts:
  backend: custom
  url: http://localhost:8102
```

| | `default` (Chrome) | `custom` (your shim) |
|---|---|---|
| STT | browser `SpeechRecognition` (cloud-backed by Google), transcript → existing `/api/sessions/<s>/send` | audio upload → `/api/transcribe` → your HTTP endpoint (audio in, text out) |
| TTS | browser `speechSynthesis` (robotic, free, offline) | text in, audio out per contract |
| Setup | none; certs not needed (127.0.0.1 = secure context) | run your shim, point config at it |
| Blessed browser | Chrome (stated explicitly in docs/UI) | any |

Key points:

- **The shim contract is the extension surface.** `STTServerBackend` already speaks HTTP-audio-in/text-out; this work documents that contract as a public, stable interface so anyone can wrap *any* model (a Deepgram key, a local whisper.cpp, whatever) in ~30 lines and set `backend: custom`.
- **The bundled moonshine/faster-whisper STT server is repositioned as the reference shim** — shipped, documented, and pointed to from the contract doc as the example implementation. WhisperKit (in-process, Mac) either becomes a bundled shim or a named preset — decide during implementation.
- The portal picks the input path from server-reported backend: `default` → browser STT, `custom` → today's audio-upload pipeline.
- Pre-launch, no backwards compat: old config shapes are migrated by hand, not shimmed.

## Capability pass-through (envelope, not vocabulary)

Models differ in what they can receive and produce — e.g., expressive TTS models (Qwen/CosyVoice family) accept emotion tags / style instructions that plain backends cannot. The contract must not flatten every model to the lowest common denominator, and must not force shim authors into an agentwire-defined tag schema:

- **Minimal required core**: TTS = `text` (+ optional `voice`); STT = audio bytes. That is the whole mandatory surface.
- **Opaque pass-through fields**: `instructions` (free text — effectively a prompt riding along with the transaction: "speak warmly, slightly amused") and `options` (opaque JSON). agentwire transports them verbatim; only the shim interprets them. Same fields on the STT side (language hints, vocabulary biasing, timestamps).
- **Inline markup is the shim's business**: if a model wants `[laughter]` or emotion tags embedded in `text`, the agent may emit them and the shim consumes them. Shims for plain models strip unknown markup rather than speaking it literally.
- **Graceful degradation**: the `default` (browser) tier ignores `instructions`/`options` and strips inline tags. A capability-blind caller must never produce audibly broken output.
- **Capability discovery**: shims may expose `GET /capabilities` (voices, emotion-tag support, etc.); absence of the endpoint means plain text only.
- **Tooldef injection — the loop must close at the producer.** Discovery is useless if the agent never learns it: the running session only emits tags it was told about. The `/capabilities` response therefore includes a shim-authored `tool_prompt` (free text, written by the dev who wrote the shim): *"This TTS supports inline `[laughter]`, `[sigh]`, and `<emotion:happy|sad|excited>` tags; use them sparingly for expressive delivery."* agentwire appends it verbatim to the `say` MCP tool description (and the voice role prompt) at MCP-server start. Shim dev writes the prompt → tooldef teaches the agent → agent emits tags → envelope passes them through → shim renders them. agentwire stays a dumb pipe at every step; the shim author programs both ends.

This is how emotion-tag-capable models stay first-class through a custom shim without agentwire ever encoding model-specific schemas.

Note: MCP tooldefs are read at MCP-server start (a separate process launched by Claude Code), so a shim swap needs a session restart to re-teach running agents — document this in the contract doc.

## Work items

1. **Browser STT path** — Web Speech API capture; final transcript lands in the message input (edit-before-send default in instant mode) (~1 day, core work)
2. **Jargon correction map** — ~20 deterministic client-side rewrites (`team up→tmux`, `pie test→pytest`, `worker pain→worker pane`, `agent wire→agentwire`, `get status→git status`, …). Chrome has no usable vocabulary-biasing API; post-correction is the only lever. User-extensible.
3. **Browser TTS out** — `speechSynthesis` for agent replies under `tts.backend: default`
4. **`default`/`custom` config refactor** — backend selection keyed on the two-tier shape above; empty config ≡ `default`
5. **Shim contract doc** — request/response shapes for STT (audio in → text out) and TTS (text in → audio out), with the bundled moonshine/whisper server as the reference implementation. **The contract defines the envelope, not the vocabulary** (see Capability pass-through above)
6. **First-run path** — fresh install + empty config boots straight into instant mode: no cert prompts, no service setup, one command to a working portal
7. **Tooldef capability injection** — MCP server fetches `GET /capabilities` from the configured shim at startup and appends the shim's `tool_prompt` to the `say` tool description + voice role prompt; no shim or no endpoint → stock description
8. **Upsell banner** — in-portal "default mode" notice: what custom adds (real voices, whisper accuracy, phone-from-anywhere via `0.0.0.0` + token auth) and a pointer to the shim contract doc

## Known residual risks

- **Confidently-wrong jargon**: Chrome reported 0.93–0.97 confidence on every miss in the spike — confidence cannot gate anything. Edit-before-send + correction map are the mitigations.
- **`auth→off`** class of misses (real-word substitutions) survive the correction map; acceptable — Claude asks or the glance catches it.
- Browser STT needs internet (Google cloud recognition) — fine for the target user, stated honestly in docs.

## Verification

- A stranger (not the owner) goes from `pip install` / clone to first successful voice command in **<10 minutes** with no GPU and no cert generation
- **5/5 test subjects complete it** before this is called done
- `agentwire portal` with a fresh, empty config reaches a usable portal with browser STT + browser TTS — no errors, no setup prompts beyond the essentials
- `backend: custom` pointed at the bundled moonshine/whisper server behaves exactly as today's pipeline (regression check)
- A from-scratch shim written against only the contract doc works without reading agentwire source
- An `instructions`/inline-tag-bearing request through the `default` tier produces clean audio (tags stripped, never spoken literally)
- With a `tool_prompt`-bearing shim configured, a fresh session's `say` tooldef visibly contains the shim's usage prompt, and the agent emits the documented tags unprompted

Built by [dotdev.dev](https://dotdev.dev)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

friction collapse: degraded-but-instant first-run mode (<60s to first voice command) #251

Goal

Design (locked 2026-06-07, validated by spike — see comment)

Capability pass-through (envelope, not vocabulary)

Work items

Known residual risks

Verification

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	`default` (Chrome)	`custom` (your shim)
STT	browser `SpeechRecognition` (cloud-backed by Google), transcript → existing `/api/sessions/<s>/send`	audio upload → `/api/transcribe` → your HTTP endpoint (audio in, text out)
TTS	browser `speechSynthesis` (robotic, free, offline)	text in, audio out per contract
Setup	none; certs not needed (127.0.0.1 = secure context)	run your shim, point config at it
Blessed browser	Chrome (stated explicitly in docs/UI)	any

Uh oh!

friction collapse: degraded-but-instant first-run mode (<60s to first voice command) #251

Description

Goal

Design (locked 2026-06-07, validated by spike — see comment)

Capability pass-through (envelope, not vocabulary)

Work items

Known residual risks

Verification

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions