Problem
Multi-session orchestration's core promise — spawn worktree workers, they report back when done, the orchestrator reviews — is broken in practice. Finished workers push their PRs and go idle, but the orchestrator is never driven to review them. The human has to manually prompt "what's done?" every time. This is the #1 multi-session reliability failure.
Root cause (confirmed)
The worktree-session role instructs every worker to report back via the POLITE channel:
agentwire msg send --to <creator> --kind done "<session>: <PR URL>"
agentwire msg only injects when the recipient's input box is EMPTY (prompt_router.prompt_is_empty) — by design, so it never clobbers a human's half-typed draft. But an orchestrator session's box is frequently non-empty (mid-turn / agent busy / residual text), so the done report defers (box_not_empty), bumps attempts, and dead-letters after 40 attempts — silently.
Evidence — agentwire msg dead for the orchestrator shows 18 dead-lettered done reports:
The workers DID report back. The polite channel never delivered it, and the dead-letter was silent — so the orchestrator looked like it was ignoring finished work, when really it never received the knock.
Why the polite channel is the wrong default for report-back
msg is polite to protect a HUMAN's draft. But a --kind done report-back to a creator is exactly the signal the creator must ACT on — deferring it forever defeats the purpose. safe_deliver conflates "a human is typing, protect the draft" with "an agent orchestrator is busy, deliver at the next safe boundary."
Proposed fix (defense in depth)
- Escalate report-backs instead of dead-lettering. A
--kind done / --kind escalation message to a creator should, after K polite deferrals, fall back to a DRIVING delivery at the next safe boundary (the orchestrator's turn-end is itself a safe boundary) rather than dying silently at 40 attempts.
- Never silently dead-letter a report-back. Surface dead-lettered
done messages loudly — agentwire doctor flags them, and the orchestrator sees them. A dead-letter that reads as "worker ghosted" is the worst failure mode.
- Orchestrator-side pull backstop. Add
agentwire worktree --watch (or a review-driver) so the orchestrator can discover finished worktree PRs by polling, never depending on push alone.
- Distinguish a human-occupied box from an agent-busy box in
safe_deliver, so agent orchestrators get driven at turn boundaries while human drafts stay protected.
Impact
This is the single highest-leverage multi-session reliability fix: without it, autonomous orchestration silently requires constant human prompting. The whole "spawn workers, they report back, you review" loop is only as reliable as the report-back, and today it dead-letters.
Problem
Multi-session orchestration's core promise — spawn worktree workers, they report back when done, the orchestrator reviews — is broken in practice. Finished workers push their PRs and go idle, but the orchestrator is never driven to review them. The human has to manually prompt "what's done?" every time. This is the #1 multi-session reliability failure.
Root cause (confirmed)
The
worktree-sessionrole instructs every worker to report back via the POLITE channel:agentwire msgonly injects when the recipient's input box is EMPTY (prompt_router.prompt_is_empty) — by design, so it never clobbers a human's half-typed draft. But an orchestrator session's box is frequently non-empty (mid-turn / agent busy / residual text), so thedonereport defers (box_not_empty), bumpsattempts, and dead-letters after 40 attempts — silently.Evidence —
agentwire msg deadfor the orchestrator shows 18 dead-lettereddonereports:issue-490-watchdog→ "draft PR Isolate watchdog tick stages so one subsystem's exception can't starve the others #505 ... Closes Wrap the watchdog tick stages so one subsystem's exception can't starve the others #490" — died after 40 attempts (box_not_empty)issue-491-lock-race→ "draft PR fix(locking): drop unsafe unlink-based stale-lock recovery (race #491) #508 ... Closes Fix lock auto-cleanup race: a waiter can unlink a freshly-acquired live lock, breaking mutual exclusion #491" — died after 40 attempts (box_not_empty)fix-delivery-race→ PR fix(sessions): verify first-message delivery against scrollback + consumed signal (#478) #479,harden-damage-control→ PR Harden the damage-control matcher #500,briefing-mode→ PR Research: Briefing Mode feasibility (asymmetric-verbosity orchestration) #432The workers DID report back. The polite channel never delivered it, and the dead-letter was silent — so the orchestrator looked like it was ignoring finished work, when really it never received the knock.
Why the polite channel is the wrong default for report-back
msgis polite to protect a HUMAN's draft. But a--kind donereport-back to a creator is exactly the signal the creator must ACT on — deferring it forever defeats the purpose.safe_deliverconflates "a human is typing, protect the draft" with "an agent orchestrator is busy, deliver at the next safe boundary."Proposed fix (defense in depth)
--kind done/--kind escalationmessage to a creator should, after K polite deferrals, fall back to a DRIVING delivery at the next safe boundary (the orchestrator's turn-end is itself a safe boundary) rather than dying silently at 40 attempts.donemessages loudly —agentwire doctorflags them, and the orchestrator sees them. A dead-letter that reads as "worker ghosted" is the worst failure mode.agentwire worktree --watch(or a review-driver) so the orchestrator can discover finished worktree PRs by polling, never depending on push alone.safe_deliver, so agent orchestrators get driven at turn boundaries while human drafts stay protected.Impact
This is the single highest-leverage multi-session reliability fix: without it, autonomous orchestration silently requires constant human prompting. The whole "spawn workers, they report back, you review" loop is only as reliable as the report-back, and today it dead-letters.