Requirements: Add full suite of live integration tests for CLI validation #3

@tinkermonkey · 2026-06-27T15:28:21Z

orchestrator-bot[bot]
Bot Jun 27, 2026

Requirements Analysis

Auto-created from Issue #2

User Request

Overview

Add a comprehensive suite of live integration tests that validate the code-wrapper CLI is functional end-to-end. Tests should exercise real invocations against a live environment rather than mocked dependencies.

Goals

Validate all CLI entry points execute without error
Confirm correct output format and exit codes for success and failure cases
Cover core wrapping behaviors (input passthrough, output capture, error propagation)
Ensure the test suite can run in CI and locally

Acceptance Criteria

Integration test runner configured (pytest or equivalent)
Tests invoke the CLI binary directly, not internal Python modules
At least one test per CLI command/subcommand
Tests pass against a clean environment with no pre-existing state
CI workflow updated to run integration tests on PR and main push
README updated with instructions for running integration tests locally

Labels: None
Requested by: @tinkermonkey

The orchestrator will analyze this request and develop detailed requirements.
When complete, Issue #2 will be updated with final requirements.

tinkermonkey · 2026-06-27T16:32:06Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Idea Research

Now I have a thorough understanding of the entire landscape. Here's my analysis:

Integration Tests for code-wrapper: Research & Analysis

Executive Summary

The ticket as written carries a significant framing mismatch against the actual codebase. code-wrapper is a TypeScript library (not a CLI application) with no binary entry points, no subcommands, and no Python anywhere in the stack. The acceptance criteria reference "CLI binary," "CLI command/subcommand," and "pytest" — none of which apply. However, the intent behind the ticket is sound: the project needs tests that exercise the library's public API against a real Claude Code CLI rather than the existing fake-claude.mjs fixture.

The existing v0.3 test suite (70 tests, vitest, forks pool) already covers what most projects would call "integration tests" — spawning a real subprocess, exercising timeouts, signal handling, and session persistence. What's missing is live validation against the actual claude binary in an authenticated environment.

Idea Exploration

What "live integration tests" means for this project

code-wrapper has three independently importable layers:

Layer	Current test coverage	What "live" would add
`CliProcess`	Full coverage via `fake-claude.mjs` (7 scenarios)	Validate that real `claude` CLI output still conforms to the event contracts code-wrapper depends on
`EventParser`	18 tests covering all known event types	Confirm parser handles the actual event shapes emitted by current Claude CLI versions (schema drift detection)
`SessionManager`	8 tests with MemoryStore and FileStore	Validate `--session-id` / `--resume` round-trip against real CLI session state

The core question: are the existing fake-claude tests sufficient, or does the project need tests that hit the real Claude CLI?

Arguments for live tests:

The Claude Code CLI's stream-json output format is undocumented and can change between versions. The fake-claude.mjs fixture encodes assumptions about event shapes that may drift from reality.
--session-id and --resume behavior can only be validated against the real CLI.
CLAUDECODE environment variable handling, OAuth/API key credential selection, and --permission-mode bypassPermissions all interact with real CLI internals.
Both downstream consumers (Switchyard, Documentation Robotics) have been bitten by behavior changes that unit tests didn't catch.

Arguments for caution:

Live tests require API credentials and consume API credits on every run.
LLM output is non-deterministic — assertions must be structural (event types, field presence), not content-based.
Test latency: a single Claude CLI invocation takes 5–30 seconds depending on prompt complexity and model.
CI environments need credential injection (secrets management).

The "CLI binary" confusion

The ticket says "Tests invoke the CLI binary directly, not internal Python modules." This needs reinterpretation:

code-wrapper has no CLI binary. It's an npm library (@tinkermonkey/code-wrapper) with three subpath exports.
The "CLI binary" it wraps is the Claude Code CLI (claude), which is an external dependency not shipped with this package.
"Not internal Python modules" is inapplicable — the project is TypeScript.

What the ticket likely intends: tests should exercise the library's public API surface (CliProcess.run(), SessionManager, parseCliLine()) against a real Claude CLI installation, not just the fake fixture.

Potential Directions

Direction 1: Live smoke test suite (minimal, CI-friendly)

A small set of tests (5–10) that invoke the real claude CLI through CliProcess.run() with trivial prompts. These validate:

The event stream contract (system/init → assistant → result flow)
ReadyEvent contains sessionId, model, tools
DoneEvent contains usage with all four token fields
Session resume works (--session-id then --resume)
Exit code 0 on clean completion

Tradeoffs: Requires credentials in CI. Costs ~$0.01–0.05 per run (trivial prompts). Takes 30–60 seconds. Tests are gated behind an env flag (e.g., RUN_LIVE_TESTS=1) so npm test still works offline.

Direction 2: CLI contract snapshot tests

Record real Claude CLI output once, commit the snapshots, then run EventParser against them. This validates parser correctness against real output without requiring credentials on every CI run.

Tradeoffs: Snapshots go stale when the CLI updates. Doesn't catch runtime integration issues (process spawn, signal handling, credential selection). But it's free, fast, and fully deterministic.

Direction 3: Extended fake-claude scenarios

Rather than going live, expand the fake-claude.mjs fixture to cover more edge cases observed in production by Switchyard and Documentation Robotics:

Multi-turn conversations with --resume
server_tool_use and redacted_thinking block types
Large output (100+ events in a single run)
Interleaved ProgressEvent timing
rate_limit_event inline (not just stderr)
Mixed error_detail + error_event sequences

Tradeoffs: Doesn't validate against real CLI. But covers the most common regression vectors without cost or credential dependencies.

Direction 4: Dual-track approach (recommended framing)

Combine Directions 1 and 3:

Fast path (npm test): Extended fake-claude scenarios + snapshot tests. Runs in CI on every PR. No credentials needed.
Live path (npm run test:live): Smoke tests against real CLI. Runs on main push or on-demand. Requires ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN.

This mirrors Switchyard's pattern: unit tests in tests/unit/ run everywhere; integration tests in tests/integration/ run only in environments with Docker and Redis available.

References and Prior Art

Documentation Robotics CLI test infrastructure

The most directly applicable reference. Key patterns worth adopting:

runDr() helper: Spawns the CLI as a subprocess via spawnSync, returns {exitCode, stdout, stderr}. code-wrapper's equivalent would be a runLive() helper that wraps CliProcess.run() and collects all events into an array.
createTempWorkdir(): Creates isolated temp directories per test with cleanup. code-wrapper's CliProcess.test.ts already does this with mkdtempSync + rmSync for the fake binary — the same pattern applies for session persistence directories.
Golden copy optimization: DR uses read-only baseline copies cloned per test to avoid repeated initialization. For code-wrapper, a similar optimization could cache a known-good session store state.
stripAnsi() for output comparison: Not needed here (code-wrapper parses JSON, not terminal output), but the principle of normalizing test output before assertion applies.
CI workflow: DR's cli-tests.yml separates "smoke tests" (always run) from "comprehensive tests" (on-demand). This exactly matches the dual-track approach.

Switchyard integration test patterns

Switchyard's tests/integration/ directory shows how to test CLI wrappers in a Docker-mediated context:

Event stream capture: Tests use callback fixtures to collect streaming events, then assert on the collected set. code-wrapper's collect() helper in CliProcess.test.ts already implements this pattern.
Skip conditions: Tests check for environment prerequisites (Docker, Redis, orchestrator environment) and skip gracefully when unavailable. Live tests should similarly gate on claude CLI availability.
Resilience tests: Switchyard tests fallback storage, result persistence, and crash recovery. code-wrapper's recovery.test.ts partially covers this; live tests could validate the full runWithRecovery() flow.
Container lifecycle: Not directly applicable (code-wrapper spawns processes, not containers), but the two-phase wait pattern (signal → grace period → force kill) is already implemented and tested.

vitest workspace pattern

For projects that need multiple test configurations (fast unit tests vs slow live tests), vitest supports workspace mode where separate configs can define different includes, timeouts, and environment requirements. This is worth investigating as an alternative to separate npm scripts.

Technical Considerations

Credential management

Live tests require one of:

ANTHROPIC_API_KEY — API key authentication
CLAUDE_CODE_OAUTH_TOKEN — OAuth token authentication

code-wrapper's CliProcess already handles the credential selection logic (prefers OAuth, falls back to API key). Tests should validate both paths, which means at least one test per auth method.

In CI, these would be injected as GitHub Actions secrets. The test runner should detect missing credentials and skip live tests with a clear message, not fail.

Cost and rate limiting

Each live test invocation costs API tokens. With trivial prompts ("respond with exactly 'hello'"), costs are negligible (~100 input + ~10 output tokens per call). But the rate limit concern is real — running 10 live tests in parallel could trigger the rate_limit_event path unintentionally.

Mitigation: run live tests serially (pool: 'forks' with maxConcurrency: 1), and set maxTimeout: 60 for each test.

Deterministic assertions

LLM output is non-deterministic. Live test assertions must be structural:

"The event stream contains at least one TextEvent" (not "the text says X")
"A ReadyEvent was emitted with a non-empty sessionId"
"A DoneEvent was emitted with usage.inputTokens > 0"
"The seq values are strictly monotonically increasing"
"No ErrorEvent was emitted"

CI workflow

No .github/workflows/ directory exists in the project today. Any CI configuration is net-new. The workflow should:

Run npm test (existing fast tests) on every PR and push to main
Run live tests on push to main only (or via workflow_dispatch)
Cache node_modules for speed
Inject API credentials from repository secrets
Set a job timeout (5 minutes for fast tests, 10 minutes for live)

The pytest question

The ticket mentions "pytest or equivalent." The project uses vitest — this is the correct choice and is already configured. There is no reason to introduce pytest or any Python tooling. The acceptance criteria should be updated to reflect vitest.

`isAvailable()` as a gate

CliProcess.isAvailable() already checks whether the claude binary exists in PATH. Live tests should call this in a beforeAll hook and skip the entire suite if the CLI is not installed. This is the same pattern Switchyard uses with its Docker availability checks.

Test file organization

Current structure places all tests in src/__tests__/. Live tests warrant a separate directory or naming convention to keep them out of the default npm test run:

src/__tests__/*.test.ts — existing fast tests (default npm test)
src/__tests__/live/*.live.test.ts — live tests (opt-in via npm run test:live)

The vitest config can include live/*.live.test.ts only when a specific config is used or an env var is set.

Generated by Orchestrator Bot 🤖
Processed by the idea_researcher agent

4 replies

tinkermonkey Jun 27, 2026
Maintainer

Let's take the hybrid approach of having both quick fake testing as well as real live testing. This needs to cover both the Claude code cli as well as GitHub copilot.

orchestrator-bot[bot] Bot Jun 27, 2026
Author

Idea Research

Great call on the hybrid approach. The Copilot dimension adds meaningful complexity worth unpacking, since the two CLIs have fundamentally different output contracts.

Where things stand today: code-wrapper already declares CliBackend = 'claude' | 'copilot' and dispatches on it in CliProcess, but the Copilot path throws "not yet implemented". The binary selection logic (claude vs gh) and the isAvailable() check are stubbed in. What's missing is buildArgs() for Copilot and an EventParser codepath for its output format.

The key architectural difference is that Claude Code emits structured stream-json events (typed, parseable, with system/init → assistant → result lifecycle), while Copilot emits plain text with markdown formatting. Documentation Robotics already solved this — their CopilotClient uses a heuristic CopilotOutputParser that detects code blocks, system messages, and tool invocations from unstructured text. The invocation pattern also differs:

	Claude Code	Copilot
Binary	`claude`	`copilot` (standalone npm)
Output format	`--output-format stream-json`	Plain text/markdown
Session start	`--session-id <id>`	(first invocation)
Session resume	`--resume <id>`	`--continue`
Permission bypass	`--permission-mode bypassPermissions`	`--allow-all-tools`
Agent selection	`--agent <name>`	`--agent <name>`

What this means for the test matrix:

The hybrid test suite needs two axes: backend (claude / copilot) × execution mode (fake / live).

Fake-claude tests (existing): keep and extend with more edge cases as discussed.
Fake-copilot tests (new): need a fake-copilot.mjs fixture that emits plain-text markdown output mimicking real Copilot responses. This validates the Copilot parser and --continue session logic without requiring the real binary.
Live-claude tests: smoke tests against real claude CLI with trivial prompts, gated on ANTHROPIC_API_KEY.
Live-copilot tests: smoke tests against real copilot CLI, gated on the binary being available and authenticated. The DR codebase shows this requires either a GitHub token or an active Copilot subscription.

Each live suite gates independently — if only claude is available, only claude live tests run. The isAvailable() method already handles this dispatch.

The ordering question: The Copilot backend isn't implemented yet (it's roadmapped for v0.4). There are two ways to sequence this:

Tests first, then implementation — write the fake-copilot fixture and live test shells now, have them assert against the expected event contract, and use them as a specification that drives the v0.4 implementation. This is TDD-style and ensures the Copilot integration is testable from day one.
Implementation first, then tests — implement the Copilot buildArgs() and parser in v0.4, then backfill tests. Faster to ship but risks the same "mocked tests pass, real CLI breaks" problem the ticket was created to prevent.

I'd lean toward option 1 — the DR codebase gives us enough information about Copilot's CLI contract to write accurate fixtures now, and having the test infrastructure ready means v0.4 development gets immediate feedback loops. The fake-copilot fixture can be derived directly from DR's CopilotOutputParser patterns: markdown code blocks, Running: / Searching: tool prefixes, and plain-text assistant responses.

One thing I want to confirm: when you say "GitHub Copilot," are you referring to the standalone copilot CLI (the npm package @github/copilot that DR uses), or gh copilot (the GitHub CLI extension)? The DR reference uses the standalone binary, and that's what code-wrapper's CliProcess stubs target. The gh copilot extension has a different invocation model and is more limited. This matters for how the fake fixture and live tests are structured.

Generated by Orchestrator Bot 🤖
Processed by the idea_researcher agent

tinkermonkey Jun 27, 2026
Maintainer

The standalone copilot cli is what we want to use.

I've also merged the copilot cli implementation in from the feature branch it was on, so it in now part of the main branch codebase. Update main and you'll see the changes.

orchestrator-bot[bot] Bot Jun 27, 2026
Author

Idea Research

Now I have a complete picture. Here's my updated analysis:

Thanks for the clarification — that reframes things significantly. Now that the copilot backend (ACP protocol via copilot --acp --stdio) is on main, the integration test story becomes much cleaner.

What the copilot backend gives us for testing:

The implementation speaks ACP (Agent Client Protocol) — a structured NDJSON JSON-RPC protocol over stdin/stdout. This is fundamentally different from the Claude Code CLI's stream-json output. The key pieces are already in place:

buildCopilotArgs() produces ['--acp', '--stdio'] plus optional --allow-all-tools, --agent, --resume=<uuid>
The ACP handshake (initialize → session/new → session/prompt) is written to stdin in run()
createCopilotAcpParser() is a stateful factory that maps NDJSON responses to normalized ClaudeEvents
fake-copilot.mjs already covers 5 scenarios (golden-path, stall, ignore-sigterm, nonzero-exit, permission-request) with 13 tests

What "live integration tests" means with the copilot CLI:

The copilot binary is the standalone GitHub Copilot CLI. Live tests would invoke code-wrapper's CliProcess with backend: 'copilot' against the real copilot --acp --stdio process. The assertions stay structural — the same approach I outlined in the original analysis — but now they're validating the ACP contract rather than the Claude stream-json contract:

Assertion	What it validates
`ReadyEvent` emitted with non-empty `sessionId`	ACP `session/new` response parses correctly
At least one `TextEvent` with content	`session/update` / `assistant.message_delta` streaming works
`DoneEvent` emitted with matching `sessionId`	`session.idle` notification fires and parser captures UUID
`seq` values monotonically increasing	Parser sequencing is consistent
No `ErrorEvent` emitted	Clean end-to-end path
Session resume with `--resume=<uuid>`	ACP session continuity works

Credential and cost considerations shift too:

Auth: Copilot CLI uses GitHub authentication (OAuth device flow or GITHUB_TOKEN), not ANTHROPIC_API_KEY. CI needs a GitHub token with Copilot access.
Cost model: Copilot is subscription-based (not per-token), so live test invocations don't incur per-call API costs. This removes the biggest objection to running live tests frequently.
Rate limiting: Copilot has rate limits, but they're higher and less aggressive than Anthropic's API limits. Serial test execution is still prudent.

The dual-track approach still applies, but the economics are better:

Since Copilot is subscription-priced, the "live path" (npm run test:live) could potentially run on every PR rather than only on main pushes — no token cost concern. The fast path (npm test) still uses fake-copilot.mjs for offline/no-auth development.

One important gap in the current implementation:

The ROADMAP explicitly flags that "Copilot backend tested against a real copilot --acp --stdio invocation" is still unchecked. The fake-copilot.mjs fixture validates the parser and process lifecycle, but nobody has yet confirmed that the real ACP protocol output matches the assumptions baked into createCopilotAcpParser(). That's exactly what this ticket should deliver — and it's the highest-value test you can write, since ACP is relatively new (GA February 2026) and the protocol surface could still shift.

Also worth noting: the ROADMAP mentions @github/copilot-sdk (Node.js, GA June 2, 2026). If the SDK provides typed ACP client bindings, it might be worth investigating whether live tests could use the SDK to validate the protocol independently of code-wrapper's parser — a "trust but verify" approach where you compare SDK output against createCopilotAcpParser() output for the same session.

Generated by Orchestrator Bot 🤖
Processed by the idea_researcher agent

2026-06-27T17:16:57Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Business Analysis

Now I have a complete and accurate picture of the codebase. Let me write the analysis.

Executive Summary

The @tinkermonkey/code-wrapper library wraps AI coding agent CLIs (Claude Code and GitHub Copilot) and normalizes their output into a typed event stream. The existing test suite (4 files, vitest, src/**/*.test.ts) exercises all three public layers — CliProcess, EventParser, and SessionManager — but exclusively through a fake-claude.mjs fixture injected into PATH. No tests run against either real CLI binary. No CI workflow exists. No test:live script exists. The Copilot backend (CliBackend = 'copilot') currently throws "Copilot backend is not yet implemented" and has no fixture, tests, or CI coverage.

The ticket's acceptance criteria reference "CLI binary," "subcommands," and "pytest" — all inapplicable to a TypeScript library with no binary entry points. The intent, confirmed by research, is to validate the library's public API (CliProcess.run(), SessionManager, parseCliLine()) against real CLI processes rather than the fake fixture.

The agreed-upon approach is a dual-track test structure: a fast path (npm test) using fake fixtures for both Claude and Copilot that runs offline without credentials, and a live path (npm run test:live) that exercises real CLI binaries, gated independently per backend on binary availability and credential presence.

The Copilot backend implementation is a prerequisite for all Copilot-related test requirements. This dependency must be resolved before Copilot test stories can be accepted.

Functional Requirements

FR-1: Live Test Runner Configuration

The project must have a dedicated test configuration and npm script that:

Runs files matching src/__tests__/live/**/*.live.test.ts (distinct from the existing src/**/*.test.ts pattern)
Does not run as part of npm test
Uses vitest with serial execution (maxConcurrency: 1) to prevent concurrency-induced rate limit errors
Sets a per-test timeout appropriate for real CLI latency (minimum 60 seconds)
Is invocable via npm run test:live

FR-2: Claude Fake Fixture Extensions

The existing fake-claude.mjs fixture must be extended with scenarios not currently covered:

A permission-request scenario emitting a server_tool_use block to validate permission-gating behavior
A rate-limit scenario emitting a rate_limit_event inline in the stream (not only via stderr)
A session-resume scenario that accepts a --resume flag and emits a system/init event with a known session ID, validating that the isFirstMessage: false code path is exercised
A multi-block scenario where a single assistant turn contains multiple content blocks (text + tool_use) in one emission

FR-3: Claude Live Test Suite

A new test file src/__tests__/live/claude.live.test.ts must:

Skip the entire suite via beforeAll if CliProcess.isAvailable() returns false for the Claude backend
Skip the entire suite if neither ANTHROPIC_API_KEY nor CLAUDE_CODE_OAUTH_TOKEN is present in the environment
Delete the unused credential before each test (OAuth preferred over API key, mirroring CliProcess.run() behavior)
Exercise CliProcess with backend: 'claude' using trivial prompts that minimize token consumption
Assert on event stream structure only — event types and field presence — not on LLM-generated content
Cover:
- Golden path: ReadyEvent emitted with non-empty sessionId and model; at least one TextEvent; DoneEvent with usage.inputTokens > 0 and usage.outputTokens > 0
- seq monotonicity across all events in a run
- DoneEvent.sessionId matching ReadyEvent.sessionId
- Session resume: two sequential runs where the second run passes --resume (using the cliSessionId recorded from the first) and emits a ReadyEvent with the same session ID
- OAuth auth path: one test explicitly using CLAUDE_CODE_OAUTH_TOKEN (skipped if not present)
- API key auth path: one test explicitly using ANTHROPIC_API_KEY (skipped if not present)
- AbortSignal mid-run: cancel after ReadyEvent, confirm ErrorEvent { code: 'aborted' } is the final event

FR-4: Copilot Fake Fixture (Prerequisite: Copilot Backend Implementation)

A new fixture src/__tests__/fixtures/fake-copilot.mjs must be created, driven by a FAKE_SCENARIO environment variable, covering:

golden-path: full ACP session lifecycle — handshake, one assistant response turn, idle
stall: ACP handshake completes, then process stalls; exits cleanly on SIGTERM
ignore-sigterm: stalls and ignores SIGTERM, exits only on SIGKILL
nonzero-exit: ACP handshake, then exits with code 1
permission-request: ACP handshake, followed by a permission request notification before the assistant turn

The fixture must consume stdin (reading ACP request JSON-RPC messages written by CliProcess.run()) and respond with ACP-compliant NDJSON on stdout.

FR-5: Copilot Fake Test Suite (Prerequisite: FR-4 and Copilot Backend Implementation)

A new test file src/__tests__/CliProcess.copilot.test.ts must:

Mirror the structure of CliProcess.test.ts, injecting fake-copilot.mjs as the copilot binary into PATH via beforeAll
Instantiate CliProcess with backend: 'copilot'
Cover the same scenario categories as the Claude fake suite: golden path, stall, SIGTERM/SIGKILL, nonzero exit, AbortSignal, idle timeout, max timeout
Assert that ReadyEvent, TextEvent, and DoneEvent are all emitted with the same field contracts as the Claude backend (same normalized types, not Copilot-specific types)

FR-6: Copilot Live Test Suite (Prerequisite: FR-5 and Copilot Backend Implementation)

A new test file src/__tests__/live/copilot.live.test.ts must:

Skip the entire suite via beforeAll if CliProcess.isAvailable() returns false for the Copilot backend
Skip the entire suite if the Copilot binary is not authenticated (detect via a dry-run invocation or an environment flag)
Cover:
- Golden path: ReadyEvent, TextEvent, DoneEvent structural assertions identical to FR-3 golden path
- seq monotonicity
- DoneEvent.sessionId matching ReadyEvent.sessionId
- Session resume using Copilot's --resume=<uuid> flag (if supported by the backend)
- AbortSignal mid-run cancellation

FR-7: CI Workflow

A GitHub Actions workflow file (.github/workflows/ci.yml) must be created that:

Triggers on: pull_request (all branches) and push to main
Runs npm ci and caches node_modules
Runs npm run typecheck and npm test on every PR and every push to main
Runs npm run test:live on push to main only, injecting ANTHROPIC_API_KEY from repository secrets
Sets a 5-minute job timeout for the fast path and a 10-minute job timeout for the live path
Uses Node.js 20 (matching engines.node: ">=20.0.0")

FR-8: README Instructions

The project README must include a section documenting:

How to run the fast test suite: npm test
How to run the live test suite: npm run test:live, including which environment variables must be set (ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN for Claude; Copilot authentication requirements)
The behavior when prerequisites are missing (suites skip gracefully rather than fail)
How to run a specific test file or scenario in isolation using vitest's filter flag

User Stories

Story 1: Live Test Runner

As a library contributor,
I want a dedicated npm run test:live script with its own vitest configuration,
so that I can run tests against real CLI binaries without affecting the offline npm test suite.

Acceptance Criteria:

Given the project has no live tests yet, when npm run test:live is invoked, then vitest runs files matching src/__tests__/live/**/*.live.test.ts and exits with code 0 (no tests found is not a failure)
Given a live test file exists, when npm test is invoked, then no files from src/__tests__/live/ are executed
Given a live test runs against a CLI that takes up to 30 seconds to respond, when the test is executing, then vitest does not report a timeout until at least 60 seconds have elapsed
Given multiple live tests run against the same backend, when the suite executes, then tests run serially (not concurrently) to avoid triggering rate limit errors

Story 2: Extended Claude Fake Scenarios

As a developer working on the Claude event stream parser,
I want fake-claude.mjs to cover session resume, rate-limit-inline, permission-request, and multi-block scenarios,
so that regression tests can detect parser drift without requiring real API credentials.

Acceptance Criteria:

Given FAKE_SCENARIO=session-resume is set, when fake-claude.mjs receives a --resume <id> argument, then it emits a system/init event with that same session ID and exits cleanly
Given FAKE_SCENARIO=rate-limit, when fake-claude.mjs runs, then it emits a rate_limit_event line on stdout before exiting with a non-zero code
Given FAKE_SCENARIO=permission-request, when fake-claude.mjs runs, then it emits a server_tool_use event after system/init before the result event
Given FAKE_SCENARIO=multi-block, when fake-claude.mjs runs, then it emits a single assistant event whose message.content array contains both a text block and a tool_use block
Given any unknown value for FAKE_SCENARIO, when fake-claude.mjs runs, then it writes an error to stderr and exits with code 1

Story 3: Claude Live Golden Path

As a CI pipeline operator,
I want a live test that confirms the full Claude CLI event stream round-trip works end-to-end,
so that I can detect Claude Code protocol changes before they reach downstream consumers.

Acceptance Criteria:

Given claude is not found in PATH, when the live Claude test suite runs, then all tests in the suite are skipped (not failed) with a message indicating the binary is unavailable
Given neither ANTHROPIC_API_KEY nor CLAUDE_CODE_OAUTH_TOKEN is set, when the live Claude test suite runs, then all tests in the suite are skipped with a message indicating missing credentials
Given valid credentials and claude in PATH, when CliProcess.run() completes with a trivial prompt, then the collected event stream contains exactly one ReadyEvent, at least one TextEvent, and exactly one DoneEvent
Given a completed run, when the ReadyEvent and DoneEvent are compared, then ReadyEvent.sessionId === DoneEvent.sessionId
Given a completed run, when DoneEvent.usage is inspected, then inputTokens > 0 and outputTokens > 0
Given a completed run, when all event seq values are inspected, then each seq value is strictly greater than the previous one
Given no ErrorEvent should be emitted on a clean run, when all events are collected, then zero events with type === 'error' are present

Story 4: Claude Live Session Resume

As a developer using SessionManager to implement multi-turn conversations,
I want a live test that validates --session-id and --resume round-trip against the real Claude CLI,
so that session continuity bugs are caught before they affect users.

Acceptance Criteria:

Given valid credentials, when a first run completes and DoneEvent.sessionId is captured, then a second run passing that ID via isFirstMessage: false emits a ReadyEvent without an ErrorEvent { code: 'stale_session' }
Given the second run completes, when its ReadyEvent.sessionId is compared to the captured ID from the first run, then the values match
Given the session resume test scenario, when the test completes both turns, then no ErrorEvent with any code is present in either run's event collection

Story 5: Claude Live Auth Path Coverage

As a library integrator configuring credentials for deployment,
I want live tests that validate both OAuth token and API key auth paths independently,
so that I can confirm that credential selection logic is correct for each deployment scenario.

Acceptance Criteria:

Given CLAUDE_CODE_OAUTH_TOKEN is set and ANTHROPIC_API_KEY is also set, when a live test run executes, then the process is spawned with ANTHROPIC_API_KEY removed from the subprocess environment and CLAUDE_CODE_OAUTH_TOKEN present, and the run completes without ErrorEvent
Given only ANTHROPIC_API_KEY is set (no OAuth token), when a live test run executes, then CLAUDE_CODE_OAUTH_TOKEN is absent from the subprocess environment and the run completes without ErrorEvent
Given CLAUDE_CODE_OAUTH_TOKEN is not set in the test environment, when the OAuth-specific test runs, then that individual test is skipped, not failed

Story 6: Claude Live AbortSignal Cancellation

As a caller using AbortSignal to cancel in-flight requests,
I want a live test confirming mid-run cancellation works against the real Claude CLI,
so that I can trust the SIGTERM/SIGKILL escalation behaves as expected outside the fake fixture.

Acceptance Criteria:

Given a live Claude run is in progress and the abort controller fires after ReadyEvent is received, when the cancellation completes, then the final event in the collected stream has type === 'error' and code === 'aborted'
Given the abort fires mid-run, when the test completes, then the CliProcess instance's activeProc is null (no leaked subprocess)

Story 7: Copilot Fake Fixture and Test Suite

As a developer implementing the Copilot ACP backend,
I want a fake-copilot.mjs fixture and corresponding test suite that exercises CliProcess with backend: 'copilot',
so that the Copilot event stream contract is testable offline and regressions are caught without live API calls.

Prerequisite: Copilot backend implementation must be merged to main before this story can be accepted.

Acceptance Criteria:

Given FAKE_SCENARIO=golden-path, when fake-copilot.mjs runs with ACP arguments (--acp --stdio), then it reads the ACP initialize request from stdin and writes a valid ACP initialize response on stdout, followed by a session/new response containing a session UUID, followed by an assistant.message_delta notification, followed by a session.idle notification, then exits with code 0
Given fake-copilot.mjs is injected into PATH as copilot, when CliProcess.run() is called with backend: 'copilot' and FAKE_SCENARIO=golden-path, then the collected events include ReadyEvent, TextEvent, and DoneEvent with the same field contracts as the Claude backend
Given FAKE_SCENARIO=stall with backend: 'copilot', when idleTimeout: 1 is set, then ErrorEvent { code: 'idle_timeout' } is the final event
Given FAKE_SCENARIO=ignore-sigterm with backend: 'copilot', when idle timeout fires, then SIGKILL escalation terminates the process and ErrorEvent { code: 'idle_timeout' } is yielded
Given FAKE_SCENARIO=nonzero-exit with backend: 'copilot', when the run completes, then ErrorEvent { code: 'nonzero_exit', exitCode: 1 } is yielded
Given a pre-flight aborted AbortController, when CliProcess.run() is called with backend: 'copilot', then a single ErrorEvent { code: 'aborted' } is yielded without spawning the process

Story 8: Copilot Live Test Suite

As a developer validating the Copilot ACP integration,
I want live tests that run CliProcess against the real copilot --acp --stdio process,
so that ACP protocol compatibility is continuously verified and drift from the real Copilot CLI is detected.

Prerequisite: Story 7 must be complete.

Acceptance Criteria:

Given the copilot binary is not found in PATH, when the live Copilot test suite runs, then all tests are skipped (not failed)
Given the Copilot binary is present but not authenticated, when the live Copilot test suite runs, then all tests are skipped with a message indicating authentication is required
Given a valid authenticated Copilot environment, when CliProcess.run() with backend: 'copilot' completes, then the event stream contains ReadyEvent with a non-empty sessionId, at least one TextEvent, and DoneEvent
Given a completed live Copilot run, when seq values are inspected, then they are strictly monotonically increasing
Given a completed live Copilot run, when ReadyEvent.sessionId and DoneEvent.sessionId are compared, then they match

Story 9: CI Workflow

As a repository maintainer,
I want a GitHub Actions workflow that runs the fast test suite on every PR and the live test suite on pushes to main,
so that regressions are caught before merge and the live integration is continuously validated against real CLI binaries.

Acceptance Criteria:

Given a pull request is opened or updated, when the CI workflow triggers, then npm run typecheck and npm test both run and must pass before the PR can be merged
Given a push to the main branch, when the CI workflow triggers, then npm test runs first and npm run test:live runs after, with ANTHROPIC_API_KEY injected from the repository's ANTHROPIC_API_KEY secret
Given the live test job starts, when credentials are missing from repository secrets, then the live tests skip gracefully and the job exits with code 0 (not a failure)
Given the fast test job, when it runs, then it completes within 5 minutes; given the live test job, when it runs, then it completes within 10 minutes
Given node_modules is installed on a previous run, when the workflow runs again on the same branch, then the cache is restored rather than re-downloading all dependencies
Given the CLAUDE_CODE_OAUTH_TOKEN secret is not configured in the repository, when the CI workflow runs, then no test or job fails solely due to its absence

Story 10: README Test Documentation

As a first-time contributor or downstream integrator,
I want the README to explain how to run both the fast and live test suites locally,
so that I can validate the library against my environment without reading source code to understand the test infrastructure.

Acceptance Criteria:

Given a developer has no credentials configured, when they read the README's test section, then they can find the command to run the offline suite (npm test) and understand it requires no API keys
Given a developer has ANTHROPIC_API_KEY set, when they read the README's test section, then they can find the exact command and environment variable name needed to run the live Claude tests
Given a developer has the Copilot CLI installed and authenticated, when they read the README's test section, then they can find the command to run the live Copilot tests
Given the prerequisites for live tests are not met, when the README is consulted, then it states that suites skip gracefully and do not fail when the binary or credentials are absent
Given a developer wants to run a single test in isolation, when the README is consulted, then it shows how to use vitest's --filter or file-targeting flag to run one test file without running the full suite

Generated by Orchestrator Bot 🤖
Processed by the business_analyst agent

0 replies

2026-06-27T18:43:08Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Architecture Design

I now have complete understanding of every relevant component. Here is the architecture design.

Architecture Design: Live Integration Tests for code-wrapper

ADR-001: Dual-Track Test Architecture

Status: Proposed

Context: code-wrapper's existing test suite (~70 tests) exercises all three layers via fake-claude.mjs and fake-copilot.mjs fixtures injected into PATH. These tests validate parser logic, timeout/signal handling, and session persistence — but only against synthetic output. Neither backend has ever been validated against a real CLI binary. The Claude Code stream-json format and Copilot ACP protocol are both undocumented and subject to change between versions. Downstream consumers (Switchyard, Documentation Robotics) have been bitten by behavior changes that fixture-based tests couldn't detect.

Decision: Adopt a dual-track test architecture:

Fast track (npm test): Existing fixture-based tests plus new extended scenarios. Runs offline, no credentials, no external binaries. Targets src/**/*.test.ts.
Live track (npm run test:live): Smoke tests against real CLI binaries. Requires the binary in PATH and valid credentials. Targets src/__tests__/live/**/*.live.test.ts. Runs serially. Independent gating per backend.

Consequences:

Contributors can develop and run tests offline with zero credential setup.
CI catches real protocol drift on main pushes without blocking PR workflows.
Two vitest configs to maintain (acceptable — they share the project's tsconfig.test.json).
Live tests incur API costs for Claude (~$0.01–0.05 per run with trivial prompts); Copilot is subscription-priced with no per-call cost.

System Architecture

Test File Layout

src/__tests__/
├── fixtures/
│   ├── fake-claude.mjs          # existing — 7 scenarios
│   └── fake-copilot.mjs         # from feat/v0.4-copilot-backend — 5 scenarios
├── helpers/
│   └── live-helpers.ts          # shared utilities for live test suites
├── live/
│   ├── claude.live.test.ts      # live Claude Code smoke tests
│   └── copilot.live.test.ts     # live Copilot ACP smoke tests
├── CliProcess.test.ts           # existing Claude + Copilot fake tests
├── EventParser.test.ts          # existing
├── SessionManager.test.ts       # existing
└── recovery.test.ts             # existing

Vitest Configuration

Two configs, selected by npm script:

vitest.config.ts (existing, unchanged):

export default defineConfig({
  test: {
    include: ['src/**/*.test.ts'],    // excludes *.live.test.ts
    pool: 'forks',
    testTimeout: 15_000,
  },
});

vitest.config.live.ts (new):

export default defineConfig({
  test: {
    include: ['src/__tests__/live/**/*.live.test.ts'],
    pool: 'forks',
    testTimeout: 120_000,             // 2 minutes per test — LLM latency
    maxConcurrency: 1,                // serial to avoid rate limiting
    retry: 0,                         // no retries — LLM non-determinism makes retries unreliable
  },
});

The include patterns are mutually exclusive. *.live.test.ts never matches *.test.ts, ensuring npm test never runs live tests and vice versa.

npm Scripts

{
  "scripts": {
    "test": "vitest run",
    "test:watch": "vitest",
    "test:live": "vitest run --config vitest.config.live.ts"
  }
}

Component Design

Live Test Helper Module (`src/tests/helpers/live-helpers.ts`)

A single module providing shared utilities for both live test suites. This avoids duplicating gating, collection, and assertion logic across claude.live.test.ts and copilot.live.test.ts.

Responsibilities:

Backend availability check — wraps CliProcess.isAvailable() and skips the suite if the binary is missing.
Credential gating — checks for ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN (Claude) or Copilot authentication status. Skips with a diagnostic message when missing.
Event collection — collectLive(backend, options): Promise<ClaudeEvent[]> — instantiates CliProcess(backend), drains all events, returns the array. Identical to the collect() pattern in CliProcess.test.ts but parameterized by backend.
Structural assertion helpers — reusable functions that encode the common structural assertions (seq monotonicity, ReadyEvent/DoneEvent sessionId match, no ErrorEvents). These are not custom matchers — they are plain functions that call expect() internally, keeping them framework-familiar.

export function assertEventStreamStructure(events: ClaudeEvent[]): void {
  // At least one of each lifecycle event
  expect(events.some(e => e.type === 'ready')).toBe(true);
  expect(events.some(e => e.type === 'text')).toBe(true);
  expect(events.some(e => e.type === 'done')).toBe(true);
  // No errors on clean runs
  expect(events.filter(e => e.type === 'error')).toHaveLength(0);
  // Seq monotonicity
  for (let i = 1; i < events.length; i++) {
    expect(events[i].seq).toBeGreaterThan(events[i - 1].seq);
  }
  // Session ID consistency
  const ready = events.find(e => e.type === 'ready') as ReadyEvent;
  const done = events.find(e => e.type === 'done') as DoneEvent;
  expect(ready.sessionId).toBe(done.sessionId);
  expect(ready.sessionId).toBeTruthy();
}

Design rationale: Switchyard's tests/utils/assertions.py demonstrates that domain-specific assertion helpers (e.g., assert_state_transition(), assert_comment_posted()) make test intent clearer without duplicating assertion logic. The same pattern applies here. Documentation Robotics' assertCliSuccess()/assertCliFailed() similarly abstracts repeated exit-code checks into readable helpers.

Claude Live Test Suite (`src/tests/live/claude.live.test.ts`)

Gating: beforeAll calls CliProcess('claude').isAvailable(). If false, skips the entire describe block via describe.skipIf(). Then checks for credential environment variables — if neither ANTHROPIC_API_KEY nor CLAUDE_CODE_OAUTH_TOKEN is set, skips with a message.

Environment isolation: Each test spawns CliProcess('claude') in a fresh tmpdir(). The helper ensures CLAUDECODE is not set (already handled by CliProcess.run() internally).

Test cases:

Test	What it validates	Prompt	Assertions
Golden path	Full event lifecycle	`"respond with hello"`	`assertEventStreamStructure()`, DoneEvent has `usage.inputTokens > 0` and `usage.outputTokens > 0`
ReadyEvent fields	Model and sessionId populated	`"respond with hello"`	`ReadyEvent.sessionId` is non-empty string, `ReadyEvent.model` is truthy
Session resume	`--session-id` → `--resume` round-trip	Two sequential runs: first with `isFirstMessage: true`, second with `isFirstMessage: false` and `sessionId` from first run's DoneEvent	Second run's ReadyEvent.sessionId matches first run's DoneEvent.sessionId; no ErrorEvents in either run
AbortSignal mid-run	Cancellation against real CLI	`"write a long essay"`	Abort after ReadyEvent; final event is `ErrorEvent { code: 'aborted' }`
OAuth auth path	OAuth credential selection	`"respond with hello"`	Skipped if `CLAUDE_CODE_OAUTH_TOKEN` not set; run with only OAuth token in env; completes without ErrorEvent
API key auth path	API key credential selection	`"respond with hello"`	Skipped if `ANTHROPIC_API_KEY` not set; run with only API key in env; completes without ErrorEvent

Prompt design: All prompts are trivial and directive — "respond with exactly the word hello" — to minimize token consumption and maximize determinism. The session resume test uses two short prompts. Estimated cost: ~$0.01–0.03 per full suite run.

Timeout strategy: Each test gets maxTimeout: 60 on the ProcessOptions (not just vitest's test timeout). This ensures the spawned CLI process is killed even if LLM latency is unusually high, preventing zombie processes in CI.

Copilot Live Test Suite (`src/tests/live/copilot.live.test.ts`)

Gating: beforeAll calls CliProcess('copilot').isAvailable(). Then attempts a dry-run ACP handshake (initialize only) to verify the binary is authenticated. If either check fails, skips the suite.

Authentication verification: The Copilot CLI requires GitHub authentication. Unlike Claude (where credential presence can be checked via environment variables), Copilot authentication state is opaque — the only reliable check is attempting an invocation. The beforeAll hook runs a minimal CliProcess('copilot').run() with a trivial prompt and a 30-second timeout. If it emits a ReadyEvent, authentication is confirmed. If it errors, the suite skips.

Test cases:

Test	What it validates	Assertions
Golden path	ACP event lifecycle	`assertEventStreamStructure()`
ReadyEvent session UUID	ACP `session/new` response parsed	`ReadyEvent.sessionId` matches UUID format
Text content emitted	`session/update` notifications stream	At least one `TextEvent` with non-empty `text`
Session resume	`--resume=<uuid>` round-trip	Second run with `sessionId` from first; ReadyEvent.sessionId matches
AbortSignal mid-run	Cancellation of ACP session	Abort after ReadyEvent; final event is `ErrorEvent { code: 'aborted' }`

No per-token cost: Copilot is subscription-priced. This suite could run on every PR if credentials are available in CI.

Extended Fake Fixture Scenarios

The business analysis identified four additional scenarios for fake-claude.mjs. These strengthen the fast-track suite without requiring credentials.

Scenario	`FAKE_SCENARIO` value	What it tests
Session resume	`session-resume`	Accepts `--resume <id>` flag, emits `system/init` with the same session ID. Validates the `isFirstMessage: false` code path in `CliProcess.buildArgs()`.
Rate limit inline	`rate-limit`	Emits a `rate_limit_event` line on stdout (not stderr). Validates `EventParser` handles inline rate limits and `CliProcess` yields `ErrorEvent { code: 'rate_limit' }`.
Permission request	`permission-request`	Emits a `server_tool_use` block inside an `assistant` event. Validates `EventParser` maps unrecognized block types to `RawEvent { rawSubtype: 'server_tool_use' }`.
Multi-block assistant	`multi-block`	Emits a single `assistant` event with `content: [text_block, tool_use_block]`. Validates `EventParser` yields two events (TextEvent + ToolUseEvent) from one line.

The Copilot fixture (fake-copilot.mjs) already covers 5 scenarios and does not need extension at this stage — its ACP protocol surface is smaller and more constrained than Claude's stream-json format.

Scalability Design

Test Isolation

Each live test creates its own CliProcess instance and runs in its own forked vitest worker. There is no shared state between tests — no shared session IDs, no shared temp directories, no shared credential state. This mirrors the isolation model used by the existing fake tests and follows Switchyard's pattern of per-test temp_workspace fixtures.

Serial Execution for Live Tests

Live tests run with maxConcurrency: 1. This is not a performance optimization — it is a correctness requirement:

Rate limiting: Concurrent Claude API calls from the same key can trigger rate_limit_event responses, turning golden-path tests into failure-path tests non-deterministically.
Copilot session contention: The standalone Copilot CLI may not support concurrent ACP sessions from the same authenticated user.
Predictable CI billing: Serial execution makes per-run API costs deterministic and auditable.

Estimated live suite duration: 2–4 minutes (6 Claude tests × 10–30s each + 5 Copilot tests × 5–15s each).

Future Backend Extension

The architecture supports adding future backends (e.g., cursor, windsurf) by:

Adding a new .live.test.ts file under src/__tests__/live/
Adding a new fake fixture under src/__tests__/fixtures/
Extending CliBackend type (already a union)
Using collectLive(backend, options) from the shared helper — no new collection infrastructure needed

The vitest include pattern src/__tests__/live/**/*.live.test.ts automatically picks up new files.

Established Patterns

Pattern: PATH Injection for Fake Binaries

Already established in CliProcess.test.ts. The test's beforeAll creates a temp directory, writes the fixture script as the binary name (claude or copilot), makes it executable, and prepends the directory to PATH. This is the project's canonical pattern for testing subprocess spawning without real binaries.

Live tests do NOT use this pattern. They rely on the real binary being in the system PATH, validated by CliProcess.isAvailable().

Pattern: Event Collection via Async Generator Drain

const events: ClaudeEvent[] = [];
for await (const ev of proc.run(options)) { events.push(ev); }

Used in every test that exercises CliProcess.run(). The live helper's collectLive() encapsulates this, matching how collect() works in CliProcess.test.ts.

Pattern: Scenario-Driven Fixtures

FAKE_SCENARIO env var selects fixture behavior. Both fake-claude.mjs and fake-copilot.mjs follow this pattern. New scenarios are added as case branches in a switch statement. Unknown scenarios write to stderr and exit with code 1.

Pattern: Skip-When-Unavailable Gating

Switchyard's integration tests skip when Docker, Redis, or API keys are unavailable. Documentation Robotics' CLI tests skip comprehensive tests unless manually triggered. The live test suites follow this same pattern: check prerequisites in beforeAll, skip the entire describe block if unmet.

vitest provides describe.skipIf(condition) for this:

const claudeAvailable = await new CliProcess('claude').isAvailable();
const hasCredentials = !!process.env.ANTHROPIC_API_KEY || !!process.env.CLAUDE_CODE_OAUTH_TOKEN;
describe.skipIf(!claudeAvailable || !hasCredentials)('claude live', () => { ... });

Pattern: Structural Assertions Over Content Assertions

Live tests never assert on LLM-generated text content. They assert on:

Event type presence (events.some(e => e.type === 'ready'))
Field existence and type (ReadyEvent.sessionId is a non-empty string)
Numeric invariants (seq monotonicity, usage.inputTokens > 0)
Absence of errors (events.filter(e => e.type === 'error').toHaveLength(0))

This is a load-bearing constraint. Any test that asserts on specific generated text will be non-deterministic and will eventually flake.

Component Reuse

From Existing Test Infrastructure

Component	Reuse in live tests
`CliProcess` class	Instantiated directly with `'claude'` or `'copilot'` backend — no wrapper needed
`ProcessOptions` interface	Used as-is; live tests set `cwd`, `prompt`, `maxTimeout`, `sessionId`, `isFirstMessage`
`ClaudeEvent` type union	All assertions use the existing event type discriminators
`collect()` pattern	Extracted to shared `collectLive()` helper
vitest `describe.skipIf`	Used for prerequisite gating
`pool: 'forks'`	Reused in live config for process isolation

From Reference Repositories

Pattern	Source	Adaptation
Assertion helpers	Switchyard `tests/utils/assertions.py`	`assertEventStreamStructure()` — same concept, TypeScript implementation
Skip-when-unavailable	Switchyard `conftest.py` markers	`describe.skipIf()` in `beforeAll` hooks
Smoke/comprehensive split	Documentation Robotics `cli-tests.yml`	Fast path (every PR) / Live path (main push only)
Separate CI jobs by speed	Documentation Robotics `build-and-smoke` vs `test-comprehensive`	Unit/typecheck job vs live test job
Event stream capture callbacks	Switchyard `test_claude_code_integration.py`	`collectLive()` drains the async generator identically

CI Architecture

Workflow Design (`.github/workflows/ci.yml`)

Two jobs, one workflow file:

Job 1: test (runs on every PR and push to main)

steps:
  - uses: actions/checkout@v4
  - uses: actions/setup-node@v4
    with: { node-version: '20', cache: 'npm' }
  - run: npm ci
  - run: npm run typecheck
  - run: npm test
timeout-minutes: 5

Job 2: test-live (runs on push to main only)

if: github.event_name == 'push' && github.ref == 'refs/heads/main'
needs: test
steps:
  - uses: actions/checkout@v4
  - uses: actions/setup-node@v4
    with: { node-version: '20', cache: 'npm' }
  - run: npm ci
  - run: npm run test:live
    env:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
timeout-minutes: 10

Design decisions:

test-live depends on test via needs: test. If fast tests fail, live tests don't run — saves API costs on broken builds.
No CLAUDE_CODE_OAUTH_TOKEN in CI initially. OAuth token management is more complex (refresh flows). API key is sufficient for CI smoke testing. OAuth auth path test skips gracefully.
Copilot live tests skip in CI unless copilot binary is installed and authenticated in the runner. This is acceptable — Copilot auth requires an interactive GitHub OAuth device flow that can't be automated in standard GitHub Actions runners. Copilot live tests are primarily for local developer validation.
continue-on-error: false for the live job. If live tests fail on main, it's a signal that a protocol change has occurred and downstream consumers are at risk. This should be a blocking signal, not a soft warning.
No workflow_dispatch trigger for live tests. Adding it later is trivial if needed. The current design keeps the workflow minimal.

Secret Management

Secret	Required by	Fallback
`ANTHROPIC_API_KEY`	Claude live tests	Suite skips; job exits 0
`CLAUDE_CODE_OAUTH_TOKEN`	OAuth-specific test only	Individual test skips
Copilot auth	Copilot live tests	Suite skips

If ANTHROPIC_API_KEY is not configured as a repository secret, the live test job runs but all Claude tests skip and the job succeeds. This follows Documentation Robotics' pattern where test-comprehensive runs but individual tests skip based on availability.

Implementation Plan

Prerequisites

P0: Merge feat/v0.4-copilot-backend to main.
The Copilot backend implementation (ACP protocol support, createCopilotAcpParser(), fake-copilot.mjs, 13 Copilot fake tests) is complete on the feature branch but not yet merged. All Copilot-related work in this plan depends on it. The user has indicated intent to merge — this should happen before implementation begins.

Phase 1: Test Infrastructure (no new tests yet)

1.1. Create vitest.config.live.ts
New vitest config file targeting src/__tests__/live/**/*.live.test.ts with 120s timeout, maxConcurrency: 1, and pool: 'forks'.

1.2. Add test:live npm script
"test:live": "vitest run --config vitest.config.live.ts" in package.json.

1.3. Create src/__tests__/live/ directory
Empty directory (with a placeholder or the first test file from Phase 2).

1.4. Create src/__tests__/helpers/live-helpers.ts
Shared utilities: collectLive(), assertEventStreamStructure(), credential-check helpers, and skipUnlessAvailable() for beforeAll hooks.

Phase 2: Claude Live Tests

2.1. Create src/__tests__/live/claude.live.test.ts
Six test cases as specified in the Component Design section: golden path, ReadyEvent fields, session resume, AbortSignal, OAuth auth path, API key auth path. All gated on binary availability and credential presence.

2.2. Validate locally
Run ANTHROPIC_API_KEY=<key> npm run test:live and confirm all six tests pass (or skip appropriately when credentials are absent).

Phase 3: Copilot Live Tests

3.1. Create src/__tests__/live/copilot.live.test.ts
Five test cases as specified: golden path, ReadyEvent UUID, text content, session resume, AbortSignal. Gated on binary availability and authentication via dry-run handshake.

3.2. Validate locally
Run npm run test:live with copilot installed and authenticated. Confirm tests pass or skip when the binary is absent.

Phase 4: Extended Fake Scenarios

4.1. Add four new scenarios to fake-claude.mjs
session-resume, rate-limit, permission-request, multi-block as specified in the Component Design section.

4.2. Add corresponding tests to CliProcess.test.ts and/or EventParser.test.ts
One test per new scenario, following the existing pattern (set FAKE_SCENARIO, collect events, assert structure).

Phase 5: CI Workflow

5.1. Create .github/workflows/ci.yml
Two-job workflow as specified: test (all PRs, all pushes) and test-live (main pushes only, needs test).

5.2. Configure repository secret
ANTHROPIC_API_KEY must be added to the repository's GitHub Actions secrets. This is a manual step outside the codebase.

Phase 6: Documentation

6.1. Update README.md
Add a "Testing" section covering:

npm test — offline fast tests, no prerequisites
npm run test:live — live smoke tests, prerequisites (binary in PATH, credentials)
Environment variables: ANTHROPIC_API_KEY, CLAUDE_CODE_OAUTH_TOKEN
Copilot: copilot CLI installed and authenticated via gh auth login
Behavior when prerequisites are missing (suites skip, don't fail)
Running a specific test: npx vitest run src/__tests__/live/claude.live.test.ts --config vitest.config.live.ts

6.2. Update ROADMAP.md
Mark live integration tests as completed under the appropriate milestone.

Risks and Mitigations

Risk	Likelihood	Impact	Mitigation
Claude CLI `stream-json` format changes silently	Medium	High — parser breaks, downstream consumers affected	Live tests on `main` push detect drift within hours
Copilot ACP protocol changes between versions	Medium	High — ACP is GA but young (Feb 2026)	Live tests catch JSON-RPC contract changes; `RawEvent` fallback preserves unknown events
Live tests flake due to LLM non-determinism	Low	Medium — false CI failures	Structural assertions only; no content matching; no retries (retries mask real issues)
API cost accumulation	Low	Low — trivial prompts cost <$0.05 per run	Only runs on main push; Copilot is subscription-priced
CI runner lacks `copilot` binary	Certain	Low — tests skip gracefully	Copilot live tests designed for local validation; `describe.skipIf` prevents CI failures
Rate limiting during live tests	Low	Medium — test appears to fail	Serial execution (`maxConcurrency: 1`); trivial prompts minimize token usage

Generated by Orchestrator Bot 🤖
Processed by the software_architect agent

0 replies

2026-06-27T18:51:07Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Work Breakdown Complete — created 7 sub-issue(s):

Phase 1: Test infrastructure — live vitest config and shared helpers #4 Phase 1: Test infrastructure — live vitest config and shared helpers
Phase 2: Extended fake-claude.mjs scenarios and unit tests #5 Phase 2: Extended fake-claude.mjs scenarios and unit tests
Phase 3: Claude live test suite #6 Phase 3: Claude live test suite
Phase 4: Copilot fake fixture and offline test suite #7 Phase 4: Copilot fake fixture and offline test suite
Phase 5: Copilot live test suite #8 Phase 5: Copilot live test suite
Phase 6: CI workflow #9 Phase 6: CI workflow
Phase 7: README test documentation #10 Phase 7: README test documentation

0 replies

2026-06-27T22:22:48Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Pr Code Reviewer

All five review agents have completed. Let me now compile the findings into the required structured format, filtering to only Critical and High Priority issues.

PR Review Findings

Critical Issues

.repair_cycle.log (5412 lines) committed to repository: A large internal orchestration/CI debug log file is checked into the repo. The .dockerignore already excludes it, confirming it's not intended for version control. .gitignore lacks a *.log entry, which allowed this file through. Must be removed via git rm and *.log added to .gitignore. /workspace/.repair_cycle.log:1 (NEW)
.mcp.json developer config committed to repository: A developer-local MCP server configuration file (pointing to https://mcp.context7.com/mcp) is checked in. The .dockerignore already excludes it. Should be removed via git rm and .mcp.json added to .gitignore. /workspace/.mcp.json:1 (NEW)
Broad catch block masks all exceptions as spawn_error: The catch-all at /workspace/src/process/CliProcess.ts:274-279 converts every exception (including programmer bugs like TypeError, ReferenceError) into a generic spawn_error ErrorEvent. A bug in the event parser would be reported as "spawn_error" — actively misleading. Should use a distinct error code (e.g., internal_error) or re-throw unexpected errors. (NEW)

High Priority Issues

Copilot ACP parser silently drops unrecognized message structures: At /workspace/src/events/EventParser.ts:300-309, if a parsed JSON-RPC message doesn't match any recognized pattern, the function returns an empty events array — zero events emitted, zero logging. This violates the "no events are ever silently lost" principle explicitly documented in the Claude parser (line 200). Any new ACP message type would be silently discarded. Should add a fallback RawEvent emission matching the Claude parser's approach. (NEW)
Module-level copilot auth check swallows all error types as "not authenticated": At /workspace/src/__tests__/live/copilot.live.test.ts:15-23, the auth probe treats ANY error (crash, timeout, rate limit, parse error) as "not authenticated" and silently skips the entire test suite. No diagnostic output is produced, so developers seeing "6 skipped" cannot distinguish between missing copilot, broken auth, CLI crashes, or rate limits. Should log the actual error events when auth check fails. (NEW)
README documents ProgressEvent as "Defined; not yet emitted" — it is actively emitted: /workspace/README.md:108 states progress events are not yet emitted, but they are emitted at spawn (elapsed: 0) and on every watchdog tick. Multiple tests in this PR verify this behavior. The documentation directly contradicts the implementation. (NEW)
README missing aborted ErrorCode and signal ProcessOption: /workspace/README.md:121 lists 8 ErrorCode values but src/events/types.ts defines 9 — aborted is missing. The signal field is also absent from the ProcessOptions table. Both are key public API features exercised extensively by this PR's tests. (NEW)
README says 'gh copilot' but code spawns 'copilot': /workspace/README.md:5 says "spawn 'claude' or 'gh copilot'" and line 12 says "or 'gh' for Copilot, once implemented." The code at /workspace/src/process/CliProcess.ts:82 spawns a binary named copilot, and Copilot IS implemented. Both the binary name and "once implemented" are stale. (NEW)

Clean Areas

DistributiveOmit type utility: Correctly solves the TypeScript union-collapse problem with Omit. Well-named following community convention, correctly placed, and properly used. Minor recommendation to add a JSDoc comment.
Event parser (parseCliLine): Thorough handling of all raw event types with explicit fallback to RawEvent so no events are silently lost. Well-structured and accurate JSDoc.
Fake binary test fixtures: Clean scenario-driven design with proper signal handling and unknown-scenario fallbacks. Consistent with existing patterns.
Live test infrastructure: vitest.config.live.ts with proper isolation settings (forks, maxConcurrency: 1, 120s timeout, passWithNoTests). live-helpers.ts with well-factored shared helpers.
CI pipeline: Correctly gates live tests to push-to-main only, properly uses secrets for ANTHROPIC_API_KEY, reasonable timeout settings.
Test isolation: Proper use of pool: 'forks', beforeAll/afterAll PATH management, afterEach env cleanup.
CliProcess exit precedence logic: Clearly documented priority ordering with ACP-specific caveats.

Generated by Orchestrator Bot 🤖
Processed by the pr_code_reviewer agent

0 replies

2026-06-27T22:24:43Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Requirements Verifier

Now I have everything I need to assess the implementation against the acceptance criteria.

Parent Issue Requirements Verification

Gaps Found

Live tests not running on PRs in CI: The acceptance criterion requires "CI workflow updated to run integration tests on PR and main push." The test-live job in .github/workflows/ci.yml includes if: github.event_name == 'push' && github.ref == 'refs/heads/main', which gates it to main-branch pushes only. Pull request runs are excluded entirely, meaning PRs never trigger the live integration suite in CI.

Deviations

None found.

Verified

Integration test runner configured: Vitest is configured with a dedicated vitest.config.live.ts (the Node.js/TypeScript equivalent of pytest).
Tests invoke the CLI binary directly: CliProcess spawns the actual claude and gh copilot binaries; no internal module mocking occurs in the live suite.
At least one test per CLI command/subcommand: Both supported backends (claude, copilot) have dedicated live test files with multiple tests each (golden path, session resume, AbortSignal, event field validation).
Tests pass against a clean environment with no pre-existing state: Tests use tmpdir() as working directory and gracefully skip entire describe blocks when the CLI binary or credentials are absent.
README updated with local run instructions: The Testing section covers npm run test:live, per-backend prerequisites, graceful skipping behavior, and single-file execution commands.

Generated by Orchestrator Bot 🤖
Processed by the requirements_verifier agent

0 replies

2026-06-27T22:28:21Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Requirements Verifier

Now I have the full picture of the PR. Let me compare each functional requirement against what was implemented.

Business Analyst Output Verification

Gaps Found

FR-4 — fake-copilot.mjs Fixture Entirely Absent: FR-4 specifies a new src/__tests__/fixtures/fake-copilot.mjs fixture covering five scenarios (golden-path, stall, ignore-sigterm, nonzero-exit, permission-request), consuming ACP JSON-RPC from stdin and producing ACP-compliant NDJSON on stdout. This file does not appear in the PR at all. The PR does add src/__tests__/CliProcess.copilot.test.ts (FR-5) which injects fake-copilot.mjs into PATH in beforeAll, but the fixture binary it depends on is never created — making the entire offline Copilot test suite non-functional as committed.
FR-3 — Per-Test Credential Stripping Not Implemented: FR-3 explicitly requires "Delete the unused credential before each test (OAuth preferred over API key, mirroring CliProcess.run() behavior)". The live-helpers.ts helper getLiveCredentials() checks whether credentials exist, and the auth-specific tests are gated via skipIf, but there is no beforeEach (or equivalent) that removes the non-preferred credential from the subprocess environment before spawning. This means when both ANTHROPIC_API_KEY and CLAUDE_CODE_OAUTH_TOKEN are present in CI, the golden-path and session-resume tests run with both credentials in the subprocess environment, leaving the credential-selection logic untested and the behavior unspecified per the spec.

Deviations

None found.

Verified

FR-1 (Live Test Runner): vitest.config.live.ts includes src/__tests__/live/**/*.live.test.ts, maxConcurrency: 1, testTimeout: 120_000 (≥ 60 s minimum), passWithNoTests: true; vitest.config.ts excludes the live directory; package.json adds test:live script. All criteria met.
FR-2 (Claude Fake Fixture Extensions): fake-claude.mjs adds session-resume (reads --resume arg, echoes session ID), rate-limit (emits rate_limit_event on stdout, exits 1), permission-request (emits server_tool_use content block), and multi-block (text + tool_use array). Tests for these scenarios, including unknown-scenario exit-1 behavior, are present in CliProcess.test.ts.
FR-3 (Claude Live Test Suite — structure): claude.live.test.ts skips suite when binary unavailable or credentials absent; covers golden path (usage tokens, ReadyEvent fields), seq monotonicity, sessionId round-trip, session resume, OAuth-only and API-key-only paths with skipIf, and AbortSignal mid-run producing error:aborted.
FR-5 (Copilot Fake Test Suite — structure): CliProcess.copilot.test.ts mirrors Claude fake suite structure, covering golden path, isAvailable, nonzero exit, idle/max timeout, SIGKILL escalation, pre-flight abort, mid-run abort, and permission request. (Non-functional until FR-4 is resolved.)
FR-6 (Copilot Live Test Suite): copilot.live.test.ts skips when binary unavailable or unauthenticated; covers golden path, UUID session ID, text content, session resume, and AbortSignal.
FR-7 (CI Workflow): .github/workflows/ci.yml triggers on pull_request and push to main; fast job (5-minute timeout, Node 20, npm ci, typecheck, npm test) runs on all triggers; live job (10-minute timeout) runs only on push to main with ANTHROPIC_API_KEY from secrets.
FR-8 (README): Testing section documents npm test, npm run test:live, required environment variables for both backends, graceful-skip behavior, and single-file isolation via vitest filter flag.

Generated by Orchestrator Bot 🤖
Processed by the requirements_verifier agent

0 replies

2026-06-27T22:32:15Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Requirements Verifier

Now I have everything I need for a complete analysis. Let me compile the findings.

Software Architect Output Verification

Gaps Found

None found.

Deviations

None found.

Verified

Dual-Track Test Architecture: Both npm test (fixture-based, src/**/*.test.ts) and npm run test:live (live, src/__tests__/live/**/*.live.test.ts) are implemented with independent vitest configs.
vitest.config.live.ts: All specified properties present — include, pool: 'forks', testTimeout: 120_000, maxConcurrency: 1, retry: 0. Also adds passWithNoTests: true (harmless CI convenience, not a deviation).
vitest.config.ts (existing): Correctly excludes live tests. The spec claimed the include: ['src/**/*.test.ts'] pattern was sufficient, but *.live.test.ts does match *.test.ts in glob semantics. The implementation correctly adds exclude: ['src/__tests__/live/**'] — a sound correction to a spec oversight.
npm scripts: test, test:watch, and test:live all present as specified.
live-helpers.ts: All three specified exports implemented — collectLive(backend, options), assertEventStreamStructure(events), and getLiveCredentials(backend). assertEventStreamStructure body matches the spec exactly (ready/text/done presence, zero error events, seq monotonicity, sessionId consistency between ReadyEvent and DoneEvent).
Claude live test suite (claude.live.test.ts): All 6 specified test cases present — golden path (with usage.inputTokens/outputTokens > 0), ReadyEvent fields (sessionId + model), session resume (sessionId from first DoneEvent → second ReadyEvent), AbortSignal mid-run (abort after ReadyEvent → final ErrorEvent { code: 'aborted' }), OAuth auth path (it.skipIf(!CLAUDE_CODE_OAUTH_TOKEN)), API key auth path (it.skipIf(!ANTHROPIC_API_KEY)). Gating via describe.skipIf(!claudeAvailable || !hasCredentials) as specified. All tests use maxTimeout: 60.
Copilot live test suite (copilot.live.test.ts): All 5 specified test cases present — golden path, ReadyEvent session UUID (UUID regex validation), text content emitted (non-empty TextEvents), session resume, AbortSignal mid-run. Copilot authentication verified via dry-run invocation (30s timeout) before the suite runs, exactly as specified. Gating via describe.skipIf(!isAvailable || !isAuthenticated).
fake-claude.mjs extended scenarios: All 4 new scenarios implemented — session-resume (accepts --resume <id>, echoes session ID in init), rate-limit (emits rate_limit_event on stdout), permission-request (emits server_tool_use block inside assistant event), multi-block (single assistant event with [text_block, tool_use_block]).
Fast-track tests for new scenarios: CliProcess.test.ts contains dedicated test cases exercising all 4 new fake-claude scenarios — session resume, rate limit → ErrorEvent { code: 'rate_limit' }, permission request → RawEvent { rawSubtype: 'server_tool_use' }, and multi-block → TextEvent + ToolUseEvent.
Test file layout: All specified files present under src/__tests__/ — helpers/live-helpers.ts, live/claude.live.test.ts, live/copilot.live.test.ts.
Serial execution: maxConcurrency: 1 in live config as specified.
Skip-when-unavailable pattern: Both suites use describe.skipIf() and gracefully skip when prerequisites are unmet, consistent with the specified pattern.

Generated by Orchestrator Bot 🤖
Processed by the requirements_verifier agent

0 replies

2026-06-27T22:33:25Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Pr Code Reviewer

Groups:

Item:
- Name: Repository Hygiene
- Severity: critical
- Findings: - .repair_cycle.log (5412 lines) committed to repository: A large internal orchestration/CI debug log is checked into version control. .dockerignore already excludes it, confirming it is not intended for the repo. .gitignore lacks a *.log entry, which allowed it through. Must be removed via git rm and *.log added to .gitignore. /workspace/.repair_cycle.log:1
.mcp.json developer config committed to repository: A developer-local MCP server configuration file (pointing to https://mcp.context7.com/mcp) is checked in. .dockerignore already excludes it. Must be removed via git rm and .mcp.json added to .gitignore. /workspace/.mcp.json:1
Item:
- Name: Error Handling and Event Parsing
- Severity: critical
- Findings: - Broad catch block masks all exceptions as spawn_error: The catch-all at /workspace/src/process/CliProcess.ts:274-279 converts every exception — including programmer bugs like TypeError and ReferenceError — into a generic spawn_error ErrorEvent. A bug in the event parser would be reported as "spawn_error", actively misleading. Should use a distinct error code (e.g., internal_error) or re-throw unexpected errors.
Copilot ACP parser silently drops unrecognized message structures: At /workspace/src/events/EventParser.ts:300-309, if a parsed JSON-RPC message doesn't match any recognized pattern, the function returns an empty events array — zero events emitted, zero logging. This violates the "no events are ever silently lost" principle explicitly documented in the Claude parser (line 200). Should add a fallback RawEvent emission matching the Claude parser's approach.
Item:
- Name: Copilot Test Infrastructure
- Severity: critical
- Findings: - fake-copilot.mjs fixture entirely absent (FR-4): The spec requires src/__tests__/fixtures/fake-copilot.mjs covering five scenarios (golden-path, stall, ignore-sigterm, nonzero-exit, permission-request), consuming ACP JSON-RPC from stdin and producing ACP-compliant NDJSON on stdout. This file does not appear in the PR. CliProcess.copilot.test.ts (FR-5) injects fake-copilot.mjs into PATH in beforeAll, but the fixture it depends on is never created — making the entire offline Copilot test suite non-functional as committed.
Module-level copilot auth check swallows all error types as "not authenticated": At /workspace/src/__tests__/live/copilot.live.test.ts:15-23, the auth probe treats ANY error (crash, timeout, rate limit, parse error) as "not authenticated" and silently skips the entire test suite. Developers seeing "6 skipped" cannot distinguish missing copilot from broken auth, CLI crashes, or rate limits. Should log the actual error when the auth check fails.
Item:
- Name: Live Test Coverage
- Severity: high
- Findings: - Per-test credential stripping not implemented (FR-3): FR-3 explicitly requires "Delete the unused credential before each test (OAuth preferred over API key)." live-helpers.ts checks whether credentials exist via getLiveCredentials(), but there is no beforeEach that removes the non-preferred credential from the subprocess environment. When both ANTHROPIC_API_KEY and CLAUDE_CODE_OAUTH_TOKEN are present in CI, golden-path and session-resume tests run with both credentials, leaving credential-selection logic untested.
CI live tests gated to main-push only — PRs excluded: The acceptance criterion requires "CI workflow updated to run integration tests on PR and main push." The test-live job in .github/workflows/ci.yml includes if: github.event_name == 'push' && github.ref == 'refs/heads/main', which excludes pull request triggers entirely. PRs never run the live integration suite.
Item:
- Name: README Documentation
- Severity: high
- Findings: - ProgressEvent documented as "not yet emitted" — it is actively emitted: /workspace/README.md:108 states progress events are not yet emitted, but they are emitted at spawn (elapsed: 0) and on every watchdog tick. Multiple tests in this PR verify this behavior. The documentation directly contradicts the implementation.
Missing aborted ErrorCode and signal ProcessOption: /workspace/README.md:121 lists 8 ErrorCode values but src/events/types.ts defines 9 — aborted is missing. The signal field is also absent from the ProcessOptions table. Both are key public API features exercised extensively by this PR's tests.
README says 'gh copilot' but code spawns 'copilot': /workspace/README.md:5 says "spawn 'claude' or 'gh copilot'" and line 12 says "or 'gh' for Copilot, once implemented." The code at /workspace/src/process/CliProcess.ts:82 spawns a binary named copilot, and Copilot IS implemented. Both the binary name and "once implemented" are stale.
Filtered Out:
Clean code review areas (DistributiveOmit type, Claude event parser, fake-claude fixtures, live test infra, CI pipeline structure, test isolation, exit precedence logic) — no issues found, positive observations only
Software Architect verification — all architectural requirements verified with no gaps or deviations
Business Analyst verified items (FR-1 test runner, FR-2 fake-claude extensions, FR-5/FR-6 copilot test structure, FR-7 CI triggers, FR-8 README testing section) — all confirmed implemented correctly
JSDoc comment suggestion on DistributiveOmit — filtered as style preference, not a real gap or bug

Generated by Orchestrator Bot 🤖
Processed by the pr_code_reviewer agent

0 replies

2026-06-27T23:23:02Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Pr Code Reviewer

Now I have all the data I need. Let me compile the final review findings. The prior cycle #13 addressed "Error Handling and Event Parsing" and #14 addressed "Copilot Test Infrastructure" — the fixes for those are visible in the current code (the acp/unknown fallback and internal_error code were the fixes). The issues I'm seeing now are genuinely new findings that weren't surfaced in prior cycles.

PR Review Findings

Critical Issues

None found

High Priority Issues

[ACP Handshake Sends session/new Unconditionally on Resume]: The ACP handshake in /workspace/src/process/CliProcess.ts:96-98 always sends initialize, session/new, and session/prompt JSON-RPC messages regardless of whether this is a new session or a resume. Meanwhile, buildCopilotArgs() at line 342 correctly appends --resume=<sessionId> for resumed sessions. This creates a contradictory state: the CLI args say "resume" but the stdin protocol says "create new session." The live copilot session-resume test passes today (likely because the --resume flag takes precedence), but this is fragile — a future ACP version could honor session/new over --resume, silently discarding conversation history. The handshake should conditionally send either session/new or skip it based on whether options.sessionId is set and options.isFirstMessage is false. (NEW)
[ACP Parser Does Not Distinguish Malformed JSON from Plaintext]: In /workspace/src/events/EventParser.ts:239-242, the ACP parser's JSON.parse catch block emits a TextEvent for all unparseable lines, including lines that look like truncated JSON (starting with {). In contrast, the Claude parser at lines 67-74 correctly distinguishes these — JSON-like lines that fail to parse become ErrorEvent { code: 'parse_error' } while other lines become TextEvent. This inconsistency means a corrupted or truncated JSON-RPC message from Copilot would be silently downgraded to innocuous text content, making protocol corruption issues significantly harder to debug. The same line.trimStart().startsWith('{') check should be applied to the ACP parser's catch block. (NEW)
[Copilot Auth Probe Treats All Errors as Auth Failures]: In /workspace/src/__tests__/live/copilot.live.test.ts:22-29, the module-level auth probe filters for any ErrorEvent and sets isAuthenticated = false if any are found, causing the entire test suite to be skipped. Non-auth errors like idle_timeout (the probe uses maxTimeout: 30 which is tight), rate_limit, nonzero_exit (CLI crash), or internal_error would all incorrectly trigger the "auth check failed" warning and silently skip the suite. Developers would waste time investigating credentials when the real issue is a timeout or rate limit. The probe should either filter for auth-specific error codes (spawn_error, cli_error) or at minimum differentiate auth vs non-auth errors in the warning message. (NEW)

Clean Areas

Type design: The DistributiveOmit<T, K> utility type is the canonical, correct solution for preserving discriminated union structure when omitting shared fields. The internal_error addition to ErrorCode fills a genuine gap in error categorization with clear documentation.
Event parser fundamentals: The Claude parseCliLine function handles all event types comprehensively with no silent data loss. The ACP acp/unknown fallback ensures zero-loss for unrecognized structures.
Test structure: The three-tier test approach (EventParser unit tests, CliProcess integration tests with fake binaries, live tests) is well-designed. Test isolation (pool: 'forks'), environment cleanup (afterEach/afterAll), and graceful skip guards (describe.skipIf) are properly implemented.
CliProcess async queue: The event queue implementation is correctly designed — the array buffer prevents event loss even when pushEvent is called before the consumer loop starts.
Credential handling logic: The OAuth-over-API-key preference with mutual exclusion is sound design.
Fake binary fixtures: Both fake-claude.mjs and fake-copilot.mjs emit realistic protocol-accurate event structures with proper scenario coverage.
README documentation: Accurate, comprehensive, and well-structured with correct examples for all testing modes.

Generated by Orchestrator Bot 🤖
Processed by the pr_code_reviewer agent

0 replies

2026-06-27T23:25:37Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Requirements Verifier

Parent Issue Requirements Verification

Gaps Found

None found.

Deviations

None found.

Verified

Integration test runner configured — Vitest is configured with a dedicated vitest.config.live.ts (120s timeout, fork pool, passWithNoTests: true) and a test:live npm script in package.json.
Tests invoke the CLI binary directly, not internal Python modules — This is a TypeScript/Node.js project with no Python. Tests exercise CliProcess which spawns the real claude/copilot binaries. No mocks or test doubles stand in for the process.
At least one test per CLI command/subcommand — Both supported backends are covered. claude.live.test.ts has 6 tests (golden path, ReadyEvent fields, session resume, AbortSignal, OAuth auth, API key auth). copilot.live.test.ts has 5 tests (golden path, ReadyEvent session UUID, text content, session resume, AbortSignal).
Tests pass against a clean environment with no pre-existing state — Tests use tmpdir() as cwd and describe.skipIf to skip entire suites when binaries or credentials are absent, rather than failing.
CI workflow updated to run integration tests on PR and main push — .github/workflows/ci.yml adds a test-live job with condition (github.event_name == 'push' && github.ref == 'refs/heads/main') || github.event_name == 'pull_request', running after the fast suite passes.
README updated with instructions for running integration tests locally — README.md has a "Live suite" section documenting npm run test:live, prerequisites for both Claude and Copilot, graceful-skipping behavior, and how to run a single test file.

Generated by Orchestrator Bot 🤖
Processed by the requirements_verifier agent

0 replies

2026-06-27T23:27:33Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Requirements Verifier

Now I have enough information to compare the implementation against the functional requirements.

Business Analyst Output Verification

Gaps Found

FR-7: CI Job Timeouts Missing: The spec explicitly requires "Sets a 5-minute job timeout for the fast path and a 10-minute job timeout for the live path." The diff description of .github/workflows/ci.yml describes the two-job pipeline without any mention of job-level timeout-minutes configuration. This is a concrete, explicitly specified requirement that appears absent.
FR-7: Node Modules Caching Not Mentioned: FR-7 explicitly requires "Runs npm ci and caches node_modules." The diff summary mentions npm ci being run but does not indicate any caching step (e.g., actions/cache) is present in the workflow.

Deviations

FR-7: Live Test Trigger Condition: The spec states npm run test:live should run "on push to main only, injecting ANTHROPIC_API_KEY from repository secrets." The diff explicitly states the test-live job "runs npm run test:live on all PRs and main-branch pushes." Running live tests (and injecting API key secrets) on every pull request deviates from the specified trigger condition, introduces unnecessary API usage on PRs, and is a security concern for fork-sourced PRs.

Verified

FR-1 — vitest.config.live.ts targets src/__tests__/live/**/*.live.test.ts, sets maxConcurrency: 1, 120s timeout (≥60s minimum), passWithNoTests: true; vitest.config.ts excludes the live/ directory; package.json adds test:live script.
FR-2 — fake-claude.mjs extended with all four required scenarios: session-resume (with --resume flag and session ID echo), rate-limit (inline rate_limit_event), permission-request (server_tool_use block), and multi-block (mixed text + tool_use content array).
FR-3 — claude.live.test.ts skips on missing binary or credentials; covers golden path, seq monotonicity, ReadyEvent/DoneEvent sessionId match, usage token counts, session resume, OAuth path, API key path, and AbortSignal mid-run cancellation.
FR-4 — fake-copilot.mjs created with all five ACP scenarios: golden-path, stall, ignore-sigterm, nonzero-exit, permission-request; reads ACP JSON-RPC from stdin, responds with NDJSON on stdout.
FR-5 — CliProcess.copilot.test.ts mirrors CliProcess.test.ts structure, injects fake-copilot.mjs into PATH, covers golden path, timeouts, AbortSignal, SIGTERM/SIGKILL, nonzero exit, permission scenarios, and asserts normalized ReadyEvent/TextEvent/DoneEvent field contracts.
FR-6 — copilot.live.test.ts skips when binary unavailable or unauthenticated; covers golden path, seq monotonicity, sessionId continuity, session resume, and AbortSignal.
FR-8 — README updated with a Testing section documenting npm test, npm run test:live, required environment variables, graceful skip behavior, and isolation with vitest filter flags.

Generated by Orchestrator Bot 🤖
Processed by the requirements_verifier agent

0 replies

2026-06-27T23:31:07Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Requirements Verifier

Now I have a complete picture. Let me compile the verification report.

Software Architect Output Verification

Gaps Found

None found.

Deviations

None found.

Verified

Dual-track test architecture: npm test targets *.test.ts (offline, no credentials); npm run test:live targets *.live.test.ts via separate config. The include patterns are mutually exclusive; the main config also adds an explicit exclude: ['src/__tests__/live/**'] for belt-and-suspenders safety.
File layout: All specified paths are present — src/__tests__/helpers/live-helpers.ts, src/__tests__/live/claude.live.test.ts, src/__tests__/live/copilot.live.test.ts, fixtures/fake-claude.mjs, fixtures/fake-copilot.mjs.
vitest.config.live.ts: All required fields match exactly — include, pool: 'forks', testTimeout: 120_000, maxConcurrency: 1, retry: 0.
npm scripts: test, test:watch, and test:live match the spec verbatim.
live-helpers.ts: Exports collectLive(backend, options), assertEventStreamStructure(events), and credential/availability gating via getLiveCredentials(). The assertEventStreamStructure body is an exact match to the spec (ready/text/done presence, no error events, seq monotonicity, sessionId consistency).
Claude live test cases (all 6): golden path with usage assertions, ReadyEvent field validation, session resume round-trip (first run defaults isFirstMessage to true per CliProcess.buildArgs(), second explicitly passes isFirstMessage: false and sessionId from first run's DoneEvent), AbortSignal mid-run yielding ErrorEvent { code: 'aborted' }, OAuth auth path (it.skipIf(!CLAUDE_CODE_OAUTH_TOKEN)), API key auth path (it.skipIf(!initialApiKey)).
Copilot live test cases (all 5): golden path, ReadyEvent session UUID format (UUID regex), text content emitted, session resume round-trip, AbortSignal mid-run.
Copilot authentication gating: Pre-suite probe run with maxTimeout: 30 checks for error events; skips suite if any are found. Functionally equivalent to the spec's "confirmed if ReadyEvent emitted."
Skip-when-unavailable gating: Both suites use describe.skipIf(!isAvailable || !hasCredentials) at module level — vitest's describe.skipIf is exactly the specified mechanism, driven by top-level await rather than beforeAll, which is functionally equivalent.
Extended fake-claude.mjs scenarios (all 4): session-resume (reads --resume <id> from process.argv), rate-limit (emits rate_limit_event on stdout), permission-request (emits server_tool_use block in assistant event), multi-block (emits [text, tool_use] in single assistant message).
Fast-track tests for extended scenarios: All four are covered in CliProcess.test.ts under describe('extended scenarios') — validating the isFirstMessage: false code path, rate_limit ErrorEvent, RawEvent { rawSubtype: 'server_tool_use' }, and dual TextEvent + ToolUseEvent emission from one line.

Generated by Orchestrator Bot 🤖
Processed by the requirements_verifier agent

0 replies

2026-06-27T23:32:32Z

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Pr Code Reviewer

Verification complete. Two BA findings are false positives — the CI file has both timeout-minutes (lines 12, 27) and cache: npm (lines 19, 40). The remaining four findings are confirmed.

{"groups":[{"name":"ACP Protocol Handling","severity":"high","findings":"- ACP Handshake Sends session/new Unconditionally on Resume: src/process/CliProcess.ts:96-98 always sends initialize, session/new, and session/prompt JSON-RPC messages regardless of whether this is a new session or a resume. Meanwhile buildCopilotArgs() at line 342 correctly appends --resume=<sessionId> for resumed sessions, creating a contradictory state. The handshake should conditionally skip session/new when options.sessionId is set and options.isFirstMessage is false.\n- ACP Parser Does Not Distinguish Malformed JSON from Plaintext: src/events/EventParser.ts:239-242 — the ACP parser's JSON.parse catch block emits a TextEvent for all unparseable lines, including lines that look like truncated JSON (starting with {). The Claude parser at lines 67-74 correctly distinguishes these by emitting ErrorEvent { code: 'parse_error' } for JSON-like lines. The same line.trimStart().startsWith('{') check should be applied to the ACP parser's catch block."},{"name":"CI/CD Pipeline","severity":"high","findings":"- Live Test Trigger Condition Deviates from Spec: .github/workflows/ci.yml:29 — the test-live job runs on all PRs and main-branch pushes via (github.event_name == 'push' && github.ref == 'refs/heads/main') || github.event_name == 'pull_request'. The spec (FR-7) requires live tests run on push to main only. Running on every PR introduces unnecessary API usage; while GitHub Actions does not expose secrets to fork PRs (mitigating the security concern), the pull_request trigger should be removed to match the spec."},{"name":"Test Infrastructure","severity":"high","findings":"- Copilot Auth Probe Treats All Errors as Auth Failures: src/__tests__/live/copilot.live.test.ts:22-29 — the module-level auth probe filters for any ErrorEvent and sets isAuthenticated = false, causing the entire suite to be skipped. Non-auth errors like idle_timeout (the probe uses maxTimeout: 30 which is tight), rate_limit, or nonzero_exit would all incorrectly trigger the auth-failure skip path. The probe should filter for auth-specific error codes (e.g., spawn_error, cli_error) or at minimum differentiate auth vs non-auth errors in the warning message."}],"filtered_out":["FR-7 CI Job Timeouts Missing — false positive: .github/workflows/ci.yml has timeout-minutes: 5 (line 12) and timeout-minutes: 10 (line 27), matching the spec exactly.","FR-7 Node Modules Caching Not Mentioned — false positive: both jobs use actions/setup-node@v4 with cache: npm (lines 19 and 40), which is the standard npm dependency caching mechanism.","All medium/low severity findings excluded per filtering criteria.","Positive/clean-area assessments from all phases excluded (not actionable issues)."]}

Generated by Orchestrator Bot 🤖
Processed by the pr_code_reviewer agent

0 replies

Requirements: Add full suite of live integration tests for CLI validation #3

Uh oh!

orchestrator-bot[bot] Bot Jun 27, 2026

Requirements Analysis

User Request

Overview

Goals

Acceptance Criteria

Replies: 14 comments · 4 replies

Uh oh!

orchestrator-bot[bot] Bot Jun 27, 2026 Author

Idea Research

Integration Tests for code-wrapper: Research & Analysis

Executive Summary

Idea Exploration

What "live integration tests" means for this project

The "CLI binary" confusion

Potential Directions

Direction 1: Live smoke test suite (minimal, CI-friendly)

Direction 2: CLI contract snapshot tests

Direction 3: Extended fake-claude scenarios

Direction 4: Dual-track approach (recommended framing)

References and Prior Art

Documentation Robotics CLI test infrastructure

Switchyard integration test patterns

vitest workspace pattern

Technical Considerations

Credential management

Cost and rate limiting

Deterministic assertions

CI workflow

The pytest question

isAvailable() as a gate

Test file organization

Uh oh!

tinkermonkey Jun 27, 2026 Maintainer

Uh oh!

orchestrator-bot[bot] Bot Jun 27, 2026 Author

Idea Research

Uh oh!

tinkermonkey Jun 27, 2026 Maintainer

Uh oh!

orchestrator-bot[bot] Bot Jun 27, 2026 Author

Idea Research

Uh oh!

orchestrator-bot[bot] Bot Jun 27, 2026 Author

Business Analysis

Executive Summary

Functional Requirements

FR-1: Live Test Runner Configuration

FR-2: Claude Fake Fixture Extensions

FR-3: Claude Live Test Suite

FR-4: Copilot Fake Fixture (Prerequisite: Copilot Backend Implementation)

FR-5: Copilot Fake Test Suite (Prerequisite: FR-4 and Copilot Backend Implementation)

FR-6: Copilot Live Test Suite (Prerequisite: FR-5 and Copilot Backend Implementation)

FR-7: CI Workflow

FR-8: README Instructions

User Stories

Story 1: Live Test Runner

Story 2: Extended Claude Fake Scenarios

Story 3: Claude Live Golden Path

Story 4: Claude Live Session Resume

Story 5: Claude Live Auth Path Coverage

Story 6: Claude Live AbortSignal Cancellation

Story 7: Copilot Fake Fixture and Test Suite

Story 8: Copilot Live Test Suite

Story 9: CI Workflow

Story 10: README Test Documentation

Uh oh!

orchestrator-bot[bot] Bot Jun 27, 2026 Author

Architecture Design

Architecture Design: Live Integration Tests for code-wrapper

ADR-001: Dual-Track Test Architecture

System Architecture

Test File Layout

Vitest Configuration

npm Scripts

Component Design

Live Test Helper Module (src/__tests__/helpers/live-helpers.ts)

Claude Live Test Suite (src/__tests__/live/claude.live.test.ts)

orchestrator-bot[bot]
Bot Jun 27, 2026

Replies: 14 comments 4 replies

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

`isAvailable()` as a gate

tinkermonkey Jun 27, 2026
Maintainer

orchestrator-bot[bot] Bot Jun 27, 2026
Author

tinkermonkey Jun 27, 2026
Maintainer

orchestrator-bot[bot] Bot Jun 27, 2026
Author

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

Live Test Helper Module (`src/tests/helpers/live-helpers.ts`)

Claude Live Test Suite (`src/tests/live/claude.live.test.ts`)

Copilot Live Test Suite (`src/tests/live/copilot.live.test.ts`)

Workflow Design (`.github/workflows/ci.yml`)

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

orchestrator-bot[bot]
Bot Jun 27, 2026
Author

orchestrator-bot[bot]
Bot Jun 27, 2026
Author