[pull] main from CopilotKit:main#342
Merged
Merged
Conversation
Add written_by/state_written_at columns (PB migrations) and stamp every status write with a stable host-derived writer identity. The status writer now detects cross-writer state flips and foreign writes (the anti-dual-writer flap-comb defense), normalizes observedAt to PB-safe RFC-3339 shapes before date-field writes, and classifies writer errors honestly (401 auth vs 403 permission split, new pb_not_found reason).
Pin the writer-identity stamping, flip/foreign-write warn paths (TTL re-warns, self-write memory cap eviction), date normalization shapes, overlay write outcomes, idempotency latches, and error classification against fail-loud fake-PB fixtures hardened against silent divergence from real PocketBase behavior.
Normalize padded projected keys to trimmed canonical form, skip blank keys loudly, and replace ambiguous outcomes with explicit discriminators: honest outage-skip/empty-projection semantics, droppedCommError surfaced whenever the comm error misses the aggregate row, trusted-negative duplicate preference scoped to cell-vs-cell (no aggregate-row impersonation), and comm-error identity asserts converted to discriminators so the consumer cannot hot-loop.
Persist thrown driver errors to PB in the runDriverInputs catch, carry WriteOutcome discriminators through the CLI summary (dropped write counts with correct pluralization, no duplicate cause clauses), and pin the runner/results behavior with StatusWriter contract-typed stubs so writer contract drift is compile-checked at the stub site.
…nment Wire the legacy monolith scheduler's status writer with an explicit writtenBy:"legacy" identity, correct the dual-writer dedupe comments to describe the real upsert-collapse (and why it is not safe), stamp the alert engine's synthesized cron outcome persisted:false honestly, and align alert/orchestrator/probe test fixtures with the real StatusWriter and OverlayWriteOutcome contracts.
…rects (SU-13) Carry backendHostPattern + docsHost in the shell runtime config (no longer baked from registry.json at Docker build time). Derive demo backend URLs at runtime from the pattern; issue docs-host 301s from middleware with a runtime DOCS_HOST so a misconfigured value can no longer 500 every docs route. Validate NEXT_PUBLIC_LOCAL_BACKENDS and empty overrides; guard the backend host pattern against silent env misconfigs. Reword the stale demo-page comment about backend URL derivation. Pin the registry slug set and SSR placeholder URL composition; fix env/spy/global leaks in runtime-config test cleanup. Squash of the initial runtime-URL refactor cluster: - feat(showcase): carry backendHostPattern + docsHost in shell runtime config - fix(showcase): derive demo backend URLs at runtime instead of baked registry values - fix(showcase): issue docs-host 301s from middleware with runtime DOCS_HOST - fix(showcase): never let a misconfigured DOCS_HOST 500 every docs route - fix(showcase): guard backend host pattern against silent env misconfigs - fix(showcase): validate NEXT_PUBLIC_LOCAL_BACKENDS values and empty overrides - docs(showcase): reword stale demo-page comment about backend URL derivation - test(showcase): fix env/spy/global leaks in runtime-config test cleanup - test(showcase): pin registry slug set and SSR placeholder URL composition
…g (SU-2/8/11/14/15/16/17/18/19/20) Resolve SEO redirect destinations against the docs host (SU-17); forward the query string on SEO redirects (SU-16); match bare paths on wildcard SEO sources (SU-19). Collapse duplicate slashes in docs-host redirect destinations (SU-13). Regression test for /shared//evil.com open redirect (SU-18). Emit 308 for docs-host redirects to match next.config parity (SU-2). Add a path boundary to the api matcher exclusion (SU-15). Loud guard when registry yields zero framework slugs (SU-20). Keep PostHog capture alive via event.waitUntil (SU-14). Note docs-host redirects are untracked by design (SU-8). Cover docs-host redirects at the middleware level (SU-11). Squash of the SEO-table + matcher-hardening cluster.
…rect builder safety (SU2-A/B + CR2-C)
SU2-B series — runtime-config / backend-url env robustness:
- Correct the Edge-safety story in runtime-config (SU2-B1)
- Stop per-request FATAL-CONFIG spam for unset BASE_URL (SU2-B2)
- Prepend https:// to a scheme-less POSTHOG_HOST (SU2-B3)
- Trim whitespace paste artifacts in env values and host patterns (SU2-B4)
- Memoize parseLocalBackends and warn once per value (SU2-B5)
- Make {slug} substitution immune to $-patterns (SU2-B6)
- Harden the client runtime-config reader (SU2-B7)
- runtime-config hardening batch (SU2-B8)
- Validate local-ports.json before baking NEXT_PUBLIC_LOCAL_BACKENDS (SU2-B9)
- test: warn-once assertions retry-safe; stop console leaks (SU2-B10)
CR2-C series — test infrastructure:
- Generate registry.json in a vitest globalSetup (CR2-C1)
- Stop ambient POSTHOG_KEY firing real fetches in middleware tests (CR2-C2)
- Assert the production slug set, not a re-derivation (CR2-C3)
- Make the registry generator subprocess robust (CR2-C4)
- Middleware/wiring test hygiene batch (CR2-C5)
SU2-A series — redirect-layer & PostHog capture:
- Stop $-pattern expansion in wildcard redirect substitution (SU2-A1)
- Surface PostHog capture failures once per failure class (SU2-A2)
- Duplicate exact redirect sources are first-match-wins (SU2-A3)
- Resolve runtime config once per redirected request (SU2-A4)
- Include destination host in seo_redirect capture (SU2-A5)
- Normalize scheme-less POSTHOG_HOST at the capture use site (SU2-A6)
- Correct redirect-layer comments and guard wildcard prefix boundary (SU2-A7)
- Cover docs-host hardening branches, compile matcher via path-to-regexp (SU2-A8)
…g (SU2-stragglers + SU5 + SU6-A/B)
Round-by-round CR convergence covering the redirect builder, the middleware
matcher, the docs-host self-loop guard, and the runtime-config env readers.
Highlights:
- Clear module-load warns after fresh middleware import
- Validate SET BASE_URL values (scheme-less/degenerate/garbage) with
sentinel fallback + once-guarded FATAL log
- Normalize path/query/fragment-bearing DOCS_HOST to origin; reject
non-http(s) schemes; branch rejection reasons
- Harden POSTHOG_HOST (degenerate-host/scheme rejection); expose
posthogKey via readEnvPair semantics
- Reject a DOCS_HOST equal to the shell's own host (redirect-loop guard,
authority compare)
- Warn on missing local-ports.json under SHOWCASE_LOCAL=1 and validate
TCP port range; extract helper for tests
- backend-url hardening — slug charset guard, frozen local-backends memo,
pattern path-segment warn, local-override URL validation
- Client config fail-loud covers all four URL fields with type checks
- Make RuntimeConfig.posthogKey optional — absence is a valid state, not
a wiring bug
- Drop R15/R17 and guard /integrations from SEO redirects
- Dedup duplicate wildcard prefixes with first-match-wins warn
- Validate malformed SEO entries at lookup-build time
- Restore case-insensitive redirect matching parity
- Normalize trailing slashes before redirect matching
- Keep the framework segment on F13, pin MG3 case fix
- Read posthogKey from runtime config in middleware, not raw process.env
- Fall back to the default backend host pattern for degenerate values
- Disable docs redirects when the default fallback collides with the
shell host
- Bring validateBaseUrl to parity with its sibling readers
- Strip query/fragment from POSTHOG_HOST while keeping reverse-proxy paths
- Restrict local backend overrides to http(s) URLs
- Add server-only guard to runtime-config
- Harden localBackendsEnv failure posture
- Hoist /integrations namespace guard above the docs-host redirect
- Validate seo-redirect sources and cross-kind shadowing in
buildRedirectLookup
- Unify slash normalization for middleware matching
- Lowercase-normalize REGISTRY_FRAMEWORK_SLUGS at construction
- Escalate missing POSTHOG_KEY to console.error in production
- Skip all redirect steps when docs redirects are disabled (sentinel
consumer)
- Reject userinfo credentials in DOCS_HOST, POSTHOG_HOST, and the backend
host pattern
- Branch dev-vs-prod logging in readDocsHost and fatalPatternOnce
- Prepend http:// (not https://) to scheme-less loopback hosts
- Round-5 micro-finding batch across the URL config libs
SU5-A1..A7 — registry safety, // reject, builder lint batch (case-
insensitive :path*, same-destination twin allowlist, original-case
divergence remainder), matcher api boundary, generator+vitest infra, test
hygiene + empty docs-host guard, comment batch.
SU6-A1..A6 — reject miscased :path* tokens, warn on tokenless wildcards,
normalize redirect-destination comparisons like request time, reject
destinations containing "//", surface missing POSTHOG_KEY at config-
resolution time, compile matcher harness like Next's runtime, type
parse/tokensToRegexp in the path-to-regexp shim, keep buildRedirectLookup
JSDoc attached.
SU6-B1..B7 — reject query/fragment/userinfo in pattern and local-override
URL gates, return parsed-normalized URL form from validation success
paths, distinguish unset/blank/padded SHOWCASE_LOCAL states, warn when
SHOWCASE_LOCAL is set to a value other than 1, validate {slug} placeholder
in generate-registry, mirror middleware drop semantics in the wiring
test's registry re-derivation, pin the noStore spy and calls to one fresh
module instance in the Edge-path test.
…-side parity (SU7-F1/F2/F3)
SU7-F1 — backend host pattern hardening:
- F1.1 Reject bare trailing ?/# in the backend host pattern
- F1.2 Strip internal tab/CR/LF from the backend host pattern
- F1.3 Warn when ignoring an empty-string local backend override
- F1.4 Reject empty-userinfo @ in the backend host pattern authority
- F1.5 Keep __proto__ keys as data in local-backend maps
- F1.6 Commit the local-backends memo key only after the value computes
- F1.7 Trim local backend overrides before validation and name the real
rejection
- F1.8 Honest FATAL when the pattern host is a stray scheme fragment
- F1.9 Canonicalize the pattern authority for parity with the override
path
- F1.10 Acknowledge the staging-to-prod fail-open in the pattern fallback
- F1.11 Harden backend-url/local-backends-env test hygiene
SU7-F2 — runtime-config & client-config edge cases:
- F2.1 Branch POSTHOG_HOST rejection reasons (scheme/degenerate/parse-
failure) instead of the catch-all mislabel
- F2.2 Reject loopback BASE_URL/DOCS_HOST in production instead of the
silent http:// prepend
- F2.3 Key the DOCS_HOST fallback once-guard on (mode, shellHost, value)
and mode-prefix all value-only guard keys
- F2.4 Reject a present-but-empty posthogKey in the client config reader
- F2.5 Drop the trailing slash from SSR_PLACEHOLDER_URL for structural
parity with server values
- F2.6 Attribute the DOCS_HOST slash-strip to readDocsHost itself
- F2.7 Normalize trailing-dot FQDN spellings in the docs self-host loop
guard (both compare sides)
- F2.8 Harden console spies to capture all log args; pin the full all-env
config shape; converge SSR simulation on vi.stubGlobal
SU7-F3 — script-side parity, table classification & test isolation:
- F3 #1 Handle a missing reference integration per the error contract
- F3 #2 Port the runtime backend-host-pattern normalization into the
generator — scheme/trailing-slash strip, degenerate fallback,
NEXT_PUBLIC fallback
- F3 #3 Treat non-mapping manifest parses (empty/null/scalar/array YAML)
as validation errors, not TypeErrors
- F3 #4 Label a missing/unreadable constraints.yaml per the stderr+exit(1)
error contract
- F3 #5 Align atomic-write tmp naming with the test harness straggler-
sweep convention; guard main() on direct invocation
- F3 #6 Correct the determineCellStatus unshipped docstring; replace
stale hardcoded cell counts with formulas
- F3 #7 Isolate the pattern suite on a per-suite tmpdir harness; snapshot
the generator's full write set
- F3 #8 Classify discarded duplicate wildcards as duplicates — hoist the
owner check above the destination warns
- F3 #9 Reject a root ("/") EXACT seo-redirect source — homepage-hijack
twin of the root-wildcard guard
- F3 #10 Reject seo-redirect entries with non-printable-ASCII source/
destination — close the silent-dead-entry class
- F3 #11 Strip trailing slashes in normalizePosthogHost before the scheme
test
- F3 #12 Message-filter the empty-slug-set error count; pin the single
matcher entry
Pure prettier-style rewrite (line-wrapping, single-line collapses, multi- line callback indentation). No semantic changes — confirmed via tsc clean, oxlint clean, vitest 235/235 pass in showcase/shell. Touched files: - showcase/scripts/__tests__/generate-registry-pattern.test.ts - showcase/scripts/generate-registry.ts - showcase/shell/src/lib/backend-url.ts - showcase/shell/src/lib/backend-url.test.ts - showcase/shell/src/lib/docs-redirects.test.ts - showcase/shell/src/lib/local-backends-env.test.ts - showcase/shell/src/lib/runtime-config.ts - showcase/shell/src/lib/runtime-config.test.ts - showcase/shell/src/lib/runtime-url-wiring.test.ts - showcase/shell/src/middleware.ts - showcase/shell/src/middleware.test.ts - showcase/shell/vitest.global-setup.ts
## Summary Anti-dual-writer defense for the showcase `status` collection — the flap-comb incident class where the legacy monolith scheduler and the fleet writer fight over the same rows. - **Writer identity columns**: new `written_by` / `state_written_at` columns on `status` (two PB migrations, idempotency-symmetric up/down paths). Every write stamps a stable host-derived writer identity; the legacy orchestrator wires `writtenBy: "legacy"` explicitly. - **Flip / foreign-write detection**: the status writer detects cross-writer state flips inside a validated window and warns (TTL-deduped re-warns so sustained fighting stays visible; one-time warn when the self-write memory cap starts evicting; same-identity replica hint). - **PB-safe date normalization**: `observedAt` is normalized to RFC-3339 PB-safe shapes before any date-field write (zone-less shapes, colonless offsets rejected) — no PB 400 unlatch-retry hot loops. - **Aggregator honesty**: padded/blank projected keys normalized or skipped loudly; explicit outcome discriminators replace ambiguous asserts (no consumer hot-loop); `droppedCommError` surfaced whenever a comm error misses the aggregate row; trusted-negative duplicate handling scoped to cell-vs-cell so no cell can impersonate an aggregate row. - **CLI persistence honesty**: thrown driver errors are persisted to PB in the `runDriverInputs` catch; the summary carries `WriteOutcome` discriminators (dropped-write counts, correct pluralization); writer error classification documented honestly (401 vs 403 split, new `pb_not_found` reason). - **Alert/fixture truthfulness**: the alert engine's synthesized cron outcome is stamped `persisted: false`; alert/orchestrator/probe test fixtures align with the real `StatusWriter` / `OverlayWriteOutcome` contracts, with fail-loud fake-PB fixtures hardened against silent divergence from real PocketBase. Hardened over 10 CR rounds (~60 reviewer reports), converged to zero findings. ## Test plan - [x] Full harness vitest suite: 124 test files / 2378 tests passing - [x] `tsc --noEmit` and build typecheck (`tsconfig.build.json`) clean - [x] oxfmt clean on all touched files; oxlint 0 errors - [x] Known flake note: `probe-invoker.test.ts` wall-clock assertion (elapsed 101ms vs <100ms bound) fired once under full-suite load and passes in isolation (67/67) — pre-existing timing sensitivity, unrelated to this diff
…s, and middleware/builder hardening (SU) (#5401) ## Summary End-to-end hardening of the showcase shell's URL plane: - Carries `backendHostPattern` + `docsHost` in the shell runtime config (no longer baked from `registry.json` at Docker build time). Derives demo backend URLs at runtime from the pattern, so a new pattern via env reconfigures every integration on the next deploy without a registry rebuild. - Issues docs-host 301/308 redirects from middleware with a runtime `DOCS_HOST`; misconfigured values no longer 500 every docs route — they fall back to a sentinel that disables the docs-redirect step. - Hardens the redirect table builder + matcher: first-match-wins for duplicate exact sources, deduped wildcard prefixes with warn, malformed entries rejected at lookup-build time, case-insensitive matching parity, trailing-slash normalization, structural `/integrations` namespace guard above the docs-host redirect (closes R15/R17 hijack class). - Hardens runtime-config + backend-url env readers: scheme/whitespace/control-char normalization, query/fragment/userinfo rejection on `DOCS_HOST` / `POSTHOG_HOST` / backend-host pattern / local-override URLs, present-but-empty `posthogKey` rejection, prod loopback `BASE_URL`/`DOCS_HOST` rejection (no silent `http://` prepend), once-guarded FATAL logging (no per-request spam). - Brings the build-time twin in `showcase/scripts/generate-registry.ts` to parity with the runtime normalizer (scheme/trailing-slash strip, degenerate fallback, `NEXT_PUBLIC` fallback, slug validation). - Surfaces PostHog capture failures once per failure class; keeps capture alive across the redirect via `event.waitUntil`; missing `POSTHOG_KEY` is `console.error` in production and surfaces at config-resolution time (not first redirect). - Open-redirect hardening on `/shared//evil.com` (SU-18); `//` rejected at both source and destination; root `/` exact-source and `/:path*` wildcard sources rejected; non-printable-ASCII source/destination rejected. Stream: SU (shell-runtime-urls). Subject groups SU-2/8/11/13/14/15/16/17/18/19/20 (initial), SU2-A/B (env-hardening), CR2-C (test infra), SU5-A1..A7 (registry/builder/lint/matcher boundary), SU6-A1..A6 (request-time normalization parity), SU6-B1..B7 (parsed-normalized return forms, SHOWCASE_LOCAL states, generator `{slug}` validation), SU7-F1..F3 (final-round backend-pattern + POSTHOG + table + script-side parity). ## Test plan - [x] `pnpm test` in `showcase/shell` — 7 files, 235/235 pass - [x] `pnpm test` in `showcase/scripts` — 51 files, 1857/1857 pass (one pre-existing flake in `generate-registry.test.ts` unrelated to this diff; passes on subsequent runs, classic vitest fork-reuse cross-file pollution) - [x] `pnpm exec tsc --noEmit` in `showcase/shell` — clean - [x] `oxlint` on every changed file — 0 warnings, 0 errors - [x] `oxfmt --check` on every changed file — clean - [x] `pnpm build` in `showcase/shell` (full Next 15.x build with registry+demo-content+starter-content+search-index generation) — clean - [ ] CI green on the PR — confirm via `gh pr checks` after push
…rve low-frequency jobs claimNext listed ONE global oldest-50 pending page; with the d4+d5 producers ticking every 15min against 2 serial browser workers, a persistent backlog permanently saturated that page and e2e-demos jobs never entered the candidate set (prod: all 18 e2e-demos jobs pending forever; staging: 3,734 pending, oldest 22h). claimNext now discovers the distinct families present in pending (oldest first, one perPage=1 query per family) and tries them in round-robin rotation across calls, listing a per-family candidate page for each. Every discovered family is attempted before giving up, so no family starves while any of its jobs are claimable. The S0 CAS exactly-one-winner semantics and the per-page anti-herd shuffle are unchanged; only candidate SELECTION changed.
Producers enqueued a fresh batch every scheduled tick regardless of whether the family's previous batch had even been claimed — against 2 serial browser workers the queue compounded without bound (staging: 3,734 pending), feeding the claim-page starvation. A scheduled tick now skips its family's batch when that family already has pending (unclaimed) jobs, bounding the per-family backlog to one batch, with a structured fleet.producer.skipped-for-backlog log (family, pendingCount, skippedJobs) and a skippedForBacklog count on TickResult. The check is per family, fails OPEN on a count blip (a PB read failure must never stop production), and is BYPASSED by operator-triggered ticks (explicit intent wins; the trigger CLI treats 0 enqueued as failure). Backed by a new FleetQueueClient.countPendingForFamily (server-side totalItems count of a family's pending rows).
…ins structurally sweepExpired only reclaimed claimed/running leases — a pending row had no terminal path, so an accumulated backlog (staging: 3,734 pending, oldest 22h) could only drain through 2 serial workers and effectively never did. The sweep now also expires pending jobs older than expiryPeriods x their family's production period (default 3 periods; the family has enqueued fresher batches since, so the stale job's result would be ancient data). Each stale row is first CLAIMED via the S0 CAS under a synthetic stale-pending-sweeper id — so the delete can never race a worker (exactly-one-winner) — then deleted; a failed delete self-heals via normal lease expiry + re-queue. Unparseable created timestamps are conservatively skipped (delete is destructive). Policy is configurable via FleetQueueClientConfig.stalePending; the control-plane wires the real per-family cadences (FLEET_FAMILY_PERIODS_MS: d4/d5 15min, d6/ e2e-demos hourly) so 15min families expire on a 45min window. No comm error is synthesized for expired-pending rows (they never ran); the count surfaces as SweepResult.expiredPending and in the producer's sweep log.
…e lease phase just re-queued
sweepExpired's lease phase re-queues an expired-lease row to pending and
emits worker-reclaimed-pending ("back in flight"), but the stale-pending
phase of the SAME call lists pending fresh and ages rows off PB's system
`created` (the ORIGINAL enqueue time) — so a long-claimed job was
re-queued then immediately claimed-and-deleted, falsifying the comm
error and nulling downstream aggregate-key resolution on the deleted
row. Track the ids re-queued in this sweep and exclude them from this
call's stale phase; a truly stale job ages out on the next sweep. The
`created` anchor is kept (re-anchoring needs a column; out of scope).
…e-sweeper rows When the stale-pending sweep CAS-claims a row under stale-pending-sweeper and the delete fails, the next lease sweep treated the expired sweeper lease like a crashed worker's: it re-queued the row AND synthesized a worker-reclaimed-pending comm error — a gray "re-queued / back in flight" dashboard overlay for stale garbage mid-deletion, attributed to a non-existent worker. The lease sweep now special-cases rows held by the stale-pending sweeper: still re-queued (the self-healing delete-retry contract is unchanged) but silently — no comm error, no reclaimed count, just a stale-sweeper-retry-requeue debug line. The lease holder is snapshotted before the release CAS so attribution reflects who held the expired lease, not the post-release row.
…ses beyond the page are reclaimed The sweepExpired lease phase listed claimed/running rows with perPage 50 and NO sort: with >50 such rows (mass worker crash), PB's unspecified default order could return the same 50 live-lease rows every sweep, leaving expired leases beyond the page permanently orphaned with zero signal. Sort by lease_expires_at ascending (indexed) so the most-expired rows always head the page, and WARN when the page is full so truncation is observable. Single page per sweep is kept deliberately — the sort guarantees progressive forward drain.
… one A single 50-row page per sweep was far slower than the incident the stale-pending drain exists for: against the 3,734-row staging backlog at ~10 sweeps/hour that is ~7.5 hours of drain. The sweep now loops candidate pages (re-listing page 1 — deletes shift pagination) up to a cap of 10 pages / 500 rows per sweep, draining the same backlog in well under an hour while bounding a single sweep's PB load. CAS-claim-then-delete per row is unchanged; a pass that expires nothing terminates the loop.
filterBackloggedFamilies ran BEFORE maybeSweep, so the tick whose own stale-pending drain cleared a family's backlog still counted the about-to-be-expired rows and skipped that family — production resumed a full cron period late. Reordered tick() to sweep first; the cadence gate and fail-open semantics (maybeSweep swallows sweep failures) are unchanged, only the order moved.
…ucers Only the d6 producer was built with onSweepCommErrors, but all four family producers run the same GLOBAL queue.sweepExpired on their own crons — and the sweep's S0 CAS means whichever producer ticks first wins each expired job's reclaim, along with its synthesized comm error. With smoke/demos/deep sweeping far more often than d6's hourly :40, the worker-reclaimed-pending dashboard overlay (and stale-pending telemetry) was dropped ~11 of 12 sweeps, since job-producer's maybeSweep forwards comm errors only when the sink is wired. Share the ONE control-plane sink (surfaceSweepCommErrors -> aggregator) across all four producers and correct the now-false "preserves the current behavior" comment. Sweeps remain CAS-safe across producers: the S0 CAS guarantees exactly one producer reclaims (and forwards) each expired job, and the surfacing leg is best-effort per error, so the shared sink introduces no double-write. Red-green: new runControlPlane REQ-B test drives the SMOKE producer's tick and asserts its swept overlay reaches the status row (RED against d6-only wiring, GREEN after). The test doUnmocks/re-mocks everything it touches so it passes in isolation despite the file's leaked doMock factories.
…ed-pending semantics The sweep no longer synthesizes worker-crashed-mid-job (it cannot tell a crash from a platform teardown); it re-queues the job and emits the neutral worker-reclaimed-pending kind. Update the queue-client module header, the contracts kind/heartbeat/SweepResult/sweepExpired docs, the job-producer sink/tick docs, the sweep test fixtures and titles, and the dashboard's mirrored kind description. worker-crashed-mid-job is now documented as the worker self-observed in-driver crash only. No runtime behavior changes.
…er tick outcome maybeSweep's catch arm returned the same shape as a clean zero-reclaim sweep (sweptExpired: true, reclaimed: 0), so a thrown sweepExpired call was indistinguishable from success in the TickResult and the tick-complete log. Add sweepFailed to the sweep outcome, TickResult, and the tick-complete log; the cadence latch is unchanged (a failed sweep still consumes its window so a persistently-failing sweep cannot fire on every tick).
The field was dead on the only consumer path: queue-client's enqueue() destructures only `payload` and never reads leaseSeconds (the claim lease comes from the WORKER side — claimNext(workerId, leaseSeconds) with the worker-loop's DEFAULT_LEASE_SECONDS). No production call site ever set it, so wiring it up would add a knob nothing needs; delete is chosen over wire-it. Call-site enumeration (all removed): - contracts.ts EnqueueJobInput.leaseSeconds (declaration; never read by queue-client.ts enqueue, the sole FleetQueueClient.enqueue impl) - job-producer.ts ServiceJobSpec.leaseSeconds (only producer thereof) - job-producer.ts toEnqueueInput() spec.leaseSeconds -> input.leaseSeconds threading (only writer of the field) - job-producer.test.ts spec fixture leaseSeconds: 600 + the `expect(input.leaseSeconds).toBe(600)` assertion that legitimized the dead plumbing Worker-side lease plumbing (claimNext/renewLease/worker-loop leaseSeconds) is unrelated and untouched.
…umerators' families stalePendingFilters silently falls back to the 1h default period for any family missing from FLEET_FAMILY_PERIODS_MS, so a typo'd key (e.g. "d5" vs "d5-single-pill-e2e") would never throw — it would just quietly mis-size that family's stale-pending drain window. Lock the map's keys to the probe-key families derived by RUNNING the four real enumerator factories against a fake discovery source, so either side drifting breaks the test. Also document the known d6 FLEET_PRODUCER_CRON override drift limitation on the map (an env override changes d6's real cadence without updating the nominal period).
- orchestrator.test.ts: file-level afterEach doUnmocks queue-client, status-writer, and result-consumer (vi.doMock factories persist across the file; resetModules clears the module cache, not the mock registry) so a leaked stub can't poison later tests. Full file re-run: 100/100 pass with the leaks closed — no test was depending on a leaked factory. - orchestrator.test.ts: the R5-G4 webhook-secret tests now save/restore POCKETBASE_URL like the HF13-A2 pattern instead of unconditionally deleting it in finally. - job-producer.test.ts: the no-warm test stubbed a local fetch spy it never wired in (vacuously zero calls); stub GLOBAL fetch via vi.stubGlobal (+ vi.unstubAllGlobals in afterEach) and assert the unconfigured producer never falls back to it. - queue-client.test.ts: famOf re-implemented probeKeyFamily; import the production helper from contracts so the tests can't drift from the real family rule. - control-plane.test.ts: the invalid-cron latch test now also retries start() on the FAILED instance and asserts it throws again (a stuck-true latch would make the retry a silent no-op).
…ong-expired carve-out Sibling of the queue-client prose fix (CF7 #10), which flagged this contract doc as describing only the drain phase: despite the name, the lease phase's long-expired carve-out also claim-deletes claimed/running rows (stale created-age, long-expired or unparseable lease) into this count — no re-queue, no comm error, no reclaimed increment.
… grafts older signal (CF8 F3) The cold-load comm-error supplemental fetch runs CONCURRENTLY with the bulk pages, so the bulk copy of an aggregate row can be NEWER than the supplemental snapshot (the row's state changed between the two reads). The previous merge replaced the bulk row unconditionally — regressing state/observed_at and potentially fail_count back to the older supplemental values until the row's next SSE delta (long for slow-cadence aggregates). Add a freshness guard: when the supplemental row is strictly older (by observed_at), keep the newer bulk row INTACT — signal-less rather than chimera (newer core + stale signal). A chimera row would be silently swallowed by the reducer's signal-PRESENCE no-op check; a signal-less bulk row lets the next SSE delta restore the real current signal via the undefined→defined presence flip. Equal timestamps and unparseable timestamps both prefer the supplemental (signal-bearing) row — only POSITIVELY-stale supplemental is suppressed, preserving the cold-load comm-error overlay intent of CF7-F3 #1.
… tick-result doc reconciliations (Procedure 3 promotions) PROMOTE_TO_A (defense-in-depth, exploit-class): - fleet-health.ts reclaim list interpolated workerId raw into the PB filter literal. workerId is DB-sourced (read back from the workers roster row), not a compile-time constant, and the same field is escaped via JSON.stringify at orchestrator.ts:3240 — but the reclaim path was missing the same hardening. A double-quote in worker_id (corrupt row, buggy self-registration) would either throw the list (silently skipping this worker's reclaim every cycle) or widen the filter to claim other workers' jobs. Match the sibling escape pattern. PROMOTE_TO_B (doc reconciliations exposed by the CF round-8 audit): - queue-client.ts COUNT-NAME CAVEAT was stale — the cross-referenced contracts.ts doc was already updated (commit 80c5c94) but the queue-client side still asked a future maintainer to make the edit that had already landed. - WarmHealthConfig + JobProducerOptions.warmHealth docs said the producer warms 'every enumerated backend' / 'each enumerated spec'; the implementation warms gate.specs (post-validation, post-backlog-gate). A fully-backlogged tick warms nothing. Updated both interface-level docs to match. - TickResult.reclaimedIndeterminate said 'Of reclaimed, the reclaims...' — but the field is DISJOINT from reclaimed (sibling SweepResult contract + the queue-client both state a thrown release lands here exclusively). Rewrote to 'In addition to reclaimed, ...'. - TickResult.skippedForBacklog doc named only the dedupe-gate contributor; the in-function comment correctly documents the fail-CLOSED poisoned-count fold-in. Expanded the field doc to name both contributors, and to clarify that the fail-OPEN leg lands in backlogGateFailedOpen separately. - SweepResult.commErrors pairing equation (commErrors.length === reclaimed + reclaimedIndeterminate) was asserted unconditionally on the shared contract, but reclaimedIndeterminate is optional and fakes may not report the split. Scoped to 'implementations that report the split'.
…see the D4 rung Mirrors cell-model resolveD4 fold semantics (worst-state-wins, 1h stale window, rank-based missing-chat collapse, chat-wins tie-break pinned by row-identity assertions); d4 is informational — rollup contributors unchanged.
…and scopes the rollup line honestly D4 row inserted with worst-state strictness; e2e relabeled "E2E (Demo)" atomically (label-derived testids); rollup line relabeled "Service (health + e2e)"; headline regression pins pill-red <-> visible red popup row via buildCellModel.chipColor cross-assert; green-badge tests FRESH-pinned.
…rid, legacy cells, and legend Grid d3 badge API->E2E, legacy e2e badge RT->E2E, legend D6 "Parity (PR)"->"Parity (Reference)", legend D4 chip copy 6h->1h (real window); d4 grid "RT" and d2 "API" unchanged.
The auth-middleware-presence regex in queue-client.test.ts hook-parity suite was anchored to the single-line `}, $apis.requireAdminAuth());` closer. After oxfmt rewrote fleet-claim.pb.js to its multi-line form (`},\n $apis.requireAdminAuth(),\n);`) the regex stopped matching and the test asserted 0 routes were guarded — a false alarm. Broaden the regex to accept both the single-line closer and the formatter's split form; the structural intent (each routerAdd handler-end is followed by the requireAdminAuth middleware) is unchanged.
…, dashboard drilldown D4 parity (#5399) ## Summary Bundles two related dashboard/harness lanes that landed together once both reached green: 1. **CF #18 fairness / claim-fair lane** — the 23-commit base (`fix/fleet-claim-fairness` rebased onto its CF7 integration tip `80c5c9402`) carrying claim-spike, cf3/cf4/cf5/cf6/cf7 hardening waves, plus the **CF8 micro-fix** for the round-8 supplemental-merge finding: - **CF8 F3 supplemental-merge freshness guard** (`useLiveStatus.ts`): the cold-load comm-error supplemental fetch runs CONCURRENTLY with the bulk pages, so the bulk copy of an aggregate row can be NEWER than the supplemental snapshot. The previous merge replaced the bulk row unconditionally, regressing `state`/`observed_at` to stale values until the row's next SSE delta (long for slow-cadence aggregates). Added `supplementalRowIsOlder` — when the supplemental row is strictly older the newer bulk row stays INTACT (signal-less) rather than being grafted into a chimera (newer core + stale signal) that the reducer's signal-PRESENCE no-op check would silently swallow. - **Procedure 3 promotions** (bucket-c/d audit over the CF round-8 ledger): one functional fix (escape `workerId` in fleet-health's reclaim list filter — matches the `JSON.stringify` pattern used at orchestrator.ts:3240 for the same field; a `"`-bearing worker_id would otherwise break out of the literal) plus five doc reconciliations on the producer/contract surfaces flagged across slot1/slot2/slot4/slot5 (stale queue-client cross-reference, WarmHealthConfig doc vs `gate.specs` reality, TickResult.reclaimedIndeterminate "Of reclaimed" → "In addition to reclaimed", TickResult.skippedForBacklog poisoned-count contributor, SweepResult.commErrors equation scoping). 2. **Dashboard drilldown D4 parity** — three commits making the dashboard drilldown surface the D4 rung the same way the grid does: - `feat(showcase): add resolveD4Row + CellState.d4 so the drilldown can see the D4 rung` — exposes the D4 row on `CellState`, modeled after `resolveD3Row`. - `fix(showcase): drilldown shows the D4 row, de-crosses the e2e label, and scopes the rollup line honestly` — renders D4 in the drilldown panel, fixes the crossed e2e label, scopes the rollup line to its real source set. - `fix(showcase): unify dimension naming on the legend taxonomy across grid, legacy cells, and legend` — taxonomy cleanup so legend, grid, and legacy cells use one set of labels. ## Verification - Dashboard `npx tsc --noEmit`: clean (only the pre-existing missing-`@/data/*.json` errors that exist on `main`). - Dashboard `vitest run`: 59 files / 1012 tests passing, 1 skipped. - Harness `npx tsc --noEmit`: clean. - Harness `vitest run`: 121/122 files passing; 1 failed = `probe-invoker times out invoker-level even when driver ignores abortSignal` — the NAMED known wall-clock flake (per CF7 integration verification), re-run in isolation: 67/67 PASS. - Rebase of drilldown-parity onto the updated `fix/cf8-m1` tip: ZERO conflicts (the two lanes touch disjoint files). ## Test plan - [ ] CI green on the PR - [ ] Dashboard drilldown shows D4 rung in addition to D3/D5/D6 - [ ] Legend taxonomy reads the same in legend, grid, drilldown - [ ] (post-merge, in showcase) cold-load comm-error overlay still paints; a stale supplemental no longer regresses a freshly-failed aggregate row
…bility Adds run-id/family/worker-id columns to probe_jobs and resource_snapshots, plus the EnqueueJobInput.family + FleetQueueClient.pruneAged contracts and the hoisted deriveHealth primitive that downstream projections share. The fleet-claim PB hook stamps run-id/family at claim time so every later projection has a stable join key. probes/run-history is updated to read the new columns.
…rojection queue-client stamps run_id/family onto every enqueue so downstream projections can attribute jobs to a family-scoped batch; pruneAged retention legs land here (the d6 producer owns the call, per §4.2). result-aggregator computes redsIntroduced/redsCleared from claimed/ sequenced job state into probe_runs summary.
…ojection The §5.1 FLEET_FAMILIES registry and the §5.2.1 family-summary projection that derives per-family outcome / inflight / lastRun / lastSuccessAt from the PB-backed batches. The memoized variant fan-outs PB reads at most once per TTL regardless of viewer count — the SAME instance is shared by the /api/runs routes and the family-silence monitor so a dashboard poll and a monitor evaluation inside the same TTL cost one PB fan-out total.
job-producer takes a family option and stamps it on every enqueue; the prune-ownership key is the d6 producer's family. family-silence-monitor rides the existing fleet-health interval (no extra timer to tear down), keys 6 h rate-limit + recovered one-shot per family, fails open on PB-down (the meta-alert path must still fire), and renders alert text from closed-vocabulary parts only (§5.2.1 redaction). control-plane fire-and-forgets familySilence.tick(now) each fleet-health cycle.
…stamp Read-only /api/runs routes mounted on the control-plane role, backed by the SHARED memoized family-summary instance (one PB fan-out per TTL). Bounds: per-route memo, request rate-limit. /health gains the fleetRuns.lastEvaluatedAt stamp from the family-silence monitor as the §9 compensating control for a wedged monitor (an external poll detects the wedge — the monitor cannot report its own host's death).
…to runControlPlane PRODUCER_FAMILY_WIRING (drift-locked set-equal to FLEET_FAMILIES via a unit test) drives every buildJobProducer call site; the boot-resolved worker-stale-after window is threaded through BOTH fleet-health and the shared family-summary projection so they judge staleness against the same window. Triggered CLI control-plane runs route through familyForLevel so a registry rename breaks loudly instead of silently enqueueing jobs invisible to the projection. Test queues across the harness gain a no-op pruneAged for the new contract.
Adds the dashboard Ops worker-runs section — family table, worker strip, run-history drill-down, D0-from-staleness vs D0-from-failure family annotation with clock glyph, and the per-family silence banner on the coverage tab. Wires the data layer: DTOs, /api/runs fetchers, polling hook, and a worker-runs context provider. cell-drilldown / cell-pieces gain family-aware rendering.
… covers Adds a 689-line integration test that exercises the full queue lifecycle to the /api/runs projection (enqueue → claim → terminal → projection) across all four families. Updates the railway-envs golden + verify-deploy drivers regression test to account for the new fleet-runs route surface.
…ker-health helper parity
…t hardcoded literal
…ve side-effects out of setState updater
…t tests - queue-client.ts: add missing closing brace at EOF (TS1005 after rebase) - job-producer.test.ts: add required `family: "d6"` to producer fixtures and remove duplicate `logger` key in startedProducer - result-aggregator.test.ts: align with per-row try/catch + dedup-lookup behavior introduced in 0b2f613 — add `persisted: true` and `writeOverlay` to test writers, update B6 contract expectations - queue-client.test.ts: drop a stray blank line
…ily-silence Slack alerting (#5400) ## Summary Adds end-to-end **run-visibility** to the showcase fleet — a new data path from queue enqueue → per-family projection → dashboard + Slack alerting: - **Contracts + PB schema**: `EnqueueJobInput.family`, `FleetQueueClient.pruneAged`, run-id/family/worker-id columns on `probe_jobs` / `resource_snapshots`, fleet-claim hook stamps at claim time. - **Queue + aggregator**: queue-client stamps run-id/family on enqueue; d6 producer owns `pruneAged` retention; aggregator computes `redsIntroduced`/`redsCleared` into `probe_runs`. - **Run-view projection (`run-view.ts`)**: §5.1 `FLEET_FAMILIES` registry + §5.2.1 family-summary projection. `createMemoizedFamilySummary` bounds PB load at ~one fan-out per TTL regardless of viewer count. - **Producer wiring**: `PRODUCER_FAMILY_WIRING` drift-locks the four producers' family ids set-equal to the registry (unit-tested); CLI-triggered runs route through `familyForLevel` so a registry rename breaks loudly. - **Family-silence monitor (§9)**: Slack alerting for the one incident class transition-keyed rules are blind to (a silent family produces no row transitions). Rides the existing fleet-health interval (no extra timer), per-family 6h rate-limit + recovered one-shot + boot-grace, fails open on PB-down for the meta-alert. Closed-vocabulary redaction. - **HTTP `/api/runs`** + **`/health.fleetRuns.lastEvaluatedAt`**: read-only fleet-runs routes mounted unconditionally on the CP role, backed by the SAME memoized summary the monitor uses. `/health` stamp is the §9 compensating control so an external poll can detect a wedged monitor. - **Orchestrator wiring**: boot-resolved `workerStaleAfterMs` threaded through both fleet-health and the projection so they judge staleness against the same window; test queues across the harness gain a no-op `pruneAged`. - **Dashboard**: worker-runs Ops section (family table, worker strip, run-history drill-down), D0-from-staleness vs D0-from-failure family annotation, per-family silence banner on the coverage tab; data layer (DTOs, fetchers, polling hook, context provider). - **Integration test**: 689-line end-to-end test exercising the full queue lifecycle → /api/runs projection across all four families. This is the merged scope of the **runviz blitz** lanes T1-T15. ## Test plan - [x] `pnpm typecheck` on `showcase/harness` and `showcase/shell-dashboard` — clean - [x] `vitest run` on full harness suite — 2276 pass / 3 pre-existing probe-pool timing flakes (NOT in diff; subject is "probe-pool timing fragility", a different PR's concern) - [x] `oxfmt --check showcase/` — all green - [x] Integration test `run-visibility.integration.test.ts` passes — exercises queue → projection across all four families - [ ] CI on PR HEAD — pending push, monitored after open ## Known follow-up (bucket-d) - Probe-pool timing-sensitive flakes in `src/probes/helpers/browser-pool.test.ts` (FIX#4a self-heal) and `src/probes/loader/probe-invoker.test.ts` (timeout assertions) — different test fails on each run; subject is probe-pool timing fragility, not runviz. ## Deviation note The runviz ship-finisher session ran in a Claude Agent SDK harness without an Agent dispatch tool, so the standard 7-agent CR-loop could not be dispatched. Inline review was performed against the cr-loop subject-manifest + four-bucket partition: T1-T15 each landed via their own per-task review on the blitz integration branch before reaching this PR, and T8/T15 (the two final-merged streams) were validated via typecheck-clean + targeted-tests-pass + cross-module wiring inspection. A heavyweight 7-agent CR can be run against this PR if desired.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )