Problem
agentv serve runs evals as child processes and tracks them in an in-memory activeRuns map. It has no SIGTERM handler (serve.ts:2144 blocks on await new Promise(() => {})). When Kubernetes sends SIGTERM during a rolling deployment:
agentv serve exits immediately — the running eval child process is orphaned briefly then killed by SIGKILL
/data is ephemeral overlay (no PVC in this deployment) — the partial index.jsonl and per-test artifacts are lost
- On pod restart the eval is gone — no way to resume
The partial index.jsonl is written to disk incrementally as each test completes (via JsonlWriter.append), so recovery is technically possible: the native --resume flag reads the existing records and skips already-completed tests. What's missing is (a) persisting the partial state off-pod before exit, and (b) re-launching the interrupted runs on the new pod.
Proposed solution
Two phases, handled entirely within agentv serve:
Phase 1 — write run-params.json at spawn time (eval-runner.ts)
When spawning an eval child process (the POST /api/eval/run path), write a small manifest to the output directory before the child starts. This gives the shutdown handler (and restore logic) everything it needs to re-launch:
// apps/cli/src/commands/results/eval-runner.ts
// written to <outputDir>/run-params.json immediately after outputDir is determined
interface RunParamsFile {
request: RunEvalRequest; // the full original API request body
cwd: string; // project root
started_at: string; // ISO timestamp
}
Also export a helper for the SIGTERM handler:
export function getInterruptibleRuns(): Array<{
outputDir: string;
process: ChildProcess | undefined;
}> {
return [...activeRuns.values()]
.filter(r => r.status === 'running' && r.outputDir)
.map(r => ({ outputDir: r.outputDir!, process: r.process }));
}
Phase 2 — SIGTERM handler in serve.ts
Replace await new Promise(() => {}) with a proper shutdown path:
// apps/cli/src/commands/results/serve.ts (around line 2144)
let shuttingDown = false;
process.once('SIGTERM', async () => {
if (shuttingDown) return;
shuttingDown = true;
console.log('[serve] SIGTERM — checkpointing in-progress evals…');
const runs = getInterruptibleRuns();
// SIGTERM each child so its JsonlWriter flushes and closes cleanly
for (const { process: child } of runs) {
try { child?.kill('SIGTERM'); } catch {}
}
await new Promise(r => setTimeout(r, 2000)); // brief flush window
// Push each partial run dir to a checkpoint branch on the results repo
await Promise.allSettled(
runs.map(({ outputDir }) =>
maybeCheckpointRunArtifacts({ cwd, outputDir })
.catch(err => console.warn(`[serve] checkpoint failed: ${err.message}`))
)
);
process.exit(0);
});
await new Promise<never>(() => {});
Phase 3 — restore and auto-resume on startup (serve.ts)
During agentv serve startup, before the HTTP server binds, check the configured results repo for checkpoint branches and re-launch any interrupted runs:
// Before startServer(...):
const restored = await restoreCheckpointedRuns(cwd);
for (const run of restored) {
console.log(`[serve] Resuming interrupted run from checkpoint: ${run.outputDir}`);
await launchResumedRun(app, cwd, run.outputDir, run.params);
}
launchResumedRun reuses the same spawn path as POST /api/eval/run but adds resume: true and an explicit output dir. It also deletes the checkpoint branch after the run completes successfully.
New core functions (packages/core/src/evaluation/results-repo.ts)
// Push a partial run dir to a dedicated checkpoint branch (force-push, non-default branch)
export async function pushCheckpointBranch(params: {
config: ResultsConfig;
sourceDir: string; // <outputDir> with partial index.jsonl
branchName: string; // e.g. "inflight/pod-abc/2026-06-12T10-30-00Z"
}): Promise<void>
// Scan remote branches matching "inflight/*/*", download each, return resume params
export async function restoreCheckpointedRuns(
config: ResultsConfig,
targetRoot: string, // where to restore run dirs (project cwd)
): Promise<Array<{ outputDir: string; params: RunEvalRequest; cwd: string }>>
// Delete a checkpoint branch after successful resume
export async function deleteCheckpointBranch(
config: ResultsConfig,
branchName: string,
): Promise<void>
Branch / state layout
results repo (e.g. WiseTechAcademy.EvalResults)
refs/heads/main ← completed runs (existing)
refs/heads/inflight/ ← checkpoints (new, non-default, not pushed to main)
pod-agentv-abc123/
2026-06-12T10-30-00-000Z/ ← one branch per interrupted run
Branch contents mirror the run dir structure:
index.jsonl ← partial results (completed tests only)
run-params.json ← original request params for re-launch
task/ ← bundled eval + targets files (written at run start)
t001/ t002/ … ← per-test artifacts for completed tests
console.log ← captured stdout/stderr up to interruption
Security
- Pushed to private results repos (EvalResults) only — never to this deploy repo
index.jsonl and per-test artifacts contain model outputs and test prompts; they should not be pushed to a public repo
- No API keys, GitHub App private keys, or other secrets appear in run artifacts — those are env-only or volume-mounted
Scope
| File |
Change |
Est. lines |
apps/cli/src/commands/results/eval-runner.ts |
write run-params.json at spawn; export getInterruptibleRuns() |
~25 |
apps/cli/src/commands/results/serve.ts |
SIGTERM handler; restore-on-startup call |
~40 |
packages/core/src/evaluation/results-repo.ts |
pushCheckpointBranch, restoreCheckpointedRuns, deleteCheckpointBranch |
~80 |
Total: ~145 lines across 3 existing files.
Non-goals (v1)
- Periodic mid-run checkpoints: not needed for the SIGTERM case; can add later if OOM / node-eviction losses prove material
- Multi-pod awareness: deployment is replicas=1, so no concurrent-writer collision concern today
- entrypoint changes: not required if serve.ts handles restore on startup itself
Problem
agentv serveruns evals as child processes and tracks them in an in-memoryactiveRunsmap. It has no SIGTERM handler (serve.ts:2144blocks onawait new Promise(() => {})). When Kubernetes sends SIGTERM during a rolling deployment:agentv serveexits immediately — the running eval child process is orphaned briefly then killed by SIGKILL/datais ephemeral overlay (no PVC in this deployment) — the partialindex.jsonland per-test artifacts are lostThe partial
index.jsonlis written to disk incrementally as each test completes (viaJsonlWriter.append), so recovery is technically possible: the native--resumeflag reads the existing records and skips already-completed tests. What's missing is (a) persisting the partial state off-pod before exit, and (b) re-launching the interrupted runs on the new pod.Proposed solution
Two phases, handled entirely within
agentv serve:Phase 1 — write
run-params.jsonat spawn time (eval-runner.ts)When spawning an eval child process (the
POST /api/eval/runpath), write a small manifest to the output directory before the child starts. This gives the shutdown handler (and restore logic) everything it needs to re-launch:Also export a helper for the SIGTERM handler:
Phase 2 — SIGTERM handler in
serve.tsReplace
await new Promise(() => {})with a proper shutdown path:Phase 3 — restore and auto-resume on startup (
serve.ts)During
agentv servestartup, before the HTTP server binds, check the configured results repo for checkpoint branches and re-launch any interrupted runs:launchResumedRunreuses the same spawn path asPOST /api/eval/runbut addsresume: trueand an explicitoutputdir. It also deletes the checkpoint branch after the run completes successfully.New core functions (
packages/core/src/evaluation/results-repo.ts)Branch / state layout
Branch contents mirror the run dir structure:
Security
index.jsonland per-test artifacts contain model outputs and test prompts; they should not be pushed to a public repoScope
apps/cli/src/commands/results/eval-runner.tsrun-params.jsonat spawn; exportgetInterruptibleRuns()apps/cli/src/commands/results/serve.tspackages/core/src/evaluation/results-repo.tspushCheckpointBranch,restoreCheckpointedRuns,deleteCheckpointBranchTotal: ~145 lines across 3 existing files.
Non-goals (v1)