Skip to content

feat(serve): SIGTERM handler — checkpoint partial eval state to GitHub branch, restore and auto-resume on startup #1368

@christso

Description

@christso

Problem

agentv serve runs evals as child processes and tracks them in an in-memory activeRuns map. It has no SIGTERM handler (serve.ts:2144 blocks on await new Promise(() => {})). When Kubernetes sends SIGTERM during a rolling deployment:

  1. agentv serve exits immediately — the running eval child process is orphaned briefly then killed by SIGKILL
  2. /data is ephemeral overlay (no PVC in this deployment) — the partial index.jsonl and per-test artifacts are lost
  3. On pod restart the eval is gone — no way to resume

The partial index.jsonl is written to disk incrementally as each test completes (via JsonlWriter.append), so recovery is technically possible: the native --resume flag reads the existing records and skips already-completed tests. What's missing is (a) persisting the partial state off-pod before exit, and (b) re-launching the interrupted runs on the new pod.

Proposed solution

Two phases, handled entirely within agentv serve:

Phase 1 — write run-params.json at spawn time (eval-runner.ts)

When spawning an eval child process (the POST /api/eval/run path), write a small manifest to the output directory before the child starts. This gives the shutdown handler (and restore logic) everything it needs to re-launch:

// apps/cli/src/commands/results/eval-runner.ts
// written to <outputDir>/run-params.json immediately after outputDir is determined

interface RunParamsFile {
  request: RunEvalRequest;  // the full original API request body
  cwd: string;              // project root
  started_at: string;       // ISO timestamp
}

Also export a helper for the SIGTERM handler:

export function getInterruptibleRuns(): Array<{
  outputDir: string;
  process: ChildProcess | undefined;
}> {
  return [...activeRuns.values()]
    .filter(r => r.status === 'running' && r.outputDir)
    .map(r => ({ outputDir: r.outputDir!, process: r.process }));
}

Phase 2 — SIGTERM handler in serve.ts

Replace await new Promise(() => {}) with a proper shutdown path:

// apps/cli/src/commands/results/serve.ts  (around line 2144)

let shuttingDown = false;

process.once('SIGTERM', async () => {
  if (shuttingDown) return;
  shuttingDown = true;
  console.log('[serve] SIGTERM — checkpointing in-progress evals…');

  const runs = getInterruptibleRuns();

  // SIGTERM each child so its JsonlWriter flushes and closes cleanly
  for (const { process: child } of runs) {
    try { child?.kill('SIGTERM'); } catch {}
  }
  await new Promise(r => setTimeout(r, 2000)); // brief flush window

  // Push each partial run dir to a checkpoint branch on the results repo
  await Promise.allSettled(
    runs.map(({ outputDir }) =>
      maybeCheckpointRunArtifacts({ cwd, outputDir })
        .catch(err => console.warn(`[serve] checkpoint failed: ${err.message}`))
    )
  );

  process.exit(0);
});

await new Promise<never>(() => {});

Phase 3 — restore and auto-resume on startup (serve.ts)

During agentv serve startup, before the HTTP server binds, check the configured results repo for checkpoint branches and re-launch any interrupted runs:

// Before startServer(...):
const restored = await restoreCheckpointedRuns(cwd);
for (const run of restored) {
  console.log(`[serve] Resuming interrupted run from checkpoint: ${run.outputDir}`);
  await launchResumedRun(app, cwd, run.outputDir, run.params);
}

launchResumedRun reuses the same spawn path as POST /api/eval/run but adds resume: true and an explicit output dir. It also deletes the checkpoint branch after the run completes successfully.

New core functions (packages/core/src/evaluation/results-repo.ts)

// Push a partial run dir to a dedicated checkpoint branch (force-push, non-default branch)
export async function pushCheckpointBranch(params: {
  config: ResultsConfig;
  sourceDir: string;      // <outputDir> with partial index.jsonl
  branchName: string;     // e.g. "inflight/pod-abc/2026-06-12T10-30-00Z"
}): Promise<void>

// Scan remote branches matching "inflight/*/*", download each, return resume params
export async function restoreCheckpointedRuns(
  config: ResultsConfig,
  targetRoot: string,     // where to restore run dirs (project cwd)
): Promise<Array<{ outputDir: string; params: RunEvalRequest; cwd: string }>>

// Delete a checkpoint branch after successful resume
export async function deleteCheckpointBranch(
  config: ResultsConfig,
  branchName: string,
): Promise<void>

Branch / state layout

results repo (e.g. WiseTechAcademy.EvalResults)
  refs/heads/main            ← completed runs (existing)
  refs/heads/inflight/       ← checkpoints (new, non-default, not pushed to main)
    pod-agentv-abc123/
      2026-06-12T10-30-00-000Z/   ← one branch per interrupted run

Branch contents mirror the run dir structure:

index.jsonl          ← partial results (completed tests only)
run-params.json      ← original request params for re-launch
task/                ← bundled eval + targets files (written at run start)
t001/ t002/ …        ← per-test artifacts for completed tests
console.log          ← captured stdout/stderr up to interruption

Security

  • Pushed to private results repos (EvalResults) only — never to this deploy repo
  • index.jsonl and per-test artifacts contain model outputs and test prompts; they should not be pushed to a public repo
  • No API keys, GitHub App private keys, or other secrets appear in run artifacts — those are env-only or volume-mounted

Scope

File Change Est. lines
apps/cli/src/commands/results/eval-runner.ts write run-params.json at spawn; export getInterruptibleRuns() ~25
apps/cli/src/commands/results/serve.ts SIGTERM handler; restore-on-startup call ~40
packages/core/src/evaluation/results-repo.ts pushCheckpointBranch, restoreCheckpointedRuns, deleteCheckpointBranch ~80

Total: ~145 lines across 3 existing files.

Non-goals (v1)

  • Periodic mid-run checkpoints: not needed for the SIGTERM case; can add later if OOM / node-eviction losses prove material
  • Multi-pod awareness: deployment is replicas=1, so no concurrent-writer collision concern today
  • entrypoint changes: not required if serve.ts handles restore on startup itself

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions