Skip to content

P1: event-driven keepalive loop enforces none of agents:paused / needs-human / run-cap; operator stop-controls do not stop it #2267

@stranske

Description

@stranske

Why

The event-driven keepalive loop is the canonical re-dispatch path (root workflow
.github/workflows/agents-keepalive-loop.yml, "Agents Keepalive Loop"; consumer
templates/consumer-repo/.github/workflows/agents-81-gate-followups.yml). Its
decision is made by evaluateKeepaliveLoop in
.github/scripts/keepalive_loop.js:2253. docs/keepalive/GoalsAndPlumbing.md
documents three operator stop/throttle controls that this function is supposed to
honor:

  • §4 line 82: "Respect the agents:paused label, which blocks all keepalive
    activity."
  • §4 line 83 / §4 lines 85-88: "After repeated failures (default: 3), the loop
    pauses and adds needs-human label" and resumption requires removing
    needs-human.
  • §3 line 74: "Respect agents:max-parallel:<K> when present (integer 1-5)" — a
    per-PR run cap.

Verified: evaluateKeepaliveLoop's decision tree enforces none of these.
The action is chosen at keepalive_loop.js:2624-2635:

if (hasDefinitiveConflict && hasAgentLabel && keepaliveEnabled) { action = 'conflict'; ... }
else if (!hasAgentLabel) { ... reason = 'missing-agent-label'; }
else if (!keepaliveEnabled) { action = 'skip'; reason = 'no-checklists'; }

with keepaliveEnabled = config.keepalive_enabled && hasAgentLabel
(keepalive_loop.js:2399). There is no read of agents:paused, no read of
needs-human, and no run-cap parse anywhere in the function.

Specific wrong/missing behavior:

  1. agents:paused is ignored. grep -n "agents:paused" .github/scripts/keepalive_loop.js
    returns nothing; the literal is defined only in the orchestrator-path file
    .github/scripts/keepalive_gate.js:12 (const PAUSE_LABEL = 'agents:paused').
    In agents-keepalive-loop.yml the label appears only inside the
    state-fingerprint hash — where adding it changes the hash and therefore
    guarantees the run is not deduped away. A PR with agent:codex +
    agents:paused + green Gate + unchecked tasks dispatches normally.
  2. needs-human does not pause at dispatch. The label is added at
    keepalive_loop.js:3998-4002 (inside the if (stop) block of
    updateKeepaliveLoopSummary), but evaluateKeepaliveLoop never consults
    needs-human or state.failure.count when choosing the action. On the next
    green Gate (or any label event — adding needs-human itself flips the
    fingerprint), gate-green + tasks-remaining → run again. The "pause" only
    holds while the Gate stays red.
  3. No run cap. No agents:max-parallel:<K> (or any K override) is parsed in
    the loop path; the only throttle is the per-PR concurrency group plus the
    runner-dispatch debounce.

This is a current break (verified by reading the live decision tree, not a
latent edge case): the primary operator stop control (agents:paused) and the
documented failure-pause (needs-human) do not stop the event-driven loop.
agents:paused and needs-human only take effect on the 30-minute orchestrator
path (keepalive_gate.js:1002), not on the event-driven loop that fires on every
Gate completion.

Scope

Enforce the documented stop/throttle controls at the top of the loop's decision
function, evaluateKeepaliveLoop in .github/scripts/keepalive_loop.js:

  • Return a non-dispatching action (action:'skip') when agents:paused is
    present on the PR.
  • Return action:'skip' when needs-human is present, or when
    state.failure.count >= failureThreshold (the failure_threshold config,
    default 3, already parsed in this function).
  • Parse and honor a per-PR run cap label and skip when at/over cap.

The labels array (lowercased label names) is already computed at
keepalive_loop.js:2318-2320, and the keepalive state (with failure.count) is
already loaded in this function — both are available at the top of the tree.

Non-Goals

  • Do NOT modify the orchestrator-path enforcement in
    .github/scripts/keepalive_gate.js or
    .github/scripts/keepalive_orchestrator_gate_runner.js; those already honor
    agents:paused/run-cap and stay as-is.
  • Do NOT change the state-fingerprint hashing in agents-keepalive-loop.yml
    (lines around the agents:paused/needs-human hash entries); this fix is in
    the JS decision function, not the dedupe layer.
  • Do NOT remove the needs-human / agent:needs-attention add behavior at
    keepalive_loop.js:3985-4002; this issue adds the consume side, it does not
    touch escalation.
  • Do NOT reconcile the separate label-name drift (agents:max-parallel vs the
    orchestrator's agents:max-runs) or the agents:keepalive activation
    question in LABELS.md here — pick ONE run-cap label name, wire it, and note the
    choice; the doc/label reconciliation is a separate issue.
  • Scaffold-only completion does NOT count: adding a paused/needs-human
    variable that is read but does not change the returned action, or a test
    that asserts the label is present rather than that it forces skip, is a
    failure of this issue. The deliberate-break acceptance criterion below must be
    demonstrated, and the new test must drive evaluateKeepaliveLoop end-to-end
    (not a helper in isolation).

Tasks

  • In evaluateKeepaliveLoop (.github/scripts/keepalive_loop.js:2253),
    before the action-selection block at lines 2624-2635, add an early
    agents:paused guard: when the lowercased labels array (built at
    keepalive_loop.js:2318-2320) includes agents:paused, return
    { action: 'skip', reason: 'paused', ... } with the same return shape used by
    the existing skip path (keepalive_loop.js:2823-2872).
  • In the same function, add a needs-human / failure-threshold guard:
    return { action: 'skip', reason: 'needs-human' } (or 'failure-threshold')
    when labels includes needs-human OR when the loaded keepalive state's
    failure.count >= failureThreshold (the failure_threshold value parsed by
    parseConfig, default 3 per keepalive_loop.js:1604-1607). This must short
    the action regardless of Gate conclusion.
  • Parse a per-PR run-cap label in evaluateKeepaliveLoop (define the prefix
    constant near the other label constants, e.g. const RUN_CAP_PREFIX = ...),
    clamp to 1-5, and skip with reason:'run-cap-reached' when in-progress run
    count is at/over the cap. Name the chosen label in ## Implementation Notes;
    if reusing the orchestrator's existing agents:max-runs: prefix
    (keepalive_gate.js:9), import or re-declare it consistently.
  • Ensure the new skip reasons are treated as neutral (no failure-count
    increment, no PR-comment noise) consistent with the existing neutral-stop
    handling at keepalive_loop.js:3135 and the §5 No-Noise policy
    (docs/keepalive/GoalsAndPlumbing.md:97).
  • Extend .github/scripts/__tests__/keepalive-loop.test.js with the
    deliberate-break test described in Acceptance Criteria, modeled on the
    existing evaluateKeepaliveLoop waits when agent label is missing
    (keepalive-loop.test.js:180) and ... skips when keepalive is disabled
    (keepalive-loop.test.js:200) cases using the buildGithubStub helper
    (keepalive-loop.test.js:23).

Acceptance Criteria

  • New named test in .github/scripts/__tests__/keepalive-loop.test.js, e.g.
    evaluateKeepaliveLoop skips when agents:paused is present: builds a PR via
    buildGithubStub with labels ['agent:codex','agents:paused'], a green
    Gate run, and at least one unchecked task, calls evaluateKeepaliveLoop,
    and asserts result.action === 'skip' and result.reason === 'paused'. A
    parallel case asserts result.action === 'skip' for a PR carrying
    needs-human (green Gate + unchecked tasks). Run via
    node --test .github/scripts/__tests__/keepalive-loop.test.js → both pass.
  • Deliberate-break gate: after implementing the guard, temporarily
    comment out the new agents:paused early-return at the top of
    evaluateKeepaliveLoop (keepalive_loop.js:2253). With the guard removed, the
    new agents:paused test must FAIL — concretely, result.action comes back
    as 'run'/'fix' (a dispatch) instead of 'skip', proving the test catches a
    loop that ignores the pause label. Capture the FAIL output, then restore the
    guard so the test passes again.
  • The existing 102 passing cases in
    .github/scripts/__tests__/keepalive-loop.test.js still pass after the change
    (node --test .github/scripts/__tests__/keepalive-loop.test.js shows fail 0),
    confirming the new guards do not regress the existing wait/skip/run/fix/verify
    decisions.

Implementation Notes

  • Confirmed-green local baseline (node v24.3.0) from the Workflows repo root:
    node --test .github/scripts/__tests__/keepalive-loop.test.jspass 102 fail 0.
  • evaluateKeepaliveLoop is exported at keepalive_loop.js:4679 (module.exports
    begins there); the test harness imports it at keepalive-loop.test.js:12.
  • The lowercased label list is already available inside the function at
    keepalive_loop.js:2318-2320; reuse it rather than re-fetching labels.
  • Grounding docs: docs/keepalive/GoalsAndPlumbing.md:74 (run cap), :82
    (agents:paused), :83 and :85-88 (needs-human pause/resume), :97 (no
    PR-comment noise on skip). docs/LABELS.md:25 (agents:paused "Pauses
    keepalive loop on PR"), :453-464, :546.
  • Orchestrator-path reference for the same semantics (do not edit, use as the
    contract): keepalive_gate.js:12 (PAUSE_LABEL), keepalive_gate.js:1002
    (hasPauseLabel), keepalive_gate.js:9 (MAX_RUNS_PREFIX = 'agents:max-runs:').

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-humanRequires human intervention or review

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions