Skip to content

Surface a structured react run report to surface non-optimal states #7

@pa911-eric

Description

@pa911-eric

Summary

deploybot react is the privileged orchestrator the GitHub Action runs on every delivery event. It already computes a rich result object (promoted, drain, integrations, release, top-level state) and prints it as JSON — but only to stdout inside one of many interleaved Actions runs, and it always exits 0. As a result, the worst operational failure mode is invisible: when the pipeline is paused, every triggered react no-ops, returns exit 0, and CI stays green, so a stuck pipeline can go unnoticed indefinitely.

This issue proposes rendering the result react already produces into the surfaces humans and agents actually look at — a GitHub Actions step summary, a sticky PR/commit status, and a non-green CI signal when paused or timed-out — plus a run_id to correlate the event fan-in. This is additive, needs no schema or architecture change, and strengthens the existing --json / notify / comment-marker patterns.

Problem statement

  • command_react returns/prints a structured object but only to stdout, and the top-level main() returns 0 for every non-exception path. (cli.py command_react, main)
  • When pipeline_control() is paused, react returns {"state": "paused", "reason": ...} and exits 0; the reason lives only in a deploybot-control:v1 comment marker, and nothing surfaces it proactively. (records.py control_body)
  • The workflow uses concurrency: group=deploybot-${{ github.repository }}, cancel-in-progress: false and fans in 6 event families, so debugging "why didn't PR #42 merge?" means diffing multiple interleaved runs with no correlation id. (examples/github-workflow.yml)
  • follow_release can return timed-out, but only to stdout with exit 0 — a stuck deploy looks successful. (pipeline.py follow_release)
  • The Action invokes react and currently writes no $GITHUB_STEP_SUMMARY. (action.yml)

There is no place an operator can glance at to see what the latest react pass did or why the pipeline is stuck.

Proposed behavior

Render the existing command_react result into operator-visible surfaces:

  1. Actions step summary — when GITHUB_STEP_SUMMARY is set, append a Markdown table each pass: state, promoted, merged, waiting [{number, reason}], integration PRs, release/timeout. Built from the result object react already returns.
  2. Visible paused / timed-out signal — when state == "paused" or release.state == "timed-out", make the run non-green (documented non-zero exit and/or a failing check-run). A normal empty pass stays exit 0. The JSON state field remains authoritative for agents.
  3. Sticky status surface — upsert a single "DeployBot status" comment (or commit status) summarizing the latest pass, including, when paused, the reason and the deploybot control unpause remedy. Reuse the marker-upsert machinery in records.py.
  4. run_id correlation — compute once per pass (sha256(repo:utc_now)[:12], mirroring the intent_id pattern in command_request) and include it in the result JSON and in every notify() payload emitted during that pass.

CLI / API / config changes

  • No new subcommand required; behavior attaches to the existing react flow.
  • Add run_id to command_react's result dict and thread it into notify() payloads.
  • Step-summary writing handled in the composite Action's final step (or behind a GITHUB_STEP_SUMMARY check in the CLI).
  • Optional: a non-zero exit policy for react when paused/timed-out.
  • No .mergequeue.toml schema changes. No marker schema changes.

Backward compatibility

Purely additive. Default text/JSON stdout is unchanged aside from the new run_id field; the step summary, sticky status, and exit-code policy are additive and auto-detected (e.g., only when GITHUB_STEP_SUMMARY is present). Commit-pinned workflows are unaffected. Safe to ship in a minor release.

Telemetry / logging needs

  • No external telemetry.
  • Reuse the existing notify() webhook for run_id-stamped events.
  • Reuse the existing comment-marker upsert for the sticky status surface.

Acceptance criteria

  • Every react pass writes a Markdown table to $GITHUB_STEP_SUMMARY when set; behavior is unchanged when unset.
  • A paused pass renders "⏸ paused: — run deploybot control unpause" in the summary/status surface and makes the Actions run non-green; a normal empty pass stays green / exit 0.
  • A follow timeout renders as a visible non-green signal and appears in the summary, distinguishable from verified.
  • react's JSON includes a run_id, and the same id appears in any notify() payloads emitted during that pass.
  • waiting[] entries carry the existing classify() reason strings (e.g., "CI is not complete", "head changed after it was queued").
  • A single sticky "DeployBot status" comment/status is upserted (not duplicated) per pass.
  • Default stdout (text and --json) is byte-for-byte unchanged aside from the additive run_id field.
  • Unit tests cover: summary file writing, paused→non-zero exit, timed-out→non-zero exit, run_id propagation into notify, and sticky-comment upsert — following the existing unittest / patch("agent_merge_queue...") style in tests/test_cli.py.

Risks & mitigations

  • Noisy sticky comment → upsert a single comment in place rather than posting per pass; only update when content changes.
  • Unexpected non-zero exits breaking existing automation → scope non-zero strictly to paused and timed-out; document it; keep empty/normal passes exit 0.
  • Step-summary unavailable outside Actions → guard on GITHUB_STEP_SUMMARY presence; no-op locally.
  • Scope creep into concurrency/idempotency fixes → explicitly out of scope here; this issue only surfaces existing state (duplicates become visible first, then fixable).

Out of scope (possible follow-ons)

  • Idempotency keys (per run_id + batch fingerprint) to prevent duplicate integration PRs / CI dispatches under event bursts.
  • Emitting paused and follow timeouts as notify() events for alerting.
  • Expanded react orchestration tests (promote→drain→overlap-integrate→follow, timed-out branch).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions