Skip to content

Add deploybot doctor preflight/health-check command (with --json + MCP diagnose) #2

@pa911-eric

Description

@pa911-eric

Summary

Add a read-only deploybot doctor command (plus a matching MCP diagnose tool) that runs an ordered set of environment and policy checks and reports ✓ / ⚠ / ✗ with a remediation hint per item.

DeployBot's value depends on GitHub state it does not own — gh presence, auth, token scopes, queue labels, exact required-check display names, trusted-actor logins, and a protective ruleset — yet these are validated lazily. The dominant failure mode for an alpha tool is therefore silent or cryptic misconfiguration (a raw gh stderr mid-operation, or a PR that waits forever), not a bug in the well-tested queue engine. doctor turns "it silently doesn't merge" into "here is exactly what to fix" in one command, with zero new dependencies or infrastructure.

Problem statement

Setup and ongoing reliability hinge on external GitHub state that DeployBot checks only at the point of use:

  • action.yml simply pip installs and runs drain --json — no preflight. (action.yml#L14-L20)
  • GitHub._run shells out to gh for every call; a non-zero exit surfaces as raw stderr. (cli.py#L586-L597)
  • trusted_actors / coordinator_actors / allowed_reviewers are free-text logins, only shape-checked — never verified against GitHub, so a typo silently breaks all marker trust. (config.py#L189-L209)
  • required_checks names must exactly match GitHub check display names; a mismatch makes a PR wait forever with a generic "is not complete." (cli.py#L530-L535)
  • The security model (independent ruleset, no-bypass merge credential) is prose-only with nothing to detect drift. (README.md#L49-L60)
  • Labels must be created via a separate ensure-labels step or every PR reports "queue authorization label is missing." (cli.py#L516-L517)

There is no single, fast, read-only way to confirm "this repo is correctly wired for DeployBot."

Proposed behavior

A new read-only deploybot doctor command runs the checks below, each independently degradable so one failure doesn't abort the rest. It exits 0 when there are no hard (warnings allowed) and non-zero when any is present.

  1. Toolinggh --version resolvable; else "install GitHub CLI."
  2. Authgh auth status succeeds; report the active account; with a gh auth login hint on failure.
  3. Configload_config() parses; surface ConfigError as a clean (no traceback).
  4. Repository — resolve owner/name (repo view) and confirm read access.
  5. Labels — queue/blocked labels exist; "run deploybot ensure-labels" if missing.
  6. Trusted/coordinator/reviewer logins — resolve @repository-owner; verify each login exists and (best-effort) has access; on an unknown login (the silent killer).
  7. Required checks (best-effort) — sample recent runs on the base branch (or an open queued PR) and if a configured required_checks name is never observed → likely display-name mismatch.
  8. Branch protection / ruleset (best-effort) — fetch protection/rulesets for base_branch; if the configured required checks aren't independently enforced, or if DeployBot's identity could bypass.

Checks 7–8 are advisory ( only) and must never hard-fail.

CLI / API / config changes

  • New subparser doctor with --json (mirrors plan/status conventions); no positional args.
  • New command_doctor(client, *, json_output) returning a list of {check, status, detail, hint} dicts; text mode renders the same data.
  • New MCP tool diagnose(repository=None, config=None) calling doctor --json, keeping human/agent parity with the rest of mcp_server.py.
  • No .mergequeue.toml schema changes.

Backward compatibility

Purely additive: a new command + new MCP tool. No change to existing markers, batch format, other commands' exit codes, or config shape. Safe to ship in a minor release (e.g., v0.2.0); commit-pinned workflows are unaffected.

Telemetry / logging needs

  • No external telemetry.
  • Reuse the existing --json output pattern for a severity-tagged report.
  • Each gh probe doctor makes must catch QueueError / non-zero exit and convert it into a row instead of crashing — this also seeds a reusable "non-fatal probe" helper for future commands.

Acceptance criteria

  • deploybot doctor exits 0 on a correctly configured repo and prints one line per check.
  • With no gh on PATH, doctor prints a single "GitHub CLI not found" with an install hint and exits non-zero — without a Python traceback.
  • With gh present but unauthenticated, the auth check is with a gh auth login hint; later network-dependent checks degrade to "skipped (no auth)" rather than crashing.
  • A malformed .mergequeue.toml yields a single carrying the ConfigError message (no traceback), and remaining checks still run where possible.
  • Missing queue/blocked labels produce a recommending deploybot ensure-labels; after running it, doctor reports .
  • A trusted_actors entry that is not a real GitHub login produces a naming the offending login; @repository-owner resolves correctly against owner/name and passes.
  • A required_checks name not present in observed base-branch/PR check runs produces a flagging a possible display-name mismatch (advisory only, never exit-failing).
  • deploybot doctor --json emits a stable array of {check, status, detail, hint} objects with status ∈ {ok, warn, fail}; exit code is non-zero iff any status == "fail".
  • The MCP diagnose tool returns the same JSON payload as doctor --json.
  • Unit tests cover: missing-gh, unauth, bad-config, missing-labels, unknown trusted actor, check-name mismatch, and exit-code semantics — following the existing unittest / patch("agent_merge_queue.cli...") style in tests/test_cli.py.

Risks & mitigations

  • False alarms on rulesets/check names → mark checks 7–8 best-effort/ only; document as advisory.
  • Token lacks admin scope to read protection → treat "cannot read" as "insufficient scope to verify," not .
  • Extra gh calls / rate → each probe is a single lightweight call; gate the heavier check-name sampling behind an available open queued PR.
  • Scope creep into auto-fixing → MVP is strictly read-only; only reference ensure-labels as a hint, never invoke it.

Out of scope (possible follow-ons)

  • deploybot doctor --fix for safe, confirmed remediations.
  • --log-level / structured logging for diagnosing drain failures in CI.
  • A doctor summary line in status/plan output (e.g., "⚠ 1 setup issue — run deploybot doctor").

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions