Skip to content

Harden ensure restart paths: lsof fallback, SSE broken-handshake, ROADMAP docs #16

Description

@Alajmah

Summary

Follow-up hardening for the ensure subcommand (merged in #15 / 692cfd6). Three items found during post-merge review of the actual merged files. The strongest functional risk is #1 (lsof dependency).

1. Unix restart silently fails when lsof is unavailable

The risk: _find_listener_pid() (ensure.py:226-249) uses lsof -ti :<port> on non-Windows. If lsof is missing (common on minimal containers, Alpine/slim images, or environments with only ss/fuser), the except Exception: return None swallows the error. _stop_rest_listener() then returns without stopping anything, and _restart_rest() launches a new REST process that cannot bind the occupied port → ensure waits to timeout. The restart guarantee silently degrades with no warning.

Scope:

  • Add a fallback chain for Unix listener discovery: lsofss -tlnpfuser, or use psutil.net_connections if psutil is acceptable as a dependency.
  • Emit a clear warning (not silent swallow) when listener termination cannot be confirmed, so the operator knows the restart may not have taken effect.
  • Test: mock the absence of lsof and assert the fallback path still finds the listener (or warns clearly).

Windows path uses netstat -ano (ubiquitous on Windows) and is not affected.

2. SSE port-up-but-handshake-broken doesn't stop the existing listener

The gap: _reconcile_sse() treats TCP-up + failed MCP handshake as "not ready," then launches another SSE without stopping the existing listener. The broken listener still owns the port → the new process likely can't bind → ensure waits to timeout.

Accepted as out-of-scope for v1 (the pinned Phase 3 scope says SSE crash is rare and the hook re-runs). Track here as the first SSE lifecycle hardening item. Fix shape: mirror the REST restart path — _stop_sse_listener(sse_port) before relaunch when the port is up but the handshake fails.

3. ROADMAP docs stale after review fixes

  • docs/ROADMAP.md:141 says "15 unit tests" — the merged tests/test_ensure.py has 21 (6 added for the review fixes: lock-handle retention, starting+connected, restart ordering).
  • docs/ROADMAP.md:156 says "Wait until REST is healthy" — the implementation now accepts starting + chrome_running + cdp_connected + driver_connected as ready (via _rest_ready_for_ensure), not healthy only.

Acceptance criteria

  • Unix restart works without lsof (fallback chain or psutil), or fails with a clear diagnostic
  • SSE port-up-but-broken stops the existing listener before relaunch
  • ROADMAP test count and readiness wording corrected
  • Tests cover the lsof-absent fallback and the SSE broken-handshake restart

Priority

Medium. Not user-facing-breaking in the common case (both lsof-present and SSE-healthy), but the lsof gap makes the restart guarantee platform-conditional with no signal to the operator.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions