Summary
Follow-up hardening for the ensure subcommand (merged in #15 / 692cfd6). Three items found during post-merge review of the actual merged files. The strongest functional risk is #1 (lsof dependency).
1. Unix restart silently fails when lsof is unavailable
The risk: _find_listener_pid() (ensure.py:226-249) uses lsof -ti :<port> on non-Windows. If lsof is missing (common on minimal containers, Alpine/slim images, or environments with only ss/fuser), the except Exception: return None swallows the error. _stop_rest_listener() then returns without stopping anything, and _restart_rest() launches a new REST process that cannot bind the occupied port → ensure waits to timeout. The restart guarantee silently degrades with no warning.
Scope:
- Add a fallback chain for Unix listener discovery:
lsof → ss -tlnp → fuser, or use psutil.net_connections if psutil is acceptable as a dependency.
- Emit a clear warning (not silent swallow) when listener termination cannot be confirmed, so the operator knows the restart may not have taken effect.
- Test: mock the absence of
lsof and assert the fallback path still finds the listener (or warns clearly).
Windows path uses netstat -ano (ubiquitous on Windows) and is not affected.
2. SSE port-up-but-handshake-broken doesn't stop the existing listener
The gap: _reconcile_sse() treats TCP-up + failed MCP handshake as "not ready," then launches another SSE without stopping the existing listener. The broken listener still owns the port → the new process likely can't bind → ensure waits to timeout.
Accepted as out-of-scope for v1 (the pinned Phase 3 scope says SSE crash is rare and the hook re-runs). Track here as the first SSE lifecycle hardening item. Fix shape: mirror the REST restart path — _stop_sse_listener(sse_port) before relaunch when the port is up but the handshake fails.
3. ROADMAP docs stale after review fixes
docs/ROADMAP.md:141 says "15 unit tests" — the merged tests/test_ensure.py has 21 (6 added for the review fixes: lock-handle retention, starting+connected, restart ordering).
docs/ROADMAP.md:156 says "Wait until REST is healthy" — the implementation now accepts starting + chrome_running + cdp_connected + driver_connected as ready (via _rest_ready_for_ensure), not healthy only.
Acceptance criteria
Priority
Medium. Not user-facing-breaking in the common case (both lsof-present and SSE-healthy), but the lsof gap makes the restart guarantee platform-conditional with no signal to the operator.
Summary
Follow-up hardening for the
ensuresubcommand (merged in #15 /692cfd6). Three items found during post-merge review of the actual merged files. The strongest functional risk is #1 (lsofdependency).1. Unix restart silently fails when
lsofis unavailableThe risk:
_find_listener_pid()(ensure.py:226-249) useslsof -ti :<port>on non-Windows. Iflsofis missing (common on minimal containers, Alpine/slim images, or environments with onlyss/fuser), theexcept Exception: return Noneswallows the error._stop_rest_listener()then returns without stopping anything, and_restart_rest()launches a new REST process that cannot bind the occupied port →ensurewaits to timeout. The restart guarantee silently degrades with no warning.Scope:
lsof→ss -tlnp→fuser, or usepsutil.net_connectionsifpsutilis acceptable as a dependency.lsofand assert the fallback path still finds the listener (or warns clearly).Windows path uses
netstat -ano(ubiquitous on Windows) and is not affected.2. SSE port-up-but-handshake-broken doesn't stop the existing listener
The gap:
_reconcile_sse()treats TCP-up + failed MCP handshake as "not ready," then launches another SSE without stopping the existing listener. The broken listener still owns the port → the new process likely can't bind →ensurewaits to timeout.Accepted as out-of-scope for v1 (the pinned Phase 3 scope says SSE crash is rare and the hook re-runs). Track here as the first SSE lifecycle hardening item. Fix shape: mirror the REST restart path —
_stop_sse_listener(sse_port)before relaunch when the port is up but the handshake fails.3. ROADMAP docs stale after review fixes
docs/ROADMAP.md:141says "15 unit tests" — the mergedtests/test_ensure.pyhas 21 (6 added for the review fixes: lock-handle retention,starting+connected, restart ordering).docs/ROADMAP.md:156says "Wait until REST is healthy" — the implementation now acceptsstarting+chrome_running+cdp_connected+driver_connectedas ready (via_rest_ready_for_ensure), nothealthyonly.Acceptance criteria
lsof(fallback chain orpsutil), or fails with a clear diagnosticlsof-absent fallback and the SSE broken-handshake restartPriority
Medium. Not user-facing-breaking in the common case (both
lsof-present and SSE-healthy), but thelsofgap makes the restart guarantee platform-conditional with no signal to the operator.