Runner stuck in_progress after ssh-action command_timeout — cancel API has no effect #192009
Replies: 2 comments
-
|
💬 Your Product Feedback Has Been Submitted 🎉 Thank you for taking the time to share your insights with us! Your feedback is invaluable as we build a better GitHub experience for all our users. Here's what you can expect moving forward ⏩
Where to look to see what's shipping 👀
What you can do in the meantime 💻
As a member of the GitHub community, your participation is essential. While we can't promise that every suggestion will be implemented, we want to emphasize that your feedback is instrumental in guiding our decisions and priorities. Thank you once again for your contribution to making GitHub even better! We're grateful for your ongoing support and collaboration in shaping the future of our platform. ⭐ |
Beta Was this translation helpful? Give feedback.
-
|
From my perspective, this looks like a race condition between job-level cancellation handling and the underlying process execution on the runner. The key issue is that the SSH step ( What is more concerning is the runner state remaining In practice, this creates a situation where:
From an operational standpoint, this defeats the purpose of both A possible area to investigate is whether the SSH action or ARC runner is correctly handling SIGTERM/SIGINT signals during blocked network I/O, and whether the runner agent continues heartbeat reporting while the step is hung. At minimum, I would expect cancellation to forcibly terminate the job container or runner process within a bounded time window, even if the remote SSH command cannot be cleanly interrupted. Overall, this appears to be a gap in cancellation propagation and runner-level enforcement under blocked external execution. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
🏷️ Discussion Type
Bug
💬 Feature/Topic Area
ARC (Actions Runner Controller)
Discussion Details
Run: https://github.com/Wyvern-Corp-PH/typedbyhand-classroom/actions/runs/24193704696/job/70618815646
What happened:
The
cleanup-on-failurejob (job ID70618815646) usedappleboy/ssh-action@v1.0.3withcommand_timeout: 1m. The SSH command hung because the EC2 host was still under load from a prior timed-out deploy. The job never self-terminated after the 1-minute timeout.Two
gh run cancelAPI calls were submitted via CLI — both returned HTTP 200 and "Request to cancel workflow submitted" — but the job stayedin_progressin the UI for 10+ minutes with no effect.The workflow has a
concurrencygroup (deploy-ec2-refs/heads/main), so the stuck run blocked all subsequent deploys from starting.Expected: Runner terminates when
command_timeoutis exceeded. Run-level cancel kills in-progress jobs within ~2 minutes.Actual: Job stayed
in_progressindefinitely. Cancel requests were acknowledged but not acted upon.Runner:
ubuntu-latest(GitHub-hosted)Action:
appleboy/ssh-action@v1.0.3Date: 2026-04-09 ~13:54 UTC
Beta Was this translation helpful? Give feedback.
All reactions