Runner stuck in_progress after ssh-action command_timeout — cancel API has no effect #192009

bakajieyan · 2026-04-09T14:13:18Z

bakajieyan
Apr 9, 2026

🏷️ Discussion Type

Bug

💬 Feature/Topic Area

ARC (Actions Runner Controller)

Discussion Details

Run: https://github.com/Wyvern-Corp-PH/typedbyhand-classroom/actions/runs/24193704696/job/70618815646

What happened:
The cleanup-on-failure job (job ID 70618815646) used appleboy/ssh-action@v1.0.3 with command_timeout: 1m. The SSH command hung because the EC2 host was still under load from a prior timed-out deploy. The job never self-terminated after the 1-minute timeout.

Two gh run cancel API calls were submitted via CLI — both returned HTTP 200 and "Request to cancel workflow submitted" — but the job stayed in_progress in the UI for 10+ minutes with no effect.

The workflow has a concurrency group (deploy-ec2-refs/heads/main), so the stuck run blocked all subsequent deploys from starting.

Expected: Runner terminates when command_timeout is exceeded. Run-level cancel kills in-progress jobs within ~2 minutes.

Actual: Job stayed in_progress indefinitely. Cancel requests were acknowledged but not acted upon.

Runner: ubuntu-latest (GitHub-hosted)
Action: appleboy/ssh-action@v1.0.3
Date: 2026-04-09 ~13:54 UTC

2026-04-09T14:14:00Z

github-actions[bot]
bot Apr 9, 2026

💬 Your Product Feedback Has Been Submitted 🎉

Thank you for taking the time to share your insights with us! Your feedback is invaluable as we build a better GitHub experience for all our users.

Here's what you can expect moving forward ⏩

Your input will be carefully reviewed and cataloged by members of our product teams.
- Due to the high volume of submissions, we may not always be able to provide individual responses.
- Rest assured, your feedback will help chart our course for product improvements.
Other users may engage with your post, sharing their own perspectives or experiences.
GitHub staff may reach out for further clarification or insight.
- We may 'Answer' your discussion if there is a current solution, workaround, or roadmap/changelog post related to the feedback.

Where to look to see what's shipping 👀

Read the Changelog for real-time updates on the latest GitHub features, enhancements, and calls for feedback.
Explore our Product Roadmap, which details upcoming major releases and initiatives.

What you can do in the meantime 💻

Upvote and comment on other user feedback Discussions that resonate with you.
Add more information at any point! Useful details include: use cases, relevant labels, desired outcomes, and any accompanying screenshots.

As a member of the GitHub community, your participation is essential. While we can't promise that every suggestion will be implemented, we want to emphasize that your feedback is instrumental in guiding our decisions and priorities.

Thank you once again for your contribution to making GitHub even better! We're grateful for your ongoing support and collaboration in shaping the future of our platform. ⭐

0 replies

abinaze · 2026-04-10T17:05:46Z

abinaze
Apr 10, 2026

From my perspective, this looks like a race condition between job-level cancellation handling and the underlying process execution on the runner.

The key issue is that the SSH step (appleboy/ssh-action@v1.0.3) is likely blocking at the process level due to the remote command still being alive or stuck under high load, even after command_timeout is reached. In such cases, the timeout may only terminate the action wrapper, not guarantee full process termination on the remote side or cleanup of the SSH session.

What is more concerning is the runner state remaining in_progress even after explicit gh run cancel calls. If the cancel request is acknowledged but not reflected in the runner state, it suggests that the runner is not properly polling or reacting to cancellation signals while blocked in a long-running SSH operation.

In practice, this creates a situation where:

The job is effectively stuck in a non-interruptible step
Cancellation requests do not propagate to the runner process
Concurrency groups remain locked indefinitely, blocking deployment pipelines

From an operational standpoint, this defeats the purpose of both command_timeout and run-level cancellation as recovery mechanisms.

A possible area to investigate is whether the SSH action or ARC runner is correctly handling SIGTERM/SIGINT signals during blocked network I/O, and whether the runner agent continues heartbeat reporting while the step is hung.

At minimum, I would expect cancellation to forcibly terminate the job container or runner process within a bounded time window, even if the remote SSH command cannot be cleanly interrupted.

Overall, this appears to be a gap in cancellation propagation and runner-level enforcement under blocked external execution.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Runner stuck in_progress after ssh-action command_timeout — cancel API has no effect #192009

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Runner stuck in_progress after ssh-action command_timeout — cancel API has no effect #192009

Uh oh!

bakajieyan Apr 9, 2026

🏷️ Discussion Type

💬 Feature/Topic Area

Discussion Details

Replies: 2 comments

Uh oh!

github-actions[bot] bot Apr 9, 2026

Uh oh!

abinaze Apr 10, 2026

bakajieyan
Apr 9, 2026

github-actions[bot]
bot Apr 9, 2026

abinaze
Apr 10, 2026