actions/runner-controller listener behavior #164628
Replies: 2 comments
-
|
Why I’m opening this discussion While operating ephemeral self-hosted runners with Actions Runner Controller (ARC) on Kubernetes, I routinely see runners die with the error “Failed to create a session. The runner registration has been deleted from the server, please re-configure.” This seems to coincide with node drains initiated by our autoscaler (Karpenter). Because the failure blocks new jobs until ARC recreates fresh pods, I’d like to confirm whether this behaviour is expected or signals a mis-configuration. What I hope to learn from the community: Expected life-cycle: GitHub’s backend will delete a runner registration that hasn’t connected “recently,” and the agent then exits with the exact error above. Is that the normal outcome for an evicted or slow-starting pod? Listener resiliency: If the listener pod is rescheduled, will in-flight runners automatically receive a new session, or is it safer to prevent disruptions to listener pods altogether? Silent failures: Issue #3748 describes registrations being deleted mid-job, causing unexpected cancellations. Can a pod keep running yet silently stop picking up work after such a loss? Best-practice handling: Should we rely on ARC’s reconciliation loop to recycle the crashed pods, or add our own health checks / PodDisruptionBudgets and restart logic? Topic area: ARC (Actions Runner Controller) |
Beta Was this translation helpful? Give feedback.
-
|
🕒 Discussion Activity Reminder 🕒 This Discussion has been labeled as dormant by an automated system for having no activity in the last 60 days. Please consider one the following actions: 1️⃣ Close as Out of Date: If the topic is no longer relevant, close the Discussion as 2️⃣ Provide More Information: Share additional details or context — or let the community know if you've found a solution on your own. 3️⃣ Mark a Reply as Answer: If your question has been answered by a reply, mark the most helpful reply as the solution. Note: This dormant notification will only apply to Discussions with the Thank you for helping bring this Discussion to a resolution! 💬 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Why are you starting this discussion?
Question
What GitHub Actions topic or product is this about?
ARC (Actions Runner Controller)
Discussion Details
Hi Team, I’m running ephemeral self-hosted runners (via Kubernetes) using actions/runner-controller. Occasionally, I see the following error in the logs, which I suspect is related to node disruptions (e.g., via Karpenter):
My questions:
1. Is it expected that runners exit like this if their registration is deleted before they connect?
2. Can the runner recover if the listener pod is rescheduled (e.g., after a node drain), or should we avoid disrupting listener pods?
3. Is it possible for a pod to keep running but silently fail to pick up jobs because it lost registration?
4. What’s the best practice to handle this—should we rely on the controller, or monitor and restart these pods ourselves?
Beta Was this translation helpful? Give feedback.
All reactions