Skip to content

ci: nightly-dakota was stuck Running for 8+ days — add stuck-workflow alerting #216

@castrojo

Description

@castrojo

What

nightly-dakota-1780628400 started on 2026-06-04 and was still in Running phase on 2026-06-13 (8+ days). It held the ghost-heavy-compute mutex, blocking all new build workflows from running. It was only discovered when investigating dakota#841.

Why it matters

The stuck workflow masked the #841 boot failure: no new dakota builds ran for 8 days, so the broken 6/13 :testing image shipped and users hit it on real hardware before the nightly QA pipeline could catch it.

Fix

Add a maximum TTL / alert for stuck workflows. Options:

  1. Set activeDeadlineSeconds on the nightly CronWorkflow (e.g. 4h) so it auto-terminates
  2. Add an Argo Events alert that fires if any workflow has been Running > 6h
  3. Extend orphan-vm-cleanup CronWorkflow to also terminate stuck workflows > 6h old

The ghost-heavy-compute mutex should never be held by a completed/stuck workflow — add a mutex watchdog.

Automatable

Yes — activeDeadlineSeconds on the CronWorkflow spec is a one-line fix. The alerting is a separate improvement.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions