What
nightly-dakota-1780628400 started on 2026-06-04 and was still in Running phase on 2026-06-13 (8+ days). It held the ghost-heavy-compute mutex, blocking all new build workflows from running. It was only discovered when investigating dakota#841.
Why it matters
The stuck workflow masked the #841 boot failure: no new dakota builds ran for 8 days, so the broken 6/13 :testing image shipped and users hit it on real hardware before the nightly QA pipeline could catch it.
Fix
Add a maximum TTL / alert for stuck workflows. Options:
- Set
activeDeadlineSeconds on the nightly CronWorkflow (e.g. 4h) so it auto-terminates
- Add an Argo Events alert that fires if any workflow has been Running > 6h
- Extend
orphan-vm-cleanup CronWorkflow to also terminate stuck workflows > 6h old
The ghost-heavy-compute mutex should never be held by a completed/stuck workflow — add a mutex watchdog.
Automatable
Yes — activeDeadlineSeconds on the CronWorkflow spec is a one-line fix. The alerting is a separate improvement.
What
nightly-dakota-1780628400started on 2026-06-04 and was still inRunningphase on 2026-06-13 (8+ days). It held theghost-heavy-computemutex, blocking all new build workflows from running. It was only discovered when investigating dakota#841.Why it matters
The stuck workflow masked the #841 boot failure: no new dakota builds ran for 8 days, so the broken 6/13
:testingimage shipped and users hit it on real hardware before the nightly QA pipeline could catch it.Fix
Add a maximum TTL / alert for stuck workflows. Options:
activeDeadlineSecondson the nightly CronWorkflow (e.g. 4h) so it auto-terminatesorphan-vm-cleanupCronWorkflow to also terminate stuck workflows > 6h oldThe
ghost-heavy-computemutex should never be held by a completed/stuck workflow — add a mutex watchdog.Automatable
Yes —
activeDeadlineSecondson the CronWorkflow spec is a one-line fix. The alerting is a separate improvement.