Skip to content

[CAPABILITY] Cron agent sessions do not survive gateway restarts — interrupted jobs vanish without record or retry #105

@Lexus2016

Description

@Lexus2016

Problem (observed 2026-06-10, cost a full daily cycle)

evolution-implementation fired on schedule at 22:00:33 and worked for ~30 minutes (46 API calls). At 22:30:47 the gateway was restarted (operator was applying provider-failover config via Telegram; the agent itself initiated the planned restart). The cron session was killed mid-flight:

  • no result was recorded — hermes cron list shows NO "Last run" for the job at all (as if it never ran);
  • no retry / no re-fire — the daily slot was simply lost;
  • downstream evolution-integration ran at 23:01, found nothing, reported "ok" — the whole night looked healthy while producing nothing.

Session log ends abruptly: agent.log last entry for cron_ac158635a029_20260610_220033 at 22:30:32, then the new gateway process initializes MCP at 22:30:51.

Why it matters

Gateway restarts are ROUTINE (config changes, hermes update auto-update at 04:17 nightly, /restart). Any cron job unlucky enough to overlap one silently dies. Combined with the lack of failure records this is invisible — #83 (watchdog) would detect the gap next day, but the work is still lost.

Proposed direction

Any of (in increasing ambition):

  1. On gateway startup, detect cron sessions that were in-flight at shutdown (marker file written at job start, cleared at completion) and record them as interrupted — making the failure visible to cron list and the future watchdog ([CAPABILITY] Evolution watchdog: detect silent pipeline failure and alert #83).
  2. Re-fire interrupted jobs once on startup if still within N hours of their scheduled slot.
  3. Graceful drain: planned restarts already drain user sessions — extend the drain to wait (bounded) for running cron sessions, or delay the restart until the job completes.

Value

  • Impact: 0.8 (a routine operation silently destroys daily cycles; happened on day one of observation)
  • Effort: 0.4 (option 1 is a marker file + startup scan; 2-3 incremental)
  • Priority Score: 4.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions