You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem (observed 2026-06-10, cost a full daily cycle)
evolution-implementation fired on schedule at 22:00:33 and worked for ~30 minutes (46 API calls). At 22:30:47 the gateway was restarted (operator was applying provider-failover config via Telegram; the agent itself initiated the planned restart). The cron session was killed mid-flight:
no result was recorded — hermes cron list shows NO "Last run" for the job at all (as if it never ran);
no retry / no re-fire — the daily slot was simply lost;
downstream evolution-integration ran at 23:01, found nothing, reported "ok" — the whole night looked healthy while producing nothing.
Session log ends abruptly: agent.log last entry for cron_ac158635a029_20260610_220033 at 22:30:32, then the new gateway process initializes MCP at 22:30:51.
Why it matters
Gateway restarts are ROUTINE (config changes, hermes update auto-update at 04:17 nightly, /restart). Any cron job unlucky enough to overlap one silently dies. Combined with the lack of failure records this is invisible — #83 (watchdog) would detect the gap next day, but the work is still lost.
Proposed direction
Any of (in increasing ambition):
On gateway startup, detect cron sessions that were in-flight at shutdown (marker file written at job start, cleared at completion) and record them as interrupted — making the failure visible to cron list and the future watchdog ([CAPABILITY] Evolution watchdog: detect silent pipeline failure and alert #83).
Re-fire interrupted jobs once on startup if still within N hours of their scheduled slot.
Graceful drain: planned restarts already drain user sessions — extend the drain to wait (bounded) for running cron sessions, or delay the restart until the job completes.
Value
Impact: 0.8 (a routine operation silently destroys daily cycles; happened on day one of observation)
Effort: 0.4 (option 1 is a marker file + startup scan; 2-3 incremental)
Problem (observed 2026-06-10, cost a full daily cycle)
evolution-implementationfired on schedule at 22:00:33 and worked for ~30 minutes (46 API calls). At 22:30:47 the gateway was restarted (operator was applying provider-failover config via Telegram; the agent itself initiated the planned restart). The cron session was killed mid-flight:hermes cron listshows NO "Last run" for the job at all (as if it never ran);evolution-integrationran at 23:01, found nothing, reported "ok" — the whole night looked healthy while producing nothing.Session log ends abruptly:
agent.loglast entry forcron_ac158635a029_20260610_220033at 22:30:32, then the new gateway process initializes MCP at 22:30:51.Why it matters
Gateway restarts are ROUTINE (config changes,
hermes updateauto-update at 04:17 nightly, /restart). Any cron job unlucky enough to overlap one silently dies. Combined with the lack of failure records this is invisible — #83 (watchdog) would detect the gap next day, but the work is still lost.Proposed direction
Any of (in increasing ambition):
interrupted— making the failure visible tocron listand the future watchdog ([CAPABILITY] Evolution watchdog: detect silent pipeline failure and alert #83).Value