Skip to content

Proposal: atomic last_stdout persistence + prefix_drifted flag (near-term); adopt A2A Handshake v2 (#71) as long-term solution for extractor drift (#108) #109

Description

@Interstellar-code

Proposal: fix agy receiver prefix-drift with atomic persistence + drift-flag (near-term); adopt A2A Handshake v2 (#71) as long-term solution

Summary

Issue #108 documents a concrete bug in the agy receiver's last_stdout prefix-strip extractor: after a receiver restart, the persisted last_stdout drifts from what the agy CLI's server-side conversation actually contains. This produces [no reply produced by agy] on individual turns and, on subsequent turns, replies that reference context the receiver never displayed.

This issue proposes a two-horizon fix:

  1. Near-term: Atomic last_stdout persistence + prefix_drifted flag + warning transcript record → makes the receiver honest about when it lost track.
  2. Long-term: Adopt the A2A Bidirectional Session Handshake v2 protocol (#71) so Hermes can detect session freshness at dispatch time and reconcile state before a turn lands in empty air.

The two are complementary: the near-term fix is a tactical patch that works within the current protocol; the long-term fix eliminates the entire class of bugs by making the orchestrator and executor share a common session model.


Near-term fix: atomic persistence + drift-flag

Root cause (from #108's transcript)

For v2611-agy-session:

  1. 14:07:20 — Turn 1 (PONG): agy returns handshake text. Receiver persists last_stdout = "PONG — Handshake Acknowledged".
  2. 14:09:24 — receiver SIGTERM-restarted (idle teardown). On-disk last_stdout is frozen at Turn 1.
  3. 14:09:52 — Turn 2 arrives at the fresh receiver. It resumes agy conversation via --conversation <uuid>. agy produces cumulative stdout: PONG handshake + <reply-to-sky>. The new receiver tries to prefix-strip using the persisted last_stdout (Turn 1 only), which is a prefix of the new output — but the extractor uses new_stdout[len(prior_stdout):], and if the precision is off by even a newline, the "last non-empty line" fallback kicks in. In this case the fallback returned empty "", producing [no reply produced by agy].
  4. 15:32:39 — Turn 3 arrives. Now last_stdout has been updated (by Turn 2's empty result) to... nothing useful. The prefix-strip succeeds but only because both prior and new last_stdout happen to overlap. The extractor outputs "42". The "sky" reply is permanently invisible to Hermes — it lives only in agy's server-side conversation.
  5. 15:33:44 — Turn 4. agy's server-side conversation has the full history (PONG, sky, 42, BLUE question). It replies referencing the sky question. Hermes sees this reply but never saw the sky answer — the user reads it as a cross-session bleed, even though it's a correct within-session recall whose display gap is the extractor's fault.

Proposed changes to templates/agy_receiver.py

1. Atomic last_stdout persistence

Problem: last_stdout is updated after run_agy_turn returns the extracted reply. If the receiver crashes between updating the in-memory store and flushing to disk (a2a-agy-sessions.json), the persisted last_stdout is stale and the next receiver starts behind.

The key window is in poll_inbox (pseudocode):

async def _process_one(self, msg, ctx_lock):
    # ctx_lock is already held — per-contextId serialization
    reply = await self._run_agy_turn(msg)  # agy runs, extractor parses
    # CRASH HERE → on-disk last_stdout = n-1 turns behind
    self._append_turn(msg["contextId"], "assistant", reply)  # in-memory only
    self._flush_store()  # writes a2a-agy-sessions.json
    await self._post_reply(reply)

Fix: Flush last_stdout to disk before agy runs, using the expected cumulative output (i.e., take the lock, read current last_stdout, flush the predicted-resume state). Alternatively, flush atomically after every successful agy return — non-negotiable, inside the per-contextId lock, using atomic_write (write to .tmp then os.rename).

async def _process_one(self, msg, ctx_lock):
    async with ctx_lock:
        # Phase 1: lock, read prior stdout, dispatch agy
        prior = self._load_last_stdout(msg["contextId"])
        new_stdout = await self._run_agy(msg, prior)
        
        # Phase 2: extract reply
        reply = self._extract_reply(prior, new_stdout, msg["contextId"])
        
        # Phase 3: ATOMIC persist — flush before anything reads again
        self._save_last_stdout(msg["contextId"], new_stdout)  # uses atomic_write
        
        # Phase 4: post result
        await self._post_reply(reply, msg["contextId"])

2. prefix_drifted flag in a2a-agy-sessions.json

Problem: When the persisted last_stdout does not match the output agy produces on a resume turn (first turn after restart, or any turn after a crash), the extractor enters the "prefix does not match" branch (docstring line 62) which silently falls back to "last non-empty line". This fallback is fragile (multi-line markdown can be misread, and empty results mask the failure).

Fix: Add a prefix_drifted: bool field to the session record in a2a-agy-sessions.json:

{
  "v2611-agy-session": {
    "conversation_id": "ae6ce7ce-...",
    "last_stdout": "...",
    "prefix_drifted": true,
    "drifted_at": 1780493888,
    "updated_at": 1780493888
  }
}

When the prefix-strip detects a mismatch (not new_stdout.startswith(prior_stdout)):

  1. Set prefix_drifted: true with a timestamp.
  2. Emit a warning transcript record as a synthetic sys_warn message in the reply — a hermes.drifted_state entry in the JSON-RPC response, not part of the conversation text. Hermes (or the dashboard) can render this as an orange warning badge.
  3. Attempt the "last non-empty line" fallback as before — but now Hermes knows the result is unreliable.

When the prefix-strip succeeds on a subsequent turn (no mismatch → the drift self-healed or the extractor caught up):

  1. Set prefix_drifted: false.
  2. Emit a sys_info as a hermes.drift_recovered entry.

This visibility is the single most important improvement: it converts a silent corruption into a first-class observable event.

3. Return [incomplete reply — drift detected] instead of [no reply produced by agy]

When prefix_drifted is true AND the fallback-produced-reply is empty:

  • Return [drift detected — persisted last_stdout does not match agy's cumulative output] instead of [no reply produced by agy].
  • This tells the user/reviewer immediately that the receiver lost track, rather than suggesting agy failed to reply.

Long-term fix: adopt #71's A2A Handshake v2 protocol

The prefix-drift bug is fundamentally a session state reconciliation failure. The receiver thinks it knows what agy said last; agy knows what it said last; there is no protocol path to reconcile the two. #71's protocol closes this gap completely:

What #71 offers that prevents this bug class

Capability How it fixes the drift
SESSION ANNOUNCE session_fresh A freshly-started receiver announces session_fresh: true. Hermes knows: "this receiver just booted — it may not have the prior turn's state." Hermes can re-send the last turn as context, or route the task to a receiver with a warm session.
SESSION ANNOUNCE mcp_health Would detect whether the agy CLI's conversation store is accessible (via Keychain, last_conversations.json). If not, Hermes knows the receiver can't do multi-turn.
TASK DISPATCH reply_schema: structured Hermes instructs agy to emit JSON-formatted replies with explicit context_fresh: true/false — structured enough that Hermes can detect whether the reply references prior context or is a fresh answer.
**TASK RESULT `status: error partial
TASK DISPATCH continuation_from Hermes explicitly tells agy which prior task this continues from. If the receiver was restarted and lost context, agy's server-side conversation can still resolve it — the dispatch tells the receiver which context ID to resume, and the receiver can confirm it has the right conversation UUID.

Adoption path

  1. Phase 1 (near-term patch) — Implement the atomic persistence + drift-flag fix above. This works today with no protocol changes.
  2. Phase 2 (inline hints, per [RFC] A2A Bidirectional Session Handshake: Orchestrator-Worker Protocol v2 #71 Phase 1) — Embed session_fresh awareness in Hermes gateway's existing fleet dispatch logic: check the receiver's drift-flag record before dispatching high-stakes tasks; flag prefix_drifted sessions for human review.
  3. Phase 3 (structured announce, per [RFC] A2A Bidirectional Session Handshake: Orchestrator-Worker Protocol v2 #71 Phase 2) — Modify the agy receiver shim to emit SESSION ANNOUNCE on startup, including session_fresh. Hermes parses it, updates its peer profile cache, and uses it for routing decisions.
  4. Phase 4 (structured result, per [RFC] A2A Bidirectional Session Handshake: Orchestrator-Worker Protocol v2 #71 Phase 3) — Modify the receiver to wrap replies in structured TASK RESULT envelopes. The extractor's warning flags become first-class fields in the structured result.

Open questions

  1. Should prefix_drifted: true block further dispatches to that contextId until a reconciliation turn (re-send the last expected prompt) succeeds? Or is the warning sufficient?
  2. The atomic-write flush adds a disk write per agy turn (agy can take 10-30s per turn — one write is negligible). Confirm the overhead is acceptable.
  3. For the prefix_drifted warning record: should it be emitted as a separate JSON-RPC message to Hermes (so the dashboard can render an inline badge), or just logged in the receiver's own log? I lean toward separate message — visibility is the whole point.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions