Proposal: atomic last_stdout persistence + prefix_drifted flag (near-term); adopt A2A Handshake v2 (#71) as long-term solution for extractor drift (#108)

# Proposal: fix agy receiver prefix-drift with atomic persistence + drift-flag (near-term); adopt A2A Handshake v2 (#71) as long-term solution

## Summary

Issue #108 documents a concrete bug in the agy receiver's `last_stdout` prefix-strip extractor: after a receiver restart, the persisted `last_stdout` drifts from what the agy CLI's server-side conversation actually contains. This produces `[no reply produced by agy]` on individual turns and, on subsequent turns, replies that reference context the receiver never displayed.

This issue proposes a **two-horizon fix**:
1. **Near-term**: Atomic `last_stdout` persistence + `prefix_drifted` flag + warning transcript record → makes the receiver honest about when it lost track.
2. **Long-term**: Adopt the A2A Bidirectional Session Handshake v2 protocol ([#71](https://github.com/Interstellar-code/hermes-agent/issues/71)) so Hermes can detect session freshness at dispatch time and reconcile state *before* a turn lands in empty air.

The two are complementary: the near-term fix is a tactical patch that works within the current protocol; the long-term fix eliminates the entire class of bugs by making the orchestrator and executor share a common session model.

---

## Near-term fix: atomic persistence + drift-flag

### Root cause (from #108's transcript)

For `v2611-agy-session`:
1. **14:07:20** — Turn 1 (PONG): agy returns handshake text. Receiver persists `last_stdout` = `"PONG — Handshake Acknowledged"`.
2. **14:09:24** — receiver SIGTERM-restarted (idle teardown). On-disk `last_stdout` is frozen at Turn 1.
3. **14:09:52** — Turn 2 arrives at the fresh receiver. It resumes agy conversation via `--conversation <uuid>`. agy produces cumulative stdout: `PONG handshake + <reply-to-sky>`. The new receiver tries to prefix-strip using the persisted `last_stdout` (Turn 1 only), which is a *prefix* of the new output — but the extractor uses `new_stdout[len(prior_stdout):]`, and if the precision is off by even a newline, the "last non-empty line" fallback kicks in. In this case the fallback returned empty `""`, producing `[no reply produced by agy]`.
4. **15:32:39** — Turn 3 arrives. Now `last_stdout` has been updated (by Turn 2's empty result) to... nothing useful. The prefix-strip succeeds but only because both prior and new `last_stdout` happen to overlap. The extractor outputs "42". The "sky" reply is permanently invisible to Hermes — it lives only in agy's server-side conversation.
5. **15:33:44** — Turn 4. agy's server-side conversation has the full history (PONG, sky, 42, BLUE question). It replies referencing the sky question. Hermes sees this reply but never saw the sky answer — the user reads it as a cross-session bleed, even though it's a correct within-session recall whose display gap is the extractor's fault.

### Proposed changes to `templates/agy_receiver.py`

#### 1. Atomic `last_stdout` persistence

**Problem:** `last_stdout` is updated *after* `run_agy_turn` returns the extracted reply. If the receiver crashes *between* updating the in-memory store and flushing to disk (`a2a-agy-sessions.json`), the persisted `last_stdout` is stale and the next receiver starts behind.

The key window is in `poll_inbox` (pseudocode):

```python
async def _process_one(self, msg, ctx_lock):
    # ctx_lock is already held — per-contextId serialization
    reply = await self._run_agy_turn(msg)  # agy runs, extractor parses
    # CRASH HERE → on-disk last_stdout = n-1 turns behind
    self._append_turn(msg["contextId"], "assistant", reply)  # in-memory only
    self._flush_store()  # writes a2a-agy-sessions.json
    await self._post_reply(reply)
```

**Fix:** Flush `last_stdout` to disk *before* agy runs, using the *expected* cumulative output (i.e., take the lock, read current `last_stdout`, flush the predicted-resume state). Alternatively, flush *atomically* after every successful agy return — non-negotiable, inside the per-contextId lock, using `atomic_write` (write to `.tmp` then `os.rename`).

```python
async def _process_one(self, msg, ctx_lock):
    async with ctx_lock:
        # Phase 1: lock, read prior stdout, dispatch agy
        prior = self._load_last_stdout(msg["contextId"])
        new_stdout = await self._run_agy(msg, prior)
        
        # Phase 2: extract reply
        reply = self._extract_reply(prior, new_stdout, msg["contextId"])
        
        # Phase 3: ATOMIC persist — flush before anything reads again
        self._save_last_stdout(msg["contextId"], new_stdout)  # uses atomic_write
        
        # Phase 4: post result
        await self._post_reply(reply, msg["contextId"])
```

#### 2. `prefix_drifted` flag in `a2a-agy-sessions.json`

**Problem:** When the persisted `last_stdout` does not match the output agy produces on a resume turn (first turn after restart, or any turn after a crash), the extractor enters the "prefix does not match" branch (docstring line 62) which silently falls back to "last non-empty line". This fallback is fragile (multi-line markdown can be misread, and empty results mask the failure).

**Fix:** Add a `prefix_drifted: bool` field to the session record in `a2a-agy-sessions.json`:

```json
{
  "v2611-agy-session": {
    "conversation_id": "ae6ce7ce-...",
    "last_stdout": "...",
    "prefix_drifted": true,
    "drifted_at": 1780493888,
    "updated_at": 1780493888
  }
}
```

When the prefix-strip detects a mismatch (`not new_stdout.startswith(prior_stdout)`):
1. Set `prefix_drifted: true` with a timestamp.
2. Emit a **warning transcript record** as a synthetic `sys_warn` message in the reply — a `hermes.drifted_state` entry in the JSON-RPC response, not part of the conversation text. Hermes (or the dashboard) can render this as an orange warning badge.
3. Attempt the "last non-empty line" fallback as before — but now Hermes knows the result is unreliable.

When the prefix-strip succeeds on a subsequent turn (no mismatch → the drift self-healed or the extractor caught up):
1. Set `prefix_drifted: false`.
2. Emit a `sys_info` as a `hermes.drift_recovered` entry.

This visibility is the single most important improvement: it converts a silent corruption into a first-class observable event.

#### 3. Return `[incomplete reply — drift detected]` instead of `[no reply produced by agy]`

When `prefix_drifted` is true AND the fallback-produced-reply is empty:
- Return `[drift detected — persisted last_stdout does not match agy's cumulative output]` instead of `[no reply produced by agy]`.
- This tells the user/reviewer immediately that the receiver lost track, rather than suggesting agy failed to reply.

---

## Long-term fix: adopt #71's A2A Handshake v2 protocol

The prefix-drift bug is fundamentally a *session state reconciliation failure*. The receiver thinks it knows what agy said last; agy knows what it said last; there is no protocol path to reconcile the two. #71's protocol closes this gap completely:

### What #71 offers that prevents this bug class

| Capability | How it fixes the drift |
|---|---|
| **SESSION ANNOUNCE `session_fresh`** | A freshly-started receiver announces `session_fresh: true`. Hermes knows: "this receiver just booted — it may not have the prior turn's state." Hermes can re-send the last turn as context, or route the task to a receiver with a warm session. |
| **SESSION ANNOUNCE `mcp_health`** | Would detect whether the agy CLI's conversation store is accessible (via Keychain, `last_conversations.json`). If not, Hermes knows the receiver can't do multi-turn. |
| **TASK DISPATCH `reply_schema: structured`** | Hermes instructs agy to emit JSON-formatted replies with explicit `context_fresh: true/false` — structured enough that Hermes can detect whether the reply references prior context or is a fresh answer. |
| **TASK RESULT `status: error | partial | blocked`** | An extractor failure produces `status: error` or `status: partial` with a `blocked_reason: "extractor prefix drift detected"`, rather than silently emitting free-text `[no reply produced by agy]`. |
| **TASK DISPATCH `continuation_from`** | Hermes explicitly tells agy which prior task this continues from. If the receiver was restarted and lost context, agy's server-side conversation can still resolve it — the dispatch tells the receiver which context ID to resume, and the receiver can confirm it has the right conversation UUID. |

### Adoption path

1. **Phase 1 (near-term patch)** — Implement the atomic persistence + drift-flag fix above. This works today with no protocol changes.
2. **Phase 2 (inline hints, per #71 Phase 1)** — Embed `session_fresh` awareness in Hermes gateway's existing fleet dispatch logic: check the receiver's drift-flag record before dispatching high-stakes tasks; flag `prefix_drifted` sessions for human review.
3. **Phase 3 (structured announce, per #71 Phase 2)** — Modify the agy receiver shim to emit `SESSION ANNOUNCE` on startup, including `session_fresh`. Hermes parses it, updates its peer profile cache, and uses it for routing decisions.
4. **Phase 4 (structured result, per #71 Phase 3)** — Modify the receiver to wrap replies in structured `TASK RESULT` envelopes. The extractor's warning flags become first-class fields in the structured result.

---

## Open questions

1. Should `prefix_drifted: true` block further dispatches to that contextId until a reconciliation turn (re-send the last expected prompt) succeeds? Or is the warning sufficient?
2. The atomic-write flush adds a disk write per agy turn (agy can take 10-30s per turn — one write is negligible). Confirm the overhead is acceptable.
3. For the `prefix_drifted` warning record: should it be emitted as a separate JSON-RPC message to Hermes (so the dashboard can render an inline badge), or just logged in the receiver's own log? I lean toward separate message — visibility is the whole point.

---

## References

- **Bug report:** #108 — agy receiver prefix-strip drift ([link](https://github.com/Interstellar-code/hermes-agent/issues/108))
- **RFC:** #71 — A2A Bidirectional Session Handshake v2 ([link](https://github.com/Interstellar-code/hermes-agent/issues/71))
- **Receiver source:** `plugins/a2a_fleet/templates/agy_receiver.py` (prefix-strip extractor in docstring lines 44-64; `run_agy_turn` and `_extract_reply` downstream)
- **Session store:** `plugins/a2a_fleet/context_store.py` (in-memory LRU store; disk persistence via `a2a-agy-sessions.json`)
- **Dashboard:** `plugins/a2a_fleet/dashboard/plugin_api.py` (correctly scopes buckets by `(repo_path, contextId)` — UI is not the source of confusion)
- **Dashboard manifest:** `plugins/a2a_fleet/dashboard/manifest.json`
- **Cluster analysis:** Hermes Switch UI image showing `v2611-agy-session` thread in sidebar with Turn 4 reply visible but Turn 2's sky-answer missing — visually confirms the drift gap

Capability	How it fixes the drift
SESSION ANNOUNCE `session_fresh`	A freshly-started receiver announces `session_fresh: true`. Hermes knows: "this receiver just booted — it may not have the prior turn's state." Hermes can re-send the last turn as context, or route the task to a receiver with a warm session.
SESSION ANNOUNCE `mcp_health`	Would detect whether the agy CLI's conversation store is accessible (via Keychain, `last_conversations.json`). If not, Hermes knows the receiver can't do multi-turn.
TASK DISPATCH `reply_schema: structured`	Hermes instructs agy to emit JSON-formatted replies with explicit `context_fresh: true/false` — structured enough that Hermes can detect whether the reply references prior context or is a fresh answer.
**TASK RESULT `status: error	partial
TASK DISPATCH `continuation_from`	Hermes explicitly tells agy which prior task this continues from. If the receiver was restarted and lost context, agy's server-side conversation can still resolve it — the dispatch tells the receiver which context ID to resume, and the receiver can confirm it has the right conversation UUID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: atomic last_stdout persistence + prefix_drifted flag (near-term); adopt A2A Handshake v2 (#71) as long-term solution for extractor drift (#108) #109

Proposal: fix agy receiver prefix-drift with atomic persistence + drift-flag (near-term); adopt A2A Handshake v2 (#71) as long-term solution

Summary

Near-term fix: atomic persistence + drift-flag

Root cause (from #108's transcript)

Proposed changes to `templates/agy_receiver.py`

1. Atomic `last_stdout` persistence

2. `prefix_drifted` flag in `a2a-agy-sessions.json`

3. Return `[incomplete reply — drift detected]` instead of `[no reply produced by agy]`

Long-term fix: adopt #71's A2A Handshake v2 protocol

What #71 offers that prevents this bug class

Adoption path

Open questions

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposal: atomic last_stdout persistence + prefix_drifted flag (near-term); adopt A2A Handshake v2 (#71) as long-term solution for extractor drift (#108) #109

Description

Proposal: fix agy receiver prefix-drift with atomic persistence + drift-flag (near-term); adopt A2A Handshake v2 (#71) as long-term solution

Summary

Near-term fix: atomic persistence + drift-flag

Root cause (from #108's transcript)

Proposed changes to templates/agy_receiver.py

1. Atomic last_stdout persistence

2. prefix_drifted flag in a2a-agy-sessions.json

3. Return [incomplete reply — drift detected] instead of [no reply produced by agy]

Long-term fix: adopt #71's A2A Handshake v2 protocol

What #71 offers that prevents this bug class

Adoption path

Open questions

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Proposed changes to `templates/agy_receiver.py`

1. Atomic `last_stdout` persistence

2. `prefix_drifted` flag in `a2a-agy-sessions.json`

3. Return `[incomplete reply — drift detected]` instead of `[no reply produced by agy]`