Skip to content

Enable Hermes↔Hermes A2A peering (response_handler: agent across profiles) + fix A2A bind-race #120

Description

@Interstellar-code

Enable Hermes↔Hermes A2A peering (response_handler: agent across profiles) + fix the A2A server bind-race

Architecture corrected after a code-level review (Codex). An earlier draft of this issue wrongly described A2A ingress as the dashboard/web tier reached over a :8642 IPC channel — that is FALSE. :8642 is the unrelated OpenAI-compatible API server (gateway/platforms/api_server.py, /v1). The real model is below. The stale plugins/a2a_fleet/references/hermes-gateway-plugin-guide.md (claims /api/plugins/a2a_fleet/jsonrpc, /sse/{task_id}, /tasks routes that do not exist) caused the error and must be fixed/removed.

Goal

One Hermes profile (Switch) dispatches a task to another profile's agent (Neo/Morpheus/Trinity) over A2A and gets a reply — fleet_send("neo", "...") → Neo's agent loop answers. No new server software: reuse the agent protocol that already ships, exposed per profile.

Actual architecture (verified in code — build on this)

  • A2A ingress = a standalone uvicorn listener in plugins/a2a_fleet/server.py, bound to fleet.server.bind_host:bind_port (server.py:314). Serves /jsonrpc (SendMessage, server.py:181), /.well-known/agent-card.json, /health. It is NOT the dashboard server and NOT /api/plugins/... (those are read-only conversation/peer routes, dashboard/plugin_api.py:14).
  • Route B agent = in-process bridge. Inbound SendMessage with response_handler: agent calls get_agent_bridge() (server.py:206) and runs bridge.bridge_sync in an executor (server.py:224); the bridge does asyncio.run_coroutine_threadsafe(self._message_handler(event), self._gateway_loop) (adapter.py:190) — i.e. the uvicorn daemon thread hands the message to the gateway agent loop in the SAME process. The bridge only exists once the platform adapter connects (adapter.py:104, wired at gateway/run.py:4157).
  • Implication: the A2A listener MUST run in the process that hosts the gateway agent loop. If any other process wins the port, its listener has no bridge → inbound agent requests return "bridge not ready" (server.py:210).
  • Outbound fleet_send (fleet_tools.py:47client.py:85) is independent of all the above — it just POSTs SendMessage to a peer url with the peer bearer. A profile can send even if it runs no listener.

CRITICAL defect — fix before anything else (blocker)

register() calls _start_server_in_thread() unconditionally (__init__.py:546), guarded only by a module-local _server_thread (__init__.py:95). But register(ctx) runs in every process that loads the plugin — generic tool startup (model_tools.py:197), gateway startup (gateway/run.py:4056), CLI deferred startup (cli.py:880). So multiple processes on one profile race to bind the same bind_port; bind failure only surfaces after uvicorn exits (server.py:327/:356) and is swallowed. If a non-gateway process wins, Route B is dead (no bridge) — nondeterministic.

Fix: add an explicit process-role gate so only the gateway/agent process (the one where register_platform/the bridge is available) starts the A2A listener. Other plugin-load contexts must skip _start_server_in_thread(). Add a test asserting the server start path fires only in the gateway context.

Implementation (one pass — follows existing patterns, no phases)

  1. Process-role gate for _start_server_in_thread() (the blocker above). Co-locate listener + bridge in the gateway process only.
  2. Per-profile enablement — each profile that should RECEIVE: $HERMES_HOME/profiles/<p>/fleet.yamlfleet.enabled: true, fleet.response_handler: agent, fleet.server.bind_port: <unique> (mandatory, fleet_config.py:170). Map: switch 9219, neo 9220, morpheus 9221, trinity 9222. The agent-card already advertises the profile name.
    • Prereq: that profile must run hermes_cli.main --profile <p> gateway run with platforms.a2a_fleet connected (the bridge readies on connect, gateway/run.py:4157). A CLI-only process is NOT enough for inbound agent mode.
  3. Peer wiring — each SENDER lists the others as plain agent peers (NO managed/mode/repo_path — those are for deployed CLI receivers; plain peers validate fine, fleet_config.py:237 + test test_fleet_config.py:67):
    agents:
      neo:
        url: http://127.0.0.1:9220/jsonrpc
        agent_card_url: http://127.0.0.1:9220/.well-known/agent-card.json
        token_env: A2A_HERMES_TOKEN_NEO   # profile-scoped name (see #5)
    Bidirectional = both list each other + both run a listener.
  4. Handshake convention — reuse the executor handshake pattern: first message on reserved contextId handshake:hermes-<peer> where initiator declares role/purpose and receiver confirms role=agent + profile name + ready. The agent handler already processes it; this is a documented convention, not new code.
  5. Profile-scoped token env names — managed-peer token envs are mode+repo-derived and collision-safe (managed_peers.py:32/176), but fleet.server.token_env and plain-peer token_env are raw os.environ.get (fleet_config.py:104/166/242). With multiple profiles in ONE host env, names MUST be profile-scoped (e.g. A2A_HERMES_TOKEN_NEO, not a generic SWITCH_A2A_TOKEN). Loopback dev may use auth_required: false.
  6. Docs cleanup — fix/remove the stale references/hermes-gateway-plugin-guide.md (it documents non-existent /api/plugins/a2a_fleet/jsonrpc|sse|tasks routes). Add a "Hermes↔Hermes peering" section to the deploy-fleet skill (port map, plain-agent-peer shape, handshake, the gateway-run prereq).

Required before an implementer starts

  • A valid fleet.yaml with fleet.server.bind_port per receiving profile.
  • Each receiving profile running gateway run with platforms.a2a_fleet connected.
  • Profile-scoped token envs when auth_required: true.

Out of scope (separate issues — NOT needed for a minimal round-trip)

Async Task lifecycle (tasks/*), streaming (message/stream) — currently stubbed -32601; structured TASK_RESULT + SESSION_ANNOUNCE (#71); P0-2 deploy-tool runtime repo_path-empty.

Why this is small

The listener, in-process bridge, outbound client, agent-card, token resolution, and session model already ship (the agent protocol is one of 5 live protocols). Net new work = the process-role gate (blocker) + per-profile config + a documented handshake + doc cleanup. Do NOT build a new HTTP surface or a new IPC bridge — none is needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions