Enable Hermes↔Hermes A2A peering (response_handler: agent across profiles) + fix the A2A server bind-race
Architecture corrected after a code-level review (Codex). An earlier draft of this issue wrongly described A2A ingress as the dashboard/web tier reached over a :8642 IPC channel — that is FALSE. :8642 is the unrelated OpenAI-compatible API server (gateway/platforms/api_server.py, /v1). The real model is below. The stale plugins/a2a_fleet/references/hermes-gateway-plugin-guide.md (claims /api/plugins/a2a_fleet/jsonrpc, /sse/{task_id}, /tasks routes that do not exist) caused the error and must be fixed/removed.
Goal
One Hermes profile (Switch) dispatches a task to another profile's agent (Neo/Morpheus/Trinity) over A2A and gets a reply — fleet_send("neo", "...") → Neo's agent loop answers. No new server software: reuse the agent protocol that already ships, exposed per profile.
Actual architecture (verified in code — build on this)
- A2A ingress = a standalone
uvicorn listener in plugins/a2a_fleet/server.py, bound to fleet.server.bind_host:bind_port (server.py:314). Serves /jsonrpc (SendMessage, server.py:181), /.well-known/agent-card.json, /health. It is NOT the dashboard server and NOT /api/plugins/... (those are read-only conversation/peer routes, dashboard/plugin_api.py:14).
- Route B
agent = in-process bridge. Inbound SendMessage with response_handler: agent calls get_agent_bridge() (server.py:206) and runs bridge.bridge_sync in an executor (server.py:224); the bridge does asyncio.run_coroutine_threadsafe(self._message_handler(event), self._gateway_loop) (adapter.py:190) — i.e. the uvicorn daemon thread hands the message to the gateway agent loop in the SAME process. The bridge only exists once the platform adapter connects (adapter.py:104, wired at gateway/run.py:4157).
- Implication: the A2A listener MUST run in the process that hosts the gateway agent loop. If any other process wins the port, its listener has no bridge → inbound
agent requests return "bridge not ready" (server.py:210).
- Outbound
fleet_send (fleet_tools.py:47 → client.py:85) is independent of all the above — it just POSTs SendMessage to a peer url with the peer bearer. A profile can send even if it runs no listener.
CRITICAL defect — fix before anything else (blocker)
register() calls _start_server_in_thread() unconditionally (__init__.py:546), guarded only by a module-local _server_thread (__init__.py:95). But register(ctx) runs in every process that loads the plugin — generic tool startup (model_tools.py:197), gateway startup (gateway/run.py:4056), CLI deferred startup (cli.py:880). So multiple processes on one profile race to bind the same bind_port; bind failure only surfaces after uvicorn exits (server.py:327/:356) and is swallowed. If a non-gateway process wins, Route B is dead (no bridge) — nondeterministic.
Fix: add an explicit process-role gate so only the gateway/agent process (the one where register_platform/the bridge is available) starts the A2A listener. Other plugin-load contexts must skip _start_server_in_thread(). Add a test asserting the server start path fires only in the gateway context.
Implementation (one pass — follows existing patterns, no phases)
- Process-role gate for
_start_server_in_thread() (the blocker above). Co-locate listener + bridge in the gateway process only.
- Per-profile enablement — each profile that should RECEIVE:
$HERMES_HOME/profiles/<p>/fleet.yaml → fleet.enabled: true, fleet.response_handler: agent, fleet.server.bind_port: <unique> (mandatory, fleet_config.py:170). Map: switch 9219, neo 9220, morpheus 9221, trinity 9222. The agent-card already advertises the profile name.
- Prereq: that profile must run
hermes_cli.main --profile <p> gateway run with platforms.a2a_fleet connected (the bridge readies on connect, gateway/run.py:4157). A CLI-only process is NOT enough for inbound agent mode.
- Peer wiring — each SENDER lists the others as plain agent peers (NO
managed/mode/repo_path — those are for deployed CLI receivers; plain peers validate fine, fleet_config.py:237 + test test_fleet_config.py:67):
agents:
neo:
url: http://127.0.0.1:9220/jsonrpc
agent_card_url: http://127.0.0.1:9220/.well-known/agent-card.json
token_env: A2A_HERMES_TOKEN_NEO # profile-scoped name (see #5)
Bidirectional = both list each other + both run a listener.
- Handshake convention — reuse the executor handshake pattern: first message on reserved contextId
handshake:hermes-<peer> where initiator declares role/purpose and receiver confirms role=agent + profile name + ready. The agent handler already processes it; this is a documented convention, not new code.
- Profile-scoped token env names — managed-peer token envs are mode+repo-derived and collision-safe (
managed_peers.py:32/176), but fleet.server.token_env and plain-peer token_env are raw os.environ.get (fleet_config.py:104/166/242). With multiple profiles in ONE host env, names MUST be profile-scoped (e.g. A2A_HERMES_TOKEN_NEO, not a generic SWITCH_A2A_TOKEN). Loopback dev may use auth_required: false.
- Docs cleanup — fix/remove the stale
references/hermes-gateway-plugin-guide.md (it documents non-existent /api/plugins/a2a_fleet/jsonrpc|sse|tasks routes). Add a "Hermes↔Hermes peering" section to the deploy-fleet skill (port map, plain-agent-peer shape, handshake, the gateway-run prereq).
Required before an implementer starts
- A valid
fleet.yaml with fleet.server.bind_port per receiving profile.
- Each receiving profile running
gateway run with platforms.a2a_fleet connected.
- Profile-scoped token envs when
auth_required: true.
Out of scope (separate issues — NOT needed for a minimal round-trip)
Async Task lifecycle (tasks/*), streaming (message/stream) — currently stubbed -32601; structured TASK_RESULT + SESSION_ANNOUNCE (#71); P0-2 deploy-tool runtime repo_path-empty.
Why this is small
The listener, in-process bridge, outbound client, agent-card, token resolution, and session model already ship (the agent protocol is one of 5 live protocols). Net new work = the process-role gate (blocker) + per-profile config + a documented handshake + doc cleanup. Do NOT build a new HTTP surface or a new IPC bridge — none is needed.
Enable Hermes↔Hermes A2A peering (response_handler: agent across profiles) + fix the A2A server bind-race
Goal
One Hermes profile (Switch) dispatches a task to another profile's agent (Neo/Morpheus/Trinity) over A2A and gets a reply —
fleet_send("neo", "...")→ Neo's agent loop answers. No new server software: reuse theagentprotocol that already ships, exposed per profile.Actual architecture (verified in code — build on this)
uvicornlistener inplugins/a2a_fleet/server.py, bound tofleet.server.bind_host:bind_port(server.py:314). Serves/jsonrpc(SendMessage,server.py:181),/.well-known/agent-card.json,/health. It is NOT the dashboard server and NOT/api/plugins/...(those are read-only conversation/peer routes,dashboard/plugin_api.py:14).agent= in-process bridge. InboundSendMessagewithresponse_handler: agentcallsget_agent_bridge()(server.py:206) and runsbridge.bridge_syncin an executor (server.py:224); the bridge doesasyncio.run_coroutine_threadsafe(self._message_handler(event), self._gateway_loop)(adapter.py:190) — i.e. the uvicorn daemon thread hands the message to the gateway agent loop in the SAME process. The bridge only exists once the platform adapter connects (adapter.py:104, wired atgateway/run.py:4157).agentrequests return "bridge not ready" (server.py:210).fleet_send(fleet_tools.py:47→client.py:85) is independent of all the above — it just POSTsSendMessageto a peerurlwith the peer bearer. A profile can send even if it runs no listener.CRITICAL defect — fix before anything else (blocker)
register()calls_start_server_in_thread()unconditionally (__init__.py:546), guarded only by a module-local_server_thread(__init__.py:95). Butregister(ctx)runs in every process that loads the plugin — generic tool startup (model_tools.py:197), gateway startup (gateway/run.py:4056), CLI deferred startup (cli.py:880). So multiple processes on one profile race to bind the samebind_port; bind failure only surfaces after uvicorn exits (server.py:327/:356) and is swallowed. If a non-gateway process wins, Route B is dead (no bridge) — nondeterministic.Fix: add an explicit process-role gate so only the gateway/agent process (the one where
register_platform/the bridge is available) starts the A2A listener. Other plugin-load contexts must skip_start_server_in_thread(). Add a test asserting the server start path fires only in the gateway context.Implementation (one pass — follows existing patterns, no phases)
_start_server_in_thread()(the blocker above). Co-locate listener + bridge in the gateway process only.$HERMES_HOME/profiles/<p>/fleet.yaml→fleet.enabled: true,fleet.response_handler: agent,fleet.server.bind_port: <unique>(mandatory,fleet_config.py:170). Map:switch 9219, neo 9220, morpheus 9221, trinity 9222. The agent-card already advertises the profile name.hermes_cli.main --profile <p> gateway runwithplatforms.a2a_fleetconnected (the bridge readies on connect,gateway/run.py:4157). A CLI-only process is NOT enough for inboundagentmode.managed/mode/repo_path— those are for deployed CLI receivers; plain peers validate fine,fleet_config.py:237+ testtest_fleet_config.py:67):handshake:hermes-<peer>where initiator declares role/purpose and receiver confirms role=agent + profile name + ready. Theagenthandler already processes it; this is a documented convention, not new code.managed_peers.py:32/176), butfleet.server.token_envand plain-peertoken_envare rawos.environ.get(fleet_config.py:104/166/242). With multiple profiles in ONE host env, names MUST be profile-scoped (e.g.A2A_HERMES_TOKEN_NEO, not a genericSWITCH_A2A_TOKEN). Loopback dev may useauth_required: false.references/hermes-gateway-plugin-guide.md(it documents non-existent/api/plugins/a2a_fleet/jsonrpc|sse|tasksroutes). Add a "Hermes↔Hermes peering" section to thedeploy-fleetskill (port map, plain-agent-peer shape, handshake, the gateway-run prereq).Required before an implementer starts
fleet.yamlwithfleet.server.bind_portper receiving profile.gateway runwithplatforms.a2a_fleetconnected.auth_required: true.Out of scope (separate issues — NOT needed for a minimal round-trip)
Async Task lifecycle (
tasks/*), streaming (message/stream) — currently stubbed-32601; structuredTASK_RESULT+SESSION_ANNOUNCE(#71); P0-2 deploy-tool runtimerepo_path-empty.Why this is small
The listener, in-process bridge, outbound client, agent-card, token resolution, and session model already ship (the
agentprotocol is one of 5 live protocols). Net new work = the process-role gate (blocker) + per-profile config + a documented handshake + doc cleanup. Do NOT build a new HTTP surface or a new IPC bridge — none is needed.