agent_card has no recovery path if the initial /registry/register fails (heartbeat + watcher gaps) #2421

Closed
opened 2026-06-08 02:36:57 +00:00 by devops-engineer · 0 comments
Member

Resilience gap found while root-causing #36 (NOT the #36 cause — that was CP Hetzner location capacity, fixed in cp#619 — but a real latent gap).

A workspace's agent_card reaches the platform ONLY via POST /registry/register at startup (runtime main.py). If register fails its retry budget, nothing backfills it: the heartbeat payload carries no agent_card; the ws-server Heartbeat handler never writes agent_card (only Register does); the config watcher posts /registry/update-card only on a file CHANGE (caches the baseline at startup) and omits the auth header. So the runtime's 'heartbeat backfill is the recovery path' log is false — a box that fails its first register stays agent_card=NULL (and its auth token, minted on first register, never lands -> later heartbeats 401).

Fix options: (a) heartbeat carries + handler backfills agent_card when NULL; (b) watcher delivers the initial card at startup + add the auth header; (c) runtime retries register until success. Likely (a)+(c). Verified on a faithful Hetzner repro.

Resilience gap found while root-causing #36 (NOT the #36 cause — that was CP Hetzner location capacity, fixed in cp#619 — but a real latent gap). A workspace's agent_card reaches the platform ONLY via POST /registry/register at startup (runtime main.py). If register fails its retry budget, nothing backfills it: the heartbeat payload carries no agent_card; the ws-server Heartbeat handler never writes agent_card (only Register does); the config watcher posts /registry/update-card only on a file CHANGE (caches the baseline at startup) and omits the auth header. So the runtime's 'heartbeat backfill is the recovery path' log is false — a box that fails its first register stays agent_card=NULL (and its auth token, minted on first register, never lands -> later heartbeats 401). Fix options: (a) heartbeat carries + handler backfills agent_card when NULL; (b) watcher delivers the initial card at startup + add the auth header; (c) runtime retries register until success. Likely (a)+(c). Verified on a faithful Hetzner repro.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2421