Watchdog double-provision race wedges fresh workspaces: two boxes fight one bootstrap-register mint → permanent 401 + online-but-braindead agents #2611

Open
opened 2026-06-11 22:44:26 +00:00 by core-devops · 0 comments
Member

Live incident on enter-os (2026-06-11 ~22:19Z), third first-run blocker of the day (after #2608/#2609):

Observed chain (all evidence in CP logs + on-box docker logs):

  1. Fresh workspaces never heartbeated (separate trigger), so the stall watchdog fired a recreate — and CP provisioned TWO instances for the same workspace within 1s (i-03097c05832d392b8 + i-090da08abea9119ef, two bootstrap-watchers, DNS A-record flapping between them at 22:20:02/03). Every deprovision/provision log line in that window is doubled → two concurrent restart cycles raced.
  2. SaaS provision revokes all workspace tokens and relies on the register handler's bootstrap path ("first successful register mints"). With two boxes racing, one consumes the bootstrap mint; the other then presents nothing/stale and gets HTTP 401 — and the runtime treats register 401 as terminal ("client error — not retrying"). On-box log: Register: HTTP 401 ... proceeding (heartbeat backfill is the recovery path) — but heartbeats need the token too, so there is no recovery.
  3. Result: workspace shows online (canary), ingests messages (a2a_receive logged), agent never processes anything — user chats into a void. Watchdog keeps cycling, sometimes doubling again. Also hit SnapshotCreationPerVolumeRateExceeded from the churn.
  4. Recovery that worked: ONE controlled restart per workspace → single box → bootstrap register 200 → .auth_token + .platform_inbound_secret persisted → agent_log activity resumed.

Fixes needed (durable, not operator ritual):

  • Mutex the recreate path per workspace (CP): a provision/recreate for a workspace must be single-flight; a second concurrent request joins or no-ops, never launches a second instance. (The watchdog + auto-recovery + admin restart can currently overlap.)
  • Register 401 must be retryable when the workspace has no live token: either the bootstrap gate re-opens (presented-but-invalid token while zero live tokens exist → treat as bootstrap), or the runtime retries register with backoff instead of giving up forever.
  • delegation status=completed means DELIVERED, not processed — during the outage my diagnostic delegations reported completed while the target agent was dead. Either rename or add a processed/acked state; operators and agents both misread this.
  • Watchdog recreating a workspace whose register has been failing should surface that on the workspace record (ties into the last_error surfacing gap from #2608/cp#729).

Refs core#2530 (the 401-starves-chat class — its "restart to recover" hint is defeated by the double-provision race), #2608, #2609, cp#729.

Live incident on `enter-os` (2026-06-11 ~22:19Z), third first-run blocker of the day (after #2608/#2609): **Observed chain (all evidence in CP logs + on-box docker logs):** 1. Fresh workspaces never heartbeated (separate trigger), so the stall watchdog fired a recreate — and CP provisioned **TWO instances for the same workspace within 1s** (`i-03097c05832d392b8` + `i-090da08abea9119ef`, two bootstrap-watchers, DNS A-record flapping between them at 22:20:02/03). Every deprovision/provision log line in that window is doubled → two concurrent restart cycles raced. 2. SaaS provision **revokes all workspace tokens** and relies on the register handler's bootstrap path ("first successful register mints"). With two boxes racing, one consumes the bootstrap mint; the other then presents nothing/stale and gets **HTTP 401** — and the runtime treats register 401 as terminal ("client error — not retrying"). On-box log: `Register: HTTP 401 ... proceeding (heartbeat backfill is the recovery path)` — but heartbeats need the token too, so there is no recovery. 3. Result: workspace shows `online` (canary), ingests messages (`a2a_receive` logged), **agent never processes anything** — user chats into a void. Watchdog keeps cycling, sometimes doubling again. Also hit `SnapshotCreationPerVolumeRateExceeded` from the churn. 4. Recovery that worked: ONE controlled restart per workspace → single box → bootstrap register 200 → `.auth_token` + `.platform_inbound_secret` persisted → agent_log activity resumed. **Fixes needed (durable, not operator ritual):** - **Mutex the recreate path per workspace** (CP): a provision/recreate for a workspace must be single-flight; a second concurrent request joins or no-ops, never launches a second instance. (The watchdog + auto-recovery + admin restart can currently overlap.) - **Register 401 must be retryable when the workspace has no live token**: either the bootstrap gate re-opens (presented-but-invalid token while zero live tokens exist → treat as bootstrap), or the runtime retries register with backoff instead of giving up forever. - **`delegation status=completed` means DELIVERED, not processed** — during the outage my diagnostic delegations reported `completed` while the target agent was dead. Either rename or add a processed/acked state; operators and agents both misread this. - Watchdog recreating a workspace whose register has been failing should surface that on the workspace record (ties into the `last_error` surfacing gap from #2608/cp#729). Refs core#2530 (the 401-starves-chat class — its "restart to recover" hint is defeated by the double-provision race), #2608, #2609, cp#729.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2611