secret-write auto-restart kills the WRITING agent mid-turn when target == self (org concierge self-restart loop) #2573

Closed
opened 2026-06-11 04:17:54 +00:00 by core-devops · 5 comments
Member

Live (agents-team concierge, 2026-06-11 ~01:53Z, twice): the org platform agent called set_org_secret / set_workspace_secret (exercising the CTO's test-approval request). The platform's secret-change auto-restart then restarted the agent's own workspace — killing its in-flight turn (Post http://...:8000: EOF), dropping the canvas reply, and (until runtime#118 ships) re-breaking its RBAC on the fresh boot. One occurrence also left the workspace stuck provisioning ~2h until provision-reconcile recovered it.

An org-root agent managing its own org will write secrets routinely — self-restart-on-write makes that operationally explosive.

Asks:

  1. When the secret-write CALLER is the target workspace (or the target is the org root processing a live turn), DEFER the restart to turn-end (or make it explicit: return restart_pending: true and let the agent/user trigger it).
  2. At minimum, the tool response must WARN that a restart is imminent so the agent can flush its reply first.

🤖 Generated with Claude Code

**Live (agents-team concierge, 2026-06-11 ~01:53Z, twice):** the org platform agent called `set_org_secret` / `set_workspace_secret` (exercising the CTO's test-approval request). The platform's secret-change **auto-restart** then restarted the agent's own workspace — killing its in-flight turn (`Post http://...:8000: EOF`), dropping the canvas reply, and (until runtime#118 ships) re-breaking its RBAC on the fresh boot. One occurrence also left the workspace stuck `provisioning` ~2h until provision-reconcile recovered it. An org-root agent managing its own org will write secrets routinely — self-restart-on-write makes that operationally explosive. **Asks:** 1. When the secret-write CALLER is the target workspace (or the target is the org root processing a live turn), DEFER the restart to turn-end (or make it explicit: return `restart_pending: true` and let the agent/user trigger it). 2. At minimum, the tool response must WARN that a restart is imminent so the agent can flush its reply first. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Member

Live escalation of this bug class (2026-06-11): the agents-team concierge box i-0ed22d442252018b0 was TERMINATED by molecule-cp at 04:39:15Z — the exact minute an operator DELETE of a junk workspace secret (TEST_APPROVAL_SECRET, cleanup from the approval-gate incident) returned 200. So:

  1. The secret-change auto-restart fires on deletes too, not just writes.
  2. The restart path is a full terminate + re-provision, and the re-provision failed silently — no replacement instance was ever created (CloudTrail shows the terminate; EC2 shows no successor). The workspace sat failed with last_sample_error=NULL and zero alarms for 14 hours until the CTO noticed in the UI.
  3. Recovery: manual POST /workspaces/:id/restart re-provisioned fine — so the failure was in the auto-restart's provision leg (cp#691 SecretsManager deletion-window race is the likely culprit: delete secret → restart → provision reads the half-deleted secret bundle).

Additional asks beyond the original scope:

  • Secret DELETES of non-runtime-consumed keys should not restart at all (or batch/debounce).
  • The auto-restart MUST surface provision failure (set last_sample_error + activity row + canvas alert) — a silent 14h-dead org root is the worst failure shape for the platform agent.
  • The reconcile sweeper should treat a failed kind='platform' workspace as auto-recover (it re-provisioned a stuck one on 2026-06-10 but did not engage here).

🤖 Generated with Claude Code

**Live escalation of this bug class (2026-06-11):** the agents-team concierge box `i-0ed22d442252018b0` was TERMINATED by `molecule-cp` at **04:39:15Z** — the exact minute an operator DELETE of a junk workspace secret (`TEST_APPROVAL_SECRET`, cleanup from the approval-gate incident) returned 200. So: 1. The secret-change auto-restart fires on **deletes** too, not just writes. 2. The restart path is a full **terminate + re-provision**, and the re-provision **failed silently** — no replacement instance was ever created (CloudTrail shows the terminate; EC2 shows no successor). The workspace sat `failed` with `last_sample_error=NULL` and zero alarms for **14 hours** until the CTO noticed in the UI. 3. Recovery: manual `POST /workspaces/:id/restart` re-provisioned fine — so the failure was in the auto-restart's provision leg (cp#691 SecretsManager deletion-window race is the likely culprit: delete secret → restart → provision reads the half-deleted secret bundle). Additional asks beyond the original scope: - Secret DELETES of non-runtime-consumed keys should not restart at all (or batch/debounce). - The auto-restart MUST surface provision failure (set last_sample_error + activity row + canvas alert) — a silent 14h-dead org root is the worst failure shape for the platform agent. - The reconcile sweeper should treat a `failed` kind='platform' workspace as auto-recover (it re-provisioned a stuck one on 2026-06-10 but did not engage here). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Member

Second occurrence, same day (2026-06-11 ~19:06Z), this time user-triggered: the CTO asked the concierge for "a test task and test approval" in canvas chat; the agent (still lacking create_approval — mcp-server#61) improvised with mcp__platform__set_workspace_secret(TEST_SECRET) on its own workspace → secret-change auto-restart → its box terminated MID-TURN → workspace failed in the UI while the user watched. Identical mechanism to the 04:39Z outage above.

This is now a reproducible loop: any approval-demo request to the org concierge self-DoSes the org root. Cleanup is also booby-trapped — deleting the junk secrets (TEST_SECRET, TEST_APPROVAL_DUMMY_KEY) would fire ANOTHER restart, so they are deliberately left in place until this is fixed.

Priority ask: treat as HIGH. Minimum viable fix = reject (400 + guidance) agent-initiated secret writes/deletes targeting the agent's OWN workspace; the restart-debounce and delete-exemption can follow. Pairs with mcp-server#61 (give the concierge create_approval so it stops improvising).

🤖 Generated with Claude Code

**Second occurrence, same day (2026-06-11 ~19:06Z), this time user-triggered:** the CTO asked the concierge for "a test task and test approval" in canvas chat; the agent (still lacking `create_approval` — mcp-server#61) improvised with `mcp__platform__set_workspace_secret(TEST_SECRET)` **on its own workspace** → secret-change auto-restart → its box terminated MID-TURN → workspace `failed` in the UI while the user watched. Identical mechanism to the 04:39Z outage above. This is now a reproducible loop: any approval-demo request to the org concierge self-DoSes the org root. Cleanup is also booby-trapped — deleting the junk secrets (TEST_SECRET, TEST_APPROVAL_DUMMY_KEY) would fire ANOTHER restart, so they are deliberately left in place until this is fixed. Priority ask: treat as HIGH. Minimum viable fix = reject (400 + guidance) agent-initiated secret writes/deletes targeting the agent's OWN workspace; the restart-debounce and delete-exemption can follow. Pairs with mcp-server#61 (give the concierge create_approval so it stops improvising). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Member

Fix chain status (2026-06-11):

  1. mcp-server#62 — MERGED + PUBLISHED. create_approval + create_request (kind task|approval) in management mode; @molecule-ai/mcp-server@1.6.0 is live on the registry.
  2. core#2603 — OPEN, review dispatched to CR2. autoRestartAllowed(): self-write skip + kind='platform' skip with FAIL-CLOSED kind lookup; same platform exclusion in the global-secret restart fan-out; concierge identity prompt now bakes the no-self-secret-ops / safe-approval-demo rule (its RBAC denies memory.write, so chat-taught rules don't survive restarts — the prompt is re-seeded every provision).
  3. workspace-template#111 — OPEN. Dockerfile.platform-agent MCP_VERSION 1.5.0→1.6.0. After merge: image → smoke gate → repin via promote API.

Concierge (cd62fe70…, agents-team) recovered ONLINE after an explicit restart at ~19:55Z (the post-self-kill auto-recovery had stalled at offline). Junk secrets TEST_SECRET / TEST_APPROVAL_DUMMY_KEY stay until #2603 deploys; deleting them now would fire the same restart.

**Fix chain status (2026-06-11):** 1. **mcp-server#62 — MERGED + PUBLISHED.** `create_approval` + `create_request` (kind task|approval) in management mode; `@molecule-ai/mcp-server@1.6.0` is live on the registry. 2. **core#2603 — OPEN, review dispatched to CR2.** `autoRestartAllowed()`: self-write skip + `kind='platform'` skip with FAIL-CLOSED kind lookup; same platform exclusion in the global-secret restart fan-out; concierge identity prompt now bakes the no-self-secret-ops / safe-approval-demo rule (its RBAC denies memory.write, so chat-taught rules don't survive restarts — the prompt is re-seeded every provision). 3. **workspace-template#111 — OPEN.** `Dockerfile.platform-agent` MCP_VERSION 1.5.0→1.6.0. After merge: image → smoke gate → repin via promote API. Concierge (`cd62fe70…`, agents-team) recovered ONLINE after an explicit restart at ~19:55Z (the post-self-kill auto-recovery had stalled at `offline`). Junk secrets `TEST_SECRET` / `TEST_APPROVAL_DUMMY_KEY` stay until #2603 deploys; deleting them now would fire the same restart.
Author
Member

RESOLVED end-to-end (2026-06-11 ~21:1xZ) — full chain verified in production:

  1. Server guard (core#2603) merged + deployed fleet-wide (agents-team: cd4281f in the publish run's fleet verification). Live-tested: deleted the workspace-scoped TEST_SECRET on the concierge — no auto-restart fired, workspace stayed online. The exact operation that caused the 04:39Z 14h outage is now a no-op.
  2. Tools (mcp-server 1.6.1): #62's 1.6.0 turned out to die at management-mode startup (duplicate create_request registration — #63), caught by the platform-agent image smoke gate on its first real run (template#112 fixed the bash quoting bug that had silently broken the smoke step on every prior publish run — no platform-agent image had been published since the gate landed). 1.6.1 published, template#113 pinned it, image built+pushed, platform-agent pin promoted on prod+staging CP.
  3. Concierge re-provisioned on the new image — verified live: create_approval + create_request present (37 platform tools). First re-provision attempt failed on an SM secret-deletion propagation race (deprovision force-deletes molecule/workspace/<id>/config ~2s before provision re-creates it → InvalidRequestException: scheduled for deletion); retry succeeded. Worth a CP-side retry/backoff — filing separately.
  4. Prompt-bake (core#2605, open): the merge queue took #2603 at its first commit, so the identity-prompt guardrail commit was re-raised as #2605 (concierge RBAC denies memory.write — chat-taught rules don't survive restarts; the prompt is the durable surface).
  5. Leftovers: TEST_APPROVAL_DUMMY_KEY is org-GLOBAL — deleting it fans out restarts to all sibling workspaces, so cleanup is deferred to a quiet window. Filed core#2606: regular workspaces don't actually load the unified request tools (npm mcp-server isn't wired into ordinary boxes) — the architectural gap behind "the concierge improvised.
**RESOLVED end-to-end (2026-06-11 ~21:1xZ) — full chain verified in production:** 1. **Server guard (core#2603)** merged + deployed fleet-wide (`agents-team: cd4281f` in the publish run's fleet verification). **Live-tested:** deleted the workspace-scoped `TEST_SECRET` on the concierge — no auto-restart fired, workspace stayed `online`. The exact operation that caused the 04:39Z 14h outage is now a no-op. 2. **Tools (mcp-server 1.6.1)**: #62's 1.6.0 turned out to die at management-mode startup (duplicate `create_request` registration — #63), caught by the platform-agent **image smoke gate on its first real run** (template#112 fixed the bash quoting bug that had silently broken the smoke step on every prior publish run — no platform-agent image had been published since the gate landed). 1.6.1 published, template#113 pinned it, image built+pushed, `platform-agent` pin promoted on prod+staging CP. 3. **Concierge re-provisioned on the new image** — verified live: `create_approval` + `create_request` present (37 platform tools). First re-provision attempt failed on an SM secret-deletion propagation race (deprovision force-deletes `molecule/workspace/<id>/config` ~2s before provision re-creates it → `InvalidRequestException: scheduled for deletion`); retry succeeded. Worth a CP-side retry/backoff — filing separately. 4. **Prompt-bake (core#2605, open):** the merge queue took #2603 at its first commit, so the identity-prompt guardrail commit was re-raised as #2605 (concierge RBAC denies memory.write — chat-taught rules don't survive restarts; the prompt is the durable surface). 5. **Leftovers:** `TEST_APPROVAL_DUMMY_KEY` is org-GLOBAL — deleting it fans out restarts to all sibling workspaces, so cleanup is deferred to a quiet window. Filed core#2606: regular workspaces don't actually load the unified request tools (npm mcp-server isn't wired into ordinary boxes) — the architectural gap behind "the concierge improvised.
Author
Member

Closing note — the "created but not visible" half is fixed in prod (2026-06-11 ~21:40Z): the concierge's create_approval/create_request rows existed but the canvas Tasks/Approvals tabs were empty because the tenant edge served the Next.js app on /requests/* — tunnel ingress predated CP#706's platformPaths entry and nothing ever re-applied it. CP#730 (merged + deployed): tenant redeploys now re-apply tunnel ingress from the platformPaths SSOT every time (tunnel_ingress_synced in redeploy results). agents-team redeployed + verified: /requests/pending returns the pending approval + task through the edge. Remaining pre-#706 tenants converge on their next fleet redeploy automatically.

**Closing note — the "created but not visible" half is fixed in prod (2026-06-11 ~21:40Z):** the concierge's create_approval/create_request rows existed but the canvas Tasks/Approvals tabs were empty because the tenant edge served the Next.js app on `/requests/*` — tunnel ingress predated CP#706's platformPaths entry and nothing ever re-applied it. **CP#730 (merged + deployed):** tenant redeploys now re-apply tunnel ingress from the platformPaths SSOT every time (`tunnel_ingress_synced` in redeploy results). agents-team redeployed + verified: `/requests/pending` returns the pending approval + task through the edge. Remaining pre-#706 tenants converge on their next fleet redeploy automatically.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2573