fix(secrets): never auto-restart the org's platform root on secret changes (core#2573) #2603

Merged
devops-engineer merged 1 commits from fix/2573-no-autorestart-platform-root into main 2026-06-11 20:01:24 +00:00
Member

Closes the remaining #2573 gap after the self-write skip.

Why the self-write skip was not enough: the org concierge's management MCP authenticates with the tenant ADMIN token, so callerWorkspaceID() returns "" and the self-write skip never fires. A secret write/delete targeting the kind='platform' workspace therefore auto-restarted (= terminated + re-provisioned) the org root's box mid-turn — twice on 2026-06-11, the first costing a 14h org-root outage when the provision leg failed silently (cp#691).

Change:

  • autoRestartAllowed() — self-write skip (existing) + new kind='platform' skip. Kind lookup fails closed: if the kind can't be proven, no restart (a wrong restart on the org root is the exact outage this guards against; a skipped restart only delays env propagation until the next explicit restart — the canvas Restart button covers that).
  • restartAllAffectedByGlobalKey()COALESCE(kind,'workspace') <> 'platform' added to the fan-out query, so global-secret rotation can't tear down the org root either.
  • Tests: platform-root skip on Set/Delete, fan-out SQL exclusion pinned by regex, kind-lookup expectations added to existing restart-fire tests (incl. the #2584 spoof pair).

Pairs with mcp-server#62 (merged): create_approval/create_request in management mode, so the concierge stops improvising approval demos with gated ops.

Refs core#2573.

🤖 Generated with Claude Code

Closes the remaining #2573 gap after the self-write skip. **Why the self-write skip was not enough:** the org concierge's management MCP authenticates with the tenant ADMIN token, so `callerWorkspaceID()` returns `""` and the self-write skip never fires. A secret write/delete targeting the `kind='platform'` workspace therefore auto-restarted (= terminated + re-provisioned) the org root's box mid-turn — twice on 2026-06-11, the first costing a 14h org-root outage when the provision leg failed silently (cp#691). **Change:** - `autoRestartAllowed()` — self-write skip (existing) + new `kind='platform'` skip. Kind lookup **fails closed**: if the kind can't be proven, no restart (a wrong restart on the org root is the exact outage this guards against; a skipped restart only delays env propagation until the next explicit restart — the canvas Restart button covers that). - `restartAllAffectedByGlobalKey()` — `COALESCE(kind,'workspace') <> 'platform'` added to the fan-out query, so global-secret rotation can't tear down the org root either. - Tests: platform-root skip on Set/Delete, fan-out SQL exclusion pinned by regex, kind-lookup expectations added to existing restart-fire tests (incl. the #2584 spoof pair). **Pairs with** mcp-server#62 (merged): `create_approval`/`create_request` in management mode, so the concierge stops improvising approval demos with gated ops. Refs core#2573. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-06-11 19:57:36 +00:00
fix(secrets): never auto-restart the org's platform root on secret changes (core#2573)
CI / Python Lint & Test (pull_request) Successful in 3s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
Harness Replays / detect-changes (pull_request) Successful in 3s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 4s
E2E Chat / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
sop-checklist / review-refire (pull_request_target) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
gate-check-v3 / gate-check (pull_request_target) Successful in 5s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
reserved-path-review / reserved-path-review (pull_request_target) Successful in 5s
Harness Replays / Harness Replays (pull_request) Successful in 2s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 8s
sop-checklist / all-items-acked (pull_request_target) Successful in 4s
E2E Chat / E2E Chat (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 19s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 35s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 35s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 1s
CI / Canvas Deploy Status (pull_request) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 38s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m1s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Failing after 25s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
reserved-path-review / reserved-path-review (pull_request_review) Successful in 3s
security-review / approved (pull_request_review) Successful in 3s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 11s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m24s
CI / Platform (Go) (pull_request) Successful in 2m50s
CI / all-required (pull_request) Successful in 1s
audit-force-merge / audit (pull_request_target) Successful in 3s
977502befd
The #2573 self-write skip only covers callers presenting a workspace
token / X-Workspace-ID. The org concierge's management MCP authenticates
with the tenant ADMIN token, so callerID resolves to "" and the skip
never fired: a secret write/delete targeting the concierge terminated
the org root's box mid-turn — twice on 2026-06-11, once costing a 14h
outage (silent provision failure, cp#691).

- autoRestartAllowed(): self-write skip + NEW kind='platform' skip; the
  kind lookup FAILS CLOSED (unknown kind -> no restart) since a wrong
  restart on the org root is the exact outage this prevents, while a
  skipped restart only delays env propagation until the next explicit
  restart.
- restartAllAffectedByGlobalKey(): same exclusion in the global-secret
  fan-out query (COALESCE(kind,'workspace') <> 'platform') — a global
  rotation must not tear down the org root either.
- Tests: platform-root skip for Set/Delete, fan-out SQL exclusion
  pinned by regex, kind-lookup expectations added to the existing
  restart-fire tests (#2584 spoof tests included).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
agent-reviewer-cr2 approved these changes 2026-06-11 19:59:41 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED after full 5-axis review of molecule-core#2603 at head 977502be.

Correctness: the new autoRestartAllowed() covers both restart-kill cases: self-write and target kind='platform'. It fails closed on kind lookup errors, which is appropriate for this outage class. Set and Delete now route through the helper, and restartAllAffectedByGlobalKey excludes COALESCE(kind,'workspace') <> 'platform', so global secret rotations cannot fan out a restart to the org root.
Robustness: fail-closed lookup means an uncertain target skips auto-restart rather than risking org-root termination; explicit restart remains available for env propagation. Existing restart-positive tests were updated with kind expectations, and new tests cover platform Set/Delete skip plus global fan-out SQL exclusion.
Security: this reduces availability blast radius for secret changes without weakening auth or allowing spoofed caller headers; the #2584 spoof regression tests still expect restart for a valid non-target caller and now include the kind lookup.
Performance: one small kind lookup only on paths that would otherwise auto-restart; fan-out adds an indexed-style predicate on workspace kind. No new loops or blocking calls beyond existing DB work.
Readability: helper name/comment clearly document why platform roots are special and why fail-closed is intentional.

Live status at review time had product checks green or running, with qa-review/security-review expected to clear from this approval and SOP ceremony still separately visible.

APPROVED after full 5-axis review of molecule-core#2603 at head 977502be. Correctness: the new `autoRestartAllowed()` covers both restart-kill cases: self-write and target `kind='platform'`. It fails closed on kind lookup errors, which is appropriate for this outage class. `Set` and `Delete` now route through the helper, and `restartAllAffectedByGlobalKey` excludes `COALESCE(kind,'workspace') <> 'platform'`, so global secret rotations cannot fan out a restart to the org root. Robustness: fail-closed lookup means an uncertain target skips auto-restart rather than risking org-root termination; explicit restart remains available for env propagation. Existing restart-positive tests were updated with kind expectations, and new tests cover platform Set/Delete skip plus global fan-out SQL exclusion. Security: this reduces availability blast radius for secret changes without weakening auth or allowing spoofed caller headers; the #2584 spoof regression tests still expect restart for a valid non-target caller and now include the kind lookup. Performance: one small kind lookup only on paths that would otherwise auto-restart; fan-out adds an indexed-style predicate on workspace kind. No new loops or blocking calls beyond existing DB work. Readability: helper name/comment clearly document why platform roots are special and why fail-closed is intentional. Live status at review time had product checks green or running, with qa-review/security-review expected to clear from this approval and SOP ceremony still separately visible.
devops-engineer merged commit 88ae8ab5fe into main 2026-06-11 20:01:24 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2603