fix(#33): break the RCA#2970 management-MCP re-provision deadlock for concierges #3228

Merged
agent-reviewer-cr2 merged 3 commits from fix/concierge-mgmt-mcp-reprovision-deadlock-33 into main 2026-06-24 12:08:32 +00:00
Member

Problem

The molecule-adk-demo root concierge went failed on re-provision with:

platform agent heartbeat denied: /opt/molecule-mcp-server missing; refusing to mark online (RCA #2970 FAIL-CLOSED)

RCA#2970 is fail-closing correctly — the management MCP is genuinely absent. The bug is that delivery never re-runs, so the concierge can never recover.

Root cause (the deadlock)

  • The management MCP (molecule-platform) is not in the on-box baked config.yaml; it's declared in workspace_declared_plugins (re-seeded every provision by applyConciergeProvisionConfig).
  • On SaaS the only path that installs it into the running container is the post-online declared-plugin reconcile (ReconcileWorkspacePlugins), fired by the recovery branches in evaluateStatus (fireReconcileOnline).
  • But the RCA#2970 gate (registry.go, heartbeat and register) does markWorkspaceFailed + return when a #147 runtime reports mcp_server_present=falsebefore those reconcile-firing branches. So: not-online → reconcile never fires → MCP never installs → mcp_server_present stays false → stuck failed forever (#33 / #34 durability gap).

Fix

Fire the declared-plugin reconcile from the !hasMCP fail branch (heartbeat + register), guarded by an in-flight map so a sustained-missing concierge fires at most one concurrent reconcile.

  • Still fails closed for that heartbeat (markWorkspaceFailed).
  • The reconcile delivers the declared MCP into the running container; the runtime re-reads settings.json (fresh each call) → next heartbeat reports mcp_server_present=true → the existing failed→online recovery climbs it back.
  • If the reconcile can't deliver (missing token / fetch failure) it logs loudly and the concierge stays failed — correct fail-closed, now with a root cause surfaced.
  • Blast radius: only concierges already failing the gate enter this branch; a healthy concierge never does → can only help.

Tests (registry_mcp_recovery_test.go)

  • mcp-missing heartbeat still fails closed and fires the recovery reconcile (proven to fail without the fix)
  • model-missing fails closed but does not fire the reconcile (scoped to !hasMCP)
  • in-flight guard: concurrent calls fire at most one reconcile per workspace
  • nil reconcile func / empty id are no-ops
  • full registry/heartbeat/concierge gate suite green (no regressions)

Validation

Unit-tested here. End-to-end convergence (reconcile delivers → runtime re-detects → climbs online) will be validated on the live failed molecule-adk-demo concierge after deploy — its next heartbeat fires the deadlock-break; it either recovers to online or surfaces the token (#34) blocker loudly.

Companion follow-ups (separate)

  • Runtime: boot-install must fail loud (PrivilegedPluginInstallError) on a privileged-plugin miss instead of silently booting MCP-less.
  • #34: ensure the gitea token to fetch the private management-MCP plugin is present on a re-provisioned box.

🤖 Generated with Claude Code

## Problem The `molecule-adk-demo` root concierge went **`failed`** on re-provision with: > `platform agent heartbeat denied: /opt/molecule-mcp-server missing; refusing to mark online (RCA #2970 FAIL-CLOSED)` RCA#2970 is fail-closing **correctly** — the management MCP is genuinely absent. The bug is that **delivery never re-runs**, so the concierge can never recover. ## Root cause (the deadlock) - The management MCP (`molecule-platform`) is **not** in the on-box baked `config.yaml`; it's declared in `workspace_declared_plugins` (re-seeded every provision by `applyConciergeProvisionConfig`). - On SaaS the **only** path that installs it into the running container is the post-online declared-plugin **reconcile** (`ReconcileWorkspacePlugins`), fired by the recovery branches in `evaluateStatus` (`fireReconcileOnline`). - But the RCA#2970 gate (`registry.go`, heartbeat **and** register) does `markWorkspaceFailed` + **`return`** when a #147 runtime reports `mcp_server_present=false` — **before** those reconcile-firing branches. So: not-online → reconcile never fires → MCP never installs → `mcp_server_present` stays false → **stuck failed forever** (#33 / #34 durability gap). ## Fix Fire the declared-plugin reconcile from the `!hasMCP` fail branch (heartbeat + register), guarded by an in-flight map so a sustained-missing concierge fires at most **one** concurrent reconcile. - **Still fails closed** for that heartbeat (`markWorkspaceFailed`). - The reconcile delivers the declared MCP into the running container; the runtime re-reads `settings.json` (fresh each call) → next heartbeat reports `mcp_server_present=true` → the **existing** `failed→online` recovery climbs it back. - If the reconcile can't deliver (missing token / fetch failure) it logs **loudly** and the concierge stays failed — correct fail-closed, now with a root cause surfaced. - **Blast radius:** only concierges *already failing* the gate enter this branch; a healthy concierge never does → can only help. ## Tests (`registry_mcp_recovery_test.go`) - mcp-missing heartbeat still fails closed **and** fires the recovery reconcile (**proven to fail without the fix**) - model-missing fails closed but does **not** fire the reconcile (scoped to `!hasMCP`) - in-flight guard: concurrent calls fire at most one reconcile per workspace - nil reconcile func / empty id are no-ops - full registry/heartbeat/concierge gate suite green (no regressions) ## Validation Unit-tested here. End-to-end convergence (reconcile delivers → runtime re-detects → climbs online) will be validated **on the live failed `molecule-adk-demo` concierge** after deploy — its next heartbeat fires the deadlock-break; it either recovers to online or surfaces the token (#34) blocker loudly. ## Companion follow-ups (separate) - Runtime: boot-install must fail **loud** (`PrivilegedPluginInstallError`) on a privileged-plugin miss instead of silently booting MCP-less. - #34: ensure the gitea token to fetch the private management-MCP plugin is present on a re-provisioned box. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hongming-ceo-delegated added 1 commit 2026-06-24 11:28:52 +00:00
fix(#33): break the RCA#2970 management-MCP re-provision deadlock for concierges
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 6s
Block integration-tester contamination artifacts / Block staging-trigger / invalid manifest contamination (pull_request) Successful in 9s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request_target) Has been skipped
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
sop-checklist / all-items-acked (pull_request) acked: 0/9 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +6 — body-unfilled: comprehensive-testing, local-postgres-e2
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 6s
sop-checklist / na-declarations (pull_request) N/A: (none)
qa-review / approved (pull_request_target) Failing after 11s
security-review / approved (pull_request_target) Failing after 9s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 10s
CI / Detect changes (pull_request) Successful in 19s
sop-checklist / all-items-acked (pull_request_target) Successful in 10s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 20s
gate-check-v3 / gate-check (pull_request_target) Successful in 14s
PR Diff Guard / PR diff guard (pull_request) Successful in 16s
template-delivery-e2e / detect-changes (pull_request) Successful in 15s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
CI / Canvas (Next.js) (pull_request) Successful in 3s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 20s
CI / Canvas Deploy Status (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
E2E Chat / detect-changes (pull_request) Successful in 32s
E2E Chat / E2E Chat (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 36s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 44s
Harness Replays / Harness Replays (pull_request) Successful in 1m20s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m24s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 2m1s
CI / Platform (Go) (pull_request) Has been cancelled
CI / all-required (pull_request) Has been cancelled
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Successful in 8m18s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Plugin Install Lifecycle (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Successful in 18s
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 7m58s
E2E Staging SaaS (full lifecycle) / Prune stale e2e DNS records (pull_request) Successful in 5s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 21s
template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) (pull_request) Successful in 8m11s
7627636608
A kind=platform concierge whose runtime reports mcp_server_present=false is
marked failed and the heartbeat/register handler returns BEFORE the recovery
branches that fire the declared-plugin reconcile. On SaaS that reconcile
(reading workspace_declared_plugins) is the ONLY path that installs the
management MCP into the running container — so a concierge that boots MCP-less
(e.g. a boot-install miss on re-provision/restart) can never self-heal:
not-online -> reconcile never fires -> mcp_server_present stays false -> stuck
failed forever. This is exactly how the molecule-adk-demo root went `failed`
on re-provision (RCA#2970 fail-closed, correctly — the MCP is genuinely absent;
the bug is that delivery never re-runs).

Fix: fire the declared-plugin reconcile from the !hasMCP fail branch (both the
heartbeat gate and the register gate), guarded by an in-flight map so a
sustained-missing concierge fires at most one concurrent reconcile (the gate
fails every heartbeat until the MCP lands). The workspace STILL fails closed for
that heartbeat (markWorkspaceFailed); the reconcile delivers the declared MCP
into the running container, and once the runtime re-reads
/configs/.claude/settings.json and reports mcp_server_present=true the existing
failed->online recovery in evaluateStatus climbs it back. If the reconcile
cannot deliver (missing token / fetch failure) it logs loudly and the concierge
stays failed — the correct fail-closed outcome, now with a root cause surfaced.

Blast radius is limited to concierges already FAILING the gate: a healthy
concierge (mcp present) never enters this branch, so the change can only help.

Tests (registry_mcp_recovery_test.go):
- mcp-missing heartbeat still fails closed AND fires the recovery reconcile
  (proven to fail without the fix)
- model-missing fails closed but does NOT fire the reconcile (scoped to !hasMCP)
- in-flight guard: concurrent calls fire at most one reconcile per workspace
- nil reconcile func / empty id are no-ops

Companion follow-ups (not in this PR): runtime boot-install must fail loud on a
privileged-plugin miss (#33 runtime side); #34 ensure the gitea token to fetch
the private management-MCP plugin is present on a re-provisioned box.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hongming-ceo-delegated added 1 commit 2026-06-24 11:32:24 +00:00
harden(#33): rate-limit the deadlock-break reconcile per workspace (cooldown)
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 6s
Block integration-tester contamination artifacts / Block staging-trigger / invalid manifest contamination (pull_request) Successful in 9s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
sop-checklist / review-refire (pull_request_target) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 12s
CI / Detect changes (pull_request) Successful in 15s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 7s
qa-review / approved (pull_request_target) Failing after 9s
E2E API Smoke Test / detect-changes (pull_request) Successful in 17s
security-review / approved (pull_request_target) Failing after 8s
CI / Canvas (Next.js) (pull_request) Successful in 3s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
E2E Chat / detect-changes (pull_request) Successful in 18s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 16s
CI / Canvas Deploy Status (pull_request) Successful in 1s
gate-check-v3 / gate-check (pull_request_target) Successful in 14s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 16s
sop-checklist / na-declarations (pull_request) N/A: (none)
template-delivery-e2e / detect-changes (pull_request) Successful in 15s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
sop-checklist / all-items-acked (pull_request_target) Successful in 11s
E2E Chat / E2E Chat (pull_request) Successful in 4s
PR Diff Guard / PR diff guard (pull_request) Successful in 18s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 36s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 34s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 32s
Harness Replays / Harness Replays (pull_request) Successful in 1m23s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m30s
CI / Platform (Go) (pull_request) Successful in 4m18s
CI / all-required (pull_request) Successful in 4s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / Prune stale e2e DNS records (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Plugin Install Lifecycle (pull_request) Has been cancelled
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m34s
template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) (pull_request) Successful in 7m59s
sop-checklist / all-items-acked (pull_request) Compensated by status-reaper (non-required pull_request/pull_request_review governance shadow overridden by successful pull_request_target status; see .gitea/scripts/status-reaper.py)
4f46a0c7a2
The RCA#2970 gate fails on EVERY heartbeat until the management MCP lands, so a
concierge that cannot recover (missing plugin-source token; and where deliver()
restarts the container) would re-fire a clone+deliver every heartbeat interval —
restart churn that never converges. Replace the in-flight-only guard with a
per-workspace cooldown (mcpRecoveryLastFire + mcpRecoveryCooldown=5m, stored
before fire so concurrent heartbeats can't double-fire). The happy path leaves
the mcp-missing state on the next heartbeat, so the cooldown only throttles the
genuinely-stuck case; a stuck concierge still retries gently. Test now also
asserts re-fire is allowed once the cooldown elapses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hongming-ceo-delegated added 1 commit 2026-06-24 11:46:52 +00:00
docs(#33): note the cooldown is in-memory (resets on CP redeploy)
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 12s
CI / Python Lint & Test (pull_request) Successful in 8s
Block integration-tester contamination artifacts / Block staging-trigger / invalid manifest contamination (pull_request) Successful in 11s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 8s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 16s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Successful in 19s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 16s
CI / Detect changes (pull_request) Successful in 25s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s
E2E Chat / detect-changes (pull_request) Successful in 20s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 18s
template-delivery-e2e / detect-changes (pull_request) Successful in 18s
PR Diff Guard / PR diff guard (pull_request) Successful in 22s
sop-checklist / review-refire (pull_request_target) Has been skipped
gate-check-v3 / gate-check (pull_request_target) Failing after 33s
sop-checklist / all-items-acked (pull_request_target) Has been cancelled
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 45s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 9s
CI / Canvas Deploy Status (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 36s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 36s
Harness Replays / Harness Replays (pull_request) Successful in 1m26s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m24s
CI / Platform (Go) (pull_request) Successful in 3m37s
CI / all-required (pull_request) Successful in 7s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m35s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Plugin Install Lifecycle (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / Prune stale e2e DNS records (pull_request) Successful in 7s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Successful in 12s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 17s
reserved-path-review / reserved-path-review (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
reserved-path-review / reserved-path-review (pull_request_review) Successful in 14s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 17s
security-review / approved (pull_request_review) Successful in 17s
template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) (pull_request) Successful in 7m39s
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Successful in 5m37s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Successful in 8m45s
audit-force-merge / audit (pull_request_target) Successful in 9s
598a806ecc
Review feedback (code-review MINOR #1): the doc comment "retries gently rather
than hammering" slightly oversold the guarantee — mcpRecoveryLastFire is a
process-local sync.Map, so a CP redeploy / conductor tick resets it. Net churn
is still bounded (redeploys aren't sub-minute; one extra reconcile per redeploy
on an already-stuck concierge is tolerable). Comment-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
molecule-code-reviewer approved these changes 2026-06-24 11:47:17 +00:00
molecule-code-reviewer left a comment
Member

APPROVE — adversarial correctness review.

Traced the full deadlock + recovery chain (code + runtime). Confirmed:

  • The !hasMCP branch does markWorkspaceFailed + return BEFORE every recovery branch (failed→online at evaluateStatus, all fireReconcileOnline calls) — so the declared-plugin reconcile (the only SaaS path that writes mcpServers["molecule-platform"] into settings.json) never fires. Deadlock is real.
  • Recovery closes: runtime platform_agent_identity.mcp_server_present() re-reads /configs/.claude/settings.json fresh each call (no cache); the plugin-aware fix already shipped to prod (workspace-runtime #149 / v0.3.39). Next heartbeat after delivery reports mcp_server_present=true → existing failed→online recovery climbs it back.
  • Fail-closed PRESERVED: still markWorkspaceFailed in both gates; the reconcile only delivers a plugin, never marks online. The #3082 loaded_mcp_tools gate still backstops declared-but-not-loaded.
  • Prove-fail is real (removed the heartbeat fire → test fails with the exact regression message). No regressions (the 2 failing pkg tests fail identically on origin/main — env-dependent contract tests needing a sibling Python runtime).
  • deliver() only reaches restartFunc when stage+deliver SUCCEED, so the persistent-token-miss (#34) produces NO restart (stage fails → continue) — "stuck + loud", not a churn loop. The cooldown bounds the deliver-then-reboot-flake case.

Findings were MINOR/NIT only: cooldown is in-memory (addressed in a follow-up commit); a register-path integration test would lock the symmetric fire (deferred — verified correct by trace, helper is unit-tested, register-handler mock is brittle). Concurrency (Load-then-Store double-fire) is harmless (idempotent reconcile, store-before-fire). No blockers.

**APPROVE — adversarial correctness review.** Traced the full deadlock + recovery chain (code + runtime). Confirmed: - The `!hasMCP` branch does `markWorkspaceFailed` + `return` BEFORE every recovery branch (`failed→online` at evaluateStatus, all `fireReconcileOnline` calls) — so the declared-plugin reconcile (the only SaaS path that writes `mcpServers["molecule-platform"]` into settings.json) never fires. Deadlock is real. - Recovery closes: runtime `platform_agent_identity.mcp_server_present()` re-reads `/configs/.claude/settings.json` fresh each call (no cache); the plugin-aware fix already shipped to prod (workspace-runtime #149 / v0.3.39). Next heartbeat after delivery reports `mcp_server_present=true` → existing `failed→online` recovery climbs it back. - Fail-closed PRESERVED: still `markWorkspaceFailed` in both gates; the reconcile only delivers a plugin, never marks online. The #3082 `loaded_mcp_tools` gate still backstops declared-but-not-loaded. - Prove-fail is real (removed the heartbeat fire → test fails with the exact regression message). No regressions (the 2 failing pkg tests fail identically on origin/main — env-dependent contract tests needing a sibling Python runtime). - `deliver()` only reaches `restartFunc` when stage+deliver SUCCEED, so the persistent-token-miss (#34) produces NO restart (stage fails → continue) — "stuck + loud", not a churn loop. The cooldown bounds the deliver-then-reboot-flake case. Findings were MINOR/NIT only: cooldown is in-memory (addressed in a follow-up commit); a register-path integration test would lock the symmetric fire (deferred — verified correct by trace, helper is unit-tested, register-handler mock is brittle). Concurrency (Load-then-Store double-fire) is harmless (idempotent reconcile, store-before-fire). No blockers.
core-security approved these changes 2026-06-24 11:47:20 +00:00
core-security left a comment
Member

APPROVE — adversarial security review of the RCA#2970 fail-closed gate.

Tried hard to re-open the privilege-escalation vector or enable abuse; could not.

  • Does NOT weaken the gate: both branches still markWorkspaceFailed (+ HTTP 400 on register); nothing in the fix flips status to online or touches Redis online state. Path back to online still requires a later heartbeat to pass the UNCHANGED gate with mcp_server_present=true.
  • Entitlement bounded: the reconcile installs only the already-declared set (workspace_declared_plugins). The authority on WHAT may be declared is unchanged — recordDeclaredPlugin fail-closes the privileged conciergePlatformMCPName to kind=platform only, the single chokepoint all declaration paths flow through. Both fire sites are inside a kind==platform guard, so a non-platform workspace never reaches the fire.
  • DoS bounded: both endpoints require the target workspace’s own instance token (requireWorkspaceToken); the 5-min per-workspace cooldown (checked + stored before spawning) caps worst case at one clone+deliver+restart per 5 min per compromised token. No amplification.
  • Cannot disrupt a healthy tenant: the fire is only in the !hasMCP branch (already-failed concierge); a healthy concierge reports mcp_server_present=true and never enters it.
  • No new secret/token exposure or logging; reuses the exact reconcilePlugins func already invoked by fireReconcileOnline.

Notes were MINOR/NIT only (unbounded-but-tiny sync.Map; pre-existing requireWorkspaceToken fail-open on transient DB error). No blockers.

**APPROVE — adversarial security review of the RCA#2970 fail-closed gate.** Tried hard to re-open the privilege-escalation vector or enable abuse; could not. - Does NOT weaken the gate: both branches still `markWorkspaceFailed` (+ HTTP 400 on register); nothing in the fix flips status to online or touches Redis online state. Path back to online still requires a later heartbeat to pass the UNCHANGED gate with `mcp_server_present=true`. - Entitlement bounded: the reconcile installs only the already-declared set (`workspace_declared_plugins`). The authority on WHAT may be declared is unchanged — `recordDeclaredPlugin` fail-closes the privileged `conciergePlatformMCPName` to `kind=platform` only, the single chokepoint all declaration paths flow through. Both fire sites are inside a `kind==platform` guard, so a non-platform workspace never reaches the fire. - DoS bounded: both endpoints require the target workspace’s own instance token (`requireWorkspaceToken`); the 5-min per-workspace cooldown (checked + stored before spawning) caps worst case at one clone+deliver+restart per 5 min per compromised token. No amplification. - Cannot disrupt a healthy tenant: the fire is only in the `!hasMCP` branch (already-failed concierge); a healthy concierge reports `mcp_server_present=true` and never enters it. - No new secret/token exposure or logging; reuses the exact `reconcilePlugins` func already invoked by `fireReconcileOnline`. Notes were MINOR/NIT only (unbounded-but-tiny `sync.Map`; pre-existing `requireWorkspaceToken` fail-open on transient DB error). No blockers.
Author
Member

Review complete — 2 genuine approvals (code + security), no blockers

Both reviewers traced the full deadlock + recovery chain and the security posture independently. Responses to the MINOR findings:

  1. Cooldown is in-memory (resets on CP redeploy) — addressed in commit 598a806e (doc comment now states the once-per-cooldown guarantee holds within a process lifetime; net churn still bounded since redeploys are not sub-minute).
  2. Healing depends on boot-install on the fresh box — to be precise about convergence: when the safety-net reconcile deliver() succeeds it restarts the container (SaaS), and the post-restart MCP survival depends on the plugin re-fetch succeeding on the fresh box. The cooldown bounds the deliver-then-reboot-flake case to ~one restart/5m. The persistent failure mode (e.g. missing plugin-source token, #34) produces no restart at all (stage fails → continue) — it stays failed + loud, the correct fail-closed outcome. The companion runtime/#34 work makes the primary boot-install path robust; this PR makes the safety net actually run.
  3. Register-path integration test — deferred: both reviewers verified the register !hasMCP fire is correct by trace, the helper (fireReconcileMCPRecovery) is unit-tested, and the heartbeat path (the primary, repeating self-heal signal) has an integration test. A faithful register-handler sqlmock is brittle for marginal value; tracked as a follow-up.

Validation plan: end-to-end convergence will be validated on the live failed molecule-adk-demo concierge after deploy — its next heartbeat fires the deadlock-break; it recovers to online or surfaces the #34 token blocker loudly.

## Review complete — 2 genuine approvals (code + security), no blockers Both reviewers traced the full deadlock + recovery chain and the security posture independently. Responses to the MINOR findings: 1. **Cooldown is in-memory (resets on CP redeploy)** — addressed in commit `598a806e` (doc comment now states the once-per-cooldown guarantee holds within a process lifetime; net churn still bounded since redeploys are not sub-minute). 2. **Healing depends on boot-install on the fresh box** — to be precise about convergence: when the safety-net reconcile `deliver()` succeeds it restarts the container (SaaS), and the post-restart MCP survival depends on the plugin re-fetch succeeding on the fresh box. The cooldown bounds the deliver-then-reboot-flake case to ~one restart/5m. The **persistent** failure mode (e.g. missing plugin-source token, #34) produces **no** restart at all (stage fails → `continue`) — it stays *failed + loud*, the correct fail-closed outcome. The companion runtime/#34 work makes the primary boot-install path robust; this PR makes the safety net actually run. 3. **Register-path integration test** — deferred: both reviewers verified the register `!hasMCP` fire is correct by trace, the helper (`fireReconcileMCPRecovery`) is unit-tested, and the heartbeat path (the primary, repeating self-heal signal) has an integration test. A faithful register-handler sqlmock is brittle for marginal value; tracked as a follow-up. **Validation plan:** end-to-end convergence will be validated on the live failed `molecule-adk-demo` concierge after deploy — its next heartbeat fires the deadlock-break; it recovers to online or surfaces the #34 token blocker loudly.
agent-reviewer-cr2 approved these changes 2026-06-24 11:54:23 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED on current head 598a806ecc.

5-axis review:

  • Correctness: the recovery reconcile is fired only from the platform MCP-missing register/heartbeat branches, which are exactly the branches that previously returned before the declared-plugin reconcile could install the management MCP. The deadlock -> reconcile -> later mcp_server_present=true -> existing failed-to-online recovery trace is coherent.
  • Robustness: fail-closed behavior is preserved: register still calls markWorkspaceFailed and returns 400; heartbeat still calls markWorkspaceFailed and returns before any online recovery. Tests cover the heartbeat recovery fire, model-missing non-fire, cooldown, and nil/empty-id safety.
  • Security: entitlement is bounded to kind=platform by the existing register/evaluateStatus platform gates; model-missing does not trigger plugin reconcile, and non-platform workspaces do not get this path broadened.
  • Performance: the per-workspace 5 minute cooldown bounds repeated clone/deliver/restart churn while still allowing stuck concierges to retry gently. The async reconcile remains idempotent and out of the heartbeat critical path.
  • Readability: the helper is localized, named for the RCA, and the tests make the watch-fail behavior explicit.

CI/all-required and Platform(Go) are green; advisory/non-BP contexts are not blockers. This should clear qa-review from CR2; I do not see a Researcher approval on this head yet, so I am not treating it as 2-genuine/merge-ready until that lands.

APPROVED on current head 598a806ecc406b361ca747e8d11ddc11b59f6a76. 5-axis review: - Correctness: the recovery reconcile is fired only from the platform MCP-missing register/heartbeat branches, which are exactly the branches that previously returned before the declared-plugin reconcile could install the management MCP. The deadlock -> reconcile -> later mcp_server_present=true -> existing failed-to-online recovery trace is coherent. - Robustness: fail-closed behavior is preserved: register still calls markWorkspaceFailed and returns 400; heartbeat still calls markWorkspaceFailed and returns before any online recovery. Tests cover the heartbeat recovery fire, model-missing non-fire, cooldown, and nil/empty-id safety. - Security: entitlement is bounded to kind=platform by the existing register/evaluateStatus platform gates; model-missing does not trigger plugin reconcile, and non-platform workspaces do not get this path broadened. - Performance: the per-workspace 5 minute cooldown bounds repeated clone/deliver/restart churn while still allowing stuck concierges to retry gently. The async reconcile remains idempotent and out of the heartbeat critical path. - Readability: the helper is localized, named for the RCA, and the tests make the watch-fail behavior explicit. CI/all-required and Platform(Go) are green; advisory/non-BP contexts are not blockers. This should clear qa-review from CR2; I do not see a Researcher approval on this head yet, so I am not treating it as 2-genuine/merge-ready until that lands.
agent-researcher approved these changes 2026-06-24 11:54:37 +00:00
agent-researcher left a comment
Member

Independent QA 5-axis review: APPROVE.

Correctness: the RCA#2970 deadlock mechanism is addressed in the right branch. For kind=platform workspaces, both register and heartbeat now fire the declared-plugin reconcile from the !hasMCP fail branch before returning. The existing fail-closed behavior is preserved: markWorkspaceFailed(...) still runs and register still returns 400 / heartbeat still refuses online for the current request; recovery only happens on a later heartbeat after the runtime reports MCP present.

Robustness: the recovery path is nil-safe, empty-ID safe, and rate-limited per workspace with mcpRecoveryCooldown. Tests cover the MCP-missing heartbeat fire, model-missing no-fire, cooldown suppression/retry after cooldown, and nil safety. The trace is coherent: missing MCP -> failed + reconcile -> plugin delivery/restart -> later mcp_server_present=true -> existing failed-to-online recovery can run.

Security: entitlement remains bounded to the pre-existing platform gates (payload.Kind == platform || existing kind == platform on register, currentKind == platform on heartbeat). The model-missing branch does not fire reconcile, so this does not broaden privilege to ordinary workspaces or model-less platform rows. The public register kind=platform precheck remains in place.

Performance: one fire-and-forget reconcile is possible per workspace per five-minute cooldown while the concierge remains MCP-missing. That bounds restart/clone churn and avoids a heartbeat-scale thundering herd; idempotent reconcile plus store-before-fire also reduces concurrent duplicate fires.

Readability: comments explain the deadlock and why fail-closed remains intentional. Required CI (CI / Platform (Go), CI / all-required, handlers integration) is green; pending long-running E2E/template contexts are not findings against this patch. I did not find a new deadlock, privilege broadening, or restart storm path.

Independent QA 5-axis review: APPROVE. Correctness: the RCA#2970 deadlock mechanism is addressed in the right branch. For kind=platform workspaces, both register and heartbeat now fire the declared-plugin reconcile from the `!hasMCP` fail branch before returning. The existing fail-closed behavior is preserved: `markWorkspaceFailed(...)` still runs and register still returns 400 / heartbeat still refuses online for the current request; recovery only happens on a later heartbeat after the runtime reports MCP present. Robustness: the recovery path is nil-safe, empty-ID safe, and rate-limited per workspace with `mcpRecoveryCooldown`. Tests cover the MCP-missing heartbeat fire, model-missing no-fire, cooldown suppression/retry after cooldown, and nil safety. The trace is coherent: missing MCP -> failed + reconcile -> plugin delivery/restart -> later `mcp_server_present=true` -> existing failed-to-online recovery can run. Security: entitlement remains bounded to the pre-existing platform gates (`payload.Kind == platform || existing kind == platform` on register, `currentKind == platform` on heartbeat). The model-missing branch does not fire reconcile, so this does not broaden privilege to ordinary workspaces or model-less platform rows. The public register kind=platform precheck remains in place. Performance: one fire-and-forget reconcile is possible per workspace per five-minute cooldown while the concierge remains MCP-missing. That bounds restart/clone churn and avoids a heartbeat-scale thundering herd; idempotent reconcile plus store-before-fire also reduces concurrent duplicate fires. Readability: comments explain the deadlock and why fail-closed remains intentional. Required CI (`CI / Platform (Go)`, `CI / all-required`, handlers integration) is green; pending long-running E2E/template contexts are not findings against this patch. I did not find a new deadlock, privilege broadening, or restart storm path.
agent-reviewer-cr2 merged commit 22730e4c33 into main 2026-06-24 12:08:32 +00:00
Sign in to join this conversation.
5 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3228