fix(#33): break the RCA#2970 management-MCP re-provision deadlock for concierges #3228
Reference in New Issue
Block a user
Delete Branch "fix/concierge-mgmt-mcp-reprovision-deadlock-33"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The
molecule-adk-demoroot concierge wentfailedon re-provision with:RCA#2970 is fail-closing correctly — the management MCP is genuinely absent. The bug is that delivery never re-runs, so the concierge can never recover.
Root cause (the deadlock)
molecule-platform) is not in the on-box bakedconfig.yaml; it's declared inworkspace_declared_plugins(re-seeded every provision byapplyConciergeProvisionConfig).ReconcileWorkspacePlugins), fired by the recovery branches inevaluateStatus(fireReconcileOnline).registry.go, heartbeat and register) doesmarkWorkspaceFailed+returnwhen a #147 runtime reportsmcp_server_present=false— before those reconcile-firing branches. So: not-online → reconcile never fires → MCP never installs →mcp_server_presentstays false → stuck failed forever (#33 / #34 durability gap).Fix
Fire the declared-plugin reconcile from the
!hasMCPfail branch (heartbeat + register), guarded by an in-flight map so a sustained-missing concierge fires at most one concurrent reconcile.markWorkspaceFailed).settings.json(fresh each call) → next heartbeat reportsmcp_server_present=true→ the existingfailed→onlinerecovery climbs it back.Tests (
registry_mcp_recovery_test.go)!hasMCP)Validation
Unit-tested here. End-to-end convergence (reconcile delivers → runtime re-detects → climbs online) will be validated on the live failed
molecule-adk-democoncierge after deploy — its next heartbeat fires the deadlock-break; it either recovers to online or surfaces the token (#34) blocker loudly.Companion follow-ups (separate)
PrivilegedPluginInstallError) on a privileged-plugin miss instead of silently booting MCP-less.🤖 Generated with Claude Code
APPROVE — adversarial correctness review.
Traced the full deadlock + recovery chain (code + runtime). Confirmed:
!hasMCPbranch doesmarkWorkspaceFailed+returnBEFORE every recovery branch (failed→onlineat evaluateStatus, allfireReconcileOnlinecalls) — so the declared-plugin reconcile (the only SaaS path that writesmcpServers["molecule-platform"]into settings.json) never fires. Deadlock is real.platform_agent_identity.mcp_server_present()re-reads/configs/.claude/settings.jsonfresh each call (no cache); the plugin-aware fix already shipped to prod (workspace-runtime #149 / v0.3.39). Next heartbeat after delivery reportsmcp_server_present=true→ existingfailed→onlinerecovery climbs it back.markWorkspaceFailedin both gates; the reconcile only delivers a plugin, never marks online. The #3082loaded_mcp_toolsgate still backstops declared-but-not-loaded.deliver()only reachesrestartFuncwhen stage+deliver SUCCEED, so the persistent-token-miss (#34) produces NO restart (stage fails → continue) — "stuck + loud", not a churn loop. The cooldown bounds the deliver-then-reboot-flake case.Findings were MINOR/NIT only: cooldown is in-memory (addressed in a follow-up commit); a register-path integration test would lock the symmetric fire (deferred — verified correct by trace, helper is unit-tested, register-handler mock is brittle). Concurrency (Load-then-Store double-fire) is harmless (idempotent reconcile, store-before-fire). No blockers.
APPROVE — adversarial security review of the RCA#2970 fail-closed gate.
Tried hard to re-open the privilege-escalation vector or enable abuse; could not.
markWorkspaceFailed(+ HTTP 400 on register); nothing in the fix flips status to online or touches Redis online state. Path back to online still requires a later heartbeat to pass the UNCHANGED gate withmcp_server_present=true.workspace_declared_plugins). The authority on WHAT may be declared is unchanged —recordDeclaredPluginfail-closes the privilegedconciergePlatformMCPNametokind=platformonly, the single chokepoint all declaration paths flow through. Both fire sites are inside akind==platformguard, so a non-platform workspace never reaches the fire.requireWorkspaceToken); the 5-min per-workspace cooldown (checked + stored before spawning) caps worst case at one clone+deliver+restart per 5 min per compromised token. No amplification.!hasMCPbranch (already-failed concierge); a healthy concierge reportsmcp_server_present=trueand never enters it.reconcilePluginsfunc already invoked byfireReconcileOnline.Notes were MINOR/NIT only (unbounded-but-tiny
sync.Map; pre-existingrequireWorkspaceTokenfail-open on transient DB error). No blockers.Review complete — 2 genuine approvals (code + security), no blockers
Both reviewers traced the full deadlock + recovery chain and the security posture independently. Responses to the MINOR findings:
598a806e(doc comment now states the once-per-cooldown guarantee holds within a process lifetime; net churn still bounded since redeploys are not sub-minute).deliver()succeeds it restarts the container (SaaS), and the post-restart MCP survival depends on the plugin re-fetch succeeding on the fresh box. The cooldown bounds the deliver-then-reboot-flake case to ~one restart/5m. The persistent failure mode (e.g. missing plugin-source token, #34) produces no restart at all (stage fails →continue) — it stays failed + loud, the correct fail-closed outcome. The companion runtime/#34 work makes the primary boot-install path robust; this PR makes the safety net actually run.!hasMCPfire is correct by trace, the helper (fireReconcileMCPRecovery) is unit-tested, and the heartbeat path (the primary, repeating self-heal signal) has an integration test. A faithful register-handler sqlmock is brittle for marginal value; tracked as a follow-up.Validation plan: end-to-end convergence will be validated on the live failed
molecule-adk-democoncierge after deploy — its next heartbeat fires the deadlock-break; it recovers to online or surfaces the #34 token blocker loudly.APPROVED on current head
598a806ecc.5-axis review:
CI/all-required and Platform(Go) are green; advisory/non-BP contexts are not blockers. This should clear qa-review from CR2; I do not see a Researcher approval on this head yet, so I am not treating it as 2-genuine/merge-ready until that lands.
Independent QA 5-axis review: APPROVE.
Correctness: the RCA#2970 deadlock mechanism is addressed in the right branch. For kind=platform workspaces, both register and heartbeat now fire the declared-plugin reconcile from the
!hasMCPfail branch before returning. The existing fail-closed behavior is preserved:markWorkspaceFailed(...)still runs and register still returns 400 / heartbeat still refuses online for the current request; recovery only happens on a later heartbeat after the runtime reports MCP present.Robustness: the recovery path is nil-safe, empty-ID safe, and rate-limited per workspace with
mcpRecoveryCooldown. Tests cover the MCP-missing heartbeat fire, model-missing no-fire, cooldown suppression/retry after cooldown, and nil safety. The trace is coherent: missing MCP -> failed + reconcile -> plugin delivery/restart -> latermcp_server_present=true-> existing failed-to-online recovery can run.Security: entitlement remains bounded to the pre-existing platform gates (
payload.Kind == platform || existing kind == platformon register,currentKind == platformon heartbeat). The model-missing branch does not fire reconcile, so this does not broaden privilege to ordinary workspaces or model-less platform rows. The public register kind=platform precheck remains in place.Performance: one fire-and-forget reconcile is possible per workspace per five-minute cooldown while the concierge remains MCP-missing. That bounds restart/clone churn and avoids a heartbeat-scale thundering herd; idempotent reconcile plus store-before-fire also reduces concurrent duplicate fires.
Readability: comments explain the deadlock and why fail-closed remains intentional. Required CI (
CI / Platform (Go),CI / all-required, handlers integration) is green; pending long-running E2E/template contexts are not findings against this patch. I did not find a new deadlock, privilege broadening, or restart storm path.