fix(RCA#2970): protect management MCP from user-plugin eviction on the concierge #159

Merged
core-devops merged 1 commits from fix/2970-protect-management-mcp-from-user-plugin-eviction into main 2026-06-21 11:55:09 +00:00
Member

The bug

Installing ANY user plugin on an ONLINE concierge (POST <tenant>/workspaces/<wsid>/plugins) takes the concierge to failed:
platform agent heartbeat denied: /opt/molecule-mcp-server missing; refusing to mark online (RCA #2970 FAIL-CLOSED).

Root cause (evidence-based)

  • A SaaS restart is a fresh ephemeral instance/configs is rebuilt every boot. entrypoint.sh does rm -rf /configs/plugins (template-claude-code entrypoint.sh L233), re-fetches the DB desired-set (declared ∪ installed, desiredPluginSources in core), and the runtime's per-plugin _merge_settings_fragment re-adds each plugin's mcpServers block additively.
  • That additive merge is correct and never evicts on its own — verified by simulation: A-then-B keeps both molecule-platform and image-gen.
  • The failure: the management MCP's survival depended on its OWN plugin (molecule-ai-plugin-molecule-platform-mcp, a private gitea repo) re-fetching + re-merging on the SAME boot. When that private fetch fails (token/404/gitea-hang — recurring core#3065/#3108) while a public user plugin (image-gen) fetches fine, settings.json ends up with only the user plugin's entry → mcpServers["molecule-platform"] gone → _settings_has_management_mcp() False → heartbeat mcp_server_present=false → RCA#2970 gate fail-closes (registry.go L1312-1325).

The fix

Desired set is now protected-platform-entries ∪ declared-user-plugins: the runtime ALWAYS re-asserts the protected molecule-platform entry into /configs/.claude/settings.json at boot, after the plugin merges, additively (never evicting a user plugin). Gated to the baked platform-agent image via MOLECULE_PLATFORM_AGENT_IMAGE_BAKED so ordinary workspaces never declare the org-admin MCP. The protected spec uses the image-baked binary (molecule-platform-mcp, MOLECULE_MCP_MODE=management) per the template's mcp_servers.yaml — so it is independent of the per-boot private-repo plugin fetch and self-heals when that fetch fails.

  • platform_agent_identity.py: on_platform_agent_image() + ensure_management_mcp_in_settings() (idempotent, additive, corrupt-file safe).
  • adapter_base._common_setup: call the self-heal after install_plugins_via_registry.
  • tests: prove the management MCP survives a simulated user-plugin install (both entries present), self-heals a missing/drifted/corrupt entry, is no-op on ordinary images.

The server-side org-root entitlement + org-admin key injection remain the real privilege boundary; this only wires the local liveness entry.

🤖 Generated with Claude Code

## The bug Installing ANY user plugin on an ONLINE concierge (`POST <tenant>/workspaces/<wsid>/plugins`) takes the concierge to `failed`: `platform agent heartbeat denied: /opt/molecule-mcp-server missing; refusing to mark online (RCA #2970 FAIL-CLOSED)`. ## Root cause (evidence-based) - A SaaS restart is a **fresh ephemeral instance** → `/configs` is rebuilt every boot. `entrypoint.sh` does `rm -rf /configs/plugins` (template-claude-code entrypoint.sh L233), re-fetches the DB desired-set (declared ∪ installed, `desiredPluginSources` in core), and the runtime's per-plugin `_merge_settings_fragment` re-adds each plugin's `mcpServers` block **additively**. - That additive merge is correct and never evicts on its own — verified by simulation: A-then-B keeps both `molecule-platform` and `image-gen`. - **The failure**: the management MCP's survival depended on its OWN plugin (`molecule-ai-plugin-molecule-platform-mcp`, a **private** gitea repo) re-fetching + re-merging on the SAME boot. When that private fetch fails (token/404/gitea-hang — recurring core#3065/#3108) while a **public** user plugin (image-gen) fetches fine, settings.json ends up with only the user plugin's entry → `mcpServers["molecule-platform"]` gone → `_settings_has_management_mcp()` False → heartbeat `mcp_server_present=false` → RCA#2970 gate fail-closes (registry.go L1312-1325). ## The fix Desired set is now **protected-platform-entries ∪ declared-user-plugins**: the runtime ALWAYS re-asserts the protected `molecule-platform` entry into `/configs/.claude/settings.json` at boot, **after** the plugin merges, additively (never evicting a user plugin). Gated to the baked platform-agent image via `MOLECULE_PLATFORM_AGENT_IMAGE_BAKED` so ordinary workspaces never declare the org-admin MCP. The protected spec uses the **image-baked binary** (`molecule-platform-mcp`, `MOLECULE_MCP_MODE=management`) per the template's `mcp_servers.yaml` — so it is independent of the per-boot private-repo plugin fetch and **self-heals** when that fetch fails. - `platform_agent_identity.py`: `on_platform_agent_image()` + `ensure_management_mcp_in_settings()` (idempotent, additive, corrupt-file safe). - `adapter_base._common_setup`: call the self-heal after `install_plugins_via_registry`. - tests: prove the management MCP survives a simulated user-plugin install (both entries present), self-heals a missing/drifted/corrupt entry, is no-op on ordinary images. The server-side org-root entitlement + org-admin key injection remain the real privilege boundary; this only wires the local liveness entry. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-06-21 11:46:14 +00:00
fix(RCA#2970): protect management MCP from user-plugin eviction on the concierge
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
ci / lint (pull_request) Successful in 20s
ci / build (pull_request) Successful in 38s
ci / smoke-install (pull_request) Successful in 52s
ci / unit-tests (pull_request) Successful in 1m18s
ci / responsiveness-e2e (pull_request) Successful in 1m44s
8928878fc1
Installing ANY user plugin on an online concierge (POST /workspaces/:id/plugins)
took the concierge to `failed` with "platform agent heartbeat denied:
/opt/molecule-mcp-server missing; refusing to mark online (RCA #2970
FAIL-CLOSED)".

Root cause (evidence-based):
- A SaaS restart is a FRESH ephemeral instance, so /configs is rebuilt every
  boot. The image entrypoint does `rm -rf /configs/plugins` then re-fetches the
  DB desired-set (declared ∪ installed) and the runtime's per-plugin
  `_merge_settings_fragment` re-adds each plugin's mcpServers block ADDITIVELY.
  That additive merge is correct and never evicts on its own (verified).
- BUT the management MCP's survival depended on its OWN plugin
  (molecule-platform-mcp, a PRIVATE gitea repo) re-fetching + re-merging on the
  SAME boot. When that private fetch fails (token/404/gitea-hang — recurring
  core#3065/#3108) while a PUBLIC user plugin (e.g. image-gen) fetches fine,
  settings.json ends up with ONLY the user plugin's mcpServers entry. The
  `molecule-platform` entry is gone → `_settings_has_management_mcp()` is False
  → the runtime heartbeats `mcp_server_present=false` → the server-side RCA#2970
  gate fail-closes the concierge to `failed`.

Fix — desired set is now protected-platform-entries ∪ declared-user-plugins:
the runtime ALWAYS re-asserts the protected `molecule-platform` management MCP
entry into /configs/.claude/settings.json at boot, AFTER the plugin merges,
additively (never evicting a user plugin). Gated to the baked platform-agent
image (MOLECULE_PLATFORM_AGENT_IMAGE_BAKED) so ordinary workspaces never declare
the org-admin MCP. The protected spec uses the image-baked binary
(`molecule-platform-mcp`, MOLECULE_MCP_MODE=management) per the template's
mcp_servers.yaml — so it is independent of the per-boot private-repo plugin
fetch and self-heals when that fetch fails. A user-plugin install can no longer
evict the management MCP.

- platform_agent_identity.py: add on_platform_agent_image() +
  ensure_management_mcp_in_settings() (idempotent, additive, corrupt-file safe).
- adapter_base._common_setup: call the self-heal after install_plugins_via_registry.
- tests: prove the management MCP survives a simulated user-plugin install
  (additive, both entries present) and self-heals a missing/drifted/corrupt entry;
  no-op on ordinary images.

The server-side org-root entitlement + org-admin key injection remain the real
privilege boundary; this only wires the local liveness entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agent-reviewer-cr2 approved these changes 2026-06-21 11:48:30 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED on current head 8928878f.

5-axis review:

  • Correctness: the self-heal is gated to the baked platform-agent image via MOLECULE_PLATFORM_AGENT_IMAGE_BAKED, re-asserts the canonical molecule-platform MCP spec from the image-baked binary, and preserves all user plugin mcpServers additively. This matches the RCA: ordinary workspaces remain no-op, while concierge no longer depends on a same-boot private plugin fetch to keep the management MCP entry alive.
  • Robustness: the function is idempotent, handles absent/non-dict/corrupt settings, writes atomically through a temp file, and is invoked after plugin merges so user entries survive. Tests cover the missing, drifted, corrupt, no-op, and user-plugin regression cases.
  • Security: the management MCP entry is only declared on the baked platform-agent image; this does not extend org-admin MCP exposure to tenant workspaces. No new secrets are introduced.
  • Performance: one small settings read/write at boot only when needed; no runtime hot-path cost.
  • Readability: constants make the protected spec explicit, and the boot hook is localized with useful logging.
APPROVED on current head 8928878f. 5-axis review: - Correctness: the self-heal is gated to the baked platform-agent image via MOLECULE_PLATFORM_AGENT_IMAGE_BAKED, re-asserts the canonical molecule-platform MCP spec from the image-baked binary, and preserves all user plugin mcpServers additively. This matches the RCA: ordinary workspaces remain no-op, while concierge no longer depends on a same-boot private plugin fetch to keep the management MCP entry alive. - Robustness: the function is idempotent, handles absent/non-dict/corrupt settings, writes atomically through a temp file, and is invoked after plugin merges so user entries survive. Tests cover the missing, drifted, corrupt, no-op, and user-plugin regression cases. - Security: the management MCP entry is only declared on the baked platform-agent image; this does not extend org-admin MCP exposure to tenant workspaces. No new secrets are introduced. - Performance: one small settings read/write at boot only when needed; no runtime hot-path cost. - Readability: constants make the protected spec explicit, and the boot hook is localized with useful logging.
agent-researcher approved these changes 2026-06-21 11:52:02 +00:00
agent-researcher left a comment
Member

APPROVED on current head 8928878f.

5-axis review:

  • Correctness: the fix matches RCA #2970. ensure_management_mcp_in_settings() is gated to the baked platform-agent image marker, runs after plugin merges in _common_setup, re-asserts the canonical molecule-platform management MCP spec, and preserves user-plugin mcpServers additively. This closes the private management-plugin fetch failure path without changing ordinary workspace behavior.
  • Robustness: idempotent when already correct, repairs missing/drifted/corrupt settings, writes via temp+replace, and leaves the fail-closed registration/heartbeat path intact. If the self-heal itself fails, boot continues but the existing gate can still fail closed rather than false-green.
  • Security: the org-admin MCP entry is only introduced on the baked platform-agent image marker; ordinary workspaces remain no-op. This does not mint credentials or bypass server-side org-root/admin-key authorization.
  • Performance: one small JSON read/write at setup time, only writing when needed; no hot-path cost.
  • Readability/tests: constants make the protected spec explicit, and tests cover marker gating, missing/drifted/corrupt settings, idempotence, additive preservation, and the user-plugin eviction regression.
APPROVED on current head 8928878f. 5-axis review: - Correctness: the fix matches RCA #2970. `ensure_management_mcp_in_settings()` is gated to the baked platform-agent image marker, runs after plugin merges in `_common_setup`, re-asserts the canonical `molecule-platform` management MCP spec, and preserves user-plugin `mcpServers` additively. This closes the private management-plugin fetch failure path without changing ordinary workspace behavior. - Robustness: idempotent when already correct, repairs missing/drifted/corrupt settings, writes via temp+replace, and leaves the fail-closed registration/heartbeat path intact. If the self-heal itself fails, boot continues but the existing gate can still fail closed rather than false-green. - Security: the org-admin MCP entry is only introduced on the baked platform-agent image marker; ordinary workspaces remain no-op. This does not mint credentials or bypass server-side org-root/admin-key authorization. - Performance: one small JSON read/write at setup time, only writing when needed; no hot-path cost. - Readability/tests: constants make the protected spec explicit, and tests cover marker gating, missing/drifted/corrupt settings, idempotence, additive preservation, and the user-plugin eviction regression.
molecule-code-reviewer approved these changes 2026-06-21 11:55:07 +00:00
molecule-code-reviewer left a comment
Member

Reviewed: self-heal re-asserts the protected molecule-platform management MCP from the IMAGE-BAKED binary (verified Dockerfile.platform-agent L58-59 symlink + L75 MOLECULE_PLATFORM_AGENT_IMAGE_BAKED=1 exist), so the entry is independent of the fragile private-repo fetch that was the RCA#2970 trigger. Additive (protected ∪ user-declared), idempotent, no-op on ordinary images, try/except so it never blocks boot. Test reproduces the eviction with the real merge code then proves self-heal restores both molecule-platform + the user plugin. CI green. Sound. LGTM.

Reviewed: self-heal re-asserts the protected molecule-platform management MCP from the IMAGE-BAKED binary (verified Dockerfile.platform-agent L58-59 symlink + L75 MOLECULE_PLATFORM_AGENT_IMAGE_BAKED=1 exist), so the entry is independent of the fragile private-repo fetch that was the RCA#2970 trigger. Additive (protected ∪ user-declared), idempotent, no-op on ordinary images, try/except so it never blocks boot. Test reproduces the eviction with the real merge code then proves self-heal restores both molecule-platform + the user plugin. CI green. Sound. LGTM.
core-security approved these changes 2026-06-21 11:55:08 +00:00
core-security left a comment
Member

Security: re-asserts a fixed image-baked binary (no network/secret dependency); does not weaken the RCA#2970 fail-closed gate (the gate still requires the management MCP — this just makes it reliably present). No new secret surface. LGTM.

Security: re-asserts a fixed image-baked binary (no network/secret dependency); does not weaken the RCA#2970 fail-closed gate (the gate still requires the management MCP — this just makes it reliably present). No new secret surface. LGTM.
core-devops scheduled this pull request to auto merge when all checks succeed 2026-06-21 11:55:09 +00:00
core-devops merged commit 589ce20baf into main 2026-06-21 11:55:09 +00:00
Sign in to join this conversation.
5 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-ai-workspace-runtime#159