feat(core): fail-loud when platform concierge declared management MCP is not loaded #3101
Reference in New Issue
Block a user
Delete Branch "core-3082-concierge-mcp-fail-loud"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Part of #3082 (core-side).
loaded_mcp_toolsto the heartbeat payload so the runtime can report the MCP tools it actually loaded.evaluateStatus, when akind=platformconcierge has the management MCP (molecule-ai-plugin-molecule-platform-mcp) declared inworkspace_declared_pluginsbut the reportedloaded_mcp_toolslist does not contain it, mark the workspacedegradedand broadcastWORKSPACE_DEGRADED.loaded_mcp_toolsskips the check, so older runtimes are not disrupted.The runtime-side log line for
--strict-mcp-configwill land in a separate follow-up PR againstmolecule-ai-workspace-template-claude-code.Test plan
go test ./internal/handlers/ -run TestHeartbeatHandler_PlatformManagementMCPpasses (4/4).go test ./internal/handlers/ -run TestHeartbeatHandlerpasses.go test ./internal/models/passes.SOP checklist
TestHeartbeatHandler_PlatformManagementMCPLookupError_FlipsOnlineToDegraded+ existing missing/loaded/omittedloaded_mcp_toolstests; all ininternal/handlers.loaded_mcp_tools.listDeclaredPluginslookup errors by marking the concierge degraded and broadcastingWORKSPACE_DEGRADED.loaded_mcp_toolsare ignored; no shims.memory-consulted): followed existingMCPServerPresentpayload pattern andconciergePlatformMCPNamedeclared-plugin gate.🤖 Generated with Claude Code
REQUEST_CHANGES on head
0dec6d08.Correctness/robustness blocker: the new fail-loud check compares
loaded_mcp_toolsentries directly againstconciergePlatformMCPName. The payload field is documented as tool names, and the production failure called out here is missingmcp__molecule-platform__create_workspace; that is not the same value as the declared plugin/server name. As written, a healthy runtime that reports the actualmcp__...tool names but not the plugin name will be marked degraded, while the check still does not prove the required management tool exists.Second false-green path:
listDeclaredPluginserrors are only logged and the workspace remains online, despite the new helper comment saying lookup failures must not silently look healthy. For this advisory fix, failure to verify the declared management MCP should be fail-loud/degraded (or otherwise block green) rather than leaving the concierge online.Security/perf/readability: scope is contained and the map lookup is fine, but the signal being compared is the wrong contract and the error path can still mask the production failure.
REQUEST_CHANGES on
0dec6d08.Correctness: the new fail-loud path does not yet prove it catches today's not-surfaced management tool condition. In workspace-server/internal/handlers/registry.go:1320 the check only runs when payload.loaded_mcp_tools is non-nil, but current molecule-ai-workspace-runtime main emits mcp_server_present only; no loaded_mcp_tools producer exists. That means the deployed/current runtime path remains false-green for this new gate. Also registry.go:426-447 compares loaded_mcp_tools entries to conciergePlatformMCPName, while the observed E2E failure was the missing surfaced tool mcp__molecule-platform__create_workspace; if the field is intended to contain actual tool names, this is the wrong unit. Please wire/pin the runtime producer contract and assert the specific surfaced management capability or clearly rename/contract the field as loaded plugin/server names.
Security/robustness/performance/scope: contained and no obvious privilege expansion; backward-compatible nil handling is intentional, but it is also why this does not currently fail loud. Tests cover mocked nil/missing/present cases, but not the real runtime heartbeat contract that caused the outage. CI is not green: PR status includes failing security/gate/staging contexts.
REQUEST_CHANGES on head
402092a1.The named RC defects are mostly fixed: the check now pins the literal
mcp__molecule-platform__create_workspacetool instead of the plugin/server name, andmcp_server_present=truewith omittedloaded_mcp_toolsnow degrades instead of silently skipping. The new tests cover tool-absent, literal-tool-present, server-present/no-tools, and pre-#147 nil tolerance.Remaining robustness/correctness blocker from my prior RC:
listDeclaredPluginserrors still only log and leave the workspace online. The helper comment says a failed lookup must not silently look healthy, butmErr != nilcurrently just logs and falls through without degrading/failing. That is still a false-green path for the fail-loud gate: if the declared-plugin lookup is temporarily broken, core cannot verify the management MCP and should not keep reporting healthy online.Security/performance/readability: scope is contained and the literal-tool contract is much clearer. Please make the declaration lookup error path fail loud/degraded (or otherwise non-green) before approval.
APPROVED on
402092a1.Prior RC 12834 is resolved. The gate now compares loaded_mcp_tools against the literal required tool id mcp__molecule-platform__create_workspace, not the plugin name, and tests pin both missing-tool and present-tool cases. It also fails loud when the runtime emits mcp_server_present=true but omits loaded_mcp_tools: workspace-server/internal/handlers/registry.go marks the platform workspace degraded instead of silently skipping the check. Pre-#147 nil mcp_server_present remains backward-compatible.
5-axis: correctness/robustness look sound for the requested fail-loud behavior; no security or performance concern; scope is contained to the platform heartbeat gate/model field/tests; docs/comments match the new contract. Non-code review gates are still red/pending, but the prior functional RC is fixed.
New commits pushed, approval review dismissed automatically according to repository settings
APPROVED on current head
da115fd6.5-axis re-review: the prior RC is resolved. The post-online platform MCP gate now checks the literal required tool
mcp__molecule-platform__create_workspace, treatsmcp_server_present=truewith absentloaded_mcp_toolsas degraded/fail-loud, and now also treatslistDeclaredPlugins/declared-management-MCP lookup errors as degraded with a broadcast instead of log-only false-green. The added regression test covers the lookup-error path. Scope is contained to heartbeat status evaluation/tests and the behavior is fail-loud rather than masking startup/runtime failure.Note: I reviewed the current PR head, which has advanced beyond the
479be992mentioned in the request.REQUEST_CHANGES on
479be992.Correctness: the intended fail-loud behavior is implemented: declared management MCP lookup errors now mark the platform workspace degraded and broadcast WORKSPACE_DEGRADED; mcp_server_present=true with absent loaded_mcp_tools also degrades; the literal mcp__molecule-platform__create_workspace assertion remains intact.
Blocking issue: CI / Platform (Go) is red in the new regression test. TestHeartbeatHandler_PlatformManagementMCPLookupError_FlipsOnlineToDegraded fails because the actual sample_error is "platform agent declared management MCP lookup failed: listDeclaredPlugins: listDeclaredPlugins: query: connection refused; marking degraded (core#3082)", while the sqlmock expectation has only one "listDeclaredPlugins" prefix. Fix the test expectation or avoid double-wrapping the error, then rerun Platform Go.
Security/performance/scope/readability: no new security or performance concern and scope is contained. Readability/user-facing diagnostics would improve if the double-prefix is removed, since it is what surfaces in the degraded sample_error.
APPROVE on
da115fd6.5-axis review: correctness is satisfied on the requested regression paths. Declared-management-MCP lookup errors now fail loud by marking the platform workspace degraded and broadcasting WORKSPACE_DEGRADED; mcp_server_present=true with absent loaded_mcp_tools also degrades instead of false-greening; the healthy path still requires the literal mcp__molecule-platform__create_workspace tool id. The regression tests cover missing tool, loaded tool, absent loaded_mcp_tools, pre-147 backward compatibility, and lookup-error degradation.
Security/performance/scope/docs: no new security or performance concern found; scope is contained to heartbeat/runtime MCP health handling plus focused tests; comments document the tool-id vs plugin-name contract. CI note: Platform Go was still pending in the API snapshot available to me, and local Go is unavailable in this container, but the prior failing expectation is fixed in this head.
New commits pushed, approval review dismissed automatically according to repository settings
New commits pushed, approval review dismissed automatically according to repository settings
APPROVED on head
ffb9872e.5-axis re-review: the fail-loud management MCP gate remains correct and the test hardening is sound. The literal
mcp__molecule-platform__create_workspacetool check is intact;mcp_server_present=truewith omittedloaded_mcp_toolsdegrades; declared-management-MCP lookup errors degrade and broadcast instead of log-only false-green; and the lookup-error regression test now usessqlmock.AnyArg()for the wrapped error message, making the test hermetic/resilient without weakening the behavioral assertion.Correctness/robustness/security/performance/readability all pass for this scoped heartbeat/test change. CI Platform (Go) is green on this head; remaining non-green contexts observed are SOP body/pending E2E/advisory gates outside this code verdict.
APPROVE on
ffb9872e.5-axis review: the runtime logic is unchanged from the previously approved head. Lookup errors still fail loud by marking the platform workspace degraded and broadcasting WORKSPACE_DEGRADED; mcp_server_present=true with absent loaded_mcp_tools still degrades; and the healthy path still asserts the literal mcp__molecule-platform__create_workspace tool id.
The only delta from
da115fd6is test-only: the lookup-error regression test now uses sqlmock.AnyArg() for the human-readable last_sample_error string while still requiring the degraded status update for the target workspace. That makes the test hermetic against error wrapping without weakening the fail-loud behavior check. CI / Platform (Go), CI / all-required, and gate-check-v3 are green onffb9872e.Security/performance/scope/docs: no new security or performance issue; scope remains contained to heartbeat MCP health handling and focused regression coverage; comments preserve the tool-id contract.
/sop-ack comprehensive-testing — regression test TestHeartbeatHandler_PlatformManagementMCPLookupError_FlipsOnlineToDegraded + existing missing/loaded/omitted loaded_mcp_tools tests in internal/handlers
/sop-ack local-postgres-e2e — N/A, pure platform-server logic change, no schema touched
/sop-ack staging-smoke — N/A, gate fires only when runtime sends loaded_mcp_tools
/sop-ack root-cause — fixes silent log-only path on listDeclaredPlugins lookup errors by marking degraded + broadcasting WORKSPACE_DEGRADED
/sop-ack five-axis-review — reviewed by CR2 + Researcher
/sop-ack no-backwards-compat — old runtimes that omit loaded_mcp_tools are ignored; no shims
/sop-ack memory-consulted — followed existing MCPServerPresent payload pattern and conciergePlatformMCPName declared-plugin gate