[RCA] Concierge boots with NO identity (generic Claude Code) — NOT #2965; root = platform-agent image + model-seed not reaching tenants (+ a fail-OPEN readiness-gate hole) #2970

Closed
opened 2026-06-15 22:02:56 +00:00 by agent-researcher · 0 comments
Member

RCA — Root-Cause Researcher (dispatch e051071e). Clears #2965; do NOT revert it. A timing-correlation lead pointed at molecule-core main commit 27c420c2 (= Merge PR #2965, the SSRF fix). I evaluated it against the actual failure log and it is coincidental27c420c2 is just main HEAD at E2E run time; #2965 has zero presence in the failure path.

MECHANISM

The concierge is being provisioned WITHOUT the platform-agent image's seeded model + MCP server, so it has no identity/persona and falls back to generic Claude Code (/research, /code-review, edit files). Two surfaces of the same root: (a) staging E2E fails CLOSEDE2E Staging Concierge Creates Workspace (run 372620 / job 511884) discovers the concierge, waits 900s, and it goes provisioning → failed; the test's own message: "concierge NOT provisioned on the platform-agent image (no /opt/molecule-mcp-server, no model) ... it cannot run the create_workspace tool — the parallel-agent image work this gate depends on"; E2E_REQUIRE_LIVE=1 correctly refuses the false-green (exit 5). (b) prod fails OPEN (CTO report) — the same model-less/identity-less concierge marks itself online and serves users as generic Claude Code. The fail-OPEN is the dangerous half: a concierge without its seeded model + /opt/molecule-mcp-server should fail closed (status=failed), never mark online-routable.

EVIDENCE

Job 511884 log (328 lines): [20:42:46] concierge → provisioning; [20:55:17] concierge → failed; [20:57:56] SKIP: concierge never reached online+routable within 900s (last status='failed', url='', err=''); ❌ E2E_REQUIRE_LIVE=1 — a skip is a false-green guard breach. Failing.; exitcode '5'. Negative evidence clearing #2965: grepped the full log for agent_card_url_rejected / workspace agent_card URL not allowed / isSafeURL / 400 / SSRF → ZERO matches. #2965's registry.go change (workspace-server/internal/handlers/registry.go Register, push-mode isSafeURL(agentCardURL) → 400 agent_card_url_rejected) is on a write-time registration path that never fires here — the concierge dies in PROVISIONING, before any Register. (Same independent conclusion I reached for #2968 main-red; re-derived from THIS log, not assumed.)

RECOMMENDED FIX SHAPE (not code)

Two tracks, repo molecule-core (workspace provisioning / platform-agent image) + molecule-controlplane (redeploy): (1) Deployment gap — confirm whether the concierge model-seed (#2966, merged) + the platform-agent image (with /opt/molecule-mcp-server + seeded model) are actually DEPLOYED to the affected tenants. In staging they are almost certainly NOT, because the #76 redeploy chain is halted (the #2968 / live-103840 total=3 healthy=0) — fix-merged-not-deployed. Restoring delivery is the #76 chain (Option-C exclude-non-AWS / land #837/#831). (2) Fail-OPEN readiness-gate hole (the prod-specific bug) — the prod concierge marking itself online WITHOUT a seeded model + identity is a #2955-class readiness-gate miss. The identity/model presence check (#2966 MISSING_MODEL + #2955 conciergeIdentityPresent at the exact probe path) must gate the online/routable marking on EVERY provisioning path, prod included, so a model-less concierge fails closed (loud status=failed, like the staging E2E) instead of silently serving generic Claude Code. Track (2) is the durable correctness fix; track (1) is what restores actual concierge identity once images flow again.

— Root-Cause Researcher (verify-don't-trust: pulled job 511884 log directly; cleared #2965 by negative grep; root is provisioning-without-platform-agent-image, gated by the #76 redeploy halt, plus a prod fail-open gate hole)

**RCA — Root-Cause Researcher (dispatch e051071e). Clears #2965; do NOT revert it.** A timing-correlation lead pointed at molecule-core main commit 27c420c2 (= Merge PR #2965, the SSRF fix). I evaluated it against the actual failure log and it is **coincidental** — 27c420c2 is just main HEAD at E2E run time; #2965 has zero presence in the failure path. ### MECHANISM The concierge is being provisioned WITHOUT the platform-agent image's seeded model + MCP server, so it has no identity/persona and falls back to generic Claude Code (`/research`, `/code-review`, edit files). Two surfaces of the same root: (a) **staging E2E fails CLOSED** — `E2E Staging Concierge Creates Workspace` (run 372620 / job 511884) discovers the concierge, waits 900s, and it goes `provisioning → failed`; the test's own message: *"concierge NOT provisioned on the platform-agent image (no /opt/molecule-mcp-server, no model) ... it cannot run the create_workspace tool — the parallel-agent image work this gate depends on"*; `E2E_REQUIRE_LIVE=1` correctly refuses the false-green (exit 5). (b) **prod fails OPEN** (CTO report) — the same model-less/identity-less concierge marks itself online and serves users as generic Claude Code. The fail-OPEN is the dangerous half: a concierge without its seeded model + `/opt/molecule-mcp-server` should fail closed (status=failed), never mark online-routable. ### EVIDENCE Job 511884 log (328 lines): `[20:42:46] concierge → provisioning`; `[20:55:17] concierge → failed`; `[20:57:56] SKIP: concierge never reached online+routable within 900s (last status='failed', url='', err='')`; `❌ E2E_REQUIRE_LIVE=1 — a skip is a false-green guard breach. Failing.`; `exitcode '5'`. **Negative evidence clearing #2965:** grepped the full log for `agent_card_url_rejected` / `workspace agent_card URL not allowed` / `isSafeURL` / `400` / `SSRF` → ZERO matches. #2965's registry.go change (`workspace-server/internal/handlers/registry.go` Register, push-mode `isSafeURL(agentCardURL)` → 400 `agent_card_url_rejected`) is on a write-time registration path that never fires here — the concierge dies in PROVISIONING, before any Register. (Same independent conclusion I reached for #2968 main-red; re-derived from THIS log, not assumed.) ### RECOMMENDED FIX SHAPE (not code) Two tracks, repo `molecule-core` (workspace provisioning / platform-agent image) + `molecule-controlplane` (redeploy): **(1) Deployment gap** — confirm whether the concierge model-seed (#2966, merged) + the platform-agent image (with `/opt/molecule-mcp-server` + seeded model) are actually DEPLOYED to the affected tenants. In staging they are almost certainly NOT, because the #76 redeploy chain is halted (the #2968 / live-103840 `total=3 healthy=0`) — fix-merged-not-deployed. Restoring delivery is the #76 chain (Option-C exclude-non-AWS / land #837/#831). **(2) Fail-OPEN readiness-gate hole (the prod-specific bug)** — the prod concierge marking itself online WITHOUT a seeded model + identity is a #2955-class readiness-gate miss. The identity/model presence check (#2966 MISSING_MODEL + #2955 `conciergeIdentityPresent` at the exact probe path) must gate the online/routable marking on EVERY provisioning path, prod included, so a model-less concierge fails closed (loud `status=failed`, like the staging E2E) instead of silently serving generic Claude Code. Track (2) is the durable correctness fix; track (1) is what restores actual concierge identity once images flow again. — Root-Cause Researcher (verify-don't-trust: pulled job 511884 log directly; cleared #2965 by negative grep; root is provisioning-without-platform-agent-image, gated by the #76 redeploy halt, plus a prod fail-open gate hole)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2970