Platform agent broken = SSOT violation (bespoke provisioning block, not the normal workspace path) — provision it as a normal workspace + platform MCP #2495

Closed
opened 2026-06-09 20:29:33 +00:00 by core-devops · 3 comments
Member

Summary

Dogfood-piloting the dedicated platform-agent on the agents-team org (2026-06-09) surfaced a blocker: the published molecule-platform-agent image is NOT a working concierge. The Phase-1 re-parenting mechanism works perfectly, but the agent the image boots is a bare langgraph adapter with no management MCP — so it cannot be the org concierge. This is almost certainly why the pin was never auto-promoted.

What worked (Phase 1 install/re-parent)

POST /admin/org/platform-agent (with the deterministic id cd62fe70… for org 2355b568…) returned {"kind":"platform","status":"installed"} and correctly:

  • upserted the platform row to kind='platform', parent_id=NULL,
  • re-parented the org's existing roots (CEO Assistant + SEO Agent) under it,
  • migrated org_api_tokens / org_plugin_allowlist (0 rows here, but the code path ran).

So topology-wise the dedicated agent became the sole root exactly as designed.

The blocker — image boots as langgraph, no MCP

Container molecule-platform-agent from molecule-platform-agent@sha256:0e35ac83… (sha-4094242, the current latest):

  • registers with runtime=langgraph, NOT claude-code — and it ignores RUNTIME=claude-code passed as env.
  • ADAPTER_MODULE=adapter is baked into the image → that, not RUNTIME/config, drives the runtime selection.
  • the seeded /configs/config.yaml (name: Org Concierge, runtime: claude-code, mcp_servers: [{name: platform, command: molecule-mcp}]) is not applied: registered name stays the raw UUID, runtime stays langgraph.
  • no molecule-mcp / claude process is running in the container (ps shows none) — the 87-tool platform MCP is NOT loaded, even though the binary exists at /usr/local/bin/molecule-mcp.

Net: the homepage concierge would be an MCP-less langgraph agent named with a UUID — worse than the existing root. I rolled the pilot back (restored CEO Assistant + SEO Agent as roots, removed the container) so the live org isn't left degraded.

Secondary finding (CP provisioning block)

buildTenantUserDataSM's start_platform_agent block (controlplane internal/provisioner/ec2.go) does not set RUNTIME/MODEL env and relies on config.yaml's runtime: field — which the image ignores. Even once the image is fixed, the block should set RUNTIME=claude-code explicitly (matching the workspace boot's molecule.env).

Required fix (blocks activation)

  1. Build molecule-platform-agent so it runs claude-code with the concierge config.yaml honored (name + mcp_servers), i.e. NOT ADAPTER_MODULE=adapter/langgraph. The molecule-mcp must spawn as an MCP server.
  2. Have the CP start_platform_agent block set RUNTIME=claude-code (and MODEL) explicitly.
  3. Re-pilot on agents-team, verify the concierge responds + can list_workspaces via the platform MCP, THEN promote the pin.

Repro + rollback evidence available; ping me. Relates to RFC docs/design/rfc-platform-agent.md and CP #658 (the fail-closed gate, which correctly stays a no-op until a pin is promoted).

🤖 Generated with Claude Code

## Summary Dogfood-piloting the dedicated platform-agent on the **agents-team** org (2026-06-09) surfaced a **blocker**: the published `molecule-platform-agent` image is NOT a working concierge. The Phase-1 re-parenting mechanism works perfectly, but the agent the image boots is a bare **langgraph adapter with no management MCP** — so it cannot be the org concierge. This is almost certainly why the pin was never auto-promoted. ## What worked (Phase 1 install/re-parent) `POST /admin/org/platform-agent` (with the deterministic id `cd62fe70…` for org `2355b568…`) returned `{"kind":"platform","status":"installed"}` and correctly: - upserted the platform row to `kind='platform'`, `parent_id=NULL`, - re-parented the org's existing roots (CEO Assistant + SEO Agent) under it, - migrated `org_api_tokens` / `org_plugin_allowlist` (0 rows here, but the code path ran). So topology-wise the dedicated agent became the sole root exactly as designed. ## The blocker — image boots as langgraph, no MCP Container `molecule-platform-agent` from `molecule-platform-agent@sha256:0e35ac83…` (`sha-4094242`, the current `latest`): - registers with **`runtime=langgraph`**, NOT `claude-code` — and it **ignores `RUNTIME=claude-code`** passed as env. - `ADAPTER_MODULE=adapter` is **baked into the image** → that, not `RUNTIME`/config, drives the runtime selection. - the seeded `/configs/config.yaml` (`name: Org Concierge`, `runtime: claude-code`, `mcp_servers: [{name: platform, command: molecule-mcp}]`) is **not applied**: registered name stays the raw UUID, runtime stays langgraph. - **no `molecule-mcp` / claude process is running** in the container (`ps` shows none) — the 87-tool platform MCP is NOT loaded, even though the binary exists at `/usr/local/bin/molecule-mcp`. Net: the homepage concierge would be an MCP-less langgraph agent named with a UUID — worse than the existing root. I rolled the pilot back (restored CEO Assistant + SEO Agent as roots, removed the container) so the live org isn't left degraded. ## Secondary finding (CP provisioning block) `buildTenantUserDataSM`'s `start_platform_agent` block (controlplane `internal/provisioner/ec2.go`) does **not** set `RUNTIME`/`MODEL` env and relies on `config.yaml`'s `runtime:` field — which the image ignores. Even once the image is fixed, the block should set `RUNTIME=claude-code` explicitly (matching the workspace boot's `molecule.env`). ## Required fix (blocks activation) 1. Build `molecule-platform-agent` so it runs **claude-code** with the concierge `config.yaml` honored (name + `mcp_servers`), i.e. NOT `ADAPTER_MODULE=adapter`/langgraph. The molecule-mcp must spawn as an MCP server. 2. Have the CP `start_platform_agent` block set `RUNTIME=claude-code` (and `MODEL`) explicitly. 3. Re-pilot on agents-team, verify the concierge responds + can `list_workspaces` via the platform MCP, THEN promote the pin. Repro + rollback evidence available; ping me. Relates to RFC `docs/design/rfc-platform-agent.md` and CP #658 (the fail-closed gate, which correctly stays a no-op until a pin is promoted). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Member

CORRECTION + full RCA (deeper dogfood on agents-team, 2026-06-09)

My initial report ("boots as langgraph, no MCP") was wrong — apologies. Drilling in (got molecule-mcp returning its tool list live), the real picture:

The image is fine — it runs claude-code, and the MCP works

  • Runtime: claude-code (Claude Code) in the boot logs. The langgraph I saw is just a stale workspaces.runtime label from registration; get_adapter runs the executor via ADAPTER_MODULE regardless, so the executor is claude-code. The a1e0c7e readiness-gate works (it raised _McpNotReadyError exactly as designed).
  • molecule-mcp (Python, node20 fine) connects and returns its tools once given the right env — {"jsonrpc":"2.0","result":{"tools":[{"name":"delegate_task",…}]}}. The 88-tool MCP is real and functional.

The actual blockers (provisioning-env gaps in the CP start_platform_agent block)

  1. Missing MOLECULE_WORKSPACE_TOKEN. The block passes MOLECULE_API_KEY=$ADMIN_TOKEN, but molecule-mcp (and registration) authenticate as the workspace — they need a per-workspace token (POST /admin/workspaces/:id/tokens), not the org admin token. With the admin token: molecule-mcp: register rejected HTTP 401 + agent Register: HTTP 401. With a minted workspace token: both succeed (Registered with platform: 200).
  2. /configs not writable by uid 1000. The block mkdirs it as root (0755); the container runs as agent (1000), so molecule-mcp fails to save /configs/.platform_inbound_secret (EPERM). Needs chown 1000:1000 (and CONFIGS_DIR=/configs).
  3. Wrong LLM token. The block sets ANTHROPIC_AUTH_TOKEN=$ADMIN_TOKEN — not a valid LLM token (llm-auth: unrecognised prefix). It needs a real LLM credential: BYOK CLAUDE_CODE_OAUTH_TOKEN=sk-ant-* (what the working Code-Reviewer claude agents use) or a platform-managed sk-cp-* proxy token. (BYOK with a real token cleared this in the pilot.)
  4. Chicken-and-egg ordering. Minting a workspace token needs the workspace row to exist, but the block docker runs before any row/install. Fix = pre-seed the platform-agent row → mint token → boot with token (the RFC's Phase-1 pre-seed), not boot-then-install.

The real architectural blocker (why end-to-end chat still doesn't work even after 1-3)

Inbound chat is poll/inbox/channel delivered: a user message lands in the agent's inbox and is pushed via experimental.claude/channel into an active claude session (a2a_tools_inbox.py inbox_peek/inbox_pop + _setup_inbox_bridge). But the platform-agent runs one-shot claude_agent_sdk.query() (per a1e0c7e) — so when idle it has no persistent session to receive the channel push, and inbound messages are never drained (confirmed: messages consumed off a2a_queue, but no claude transcript ever runs, no reply). The existing channel-based root (the bridged Claude Code session) works precisely because it's persistent.

So an always-on concierge needs a persistent-session / inbox-dispatch loop for kind=platform agents — the missing runtime piece. This is the gating work for activation.

Required to ship (proposed)

  • controlplane start_platform_agent: pre-seed row → mint workspace token (MOLECULE_WORKSPACE_TOKEN) → chown configs + CONFIGS_DIR → real LLM token (BYOK or sk-cp-*) → set RUNTIME so the label is correct. Pass runtime through (default claude-code; any-runtime supported since it's config-driven).
  • runtime/template: persistent-session (or inbox-poll-dispatch) mode for platform agents so idle inbound chat is processed.
  • Then re-pilot on agents-team (mechanism + MCP already proven), verify a chat round-trip + list_workspaces, THEN promote the pin.

Pilot fully rolled back; agents-team clean. CP #658 (fail-closed gate) correctly stays a no-op until a pin is promoted.

## CORRECTION + full RCA (deeper dogfood on agents-team, 2026-06-09) My initial report ("boots as langgraph, no MCP") was **wrong** — apologies. Drilling in (got `molecule-mcp` returning its tool list live), the real picture: ### The image is fine — it runs claude-code, and the MCP works - `Runtime: claude-code (Claude Code)` in the boot logs. The `langgraph` I saw is just a **stale `workspaces.runtime` label** from registration; `get_adapter` runs the executor via `ADAPTER_MODULE` regardless, so the executor is claude-code. The `a1e0c7e` readiness-gate works (it raised `_McpNotReadyError` exactly as designed). - `molecule-mcp` (Python, node20 fine) **connects and returns its tools** once given the right env — `{"jsonrpc":"2.0","result":{"tools":[{"name":"delegate_task",…}]}}`. The 88-tool MCP is real and functional. ### The actual blockers (provisioning-env gaps in the CP `start_platform_agent` block) 1. **Missing `MOLECULE_WORKSPACE_TOKEN`.** The block passes `MOLECULE_API_KEY=$ADMIN_TOKEN`, but `molecule-mcp` (and registration) authenticate as the **workspace** — they need a per-workspace token (`POST /admin/workspaces/:id/tokens`), not the org admin token. With the admin token: `molecule-mcp: register rejected HTTP 401` + agent `Register: HTTP 401`. With a minted workspace token: both succeed (`Registered with platform: 200`). 2. **`/configs` not writable by uid 1000.** The block `mkdir`s it as root (0755); the container runs as `agent` (1000), so `molecule-mcp` fails to save `/configs/.platform_inbound_secret` (EPERM). Needs `chown 1000:1000` (and `CONFIGS_DIR=/configs`). 3. **Wrong LLM token.** The block sets `ANTHROPIC_AUTH_TOKEN=$ADMIN_TOKEN` — not a valid LLM token (`llm-auth: unrecognised prefix`). It needs a real LLM credential: BYOK `CLAUDE_CODE_OAUTH_TOKEN=sk-ant-*` (what the working Code-Reviewer claude agents use) or a platform-managed `sk-cp-*` proxy token. (BYOK with a real token cleared this in the pilot.) 4. **Chicken-and-egg ordering.** Minting a workspace token needs the workspace row to exist, but the block `docker run`s before any row/install. Fix = **pre-seed the platform-agent row → mint token → boot with token** (the RFC's Phase-1 pre-seed), not boot-then-install. ### The real architectural blocker (why end-to-end chat still doesn't work even after 1-3) Inbound chat is **poll/inbox/channel** delivered: a user message lands in the agent's inbox and is pushed via `experimental.claude/channel` **into an active claude session** (`a2a_tools_inbox.py` `inbox_peek`/`inbox_pop` + `_setup_inbox_bridge`). But the platform-agent runs **one-shot `claude_agent_sdk.query()`** (per a1e0c7e) — so when **idle it has no persistent session** to receive the channel push, and inbound messages are never drained (confirmed: messages consumed off `a2a_queue`, but **no claude transcript ever runs**, no reply). The existing channel-based root (the bridged Claude Code session) works precisely *because* it's persistent. So an always-on concierge needs a **persistent-session / inbox-dispatch loop** for `kind=platform` agents — the missing runtime piece. This is the gating work for activation. ### Required to ship (proposed) - **controlplane** `start_platform_agent`: pre-seed row → mint workspace token (`MOLECULE_WORKSPACE_TOKEN`) → `chown` configs + `CONFIGS_DIR` → real LLM token (BYOK or sk-cp-*) → set `RUNTIME` so the label is correct. Pass runtime through (default claude-code; any-runtime supported since it's config-driven). - **runtime/template**: persistent-session (or inbox-poll-dispatch) mode for platform agents so idle inbound chat is processed. - Then re-pilot on agents-team (mechanism + MCP already proven), verify a chat round-trip + `list_workspaces`, THEN promote the pin. Pilot fully rolled back; agents-team clean. CP #658 (fail-closed gate) correctly stays a no-op until a pin is promoted.
core-devops changed title from Platform-agent image boots as langgraph w/ no MCP — blocks concierge activation (pilot RCA) to Platform-agent activation: provisioning-env gaps + idle inbound-delivery (needs persistent session) — full RCA 2026-06-09 21:01:01 +00:00
Author
Member

B (end-to-end verification on agents-team) — the inbound blocker is the DELIVERY path, not the source_id filter

Re-placed the concierge with all env fixes (workspace token, configs chown+CONFIGS_DIR, BYOK sk-ant-* LLM token) — boots clean: Runtime: claude-code, Registered: 200, Saved platform_inbound_secret, molecule-mcp tools/list works. Then drove inbound both ways:

  • canvas-user chat (chat_with_agent, source_id empty) → a2a_receive logged, delivery_mode=poll, no query ran, no reply.
  • peer delegation (delegate_task as CEO Assistant, source_id=091a9180, a real peer) → a2a_receive logged; the heartbeat's _check_activity_delegations fired its self-wake (a source_id=<self> a2a_receive appeared right after) — but still no claude transcript ran and no reply.

So my earlier "heartbeat skips canvas-user" hypothesis was incomplete: the peer message WAS picked up and a wake fired, yet the executor never ran. The wake self-message (and the original) are themselves delivery_mode=poll and the agent's :8000 serves none of the A2A push routes (/a2a, /message, /.well-known/agent.json, /openapi.json all 404) — and A2A_PORT=8090 while uvicorn listens on :8000 (advertised url is :8000). So nothing delivers the queued message/wake into a query().

Net: the executor + MCP are proven good in isolation; the break is the platform→agent inbound delivery + wake→query dispatch (push/poll mode, the A2A server's served endpoints, and the A2A_PORT vs serve-port). This is the activation gate and needs the runtime/A2A owner — it's a multi-component dispatch issue, not a one-line filter. Suspects to check first: (1) why delivery falls to poll instead of pushing to the agent's A2A server; (2) what path/port the agent's A2A server actually serves vs what the platform pushes to (A2A_PORT=8090 vs uvicorn :8000); (3) whether the self-wake is supposed to push to that endpoint and 404s.

Pilot rolled back again; agents-team clean (roots = CEO Assistant + SEO Agent).

## B (end-to-end verification on agents-team) — the inbound blocker is the DELIVERY path, not the source_id filter Re-placed the concierge with all env fixes (workspace token, configs chown+CONFIGS_DIR, BYOK `sk-ant-*` LLM token) — boots clean: `Runtime: claude-code`, `Registered: 200`, `Saved platform_inbound_secret`, `molecule-mcp` tools/list works. Then drove inbound both ways: - **canvas-user chat** (`chat_with_agent`, `source_id` empty) → `a2a_receive` logged, `delivery_mode=poll`, **no query ran, no reply**. - **peer delegation** (`delegate_task` as CEO Assistant, `source_id=091a9180`, a real peer) → `a2a_receive` logged; the heartbeat's `_check_activity_delegations` **fired its self-wake** (a `source_id=<self>` a2a_receive appeared right after) — but **still no claude transcript ran and no reply**. So my earlier "heartbeat skips canvas-user" hypothesis was incomplete: the peer message WAS picked up and a wake fired, yet the executor never ran. The wake self-message (and the original) are themselves `delivery_mode=poll` and the agent's `:8000` serves **none** of the A2A push routes (`/a2a`, `/message`, `/.well-known/agent.json`, `/openapi.json` all 404) — and `A2A_PORT=8090` while uvicorn listens on `:8000` (advertised url is `:8000`). So nothing delivers the queued message/wake into a `query()`. **Net:** the executor + MCP are proven good in isolation; the break is the **platform→agent inbound delivery + wake→query dispatch** (push/poll mode, the A2A server's served endpoints, and the `A2A_PORT` vs serve-port). This is the activation gate and needs the runtime/A2A owner — it's a multi-component dispatch issue, not a one-line filter. Suspects to check first: (1) why delivery falls to `poll` instead of pushing to the agent's A2A server; (2) what path/port the agent's A2A server actually serves vs what the platform pushes to (`A2A_PORT=8090` vs uvicorn `:8000`); (3) whether the self-wake is supposed to push to that endpoint and 404s. Pilot rolled back again; agents-team clean (roots = CEO Assistant + SEO Agent).
Author
Member

CORRECTION — root cause is an SSOT violation, not a runtime architecture gap

My "needs a persistent-session / runtime inbound-delivery rewrite" conclusion was wrong — I was chasing symptoms. The real root, surfaced by the CTO: the platform agent is just a regular workspace, but it's provisioned by a BESPOKE start_platform_agent block (a hand-rolled docker run on the tenant EC2) instead of the normal ProvisionWorkspace path. That bespoke block reinvented a subset of the workspace provisioning and omitted the rest.

Proof it's provisioning, not a runtime gap: chat_with_agent to the PM (a normal workspace) → returns PONG synchronously; the same chat to the bespoke platform agent → "queued for poll", never processed. Normal workspaces handle user→agent chat fine. The difference is 100% the provisioning.

Everything I "found broken" maps 1:1 to what the normal path provides and the bespoke block omitted:

  • minted workspace token (bespoke gave the admin token → 401),
  • the full MOLECULE_LLM_* proxy env (bespoke gave a bare ANTHROPIC_AUTH_TOKEN → unrecognised),
  • writable /configs uid-1000 (bespoke root-owned → EPERM),
  • proper A2A registration so the platform PUSHES synchronously (bespoke setup → poll fallback → never consumed → the "inbound delivery" red herring).

Correct fix (SSOT)

Provision the concierge through the normal workspace path (ProvisionWorkspace — own EC2, full env + A2A registration), adding only:

  • kind=platform,
  • extra_mcp_servers: [{name: platform, command: molecule-mcp}] (Phase-2 config — verified working),
  • the dedicated molecule-platform-agent image (only for the baked-in molecule-mcp binary),
  • the org admin token as its native credential.

Then DELETE the bespoke start_platform_agent block in controlplane ec2.go. The localhost-on-tenant rationale is moot — a normal workspace reaches the platform via PLATFORM_URL/MOLECULE_API_URL (tenant URL), like every workspace.

This supersedes the env-patch list and the "persistent session" framing above — those were all symptoms of the bespoke divergence. The fix is to stop diverging from the workspace SSOT.

## CORRECTION — root cause is an SSOT violation, not a runtime architecture gap My "needs a persistent-session / runtime inbound-delivery rewrite" conclusion was **wrong** — I was chasing symptoms. The real root, surfaced by the CTO: **the platform agent is just a regular workspace, but it's provisioned by a BESPOKE `start_platform_agent` block (a hand-rolled `docker run` on the tenant EC2) instead of the normal `ProvisionWorkspace` path.** That bespoke block reinvented a subset of the workspace provisioning and omitted the rest. **Proof it's provisioning, not a runtime gap:** `chat_with_agent` to the PM (a normal workspace) → returns **`PONG` synchronously**; the same chat to the bespoke platform agent → "queued for poll", never processed. Normal workspaces handle user→agent chat fine. The difference is 100% the provisioning. Everything I "found broken" maps 1:1 to what the normal path provides and the bespoke block omitted: - minted **workspace token** (bespoke gave the admin token → 401), - the full **`MOLECULE_LLM_*` proxy env** (bespoke gave a bare `ANTHROPIC_AUTH_TOKEN` → unrecognised), - writable `/configs` uid-1000 (bespoke root-owned → EPERM), - proper **A2A registration so the platform PUSHES synchronously** (bespoke setup → poll fallback → never consumed → the "inbound delivery" red herring). ## Correct fix (SSOT) Provision the concierge through the **normal workspace path** (`ProvisionWorkspace` — own EC2, full env + A2A registration), adding only: - `kind=platform`, - `extra_mcp_servers: [{name: platform, command: molecule-mcp}]` (Phase-2 config — verified working), - the dedicated `molecule-platform-agent` image (only for the baked-in `molecule-mcp` binary), - the org admin token as its native credential. Then **DELETE the bespoke `start_platform_agent` block** in controlplane `ec2.go`. The localhost-on-tenant rationale is moot — a normal workspace reaches the platform via `PLATFORM_URL`/`MOLECULE_API_URL` (tenant URL), like every workspace. This supersedes the env-patch list and the "persistent session" framing above — those were all symptoms of the bespoke divergence. The fix is to stop diverging from the workspace SSOT.
core-devops changed title from Platform-agent activation: provisioning-env gaps + idle inbound-delivery (needs persistent session) — full RCA to Platform agent broken = SSOT violation (bespoke provisioning block, not the normal workspace path) — provision it as a normal workspace + platform MCP 2026-06-09 21:47:49 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2495