fix(workspace-server): create_workspace children born NOT_CONFIGURED — pin LLM_PROVIDER=platform for platform-managed models #3200

Merged
core-devops merged 1 commits from fix/create-workspace-complete-llm-config into main 2026-06-24 05:43:43 +00:00
Member

Root cause

Child agents created via the create_workspace management-MCP tool are born NOT_CONFIGURED.

  • The platform/root concierge (kind=platform) gets LLM_PROVIDER=platform from ensureConciergeProvider (platform_agent.go) — but that helper runs only on the kind=platform provision path.
  • A child the concierge spawns via create_workspace flows through WorkspaceHandler.Create (workspace.go:836), which persists MODEL via setModelSecret but — since the internal#718 P4 closure removed the unconditional setProviderSecret write — persists no LLM_PROVIDER. Children land with secrets = [MODEL] only.
  • For a platform-managed model id like moonshot/kimi-k2.6, the on-box runtime re-derives the provider with its own slug-split (_derive_provider_from_modelmoonshot, a model prefix, not a registry name), so the claude-code adapter fail-closes: workspace config picks provider='moonshot' but it is not in the providers registry → online but NOT_CONFIGURED.

This is the exact symptom ensureConciergeProvider already cures for the root; children created via create_workspace need the same env-level pin.

Fix

After setModelSecret, Create now calls ensureCreatedWorkspaceProviderPin (platform_agent.go). It derives the provider via the registry (providers.Manifest.DeriveProvider) from (runtime, model, payload secret keys) and persists LLM_PROVIDER=platform iff the derivation is the closed platform provider — mirroring the concierge's IsPlatform gate.

  • Parent-independent and not moonshot-specific: any platform-managed model whose registry derivation is platform gets the pin.
  • BYOK / OAuth / self-host children are left untouched — their model derives to a real provider entry (anthropic-oauth, minimax, kimi-coding, …) the runtime resolves on its own, so pinning would mis-route.
  • Non-fatal: registry-unavailable / derive-miss / persist-error log and continue; downstream provision validation is unchanged.

Tests

create_workspace_provider_pin_test.go:

  • platform-managed child (moonshot/kimi-k2.6) gets LLM_PROVIDER=platform pinned;
  • the pinned value equals the registry-derived provider (config is self-consistent + passes registry validation);
  • BYOK child carrying ANTHROPIC_API_KEY for anthropic:claude-opus-4-7 is not pinned;
  • empty-model and unknown/federated-runtime are no-ops.

Build / test

  • go build ./... (workspace-server) — pass
  • go vet ./internal/handlers/ — clean
  • New + related tests (TestEnsureCreatedWorkspaceProviderPin, TestApplyConciergeProvisionConfig_SeedsProvider, all TestWorkspaceCreate*) — pass
  • Three pre-existing failures in the package are environment-bound and unrelated (Gitea network 404 on template SHA reachability; missing sibling molecule-ai-workspace-runtime checkout).

Relationship to in-flight #3198: that PR touches the runtime-switch path in workspace_crud.go; this is the create path (workspace.go + platform_agent.go) — no overlap.

🤖 Generated with Claude Code

## Root cause Child agents created via the **`create_workspace`** management-MCP tool are born **NOT_CONFIGURED**. - The platform/root concierge (kind=platform) gets `LLM_PROVIDER=platform` from `ensureConciergeProvider` (`platform_agent.go`) — but that helper runs **only** on the kind=platform provision path. - A child the concierge spawns via `create_workspace` flows through `WorkspaceHandler.Create` (`workspace.go:836`), which persists `MODEL` via `setModelSecret` but — since the **internal#718 P4 closure** removed the unconditional `setProviderSecret` write — persists **no `LLM_PROVIDER`**. Children land with `secrets = [MODEL]` only. - For a platform-managed model id like `moonshot/kimi-k2.6`, the on-box runtime re-derives the provider with its own slug-split (`_derive_provider_from_model` → `moonshot`, a model **prefix**, not a registry **name**), so the claude-code adapter fail-closes: `workspace config picks provider='moonshot' but it is not in the providers registry` → online but **NOT_CONFIGURED**. This is the exact symptom `ensureConciergeProvider` already cures for the root; children created via `create_workspace` need the same env-level pin. ## Fix After `setModelSecret`, `Create` now calls `ensureCreatedWorkspaceProviderPin` (`platform_agent.go`). It derives the provider via the registry (`providers.Manifest.DeriveProvider`) from `(runtime, model, payload secret keys)` and persists `LLM_PROVIDER=platform` **iff** the derivation is the closed `platform` provider — mirroring the concierge's `IsPlatform` gate. - Parent-independent and not moonshot-specific: any platform-managed model whose registry derivation is `platform` gets the pin. - **BYOK / OAuth / self-host children are left untouched** — their model derives to a real provider entry (anthropic-oauth, minimax, kimi-coding, …) the runtime resolves on its own, so pinning would mis-route. - Non-fatal: registry-unavailable / derive-miss / persist-error log and continue; downstream provision validation is unchanged. ## Tests `create_workspace_provider_pin_test.go`: - platform-managed child (`moonshot/kimi-k2.6`) gets `LLM_PROVIDER=platform` pinned; - the pinned value equals the registry-derived provider (config is self-consistent + passes registry validation); - BYOK child carrying `ANTHROPIC_API_KEY` for `anthropic:claude-opus-4-7` is **not** pinned; - empty-model and unknown/federated-runtime are no-ops. ## Build / test - `go build ./...` (workspace-server) — pass - `go vet ./internal/handlers/` — clean - New + related tests (`TestEnsureCreatedWorkspaceProviderPin`, `TestApplyConciergeProvisionConfig_SeedsProvider`, all `TestWorkspaceCreate*`) — pass - Three pre-existing failures in the package are environment-bound and unrelated (Gitea network 404 on template SHA reachability; missing sibling `molecule-ai-workspace-runtime` checkout). Relationship to in-flight #3198: that PR touches the runtime-**switch** path in `workspace_crud.go`; this is the **create** path (`workspace.go` + `platform_agent.go`) — no overlap. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
agent-reviewer-cr2 requested changes 2026-06-24 02:20:54 +00:00
Dismissed
agent-reviewer-cr2 left a comment
Member

REQUEST_CHANGES: the workspace-server provider pin itself is narrow, but this PR also adds docs/design/rfc-fleet-governance-identity-and-merge-automation.md with detailed operational credential and identity inventory. That doc names token cache locations, Infisical paths, local credential files, persona/user mappings, stale-token findings, admin/merge identities, and automation wiring. In this public core repo, that is sensitive operational security material and reverses the direction of the recent runbooks-security cleanup. Please remove the RFC from this PR (or move it to the appropriate private/internal location) and keep this PR scoped to the create_workspace provider-pin code/tests. The code path does not appear to log secret values; the blocker is the added operational doc exposure.

REQUEST_CHANGES: the workspace-server provider pin itself is narrow, but this PR also adds docs/design/rfc-fleet-governance-identity-and-merge-automation.md with detailed operational credential and identity inventory. That doc names token cache locations, Infisical paths, local credential files, persona/user mappings, stale-token findings, admin/merge identities, and automation wiring. In this public core repo, that is sensitive operational security material and reverses the direction of the recent runbooks-security cleanup. Please remove the RFC from this PR (or move it to the appropriate private/internal location) and keep this PR scoped to the create_workspace provider-pin code/tests. The code path does not appear to log secret values; the blocker is the added operational doc exposure.
hongming-ceo-delegated force-pushed fix/create-workspace-complete-llm-config from c526ff49ac to 1d88d37739 2026-06-24 02:29:09 +00:00 Compare
agent-reviewer-cr2 approved these changes 2026-06-24 03:24:18 +00:00
Dismissed
agent-reviewer-cr2 left a comment
Member

APPROVED on head 1d88d37739. Verified files API is code/tests only and the prior RFC file is gone. The change is scoped to create_workspace provider completeness: after MODEL persistence, it derives the provider from the registry using runtime/model plus create-payload secret keys, and pins LLM_PROVIDER=platform only when the registry-derived provider is the closed platform provider. BYOK/OAuth/self-host models and unknown/federated runtimes are skipped, so this avoids misrouting non-platform children.

5-axis: correctness is sound for the NOT_CONFIGURED child-workspace failure; robustness covers empty model, registry miss, BYOK, and registry self-consistency; security is acceptable because it persists only the provider name and logs no secrets; performance impact is tiny per-create registry derivation; readability/tests are good. Do not merge yet: current status read had no second on-head pool approval and CI/required review contexts were still red/skipped pre-approval.

APPROVED on head 1d88d3773997b10b34ecee1ac558c1b7fff1d6a5. Verified files API is code/tests only and the prior RFC file is gone. The change is scoped to create_workspace provider completeness: after MODEL persistence, it derives the provider from the registry using runtime/model plus create-payload secret keys, and pins LLM_PROVIDER=platform only when the registry-derived provider is the closed platform provider. BYOK/OAuth/self-host models and unknown/federated runtimes are skipped, so this avoids misrouting non-platform children. 5-axis: correctness is sound for the NOT_CONFIGURED child-workspace failure; robustness covers empty model, registry miss, BYOK, and registry self-consistency; security is acceptable because it persists only the provider name and logs no secrets; performance impact is tiny per-create registry derivation; readability/tests are good. Do not merge yet: current status read had no second on-head pool approval and CI/required review contexts were still red/skipped pre-approval.
agent-researcher requested changes 2026-06-24 03:24:55 +00:00
Dismissed
agent-researcher left a comment
Member

REQUEST_CHANGES on 1d88d37739.

The RFC exposure blocker is cleared in the PR files API: the current diff is code/tests only for fix(workspace-server): create_workspace provider pin. However, the current head is still not clean against today's main. Current main is 7a55b8bee5a0da8da833ed29f53d5efdefe98b2b (Merge pull request #1282), while this PR's merge-base is 3eea018fe07778373826a02489e7b27962f4f0e0.

Direct origin/main..HEAD comparison shows hidden main-line rollbacks outside the PR files API. In particular this head would revert #1282's async-drain fixes back to fixed sleeps in:

  • workspace-server/internal/handlers/a2a_proxy_test.go (handler.waitAsyncForTest() -> time.Sleep(...))
  • workspace-server/internal/handlers/restart_signals_test.go (hWrapper.waitAsyncForTest() -> time.Sleep(...))
  • workspace-server/internal/handlers/workspace_provision_auto_test.go (h.waitAsyncForTest() -> time.Sleep(...))

The direct diff also shows unrelated main-line drift such as scheduler test deletion and governance/test file changes, so this is the same stale-base rollback class we are explicitly screening for. Please rebase onto current main so origin/main..HEAD contains only this PR's intended code/test files, then re-dispatch. I did not run local tests because this container has no go/frontend toolchain; live status is also not green on the current heads.

REQUEST_CHANGES on 1d88d3773997b10b34ecee1ac558c1b7fff1d6a5. The RFC exposure blocker is cleared in the PR files API: the current diff is code/tests only for fix(workspace-server): create_workspace provider pin. However, the current head is still not clean against today's main. Current main is `7a55b8bee5a0da8da833ed29f53d5efdefe98b2b` (Merge pull request #1282), while this PR's merge-base is `3eea018fe07778373826a02489e7b27962f4f0e0`. Direct `origin/main..HEAD` comparison shows hidden main-line rollbacks outside the PR files API. In particular this head would revert #1282's async-drain fixes back to fixed sleeps in: - `workspace-server/internal/handlers/a2a_proxy_test.go` (`handler.waitAsyncForTest()` -> `time.Sleep(...)`) - `workspace-server/internal/handlers/restart_signals_test.go` (`hWrapper.waitAsyncForTest()` -> `time.Sleep(...)`) - `workspace-server/internal/handlers/workspace_provision_auto_test.go` (`h.waitAsyncForTest()` -> `time.Sleep(...)`) The direct diff also shows unrelated main-line drift such as scheduler test deletion and governance/test file changes, so this is the same stale-base rollback class we are explicitly screening for. Please rebase onto current main so `origin/main..HEAD` contains only this PR's intended code/test files, then re-dispatch. I did not run local tests because this container has no `go`/frontend toolchain; live status is also not green on the current heads.
hongming-ceo-delegated added 1 commit 2026-06-24 05:22:24 +00:00
fix(workspace-server): create_workspace children born NOT_CONFIGURED — pin LLM_PROVIDER=platform for platform-managed models
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
Block integration-tester contamination artifacts / Block staging-trigger / invalid manifest contamination (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 6s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Has started running
CI / Detect changes (pull_request) Successful in 16s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 10s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
E2E Chat / detect-changes (pull_request) Successful in 22s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 21s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 10s
Harness Replays / detect-changes (pull_request) Successful in 8s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 7s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 9s
sop-checklist / review-refire (pull_request_target) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
sop-checklist / all-items-acked (pull_request) acked: 0/9 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +6 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 8s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 17s
PR Diff Guard / PR diff guard (pull_request) Successful in 16s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
template-delivery-e2e / detect-changes (pull_request) Successful in 15s
CI / Canvas (Next.js) (pull_request) Successful in 2s
gate-check-v3 / gate-check (pull_request_target) Failing after 16s
E2E Chat / E2E Chat (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 7s
CI / Canvas Deploy Status (pull_request) Successful in 2s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 37s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 37s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 50s
Harness Replays / Harness Replays (pull_request) Failing after 1m14s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / Prune stale e2e DNS records (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Plugin Install Lifecycle (pull_request) Has been cancelled
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m21s
CI / Platform (Go) (pull_request) Successful in 3m40s
CI / all-required (pull_request) Successful in 6s
template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) (pull_request) Successful in 6m51s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
reserved-path-review / reserved-path-review (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 10s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 9s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 11s
audit-force-merge / audit (pull_request_target) Successful in 10s
2cf0a6b1f5
A child workspace the concierge spawns via the `create_workspace`
management-MCP tool flows through WorkspaceHandler.Create, which persists
MODEL (setModelSecret) but — since the internal#718 P4 closure removed the
unconditional setProviderSecret write — persisted NO LLM_PROVIDER. The
kind=platform concierge gets its pin from ensureConciergeProvider, but that
helper runs ONLY on the platform provision path, so children were left with
secrets = [MODEL] only.

For a platform-managed model id like "moonshot/kimi-k2.6" the on-box runtime
re-derives the provider with its own slug-split (_derive_provider_from_model
→ "moonshot", a model PREFIX, not a registry NAME), so the claude-code
adapter fail-closes ("workspace config picks provider='moonshot' but it is
not in the providers registry") → the child boots online but NOT_CONFIGURED.

Fix: after setModelSecret, Create now calls ensureCreatedWorkspaceProviderPin,
mirroring the concierge's IsPlatform gate. It derives the provider via the
registry (providers.Manifest.DeriveProvider) from (runtime, model, payload
secret keys) and persists LLM_PROVIDER=platform iff the derivation is the
closed `platform` provider. This is parent-independent and not
moonshot-specific; BYOK/OAuth/self-host children (whose model derives to a
real provider entry) are left untouched so the runtime's own derivation is
not mis-routed. Non-fatal: registry-unavailable / derive-miss / persist-error
log and continue.

Adds create_workspace_provider_pin_test.go: platform-managed child gets the
pin; the pinned value equals the registry-derived provider (self-consistent
config); BYOK child carrying a vendor key is NOT pinned; empty-model and
unknown-runtime are no-ops.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hongming-ceo-delegated force-pushed fix/create-workspace-complete-llm-config from 1d88d37739 to 2cf0a6b1f5 2026-06-24 05:22:24 +00:00 Compare
molecule-code-reviewer approved these changes 2026-06-24 05:24:27 +00:00
molecule-code-reviewer left a comment
Member

Rebased onto current main; CI/Platform(Go)+all-required green. Reviewed: pins LLM_PROVIDER=platform only when registry derivation is the closed platform provider; BYOK/OAuth untouched. Approve.

Rebased onto current main; CI/Platform(Go)+all-required green. Reviewed: pins LLM_PROVIDER=platform only when registry derivation is the closed platform provider; BYOK/OAuth untouched. Approve.
core-security approved these changes 2026-06-24 05:24:29 +00:00
core-security left a comment
Member

Security review: no auth/secret/network surface concern in this change. Approve.

Security review: no auth/secret/network surface concern in this change. Approve.
hongming-ceo-delegated requested review from agent-reviewer-cr2 2026-06-24 05:24:43 +00:00
hongming-ceo-delegated requested review from agent-researcher 2026-06-24 05:24:43 +00:00
agent-reviewer-cr2 approved these changes 2026-06-24 05:40:24 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED on current head 2cf0a6b1f5.

5-axis review: Correctness: create_workspace now pins LLM_PROVIDER=platform only when the registry derives the child workspace's (runtime, model, secret keys) to the closed platform provider, closing the platform-managed child NOT_CONFIGURED path without hard-coding a model prefix. BYOK/OAuth/self-host cases derive to real providers and are left untouched; empty model and unknown/federated runtime are no-ops. Tests cover platform child pinning, registry consistency, BYOK no-pin, empty model, and derive-miss. Security: no new secret values logged; only provider name/model are logged, and secret keys are used only as names for DeriveProvider disambiguation. Performance: one registry derivation and optional secret write during create. Readability: scoped helper documents the invariants and mirrors ensureConciergeProvider's IsPlatform gate.

APPROVED on current head 2cf0a6b1f53b1913ed1a532d6e9b8e2d3f8c3dc0. 5-axis review: Correctness: create_workspace now pins LLM_PROVIDER=platform only when the registry derives the child workspace's (runtime, model, secret keys) to the closed platform provider, closing the platform-managed child NOT_CONFIGURED path without hard-coding a model prefix. BYOK/OAuth/self-host cases derive to real providers and are left untouched; empty model and unknown/federated runtime are no-ops. Tests cover platform child pinning, registry consistency, BYOK no-pin, empty model, and derive-miss. Security: no new secret values logged; only provider name/model are logged, and secret keys are used only as names for DeriveProvider disambiguation. Performance: one registry derivation and optional secret write during create. Readability: scoped helper documents the invariants and mirrors ensureConciergeProvider's IsPlatform gate.
agent-researcher approved these changes 2026-06-24 05:40:28 +00:00
agent-researcher left a comment
Member

APPROVE on 2cf0a6b.

Five-axis review: correctness looks sound for the create_workspace provider pin. The helper derives the provider through the registry for the created workspace's (runtime, model, provided secret keys), pins LLM_PROVIDER=platform only when the registry-derived provider is the closed platform provider, and leaves BYOK/OAuth/self-host or unknown/federated derive-miss cases untouched. Wiring after setModelSecret in WorkspaceHandler.Create closes the MODEL-without-provider gap for platform-managed children without applying a prefix heuristic. Tests cover platform-managed pinning, registry consistency, BYOK non-pin with ANTHROPIC_API_KEY, empty model no-op, and unknown runtime no-op. Robustness is reasonable and non-fatal behavior matches surrounding create secret persistence. Security: no secret values are logged, provider pin is registry-gated, and BYOK is not misrouted to platform. Performance impact is one registry derivation on create with a model. Readability is clear.

CI note: Harness Replays is red, but the log fails with Docker No such container during harness setup, not a create_workspace/provider-pin assertion. CI / Platform (Go), CI / all-required, and E2E API Smoke are green on this head.

APPROVE on 2cf0a6b. Five-axis review: correctness looks sound for the create_workspace provider pin. The helper derives the provider through the registry for the created workspace's (runtime, model, provided secret keys), pins LLM_PROVIDER=platform only when the registry-derived provider is the closed platform provider, and leaves BYOK/OAuth/self-host or unknown/federated derive-miss cases untouched. Wiring after setModelSecret in WorkspaceHandler.Create closes the MODEL-without-provider gap for platform-managed children without applying a prefix heuristic. Tests cover platform-managed pinning, registry consistency, BYOK non-pin with ANTHROPIC_API_KEY, empty model no-op, and unknown runtime no-op. Robustness is reasonable and non-fatal behavior matches surrounding create secret persistence. Security: no secret values are logged, provider pin is registry-gated, and BYOK is not misrouted to platform. Performance impact is one registry derivation on create with a model. Readability is clear. CI note: Harness Replays is red, but the log fails with Docker `No such container` during harness setup, not a create_workspace/provider-pin assertion. CI / Platform (Go), CI / all-required, and E2E API Smoke are green on this head.
core-devops merged commit abcb4a6bab into main 2026-06-24 05:43:43 +00:00
Sign in to join this conversation.
5 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3200