fix(workspace-server): create_workspace children born NOT_CONFIGURED — pin LLM_PROVIDER=platform for platform-managed models #3200
Reference in New Issue
Block a user
Delete Branch "fix/create-workspace-complete-llm-config"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root cause
Child agents created via the
create_workspacemanagement-MCP tool are born NOT_CONFIGURED.LLM_PROVIDER=platformfromensureConciergeProvider(platform_agent.go) — but that helper runs only on the kind=platform provision path.create_workspaceflows throughWorkspaceHandler.Create(workspace.go:836), which persistsMODELviasetModelSecretbut — since the internal#718 P4 closure removed the unconditionalsetProviderSecretwrite — persists noLLM_PROVIDER. Children land withsecrets = [MODEL]only.moonshot/kimi-k2.6, the on-box runtime re-derives the provider with its own slug-split (_derive_provider_from_model→moonshot, a model prefix, not a registry name), so the claude-code adapter fail-closes:workspace config picks provider='moonshot' but it is not in the providers registry→ online but NOT_CONFIGURED.This is the exact symptom
ensureConciergeProvideralready cures for the root; children created viacreate_workspaceneed the same env-level pin.Fix
After
setModelSecret,Createnow callsensureCreatedWorkspaceProviderPin(platform_agent.go). It derives the provider via the registry (providers.Manifest.DeriveProvider) from(runtime, model, payload secret keys)and persistsLLM_PROVIDER=platformiff the derivation is the closedplatformprovider — mirroring the concierge'sIsPlatformgate.platformgets the pin.Tests
create_workspace_provider_pin_test.go:moonshot/kimi-k2.6) getsLLM_PROVIDER=platformpinned;ANTHROPIC_API_KEYforanthropic:claude-opus-4-7is not pinned;Build / test
go build ./...(workspace-server) — passgo vet ./internal/handlers/— cleanTestEnsureCreatedWorkspaceProviderPin,TestApplyConciergeProvisionConfig_SeedsProvider, allTestWorkspaceCreate*) — passmolecule-ai-workspace-runtimecheckout).Relationship to in-flight #3198: that PR touches the runtime-switch path in
workspace_crud.go; this is the create path (workspace.go+platform_agent.go) — no overlap.🤖 Generated with Claude Code
REQUEST_CHANGES: the workspace-server provider pin itself is narrow, but this PR also adds docs/design/rfc-fleet-governance-identity-and-merge-automation.md with detailed operational credential and identity inventory. That doc names token cache locations, Infisical paths, local credential files, persona/user mappings, stale-token findings, admin/merge identities, and automation wiring. In this public core repo, that is sensitive operational security material and reverses the direction of the recent runbooks-security cleanup. Please remove the RFC from this PR (or move it to the appropriate private/internal location) and keep this PR scoped to the create_workspace provider-pin code/tests. The code path does not appear to log secret values; the blocker is the added operational doc exposure.
c526ff49acto1d88d37739APPROVED on head
1d88d37739. Verified files API is code/tests only and the prior RFC file is gone. The change is scoped to create_workspace provider completeness: after MODEL persistence, it derives the provider from the registry using runtime/model plus create-payload secret keys, and pins LLM_PROVIDER=platform only when the registry-derived provider is the closed platform provider. BYOK/OAuth/self-host models and unknown/federated runtimes are skipped, so this avoids misrouting non-platform children.5-axis: correctness is sound for the NOT_CONFIGURED child-workspace failure; robustness covers empty model, registry miss, BYOK, and registry self-consistency; security is acceptable because it persists only the provider name and logs no secrets; performance impact is tiny per-create registry derivation; readability/tests are good. Do not merge yet: current status read had no second on-head pool approval and CI/required review contexts were still red/skipped pre-approval.
REQUEST_CHANGES on
1d88d37739.The RFC exposure blocker is cleared in the PR files API: the current diff is code/tests only for fix(workspace-server): create_workspace provider pin. However, the current head is still not clean against today's main. Current main is
7a55b8bee5a0da8da833ed29f53d5efdefe98b2b(Merge pull request #1282), while this PR's merge-base is3eea018fe07778373826a02489e7b27962f4f0e0.Direct
origin/main..HEADcomparison shows hidden main-line rollbacks outside the PR files API. In particular this head would revert #1282's async-drain fixes back to fixed sleeps in:workspace-server/internal/handlers/a2a_proxy_test.go(handler.waitAsyncForTest()->time.Sleep(...))workspace-server/internal/handlers/restart_signals_test.go(hWrapper.waitAsyncForTest()->time.Sleep(...))workspace-server/internal/handlers/workspace_provision_auto_test.go(h.waitAsyncForTest()->time.Sleep(...))The direct diff also shows unrelated main-line drift such as scheduler test deletion and governance/test file changes, so this is the same stale-base rollback class we are explicitly screening for. Please rebase onto current main so
origin/main..HEADcontains only this PR's intended code/test files, then re-dispatch. I did not run local tests because this container has nogo/frontend toolchain; live status is also not green on the current heads.A child workspace the concierge spawns via the `create_workspace` management-MCP tool flows through WorkspaceHandler.Create, which persists MODEL (setModelSecret) but — since the internal#718 P4 closure removed the unconditional setProviderSecret write — persisted NO LLM_PROVIDER. The kind=platform concierge gets its pin from ensureConciergeProvider, but that helper runs ONLY on the platform provision path, so children were left with secrets = [MODEL] only. For a platform-managed model id like "moonshot/kimi-k2.6" the on-box runtime re-derives the provider with its own slug-split (_derive_provider_from_model → "moonshot", a model PREFIX, not a registry NAME), so the claude-code adapter fail-closes ("workspace config picks provider='moonshot' but it is not in the providers registry") → the child boots online but NOT_CONFIGURED. Fix: after setModelSecret, Create now calls ensureCreatedWorkspaceProviderPin, mirroring the concierge's IsPlatform gate. It derives the provider via the registry (providers.Manifest.DeriveProvider) from (runtime, model, payload secret keys) and persists LLM_PROVIDER=platform iff the derivation is the closed `platform` provider. This is parent-independent and not moonshot-specific; BYOK/OAuth/self-host children (whose model derives to a real provider entry) are left untouched so the runtime's own derivation is not mis-routed. Non-fatal: registry-unavailable / derive-miss / persist-error log and continue. Adds create_workspace_provider_pin_test.go: platform-managed child gets the pin; the pinned value equals the registry-derived provider (self-consistent config); BYOK child carrying a vendor key is NOT pinned; empty-model and unknown-runtime are no-ops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1d88d37739to2cf0a6b1f5Rebased onto current main; CI/Platform(Go)+all-required green. Reviewed: pins LLM_PROVIDER=platform only when registry derivation is the closed platform provider; BYOK/OAuth untouched. Approve.
Security review: no auth/secret/network surface concern in this change. Approve.
APPROVED on current head
2cf0a6b1f5.5-axis review: Correctness: create_workspace now pins LLM_PROVIDER=platform only when the registry derives the child workspace's (runtime, model, secret keys) to the closed platform provider, closing the platform-managed child NOT_CONFIGURED path without hard-coding a model prefix. BYOK/OAuth/self-host cases derive to real providers and are left untouched; empty model and unknown/federated runtime are no-ops. Tests cover platform child pinning, registry consistency, BYOK no-pin, empty model, and derive-miss. Security: no new secret values logged; only provider name/model are logged, and secret keys are used only as names for DeriveProvider disambiguation. Performance: one registry derivation and optional secret write during create. Readability: scoped helper documents the invariants and mirrors ensureConciergeProvider's IsPlatform gate.
APPROVE on
2cf0a6b.Five-axis review: correctness looks sound for the create_workspace provider pin. The helper derives the provider through the registry for the created workspace's (runtime, model, provided secret keys), pins LLM_PROVIDER=platform only when the registry-derived provider is the closed platform provider, and leaves BYOK/OAuth/self-host or unknown/federated derive-miss cases untouched. Wiring after setModelSecret in WorkspaceHandler.Create closes the MODEL-without-provider gap for platform-managed children without applying a prefix heuristic. Tests cover platform-managed pinning, registry consistency, BYOK non-pin with ANTHROPIC_API_KEY, empty model no-op, and unknown runtime no-op. Robustness is reasonable and non-fatal behavior matches surrounding create secret persistence. Security: no secret values are logged, provider pin is registry-gated, and BYOK is not misrouted to platform. Performance impact is one registry derivation on create with a model. Readability is clear.
CI note: Harness Replays is red, but the log fails with Docker
No such containerduring harness setup, not a create_workspace/provider-pin assertion. CI / Platform (Go), CI / all-required, and E2E API Smoke are green on this head.