fix(platform-agent): pin LLM_PROVIDER=platform when concierge MODEL is empty #3160
Reference in New Issue
Block a user
Delete Branch "fix/concierge-provider-empty-model"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root cause
Newly-created platform-agent orgs (e.g. test9) 401 on every LLM call ("Incorrect API key"). The concierge's claude-code sends an inherited tenant
CLAUDE_CODE_OAUTH_TOKEN(sk-ant-oat01) as the bearer to the CP LLM proxy, which authenticates callers against the per-workspaceorg_instances.admin_token— an OAuth bearer can never match it -> 401.That OAuth token should have been stripped because the concierge pins
LLM_PROVIDER=platform. ButensureConciergeProvidergated the pin onMODELhaving the platform-managed prefix; on a rebuilt-from-DB provision payloadMODELis empty, soHasPrefix("", ...)is false and the pin was skipped -> noLLM_PROVIDER-> the runtime could not drop the inherited OAuth token.Fix
Treat an empty
MODELas platform-managed (pinplatform). Empty is never a BYOK signal — BYOK concierges carry a storedLLM_PROVIDER(early-return) or an explicit non-platformMODEL(still skipped). Only the unresolved/empty case now pins.Tests
TestEnsureConciergeProvider_EmptyModelPins: empty MODEL pins platform; explicit BYOK model still skips. go test green, go vet clean.Paired with the runtime defense-in-depth (un-gated OAuth drop when base URL is the CP proxy): branch
fix/llm-auth-drop-oauth-on-cp-proxyin molecule-ai-workspace-runtime.🤖 Generated with Claude Code
ensureConciergeProvider gated the LLM_PROVIDER=platform pin on the effective MODEL having the platform-managed prefix. On a rebuilt-from-DB provision payload the MODEL is empty, so HasPrefix("", ...) was false and the pin was skipped — the concierge booted without LLM_PROVIDER. The runtime then could not drop the inherited tenant CLAUDE_CODE_OAUTH_TOKEN, so claude-code sent the OAuth bearer to the CP LLM proxy (which authenticates via the per-workspace admin_token) and 401'd every call. Root cause of the test9 / platform-agent "Incorrect API key" outage. Treat an empty MODEL as platform-managed (pin platform): empty is never a BYOK signal — BYOK concierges carry a stored LLM_PROVIDER (early-return) or an explicit non-platform MODEL (still skipped). Only the unresolved/empty case now pins. Regression test: empty MODEL pins platform; explicit BYOK model still skips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Reviewed the gate change directly:
!HasPrefix(model,prefix)->model != "" && !HasPrefix(...)correctly pins platform for an empty/unresolved MODEL (the rebuilt-from-DB concierge case that 401'd). BYOK safe: stored LLM_PROVIDER early-returns above the gate (verified in source), non-empty non-platform MODEL still skips; both covered by the new regression test. APPROVE (code). Note: PR is not green yet — qa-review/security-review/gate-check-v3/sop-checklist + E2E Staging Boot need resolving before merge.Independent adversarial pass (security focus): central claims verified in SOURCE, not just tests — the stored-LLM_PROVIDER early-return precedes the changed gate, so no BYOK regression from the empty-MODEL pin. Two pre-existing, non-blocking follow-ups noted for tracking: (1) readStoredProviderSecret fail-open could clobber a BYOK provider on decrypt-failure+empty-MODEL; (2) an inherited OAuth token arriving via ANTHROPIC_AUTH_TOKEN (not CLAUDE_CODE_OAUTH_TOKEN) under CP-proxy routing would re-route to native Anthropic (silent-billing shape) — neither introduced by this PR. APPROVE.
APPROVED — genuine 5-axis review on current head
4b7fb6ecfc.Correctness: the platform pin now fires for the failing rebuilt-from-DB/empty MODEL path while preserving the existing stored LLM_PROVIDER early-return and the explicit non-platform MODEL skip, so legitimate BYOK/self-host model overrides are not forced to platform. Security/token precedence: this restores the provider pin needed for runtime auth stripping and avoids the inherited Claude OAuth token winning against the CP LLM proxy. Regression risk: narrow one-condition change in ensureConciergeProvider; no production logic outside concierge provider seeding. Tests: direct unit coverage proves empty MODEL pins and explicit BYOK model does not. Blast radius: platform-agent concierge provisioning only; appropriate for the live test9/cd62fe70 401 incident.
APPROVED current-head incident review.
5-axis check: