fix(platform-agent): pin LLM_PROVIDER=platform when concierge MODEL is empty #3160

Merged
devops-engineer merged 1 commits from fix/concierge-provider-empty-model into main 2026-06-22 17:27:57 +00:00
Owner

Root cause

Newly-created platform-agent orgs (e.g. test9) 401 on every LLM call ("Incorrect API key"). The concierge's claude-code sends an inherited tenant CLAUDE_CODE_OAUTH_TOKEN (sk-ant-oat01) as the bearer to the CP LLM proxy, which authenticates callers against the per-workspace org_instances.admin_token — an OAuth bearer can never match it -> 401.

That OAuth token should have been stripped because the concierge pins LLM_PROVIDER=platform. But ensureConciergeProvider gated the pin on MODEL having the platform-managed prefix; on a rebuilt-from-DB provision payload MODEL is empty, so HasPrefix("", ...) is false and the pin was skipped -> no LLM_PROVIDER -> the runtime could not drop the inherited OAuth token.

Fix

Treat an empty MODEL as platform-managed (pin platform). Empty is never a BYOK signal — BYOK concierges carry a stored LLM_PROVIDER (early-return) or an explicit non-platform MODEL (still skipped). Only the unresolved/empty case now pins.

Tests

TestEnsureConciergeProvider_EmptyModelPins: empty MODEL pins platform; explicit BYOK model still skips. go test green, go vet clean.

Paired with the runtime defense-in-depth (un-gated OAuth drop when base URL is the CP proxy): branch fix/llm-auth-drop-oauth-on-cp-proxy in molecule-ai-workspace-runtime.

🤖 Generated with Claude Code

## Root cause Newly-created platform-agent orgs (e.g. test9) 401 on every LLM call ("Incorrect API key"). The concierge's claude-code sends an inherited tenant `CLAUDE_CODE_OAUTH_TOKEN` (sk-ant-oat01) as the bearer to the CP LLM proxy, which authenticates callers against the per-workspace `org_instances.admin_token` — an OAuth bearer can never match it -> 401. That OAuth token should have been stripped because the concierge pins `LLM_PROVIDER=platform`. But `ensureConciergeProvider` gated the pin on `MODEL` having the platform-managed prefix; on a rebuilt-from-DB provision payload `MODEL` is empty, so `HasPrefix("", ...)` is false and the pin was skipped -> no `LLM_PROVIDER` -> the runtime could not drop the inherited OAuth token. ## Fix Treat an empty `MODEL` as platform-managed (pin `platform`). Empty is never a BYOK signal — BYOK concierges carry a stored `LLM_PROVIDER` (early-return) or an explicit non-platform `MODEL` (still skipped). Only the unresolved/empty case now pins. ## Tests `TestEnsureConciergeProvider_EmptyModelPins`: empty MODEL pins platform; explicit BYOK model still skips. go test green, go vet clean. Paired with the runtime defense-in-depth (un-gated OAuth drop when base URL is the CP proxy): branch `fix/llm-auth-drop-oauth-on-cp-proxy` in molecule-ai-workspace-runtime. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hongming added 1 commit 2026-06-22 16:47:54 +00:00
fix(platform-agent): pin LLM_PROVIDER=platform when concierge MODEL is empty
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Plugin Install Lifecycle (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 6s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Block integration-tester contamination artifacts / Block staging-trigger / invalid manifest contamination (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / Prune stale e2e DNS records (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
sop-checklist / review-refire (pull_request_target) Has been skipped
Harness Replays / detect-changes (pull_request) Successful in 7s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 8s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Successful in 12s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 16s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
CI / Detect changes (pull_request) Successful in 20s
sop-checklist / na-declarations (pull_request) N/A: (none)
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 18s
E2E Chat / detect-changes (pull_request) Successful in 23s
sop-checklist / all-items-acked (pull_request_target) Successful in 16s
PR Diff Guard / PR diff guard (pull_request) Successful in 22s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
template-delivery-e2e / detect-changes (pull_request) Successful in 20s
CI / Canvas (Next.js) (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
E2E API Smoke Test / detect-changes (pull_request) Successful in 35s
E2E Chat / E2E Chat (pull_request) Successful in 7s
CI / Canvas Deploy Status (pull_request) Successful in 2s
gate-check-v3 / gate-check (pull_request_target) Failing after 37s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 45s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 36s
Harness Replays / Harness Replays (pull_request) Successful in 1m31s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 2m3s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m24s
CI / Platform (Go) (pull_request) Successful in 4m40s
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 6m27s
template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) (pull_request) Successful in 7m14s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Successful in 7m49s
CI / all-required (pull_request) Successful in 3m24s
reserved-path-review / reserved-path-review (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_target) Approved via pull_request_review trigger
reserved-path-review / reserved-path-review (pull_request_review) Successful in 15s
qa-review / approved (pull_request_review) Successful in 16s
security-review / approved (pull_request_review) Successful in 16s
audit-force-merge / audit (pull_request_target) Successful in 8s
4b7fb6ecfc
ensureConciergeProvider gated the LLM_PROVIDER=platform pin on the effective MODEL
having the platform-managed prefix. On a rebuilt-from-DB provision payload the
MODEL is empty, so HasPrefix("", ...) was false and the pin was skipped — the
concierge booted without LLM_PROVIDER. The runtime then could not drop the
inherited tenant CLAUDE_CODE_OAUTH_TOKEN, so claude-code sent the OAuth bearer to
the CP LLM proxy (which authenticates via the per-workspace admin_token) and 401'd
every call. Root cause of the test9 / platform-agent "Incorrect API key" outage.

Treat an empty MODEL as platform-managed (pin platform): empty is never a BYOK
signal — BYOK concierges carry a stored LLM_PROVIDER (early-return) or an explicit
non-platform MODEL (still skipped). Only the unresolved/empty case now pins.

Regression test: empty MODEL pins platform; explicit BYOK model still skips.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
molecule-code-reviewer approved these changes 2026-06-22 16:57:29 +00:00
molecule-code-reviewer left a comment
Member

Reviewed the gate change directly: !HasPrefix(model,prefix) -> model != "" && !HasPrefix(...) correctly pins platform for an empty/unresolved MODEL (the rebuilt-from-DB concierge case that 401'd). BYOK safe: stored LLM_PROVIDER early-returns above the gate (verified in source), non-empty non-platform MODEL still skips; both covered by the new regression test. APPROVE (code). Note: PR is not green yet — qa-review/security-review/gate-check-v3/sop-checklist + E2E Staging Boot need resolving before merge.

Reviewed the gate change directly: `!HasPrefix(model,prefix)` -> `model != "" && !HasPrefix(...)` correctly pins platform for an empty/unresolved MODEL (the rebuilt-from-DB concierge case that 401'd). BYOK safe: stored LLM_PROVIDER early-returns above the gate (verified in source), non-empty non-platform MODEL still skips; both covered by the new regression test. APPROVE (code). Note: PR is not green yet — qa-review/security-review/gate-check-v3/sop-checklist + E2E Staging Boot need resolving before merge.
core-security approved these changes 2026-06-22 16:57:37 +00:00
core-security left a comment
Member

Independent adversarial pass (security focus): central claims verified in SOURCE, not just tests — the stored-LLM_PROVIDER early-return precedes the changed gate, so no BYOK regression from the empty-MODEL pin. Two pre-existing, non-blocking follow-ups noted for tracking: (1) readStoredProviderSecret fail-open could clobber a BYOK provider on decrypt-failure+empty-MODEL; (2) an inherited OAuth token arriving via ANTHROPIC_AUTH_TOKEN (not CLAUDE_CODE_OAUTH_TOKEN) under CP-proxy routing would re-route to native Anthropic (silent-billing shape) — neither introduced by this PR. APPROVE.

Independent adversarial pass (security focus): central claims verified in SOURCE, not just tests — the stored-LLM_PROVIDER early-return precedes the changed gate, so no BYOK regression from the empty-MODEL pin. Two pre-existing, non-blocking follow-ups noted for tracking: (1) readStoredProviderSecret fail-open could clobber a BYOK provider on decrypt-failure+empty-MODEL; (2) an inherited OAuth token arriving via ANTHROPIC_AUTH_TOKEN (not CLAUDE_CODE_OAUTH_TOKEN) under CP-proxy routing would re-route to native Anthropic (silent-billing shape) — neither introduced by this PR. APPROVE.
agent-researcher approved these changes 2026-06-22 17:26:28 +00:00
agent-researcher left a comment
Member

APPROVED — genuine 5-axis review on current head 4b7fb6ecfc.

Correctness: the platform pin now fires for the failing rebuilt-from-DB/empty MODEL path while preserving the existing stored LLM_PROVIDER early-return and the explicit non-platform MODEL skip, so legitimate BYOK/self-host model overrides are not forced to platform. Security/token precedence: this restores the provider pin needed for runtime auth stripping and avoids the inherited Claude OAuth token winning against the CP LLM proxy. Regression risk: narrow one-condition change in ensureConciergeProvider; no production logic outside concierge provider seeding. Tests: direct unit coverage proves empty MODEL pins and explicit BYOK model does not. Blast radius: platform-agent concierge provisioning only; appropriate for the live test9/cd62fe70 401 incident.

APPROVED — genuine 5-axis review on current head 4b7fb6ecfcab99d870ae9ce99922a4465f482d39. Correctness: the platform pin now fires for the failing rebuilt-from-DB/empty MODEL path while preserving the existing stored LLM_PROVIDER early-return and the explicit non-platform MODEL skip, so legitimate BYOK/self-host model overrides are not forced to platform. Security/token precedence: this restores the provider pin needed for runtime auth stripping and avoids the inherited Claude OAuth token winning against the CP LLM proxy. Regression risk: narrow one-condition change in ensureConciergeProvider; no production logic outside concierge provider seeding. Tests: direct unit coverage proves empty MODEL pins and explicit BYOK model does not. Blast radius: platform-agent concierge provisioning only; appropriate for the live test9/cd62fe70 401 incident.
agent-reviewer-cr2 approved these changes 2026-06-22 17:27:20 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED current-head incident review.

5-axis check:

  • Correctness: empty MODEL for the platform concierge now pins LLM_PROVIDER=platform instead of returning early, closing the CP LLM proxy 401 path. Explicit non-empty non-platform/BYOK models still skip the pin.
  • Robustness/regression: respects existing stored provider early-return and persists the platform pin for rebuild/fresh payloads; no broad behavior change outside concierge provider seeding.
  • Security: restores admin-token precedence by ensuring runtime can classify platform traffic; no secret values added to logs.
  • Performance: no new hot-loop or blocking behavior beyond the existing single provider-secret check/persist path.
  • Readability/tests: direct regression test covers empty MODEL pin and explicit BYOK non-platform skip.
APPROVED current-head incident review. 5-axis check: - Correctness: empty MODEL for the platform concierge now pins LLM_PROVIDER=platform instead of returning early, closing the CP LLM proxy 401 path. Explicit non-empty non-platform/BYOK models still skip the pin. - Robustness/regression: respects existing stored provider early-return and persists the platform pin for rebuild/fresh payloads; no broad behavior change outside concierge provider seeding. - Security: restores admin-token precedence by ensuring runtime can classify platform traffic; no secret values added to logs. - Performance: no new hot-loop or blocking behavior beyond the existing single provider-secret check/persist path. - Readability/tests: direct regression test covers empty MODEL pin and explicit BYOK non-platform skip.
devops-engineer merged commit 9589d5373f into main 2026-06-22 17:27:57 +00:00
Sign in to join this conversation.
5 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3160