Fresh SaaS tenant: concierge team spin-up fails MISSING_BYOK_CREDENTIAL — provider derivation overrides org default platform_managed #2608

Closed
opened 2026-06-11 21:57:25 +00:00 by core-devops · 2 comments
Member

Live repro 2026-06-11 ~21:51Z, brand-new tenant enter-os (first real user flow): user asked the concierge to spin up a team; both child workspaces (enter-product, enter-growth, runtime claude-code tier 4) failed WORKSPACE_PROVISION_FAILED / MISSING_BYOK_CREDENTIAL (payload cites #1994), while the org's billing default was platform_managed.

Mechanism (llm_billing_mode.go resolution): with no workspace override, precedence DERIVES the provider from (runtime, model) BEFORE the org default — a specific non-platform vendor derivation forces byok, which a fresh org can never satisfy (no workspace credential exists yet). Net effect: the canonical first-run journey — new tenant → concierge → "build me a team" — fails out of the box, with failed nodes in the org tree.

Unblock applied (enter-os): explicit workspace overrides to platform_managed via PUT /admin/workspaces/:id/llm-billing-mode + restart → both online.

Decision needed (CTO — this is the standing "billing default direction" question, now with live first-run evidence): on SaaS tenants with a working LLM proxy, resolution should prefer the org default (platform_managed) over vendor derivation; byok only on explicit opt-in (workspace override or org default set to byok). Self-host keeps fail-closed byok.

Secondary (either here or with cp#729): the workspace row's last_error stays EMPTY on provision failure — the real error only lives in the events stream; the canvas shows a bare red failed. Surface the provision error onto the workspace record.

Refs #1994, #1922, #1963.

Live repro 2026-06-11 ~21:51Z, brand-new tenant `enter-os` (first real user flow): user asked the concierge to spin up a team; both child workspaces (`enter-product`, `enter-growth`, runtime claude-code tier 4) failed `WORKSPACE_PROVISION_FAILED` / `MISSING_BYOK_CREDENTIAL` (payload cites #1994), while the org's billing default was **platform_managed**. **Mechanism** (`llm_billing_mode.go` resolution): with no workspace override, precedence DERIVES the provider from (runtime, model) BEFORE the org default — a specific non-platform vendor derivation forces `byok`, which a fresh org can never satisfy (no workspace credential exists yet). Net effect: the canonical first-run journey — new tenant → concierge → "build me a team" — fails out of the box, with `failed` nodes in the org tree. **Unblock applied (enter-os):** explicit workspace overrides to `platform_managed` via PUT /admin/workspaces/:id/llm-billing-mode + restart → both online. **Decision needed (CTO — this is the standing "billing default direction" question, now with live first-run evidence):** on SaaS tenants with a working LLM proxy, resolution should prefer the org default (platform_managed) over vendor derivation; `byok` only on explicit opt-in (workspace override or org default set to byok). Self-host keeps fail-closed byok. **Secondary (either here or with cp#729):** the workspace row's `last_error` stays EMPTY on provision failure — the real error only lives in the events stream; the canvas shows a bare red `failed`. Surface the provision error onto the workspace record. Refs #1994, #1922, #1963.
Author
Member

Sharpened root cause (supersedes the org-default framing above): org_default in the admin responses is a wire-compat CONSTANT — org-level defaults were deliberately retired as a billing source (internal#718 P2-B, CTO 2026-05-27), so "derivation overrode the org default" was imprecise. The actual defect:

DeriveProvider(runtime, model, availableAuthEnv) saw the org's shared/global credential env names (e.g. CLAUDE_CODE_OAUTH_TOKEN seeded org-wide), classified the workspace as a non-platform vendor → byok — and then the byok gate refused to use those very shared credentials ("the platform's shared LLM credentials are not used for non-platform workspaces"). The resolver and the gate disagree about what shared creds mean, and a fresh SaaS org walks straight into the contradiction.

Decision options (CTO):

  1. Derive must not count GLOBAL-scope credential env names as BYOK evidence — only workspace-scoped creds flip a workspace to byok. SaaS first-run then resolves platform_managed and works. (My recommendation: it makes the resolver consistent with the gate's own rule.)
  2. Keep derivation as-is but let byok consume org-shared creds (weakens the byok isolation the gate was built for).
  3. Reintroduce an org-level default with precedence over derive (reverses internal#718 P2-B — not recommended without revisiting that decision).

No code change shipped for this one pending your call — it sits directly on the internal#718 P2-B ruling.

**Sharpened root cause (supersedes the org-default framing above):** `org_default` in the admin responses is a wire-compat CONSTANT — org-level defaults were deliberately retired as a billing source (internal#718 P2-B, CTO 2026-05-27), so "derivation overrode the org default" was imprecise. The actual defect: `DeriveProvider(runtime, model, availableAuthEnv)` saw the org's **shared/global** credential env names (e.g. CLAUDE_CODE_OAUTH_TOKEN seeded org-wide), classified the workspace as a non-platform vendor → `byok` — and then the byok gate **refused to use those very shared credentials** ("the platform's shared LLM credentials are not used for non-platform workspaces"). The resolver and the gate disagree about what shared creds mean, and a fresh SaaS org walks straight into the contradiction. **Decision options (CTO):** 1. Derive must not count GLOBAL-scope credential env names as BYOK evidence — only workspace-scoped creds flip a workspace to byok. SaaS first-run then resolves platform_managed and works. (My recommendation: it makes the resolver consistent with the gate's own rule.) 2. Keep derivation as-is but let byok consume org-shared creds (weakens the byok isolation the gate was built for). 3. Reintroduce an org-level default with precedence over derive (reverses internal#718 P2-B — not recommended without revisiting that decision). No code change shipped for this one pending your call — it sits directly on the internal#718 P2-B ruling.
Author
Member

FINAL analysis after a full SSOT check (CTO asked for a comprehensive verification — this supersedes both earlier comments):

The design works exactly as intended, and the CTO's description of it is correct:

  • Template SSOT (config.yaml): runtime_config.model: moonshot/kimi-k2.6, provider: platform, required_env: [] — the default for every claude-code workspace is Kimi on the platform proxy, no key needed.
  • Registry SSOT (providers.yaml runtimes block): the model id namespace encodes the arm deterministically — slash-form vendor/model ids (anthropic/claude-…, moonshot/kimi-…) live on the platform arm; bare/colon-form ids (sonnet, claude-opus-4-7, anthropic:…) live on the BYOK arms. Exact-id match is authoritative; for these ids no env-var tie-break is even consulted (each bare id is owned by exactly one arm).
  • Keys are only required when a non-platform entry is selected in settings (required_env per entry). All as designed.

So the earlier theories (org-default precedence; shared-creds polluting the tie-break) are both WRONG — retracted. The actual failure: the org concierge passed a BYOK-form model id when provisioning the team (the mcp-server management provision_workspace tool's model param is optional and described only as "LLM model id" — an agent naturally types a Claude id like claude-sonnet-4-6, which deterministically resolves to the anthropic-api BYOK arm → byok → MISSING_BYOK_CREDENTIAL on a fresh org). The platform behaved correctly given that input; the input had no guardrail. (The enter-os org was deleted before the stored model could be read back, but the resolver is deterministic — a byok outcome implies a BYOK-form id.)

Design-conformant durable fix (no resolver/billing changes needed):

  1. Server-side default-from-SSOT: when an agent/management provision omits model, default to the template's runtime_config.model (the platform Kimi default) instead of erroring — the create boundary keeps requiring a model, it's just sourced from the SSOT the CTO described.
  2. mcp-server tool guidance: provision_workspace.model description must state the namespace rule — "omit for the org's platform default; vendor/model slash-ids = platform-billed, no key; bare/vendor: ids = BYOK and the workspace must hold its own key" — so orchestrating agents stop guessing.
  3. (Already-good) the provision-time MISSING_BYOK_CREDENTIAL rejection stays as the backstop.

This drops the priority of any resolver change to zero. Owners: (1) core workspace create/provision path, (2) molecule-mcp-server.

**FINAL analysis after a full SSOT check (CTO asked for a comprehensive verification — this supersedes both earlier comments):** The design works exactly as intended, and the CTO's description of it is correct: - Template SSOT (`config.yaml`): `runtime_config.model: moonshot/kimi-k2.6`, `provider: platform`, `required_env: []` — the default for every claude-code workspace is Kimi on the platform proxy, **no key needed**. - Registry SSOT (`providers.yaml` runtimes block): the **model id namespace encodes the arm deterministically** — slash-form `vendor/model` ids (anthropic/claude-…, moonshot/kimi-…) live on the `platform` arm; bare/colon-form ids (`sonnet`, `claude-opus-4-7`, `anthropic:…`) live on the BYOK arms. Exact-id match is authoritative; for these ids **no env-var tie-break is even consulted** (each bare id is owned by exactly one arm). - Keys are only required when a non-platform entry is selected in settings (`required_env` per entry). All as designed. **So the earlier theories (org-default precedence; shared-creds polluting the tie-break) are both WRONG — retracted.** The actual failure: the **org concierge passed a BYOK-form model id** when provisioning the team (the mcp-server management `provision_workspace` tool's `model` param is optional and described only as *"LLM model id"* — an agent naturally types a Claude id like `claude-sonnet-4-6`, which deterministically resolves to the anthropic-api BYOK arm → byok → MISSING_BYOK_CREDENTIAL on a fresh org). The platform behaved correctly given that input; the input had no guardrail. (The enter-os org was deleted before the stored model could be read back, but the resolver is deterministic — a byok outcome implies a BYOK-form id.) **Design-conformant durable fix (no resolver/billing changes needed):** 1. **Server-side default-from-SSOT**: when an agent/management provision omits `model`, default to the template's `runtime_config.model` (the platform Kimi default) instead of erroring — the create boundary keeps requiring a model, it's just sourced from the SSOT the CTO described. 2. **mcp-server tool guidance**: `provision_workspace.model` description must state the namespace rule — "omit for the org's platform default; `vendor/model` slash-ids = platform-billed, no key; bare/`vendor:` ids = BYOK and the workspace must hold its own key" — so orchestrating agents stop guessing. 3. (Already-good) the provision-time MISSING_BYOK_CREDENTIAL rejection stays as the backstop. This drops the priority of any resolver change to zero. Owners: (1) core workspace create/provision path, (2) molecule-mcp-server.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2608