Fresh SaaS tenant: concierge team spin-up fails MISSING_BYOK_CREDENTIAL — provider derivation overrides org default platform_managed #2608

New Issue

2026-06-11T21:57:25Z

core-devops commented

2026-06-11 21:57:25 +00:00

Live repro 2026-06-11 ~21:51Z, brand-new tenant enter-os (first real user flow): user asked the concierge to spin up a team; both child workspaces (enter-product, enter-growth, runtime claude-code tier 4) failed WORKSPACE_PROVISION_FAILED / MISSING_BYOK_CREDENTIAL (payload cites #1994), while the org's billing default was platform_managed.

Mechanism (llm_billing_mode.go resolution): with no workspace override, precedence DERIVES the provider from (runtime, model) BEFORE the org default — a specific non-platform vendor derivation forces byok, which a fresh org can never satisfy (no workspace credential exists yet). Net effect: the canonical first-run journey — new tenant → concierge → "build me a team" — fails out of the box, with failed nodes in the org tree.

Unblock applied (enter-os): explicit workspace overrides to platform_managed via PUT /admin/workspaces/:id/llm-billing-mode + restart → both online.

Decision needed (CTO — this is the standing "billing default direction" question, now with live first-run evidence): on SaaS tenants with a working LLM proxy, resolution should prefer the org default (platform_managed) over vendor derivation; byok only on explicit opt-in (workspace override or org default set to byok). Self-host keeps fail-closed byok.

Secondary (either here or with cp#729): the workspace row's last_error stays EMPTY on provision failure — the real error only lives in the events stream; the canvas shows a bare red failed. Surface the provision error onto the workspace record.

Refs #1994, #1922, #1963.

Live repro 2026-06-11 ~21:51Z, brand-new tenant `enter-os` (first real user flow): user asked the concierge to spin up a team; both child workspaces (`enter-product`, `enter-growth`, runtime claude-code tier 4) failed `WORKSPACE_PROVISION_FAILED` / `MISSING_BYOK_CREDENTIAL` (payload cites #1994), while the org's billing default was **platform_managed**. **Mechanism** (`llm_billing_mode.go` resolution): with no workspace override, precedence DERIVES the provider from (runtime, model) BEFORE the org default — a specific non-platform vendor derivation forces `byok`, which a fresh org can never satisfy (no workspace credential exists yet). Net effect: the canonical first-run journey — new tenant → concierge → "build me a team" — fails out of the box, with `failed` nodes in the org tree. **Unblock applied (enter-os):** explicit workspace overrides to `platform_managed` via PUT /admin/workspaces/:id/llm-billing-mode + restart → both online. **Decision needed (CTO — this is the standing "billing default direction" question, now with live first-run evidence):** on SaaS tenants with a working LLM proxy, resolution should prefer the org default (platform_managed) over vendor derivation; `byok` only on explicit opt-in (workspace override or org default set to byok). Self-host keeps fail-closed byok. **Secondary (either here or with cp#729):** the workspace row's `last_error` stays EMPTY on provision failure — the real error only lives in the events stream; the canvas shows a bare red `failed`. Surface the provision error onto the workspace record. Refs #1994, #1922, #1963.

core-devops referenced this issue

2026-06-11 22:12:27 +00:00

Concierge-provisioned workspaces are created as parent_id NULL orphans — breaks A2A ("cannot communicate per hierarchy rules"), canvas depth-1 (#2601), and the single-root org model #2609

core-devops referenced this issue

2026-06-11 22:44:26 +00:00

Watchdog double-provision race wedges fresh workspaces: two boxes fight one bootstrap-register mint → permanent 401 + online-but-braindead agents #2611

core-devops commented

2026-06-11 22:46:04 +00:00

Sharpened root cause (supersedes the org-default framing above): org_default in the admin responses is a wire-compat CONSTANT — org-level defaults were deliberately retired as a billing source (internal#718 P2-B, CTO 2026-05-27), so "derivation overrode the org default" was imprecise. The actual defect:

DeriveProvider(runtime, model, availableAuthEnv) saw the org's shared/global credential env names (e.g. CLAUDE_CODE_OAUTH_TOKEN seeded org-wide), classified the workspace as a non-platform vendor → byok — and then the byok gate refused to use those very shared credentials ("the platform's shared LLM credentials are not used for non-platform workspaces"). The resolver and the gate disagree about what shared creds mean, and a fresh SaaS org walks straight into the contradiction.

Decision options (CTO):

Derive must not count GLOBAL-scope credential env names as BYOK evidence — only workspace-scoped creds flip a workspace to byok. SaaS first-run then resolves platform_managed and works. (My recommendation: it makes the resolver consistent with the gate's own rule.)
Keep derivation as-is but let byok consume org-shared creds (weakens the byok isolation the gate was built for).
Reintroduce an org-level default with precedence over derive (reverses internal#718 P2-B — not recommended without revisiting that decision).

No code change shipped for this one pending your call — it sits directly on the internal#718 P2-B ruling.

**Sharpened root cause (supersedes the org-default framing above):** `org_default` in the admin responses is a wire-compat CONSTANT — org-level defaults were deliberately retired as a billing source (internal#718 P2-B, CTO 2026-05-27), so "derivation overrode the org default" was imprecise. The actual defect: `DeriveProvider(runtime, model, availableAuthEnv)` saw the org's **shared/global** credential env names (e.g. CLAUDE_CODE_OAUTH_TOKEN seeded org-wide), classified the workspace as a non-platform vendor → `byok` — and then the byok gate **refused to use those very shared credentials** ("the platform's shared LLM credentials are not used for non-platform workspaces"). The resolver and the gate disagree about what shared creds mean, and a fresh SaaS org walks straight into the contradiction. **Decision options (CTO):** 1. Derive must not count GLOBAL-scope credential env names as BYOK evidence — only workspace-scoped creds flip a workspace to byok. SaaS first-run then resolves platform_managed and works. (My recommendation: it makes the resolver consistent with the gate's own rule.) 2. Keep derivation as-is but let byok consume org-shared creds (weakens the byok isolation the gate was built for). 3. Reintroduce an org-level default with precedence over derive (reverses internal#718 P2-B — not recommended without revisiting that decision). No code change shipped for this one pending your call — it sits directly on the internal#718 P2-B ruling.