fix(workspace-server): provider-aware gate on platform scope:global LLM creds (internal#711) #1963

2026-05-27T19:56:28Z

hongming commented

2026-05-27 19:56:28 +00:00

Summary

Fixes the credential-resolution leak confirmed live 2026-05-27: a workspace whose resolved LLM billing mode is not platform_managed (byok / subscription) was still injected with the platform's scope:global CLAUDE_CODE_OAUTH_TOKEN and ran on Molecule's Anthropic credits. Reno Stars SEO (352e3c2b-0546-4e9c-b487-1e2ff1cf29fc) and Marketing (6b66de8d-9337-4fb4-be8d-6d49dca0d809) claude-code agents had no workspace-scoped LLM credential yet ran MODEL=opus directly on api.anthropic.com using the platform token.

Root cause

loadWorkspaceSecrets (workspace_provision.go) merges all global_secrets into every workspace's env — provenance-blind. The non-platform branch of applyPlatformManagedLLMEnv (workspace_provision.go:968 pre-fix) then early-returned without stripping those inherited platform globals, so a workspace with no LLM credential of its own kept the platform's scope:global CLAUDE_CODE_OAUTH_TOKEN. The same leak existed on the remote-pull path GET /workspaces/:id/secrets/values (secrets.go Values), which also merged globals unconditionally.

No schema change is needed: the per-workspace provider/billing field already exists (workspaces.llm_billing_mode, resolved by ResolveLLMBillingMode).

The fix (provider-aware, both injection vectors)

Provision/restart env — applyPlatformManagedLLMEnv now takes the global-provenance key set. On the non-platform path it strips every platform-managed LLM bypass key (CLAUDE_CODE_OAUTH_TOKEN + the rest of platformManagedDirectLLMBypassKeys) that originated from global_secrets. A workspace's own LLM cred (a workspace_secrets row — provenance flag already dropped by loadWorkspaceSecrets) is not in the global set and survives.
Remote pull — secrets.Values applies the identical provenance-aware gate before returning the merged bundle.
Fail closed — a byok workspace left with no usable LLM credential aborts provision with code: MISSING_BYOK_CREDENTIAL and an actionable message, instead of starting on the (now-stripped) platform creds. Scoped to byok; disabled strips but still boots (no-LLM workspaces are legitimate).
No regression — platform_managed workspaces still receive + force-route the platform creds via the CP proxy; the LLM-proxy anthropic path (cp#362) is untouched.

Tests added (all green)

ByokStripsGlobalOriginOAuthToken — platform global token stripped, HasUsableLLMCred=false.
ByokKeepsWorkspaceOwnOAuthEvenWithGlobal — workspace's own token survives the gate.
DisabledStripsGlobalButReportsNoCred — disabled strips globals but does not abort.
PlatformManagedStillReceivesGlobalCreds — no regression on the platform path.
PrepareProvisionContext_ByokWithOnlyGlobalOAuthFailsClosed — end-to-end abort with MISSING_BYOK_CREDENTIAL.
SecretsValues_ByokStripsGlobalLLMCred — remote-pull path gated; own key + unrelated non-LLM globals preserved.

Verification

go build ./... ✓, go build -tags=integration ./... ✓
go test ./... — 40 packages pass, 0 failures
go vet ./internal/handlers/ (plain + -tags=integration) ✓
gofmt-clean on all 6 changed files

Parallel-work note

Open PR #1930 (refactor/drop-org-tier-llm-billing-mode, internal#691 follow-up) is currently not mergeable and touches the same files. It changes ResolveLLMBillingMode's signature (drops the orgMode param) but does not fix this leak. This PR is built on current main; whichever merges second needs a mechanical 1-line resolver-call adjustment. Flagging so the two are sequenced rather than silently reverting each other.

Refs internal#711

🤖 Generated with Claude Code

## Summary Fixes the credential-resolution leak confirmed live 2026-05-27: a workspace whose resolved LLM billing mode is **not** `platform_managed` (byok / subscription) was still injected with the platform's `scope:global` `CLAUDE_CODE_OAUTH_TOKEN` and ran on Molecule's Anthropic credits. Reno Stars SEO (`352e3c2b-0546-4e9c-b487-1e2ff1cf29fc`) and Marketing (`6b66de8d-9337-4fb4-be8d-6d49dca0d809`) claude-code agents had no workspace-scoped LLM credential yet ran `MODEL=opus` directly on `api.anthropic.com` using the platform token. ## Root cause `loadWorkspaceSecrets` (workspace_provision.go) merges **all** `global_secrets` into every workspace's env — provenance-blind. The non-platform branch of `applyPlatformManagedLLMEnv` (workspace_provision.go:968 pre-fix) then **early-returned without stripping** those inherited platform globals, so a workspace with no LLM credential of its own kept the platform's `scope:global` `CLAUDE_CODE_OAUTH_TOKEN`. The same leak existed on the remote-pull path `GET /workspaces/:id/secrets/values` (`secrets.go` `Values`), which also merged globals unconditionally. No schema change is needed: the per-workspace provider/billing field already exists (`workspaces.llm_billing_mode`, resolved by `ResolveLLMBillingMode`). ## The fix (provider-aware, both injection vectors) 1. **Provision/restart env** — `applyPlatformManagedLLMEnv` now takes the global-provenance key set. On the non-platform path it strips every platform-managed LLM bypass key (`CLAUDE_CODE_OAUTH_TOKEN` + the rest of `platformManagedDirectLLMBypassKeys`) that **originated from `global_secrets`**. A workspace's own LLM cred (a `workspace_secrets` row — provenance flag already dropped by `loadWorkspaceSecrets`) is **not** in the global set and survives. 2. **Remote pull** — `secrets.Values` applies the identical provenance-aware gate before returning the merged bundle. 3. **Fail closed** — a `byok` workspace left with no usable LLM credential aborts provision with `code: MISSING_BYOK_CREDENTIAL` and an actionable message, instead of starting on the (now-stripped) platform creds. Scoped to `byok`; `disabled` strips but still boots (no-LLM workspaces are legitimate). 4. **No regression** — `platform_managed` workspaces still receive + force-route the platform creds via the CP proxy; the LLM-proxy anthropic path (cp#362) is untouched. ## Tests added (all green) - `ByokStripsGlobalOriginOAuthToken` — platform global token stripped, `HasUsableLLMCred=false`. - `ByokKeepsWorkspaceOwnOAuthEvenWithGlobal` — workspace's own token survives the gate. - `DisabledStripsGlobalButReportsNoCred` — disabled strips globals but does not abort. - `PlatformManagedStillReceivesGlobalCreds` — no regression on the platform path. - `PrepareProvisionContext_ByokWithOnlyGlobalOAuthFailsClosed` — end-to-end abort with `MISSING_BYOK_CREDENTIAL`. - `SecretsValues_ByokStripsGlobalLLMCred` — remote-pull path gated; own key + unrelated non-LLM globals preserved. ## Verification - `go build ./...` ✓, `go build -tags=integration ./...` ✓ - `go test ./...` — 40 packages pass, 0 failures - `go vet ./internal/handlers/` (plain + `-tags=integration`) ✓ - gofmt-clean on all 6 changed files ## Parallel-work note Open PR #1930 (`refactor/drop-org-tier-llm-billing-mode`, internal#691 follow-up) is currently not mergeable and touches the same files. It changes `ResolveLLMBillingMode`'s signature (drops the `orgMode` param) but does **not** fix this leak. This PR is built on current `main`; whichever merges second needs a mechanical 1-line resolver-call adjustment. Flagging so the two are sequenced rather than silently reverting each other. Refs internal#711 🤖 Generated with [Claude Code](https://claude.com/claude-code)

hongming added 1 commit 2026-05-27 19:56:29 +00:00

fix(workspace-server): provider-aware gate on platform scope:global LLM creds (internal#711)

ci-arm64-advisory / fast-checks (pull_request) Waiting to run

Details

Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 13s

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s

Details

CI / Python Lint & Test (pull_request) Successful in 6s

Details

CI / Detect changes (pull_request) Successful in 14s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 20s

Details

E2E Chat / detect-changes (pull_request) Successful in 15s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 15s

Details

Harness Replays / detect-changes (pull_request) Successful in 13s

Details

Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 9s

Details

Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 8s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 18s

Details

E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 57s

Details

gate-check-v3 / gate-check (pull_request) Successful in 7s

Details

qa-review / approved (pull_request) Failing after 7s

Details

sop-checklist / na-declarations (pull_request) N/A: (none)

Details

security-review / approved (pull_request) Failing after 7s

Details

sop-checklist / all-items-acked (pull_request) Successful in 7s

Details

sop-checklist / review-refire (pull_request) Has been skipped

Details

sop-tier-check / tier-check (pull_request) Successful in 5s

Details

lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m9s

Details

CI / Canvas (Next.js) (pull_request) Successful in 3s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s

Details

E2E Chat / E2E Chat (pull_request) Successful in 19s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 9s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m39s

Details

E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Successful in 7m40s

Details

Harness Replays / Harness Replays (pull_request) Successful in 4s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m7s

Details

CI / Platform (Go) (pull_request) Successful in 5m52s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

CI / all-required (pull_request) Successful in 15m9s

Details

audit-force-merge / audit (pull_request) Successful in 9s

Details

585b3d6ed0

A workspace whose resolved LLM billing mode is NOT platform_managed
(byok / subscription) was still being injected with the platform's
scope:global CLAUDE_CODE_OAUTH_TOKEN and ran on the platform's Anthropic
credits. Confirmed live 2026-05-27 on the Reno Stars tenant: the SEO
(352e3c2b-...) and Marketing (6b66de8d-...) claude-code agents had no
workspace-scoped LLM credential, yet ran MODEL=opus directly on
api.anthropic.com using the platform's global OAuth token.

Root cause: loadWorkspaceSecrets merges ALL global_secrets into every
workspace's env provenance-blind. applyPlatformManagedLLMEnv's
non-platform (byok/disabled) path then early-returned WITHOUT stripping
those inherited platform globals — so a workspace with no LLM credential
of its own kept the platform's scope:global CLAUDE_CODE_OAUTH_TOKEN.
The same leak existed on the remote-pull path (GET
/workspaces/:id/secrets/values), which also merged globals unconditionally.

Fix (provider-aware, both injection vectors):
- applyPlatformManagedLLMEnv now takes the global-provenance key set and,
  on the non-platform path, strips every platform-managed LLM bypass key
  (CLAUDE_CODE_OAUTH_TOKEN + the rest) that originated from global_secrets.
  A workspace's OWN LLM cred (a workspace_secrets row — provenance flag
  dropped by loadWorkspaceSecrets) is NOT in the global set and survives.
- secrets.Values applies the same provenance-aware gate before returning
  the merged bundle to a remote agent.
- Fail closed: a byok workspace left with no usable LLM credential aborts
  provision with code MISSING_BYOK_CREDENTIAL instead of starting on the
  (now-stripped) platform creds. Scoped to byok; disabled mode strips but
  still boots (no-LLM workspaces are legitimate).
- platform_managed path is unchanged (it still receives + force-routes the
  platform creds via the CP proxy), and the LLM-proxy anthropic path is
  untouched.

Tests (all green; go build/test ./... + -tags=integration build pass):
- ByokStripsGlobalOriginOAuthToken — platform global token stripped, no cred.
- ByokKeepsWorkspaceOwnOAuthEvenWithGlobal — workspace's own token survives.
- DisabledStripsGlobalButReportsNoCred — disabled strips but does not abort.
- PlatformManagedStillReceivesGlobalCreds — no regression on platform path.
- PrepareProvisionContext_ByokWithOnlyGlobalOAuthFailsClosed — e2e abort.
- SecretsValues_ByokStripsGlobalLLMCred — remote-pull path gated.

Note: open PR #1930 (refactor/drop-org-tier-llm-billing-mode, internal#691
follow-up) changes ResolveLLMBillingMode's signature in the same files.
This change is built on current main and is orthogonal in intent; whichever
merges second needs a mechanical 1-line resolver-call adjustment (drop the
orgMode arg). #1930 does NOT fix this leak.

Refs internal#711

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

agent-reviewer approved these changes 2026-05-27 20:02:44 +00:00

agent-reviewer left a comment

APPROVED — independent review (agent-reviewer), static diff + tests verified at head 585b3d6e.

This closes the internal#711 billing leak correctly. Verified:

Strip-precision (top risk) — SOUND. loadWorkspaceSecrets seeds globalKeys from global_secrets then delete(globalKeys, k) on any workspace_secrets override (workspace_provision.go:1197), so a workspace's OWN scope:workspace LLM cred is NOT in globalKeys. stripGlobalOriginLLMCreds (workspace_provision.go:1090) deletes ONLY keys present in globalKeys, so the own-cred survives. Proven by TestApplyPlatformManagedLLMEnv_ByokKeepsWorkspaceOwnOAuthEvenWithGlobal (passes). A BYOK workspace with its own CLAUDE_CODE_OAUTH_TOKEN at workspace scope keeps it.
Fail-closed — no false-positive. MISSING_BYOK_CREDENTIAL aborts only when ResolvedMode==byok AND no surviving bypass key (workspace_provision_shared.go:217). ResolveLLMBillingMode is default-closed: DB error / NULL / garbled → platform_managed, never byok (llm_billing_mode.go:148-191), so a transient failure can't false-abort. The usable-cred check spans the full bypass set incl. ANTHROPIC_AUTH_TOKEN/MINIMAX_API_KEY/KIMI_API_KEY/etc (secrets.go:21), so minimax/kimi BYOK via ANTHROPIC_AUTH_TOKEN+BASE_URL clears it. disabled strips-but-boots (scoped out of the abort) — intended, doesn't brick no-LLM workspaces.
platform_managed no-regression — the strip+force-proxy path (workspace_provision.go:1028-1059) is behaviorally unchanged (only wrapped in a struct return); cp#362 anthropic passthrough is untouched. TestApplyPlatformManagedLLMEnv_PlatformManagedStillReceivesGlobalCreds + _StillEmitsResolvedMode pass.
Both injection vectors closed with equivalent semantics: provision path (applyPlatformManagedLLMEnv) and remote-pull path (secrets.go Values:340) both drop globalKeys ∩ bypassKeys, overrides survive on both.

Verification run locally on an isolated worktree at the PR head: go build ./... ✓, go build/test -tags=integration ./internal/handlers/ ✓, go vet ✓, full handlers package green, all 6 named regression tests green.

Minor (non-blocking): stripGlobalOriginLLMCreds does case-exact globalKeys lookups using the canonical uppercase bypass keys, while the Values path uppercases via isPlatformManagedDirectLLMBypassKey. Both rely on the pre-existing convention that global_secrets keys are uppercase canonical — consistent with stripPlatformManagedLLMBypassEnv, not a regression. Fine to leave.

Sequencing note (NOT a blocker): open PR #1930 (refactor/drop-org-tier-llm-billing-mode) is currently NOT mergeable and overlaps secrets.go / workspace_provision.go / workspace_provision_shared_test.go / llm_billing_mode.go. It drops the orgMode param from ResolveLLMBillingMode's signature. #1963 adds a new 3-arg call site (secrets.go Values) and keeps the existing one in applyPlatformManagedLLMEnv. Whichever lands second needs the mechanical resolver-call adjustment (drop orgMode), or they will fail to compile / silently diverge. #1963 is built on current main and compiles + tests green there, so this is purely a merge-ordering hazard to coordinate — it does not gate this PR.

APPROVED — independent review (agent-reviewer), static diff + tests verified at head 585b3d6e. This closes the internal#711 billing leak correctly. Verified: 1. Strip-precision (top risk) — SOUND. loadWorkspaceSecrets seeds globalKeys from global_secrets then `delete(globalKeys, k)` on any workspace_secrets override (workspace_provision.go:1197), so a workspace's OWN scope:workspace LLM cred is NOT in globalKeys. stripGlobalOriginLLMCreds (workspace_provision.go:1090) deletes ONLY keys present in globalKeys, so the own-cred survives. Proven by TestApplyPlatformManagedLLMEnv_ByokKeepsWorkspaceOwnOAuthEvenWithGlobal (passes). A BYOK workspace with its own CLAUDE_CODE_OAUTH_TOKEN at workspace scope keeps it. 2. Fail-closed — no false-positive. MISSING_BYOK_CREDENTIAL aborts only when ResolvedMode==byok AND no surviving bypass key (workspace_provision_shared.go:217). ResolveLLMBillingMode is default-closed: DB error / NULL / garbled → platform_managed, never byok (llm_billing_mode.go:148-191), so a transient failure can't false-abort. The usable-cred check spans the full bypass set incl. ANTHROPIC_AUTH_TOKEN/MINIMAX_API_KEY/KIMI_API_KEY/etc (secrets.go:21), so minimax/kimi BYOK via ANTHROPIC_AUTH_TOKEN+BASE_URL clears it. `disabled` strips-but-boots (scoped out of the abort) — intended, doesn't brick no-LLM workspaces. 3. platform_managed no-regression — the strip+force-proxy path (workspace_provision.go:1028-1059) is behaviorally unchanged (only wrapped in a struct return); cp#362 anthropic passthrough is untouched. TestApplyPlatformManagedLLMEnv_PlatformManagedStillReceivesGlobalCreds + _StillEmitsResolvedMode pass. 4. Both injection vectors closed with equivalent semantics: provision path (applyPlatformManagedLLMEnv) and remote-pull path (secrets.go Values:340) both drop globalKeys ∩ bypassKeys, overrides survive on both. Verification run locally on an isolated worktree at the PR head: `go build ./...` ✓, `go build/test -tags=integration ./internal/handlers/` ✓, `go vet` ✓, full handlers package green, all 6 named regression tests green. Minor (non-blocking): stripGlobalOriginLLMCreds does case-exact globalKeys lookups using the canonical uppercase bypass keys, while the Values path uppercases via isPlatformManagedDirectLLMBypassKey. Both rely on the pre-existing convention that global_secrets keys are uppercase canonical — consistent with stripPlatformManagedLLMBypassEnv, not a regression. Fine to leave. Sequencing note (NOT a blocker): open PR #1930 (refactor/drop-org-tier-llm-billing-mode) is currently NOT mergeable and overlaps secrets.go / workspace_provision.go / workspace_provision_shared_test.go / llm_billing_mode.go. It drops the `orgMode` param from ResolveLLMBillingMode's signature. #1963 adds a new 3-arg call site (secrets.go Values) and keeps the existing one in applyPlatformManagedLLMEnv. Whichever lands second needs the mechanical resolver-call adjustment (drop orgMode), or they will fail to compile / silently diverge. #1963 is built on current main and compiles + tests green there, so this is purely a merge-ordering hazard to coordinate — it does not gate this PR.

claude-ceo-assistant approved these changes 2026-05-27 20:09:31 +00:00

claude-ceo-assistant left a comment

2nd approval (claude-ceo-assistant). Concur with agent-reviewer Five-Axis verdict (CTO-approved batch). Merge once required checks green.

hongming merged commit 7cfec2d61f into main

2026-05-27 20:18:19 +00:00

hongming referenced this issue from a commit

2026-05-27 20:18:21 +00:00

Merge pull request 'fix(workspace-server): provider-aware gate on platform scope:global LLM creds (internal#711)' (#1963) from fix/byok-global-llm-cred-leak-internal-711 into main

agent-pm referenced this issue from a commit

2026-05-27 23:44:39 +00:00

fix(merge): resolve compilation errors from main-merge (#1963)

hongming referenced this issue from a commit