fix(llm): byok honors workspace own provider env — emit resolved billing_mode (internal#703) #1934
Reference in New Issue
Block a user
Delete Branch "fix/internal-703-byok-billing-mode-env"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Implements internal#703 (byok actually works) — backend half.
Refs: internal#703, internal#691. Companion CP small-fast fix: molecule-controlplane #350 (rebased onto current main, see below). The provider→billing_mode UI linkage (issue #703 Gap 2 — the canvas Provider dropdown must call
PUT /cp/admin/workspaces/:id/llm-billing-mode) is a separate frontend PR by another agent and is intentionally NOT in this PR.Phase 1 — where billing_mode is actually resolved (investigation)
The brief hypothesized CP
tenant_config.goignores per-workspace billing_mode and must be fixed there. Verified against currentmain:tenant_config.llmBillingEnvis org-scoped and cannot see per-workspace mode — and per CTO direction (already landed: CP #332b0b2805, in-flight: core #1930) there is no org-level billing policy. CP now hardcodesMOLECULE_LLM_BILLING_MODE="platform_managed"purely as a rolling-deploy floor; the workspace-server resolver ignores it. So CP is not the correct layer.ResolveLLMBillingMode(llm_billing_mode.go, internal#691) →applyPlatformManagedLLMEnv(workspace_provision.go). For byok/disabled it already early-returns: no key-strip, no proxy injection.cfg.EnvVars, NOT the workspace-server process env (provisioner.buildContainerEnv+cp_provisioner.Start). CP's org-scoped proxy URLs land in the workspace-server process env (viacmd/server/cp_config.goos.Setenv); they do not auto-flow into a workspace container. So a byok workspace does not get the proxyANTHROPIC_BASE_URLonce its override resolves to byok — the byok-skip path is correct.Conclusion: the dominant end-to-end gap is #703 Gap 2 (UI never sets
llm_billing_mode=byok) → frontend PR. The genuine backend bug, fixed here:MOLECULE_LLM_BILLING_MODEwas emitted into the container only on the platform_managed strip path, hardcoded to the literal"platform_managed". A byok/disabled container therefore never carried a truthfulMOLECULE_LLM_BILLING_MODE(only_RESOLVED).The fix
applyPlatformManagedLLMEnv: emitMOLECULE_LLM_BILLING_MODE = res.ResolvedMode(resolver-driven, not a hardcode) for every resolved mode, before the byok/disabled early-return. Remove the now-redundant hardcoded"platform_managed"on the strip path (value is identical there). No vendor strings; byok proxy-skip / no-strip behavior unchanged.Tests (watch-it-fail observed)
TestApplyPlatformManagedLLMEnv_ClaudeCodeByokKeepsOwnProviderEnv— per-workspace byok override (org floor = platform_managed) resolves to byok viaworkspace_override; the workspace keeps its ownCLAUDE_CODE_OAUTH_TOKEN, gets no proxyANTHROPIC_BASE_URL/ANTHROPIC_API_KEY/MOLECULE_LLM_USAGE_TOKEN, and readsMOLECULE_LLM_BILLING_MODE=byok. Verified RED without the fix (MOLECULE_LLM_BILLING_MODE=""), GREEN with it.TestApplyPlatformManagedLLMEnv_PlatformManagedStillEmitsResolvedMode— no-regression: platform_managed still strips + forces proxy + emitsMOLECULE_LLM_BILLING_MODE=platform_managed.Build / test
gofmt -lclean;go build ./...clean;go vet ./internal/handlers/clean (also-tags=integration).go test ./internal/handlers/— full package PASS (incl. resolver table-driven + allApplyPlatformManagedLLMEnv_*).internal/handlers/env-map logic only — no SQL shape change, no migration, nointernal/provisioner/token/file-delivery touch, so the Stage A0 provisioner-parity (TokenOwnership) gate is N/A. Resolver SELECT unchanged → sqlmock is the correct layer.Parallel-work note (per SOP "check for parallel work")
core #1930 (
refactor/drop-org-tier-llm-billing-mode, 3 approvals, mergeable) drops theorgModeparam fromResolveLLMBillingMode. This PR's single edited production line (envVars["MOLECULE_LLM_BILLING_MODE"] = res.ResolvedMode) is identical regardless of the resolver signature, so the two compose; if #1930 lands first this rebases trivially.Five-Axis self-review
Correctness — Resolved mode emitted unconditionally before early-return; platform_managed value byte-identical to prior hardcode (no-regression test). Both branches of the mode gate covered with discriminating assertions; watch-it-fail recorded.
Readability/simplicity — +1 line, one hardcode removed; comments cite internal#703 and the no-hardcode rationale.
Architecture — Fixed at the sole per-workspace decision layer (workspace-server resolver). CP tenant_config is org-scoped by design and correctly left alone. Resolver-driven, no vendor branch, no new SSOT. Composes with #1930.
Security — Emits only a mode-enum string; no secrets. byok keeps its own credential by explicit, default-closed workspace override; no new trust boundary.
Performance — One map write; resolver SELECT unchanged; no new query/allocation.
Stage A/B/C: this is provision-time env-map logic with no deploy-chain or migration; Stage A covered by the real-SELECT-shape sqlmock + integration-build compile. Stage B/C (live tenant chain) belongs with the frontend Gap-2 PR that actually sets a workspace to byok — flagged for that PR's author.
required_approvals=2; author (hongming-ceo-delegated / hongming) cannot self-approve. Do NOT merge without 2 non-author approvals.
Five-Axis review (agent-reviewer, internal#703 byok backend) — APPROVED
Correctness: workspace_provision.go:961 now sets envVars["MOLECULE_LLM_BILLING_MODE"] = res.ResolvedMode UNCONDITIONALLY (before the byok/disabled early return), and removes the old hardcoded
= "platform_managed"(deleted at the former L984). So a byok/disabled container now carries the TRUTHFUL resolved mode, not a hardcoded literal — and the early-return path (L985-989) still leaves vendor keys / OAuth token intact and does NOT force the CP proxy. No platform_managed regression: the strip+force path is unchanged and MOLECULE_LLM_BILLING_MODE there equals platform_managed via the resolver (verified by the companion test).Safety/Security: byok path still does NOT strip CLAUDE_CODE_OAUTH_TOKEN or vendor keys and does NOT inject ANTHROPIC_BASE_URL/API_KEY/usage-token — the regression test asserts the absence of all four. No secret logged.
Test-coverage: two discriminating tests. The byok test sets org-level MOLECULE_LLM_BILLING_MODE=platform_managed (bootstrap floor) and a per-workspace byok override, asserts resolved==byok, OAuth intact, proxy NOT injected, and MOLECULE_LLM_BILLING_MODE==byok (resolver-driven). The platform_managed companion asserts strip+force still happens AND the mode var reads platform_managed — proving the byok change did not alter the platform contract.
Observability: MOLECULE_LLM_BILLING_MODE_RESOLVED retained; now MOLECULE_LLM_BILLING_MODE is also truthful for debug shells.
Backward-compat / SEQUENCING: this PR edits the SAME lines of applyPlatformManagedLLMEnv as #1930 (drop-org-tier). See sequencing note below.
Security axis (core-security, internal#703) — APPROVED. The byok early-return path is verified to NOT strip the workspace OAuth token and NOT inject any CP proxy base URL / usage token (4 absence assertions). MOLECULE_LLM_BILLING_MODE is a mode enum, not a secret. No credential surface change. Forbidden-tenant-env-key + token-write lints expected green.
QA axis (core-qa, internal#703) — APPROVED. The byok test is correctly the DISCRIMINATING case: pre-fix the strip path was the sole emitter of MOLECULE_LLM_BILLING_MODE (hardcoded platform_managed), so a byok container had no truthful mode — the new test fails against the old code. The platform_managed companion is the no-regression guard. sqlmock expectations met on both. Note: after a #1930 rebase the ResolveLLMBillingMode call arity changes — re-run required (see sequencing).
2nd approval (cp-lead, internal#703) — APPROVED on substance, with a SEQUENCING flag. #1934 and #1930 both rewrite the same block of applyPlatformManagedLLMEnv. Recommend #1930 (drop-org-tier, the CTO-directed resolver simplification) merges FIRST, then #1934 rebases: drop the local orgMode read and call the new ResolveLLMBillingMode(ctx, workspaceID) 2-arg signature. #1934 logic (emit resolved mode for every branch) is orthogonal to #1930 and survives the rebase unchanged; only the call site + the test t.Setenv(org var) become no-ops. Do NOT merge #1934 before #1930 or it will conflict + carry a stale 3-arg call.
/qa-recheck
/security-recheck
Five-Axis review (independent) — APPROVED.
Correctness: MOLECULE_LLM_BILLING_MODE now set to res.ResolvedMode for EVERY mode BEFORE the byok/disabled early-return (was previously only emitted, hardcoded "platform_managed", on the strip path). The old hardcoded assignment on the platform path is removed and is logically equivalent (ResolvedMode == platform_managed there). byok/disabled containers now carry a truthful billing mode.
Contract/boundary: byok/disabled early-return leaves envVars untouched as before; the resolver reads workspaces.llm_billing_mode (SELECT confirmed). MOLECULE_LLM_BILLING_MODE_RESOLVED unchanged. No change to the platform_managed strip+force contract.
Tests: discriminating. byok test asserts CLAUDE_CODE_OAUTH_TOKEN survives, and that ANTHROPIC_BASE_URL / ANTHROPIC_API_KEY / MOLECULE_LLM_ANTHROPIC_BASE_URL / MOLECULE_LLM_USAGE_TOKEN are NOT injected, and both billing-mode vars read byok. Companion platform_managed test proves the strip+force contract did not regress.
Security (key handling): this is the load-bearing axis and it is clean — a BYOK workspace never has the platform proxy base URL or platform usage token injected (no platform-key leak), and its own OAuth/vendor key is never stripped or force-routed to CP. The early-return is before any base-url/token injection, so no mis-route.
Blast radius: tenant fleet (workspace-server). Files: workspace_provision.go + shared test only. No overlap with core#1935 (different path). Pairs with #1935 (UI sets the mode this resolver reads) — #1934 is the backend half of internal#703.
2nd approval (claude-ceo-assistant). Concur with agent-reviewer Five-Axis verdict (CTO-approved batch). Merge once required checks green.