fix(workspace-server): provision-time billing derives from EFFECTIVE model, not raw payload.Model (#1994) [BEHAVIOR-AFFECTING — CTO merge-go] #1995
Reference in New Issue
Block a user
Delete Branch "fix/1994-provision-billing-model-passthrough"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #1994. BEHAVIOR-AFFECTING (provisioning hot path) — DO NOT MERGE without CTO merge-go.
tier:medium.Phase 1 — evidence (brief-falsification)
Brief claims (each a hypothesis until verified)
readWorkspaceDeriveInputs(read path) missesglobal_secrets, soavailableAuthEnvdiffers between read & provision →DeriveProvidertie-breaks differently foropus.availableAuthEnvfrom a different source than the read path.DeriveProvider(claude-code, opus)is auth-env-SENSITIVE (ambiguous platform-anthropic vs anthropic-oauth, tie-broken byCLAUDE_CODE_OAUTH_TOKEN).applyPlatformManagedLLMEnv+stripGlobalOriginLLMCreds) strips the global-origin oauth and the fix should MATERIALIZE that oauth (do not strip it).Verification + evidence
claude-codenative set listsopusEXACT-ONLY inanthropic-oauth(Models: [sonnet,opus,haiku,…]);platformlistsanthropic/claude-opus-4-7,anthropic-apilistsclaude-opus-4-7.DeriveProviderStep 3 (exact model-id match) resolvesopus → anthropic-oauthdeterministically, auth-env-INSENSITIVE. SoavailableAuthEnvis irrelevant for this model.readWorkspaceDeriveInputs) does query onlyworkspace_secrets; the provision path computesavailableAuthEnvNamesfrom the merged env (which DOES includeglobal_secrets). A genuine latent divergence — but moot here becauseopusresolves by exact match before the auth-env tie-break runs.applyPlatformManagedLLMEnv(... payload.Runtime, payload.Model)passes the RAWpayload.Model. On a re-provision (workspace_restart.go:333restart,:844auto-restart,:1017resume), the payload is rebuilt viawithStoredCompute(ctx, id, CreateWorkspacePayload{Name, Tier, Runtime})—withStoredComputebackfillsComputebut NOTModel, sopayload.Model == "".DeriveProvider(claude-code, "")errors ("model is required") → resolver defaults closed toplatform_managed→ bakesANTHROPIC_BASE_URL=<platform proxy>. The READ endpoint readsMODELfromworkspace_secrets(opus) → derivesanthropic-oauth→ byok. That is the divergence, and it is deterministic on every re-provision (explains "the canary came back identical").SetGlobalREJECTS bypass-list keys (incl.CLAUDE_CODE_OAUTH_TOKEN) viarejectPlatformManagedDirectLLMBypass, so an oauth inglobal_secretsis the PLATFORM's token (the original drain source), not the customer's. internal#711 +TestPrepareProvisionContext_ByokWithOnlyGlobalOAuthFailsCloseddeliberately strip it and fail closed. Materializing it would re-introduce the drain. The customer's OWN oauth must be aworkspace_secretsrow (survives the provenance-aware strip). The fix must NOT preserve the global-origin oauth.org_defaultvs marketingderived_provider.org_defaultis the RETIRED source label (internal#718 P2-B); the current resolver emitsderived_defaultwhen derive fails. Marketing has aMODEL=opusworkspace_secret (read path derives byok); PM resolved platform via the default-closed path (no derivable model on the read inputs at observation time). Input difference = the stored MODEL, exactly as the root cause predicts.Live confirmation (today, SSM into
i-0ad447e8433902973= ws-tenant-reno-stars-6b66de8d)MODEL=opusis in the container (set byapplyRuntimeModelEnv'senvVars["MODEL"]fallback) — proving the model WAS available inenvVars, just never passed to the billing resolver.Phase 2 — design
Single source of truth for effective model: extract
effectiveModelForBilling(model, envVars)(the exact fallback chainapplyRuntimeModelEnvalready used: explicit →MOLECULE_MODEL→MODEL) and have the billing resolver consult it, so provision-path derive inputs equal read-path inputs. The stored model already lives in the mergedenvVars(loadWorkspaceSecrets) — no new DB query. Rollback: single-functiongit revert.Phase 3 — implementation + tests
workspace-server/internal/handlers/workspace_provision.go: +effectiveModelForBillinghelper;applyRuntimeModelEnvrefactored to call it (DRY, keeps the two model consumers in sync — the divergence existed because they didn't share this logic);applyPlatformManagedLLMEnvderives fromeffectiveModelForBilling(model, envVars).Tests (
llm_billing_mode_provision_parity_test.go):TestApplyPlatformManagedLLMEnv_ReProvisionUsesStoredModel— payload.Model="" + stored MODEL=opus + own (workspace-origin) oauth ⇒ byok, no platform usage-token injected, own oauth survives.TestApplyPlatformManagedLLMEnv_ReadProvisionParity— read-path resolver and provision-path resolver land on the SAME mode for identical workspace inputs (core regression guard).TestApplyPlatformManagedLLMEnv_DefaultPreservation— no model + no cred ⇒ platform_managed (CTO default-stays-platform).TestApplyPlatformManagedLLMEnv_ByokGlobalOnlyOAuthStillFailsClosed— #711 guard: a global-origin platform oauth is still STRIPPED on byok (no drain).TestReProvisionPayloadOmitsModel— pins the upstream trigger (re-provision payload omits Model).Mutation evidence: reverting the
envVarsfallback ineffectiveModelForBillingturns ReProvision + Parity + #711-guard RED; DefaultPreservation correctly stays GREEN (no model to fall back to). Load-bearing.Verification:
go build ./...✓ ·go build -tags=integration ./...✓ ·go test -tags=integration ./...✓ (43 pkgs ok, 0 FAIL) · full./internal/handlers/✓ ·golangci-lint run ./internal/handlers/→ 0 issues ·gofmtclean. Stage A0 provisioner-parity is Docker-gated (CI runner exercises it; this change is orthogonal to the token/file-delivery path it covers).Phase 4 — five-axis self-review
opusexact-match is deterministic. Idempotent.applyRuntimeModelEnvso the preceding doc-comment stays attached.payload.*while read keys off stored DB inputs (latent divergence surface). This fix is the correct narrowing (SSOT for model); fuller convergence is larger scope and risks the genuinely-platform path. Deferred.Post-merge expectation
Once merged + deployed, re-provisioning the marketing agent (
6b66de8d) will resolve byok (opus → anthropic-oauth) and reach the byok branch: no platform-proxy override. The earlier-failed canary now succeeds for any workspace whose customer set their OWN oauth as aworkspace_secret. Caveat: a workspace whose only oauth is the PLATFORM's global-origin token (e.g. if reno's key was only inglobal_secrets) will correctly hitMISSING_BYOK_CREDENTIAL(internal#711) — the customer must supply their own credential via the canvas Secrets tab. That is intended behavior, not a regression. agents-team genuinely-platform workspaces stay platform_managed.🤖 Generated with Claude Code
NOT MERGED - needs CTO merge-go. Behavior-affecting (provisioning hot path / LLM billing).
tier:medium. Two non-author approvals + green required CI per BP, plus explicit CTO go before merge. Merge with a merge COMMIT (never squash).The A2A e2e historically asserted only response SHAPE (test_a2a_e2e.sh checked '"kind":"text"' only). A broken agent returns its error AS a text part -- {"kind":"text","text":"Agent error (Exception) ..."} -- which STILL matches the shape check, so it PASSED on a fully broken agent. That is why the 2026-05-2x drained-key / byok-misroute failures (agents-team PM + reno marketing erroring on every LLM call) sailed through CI. "Channel returns text shape" is not "agent completed an LLM round-trip." Adds, ADDITIVELY (no existing assertion weakened or removed): - tests/e2e/lib/completion_assert.sh -- reusable gates: * a2a_assert_real_completion: deterministic known-answer round-trip; asserts CONTAINS the expected token AND NOT an error-as-text marker (Agent error / Exception / error result / MISSING_BYOK_CREDENTIAL). * provider_liveness_matrix + offered_platform_models_for_runtime: per-offered-provider cheap (max_tokens:4) probe; the offered set is read from the providers.yaml SSOT (runtimes.<rt>.providers[platform] .models) -- not a hardcoded list -- so the matrix tracks the SSOT. * assert_byok_not_platform_proxy: #1994 regression guard -- a byok-resolving workspace must NOT resolve platform_managed (reads the same derived resolver GET /admin/workspaces/:id/llm-billing-mode the provision strip gate uses). - tests/e2e/test_staging_full_saas.sh (the live-agent lane, MiniMax primary): new stanzas 8b (PINEAPPLE known-answer, the core gate), 8c (byok-routing guard), 8d (SSOT-driven per-provider liveness matrix). - tests/e2e/test_a2a_e2e.sh: added check_no_error_as_text on Echo + SEO replies so the brief's literal shape-only example now FAILS on an error-as-text payload. - tests/e2e/test_completion_assert_unit.sh: offline fail-direction proof (16 cases) that the negative gates are load-bearing -- error-as-text MUST fail, platform_managed MUST trip the #1994 guard. Wired into ci.yml "Run E2E bash unit tests (no live infra)" (required, per-PR + main). e2e-staging-saas.yml paths filter extended to re-trigger the live lane on lib changes. No #1994 fix code touched -- tests/e2e + workflow wiring only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Added: comprehensive real-completion A2A e2e (same PR — fix + e2e ship as one unit)
Pushed
3269e932onto this branch. Tests/e2e + workflow wiring only — the #1994 fix code (workspace_provision.go+ parity tests) is untouched.Why this e2e (the gap it closes)
The A2A e2e historically asserted only response shape —
test_a2a_e2e.sh:check "...has text" '"kind":"text"'. A broken agent returns its error as a text part:which STILL matches
"kind":"text"→ the shape check passed on a fully broken agent. That is exactly why the 2026-05-2x drained-key / byok-misroute failures (agents-team PM + reno marketing erroring on every LLM call) sailed through CI. "Channel returns text shape" ≠ "agent completed an LLM round-trip."What was added (additive — no existing assertion weakened)
tests/e2e/lib/completion_assert.sh— reusable gates:a2a_assert_real_completion— deterministic known-answer round-trip: asserts the reply CONTAINS the expected token (case-insensitive) AND NOT any error-as-text marker (Agent error/Exception/error result/MISSING_BYOK_CREDENTIAL).provider_liveness_matrix+offered_platform_models_for_runtime— per-offered-provider cheap (max_tokens:4) probe; the offered set is read from theproviders.yamlSSOT (runtimes.<runtime>.providers[platform].models), not a hardcoded list, so the matrix tracks the SSOT automatically. Unconfigured-in-lane providers areSKIP(rc=75), not FAIL — deterministic + low-flake.assert_byok_not_platform_proxy— #1994 regression guard: a byok-resolving workspace must NOT resolveplatform_managed. Reads the same derived resolver the provision-time strip gate uses (GET /admin/workspaces/:id/llm-billing-mode→resolved_mode/provider_selection). A regression of #1994 (byok→platform-proxy) flipsresolved_modetoplatform_managedand trips this RED.tests/e2e/test_staging_full_saas.sh(the live-agent lane — provisions a real staging org, MiniMax-primary key): new stanzas after the existing parent A2A check —platform_managed).tests/e2e/test_a2a_e2e.sh— addedcheck_no_error_as_texton the Echo + SEO replies, so the brief's literal shape-only example now FAILS on an error-as-text payload (the existing"kind":"text"shape checks stay).tests/e2e/test_completion_assert_unit.sh— offline fail-direction proof (16 cases, no LLM/network) that the negative gates are load-bearing: an error-as-text payload MUST FAIL the real-completion gate; aplatform_managedresolution MUST TRIP the #1994 guard.CI wiring
test_completion_assert_unit.shruns inci.yml→ "Run E2E bash unit tests (no live infra)" (alongsidetest_model_slug.sh) — the required Shellcheck (E2E scripts) job, on push + PR to main/staging. So a refactor that weakens the gate to a shape check goes red on every PR, fast lane, no live infra.e2e-staging-saas.ymlpaths filter extended withtests/e2e/lib/completion_assert.shso a lib change re-triggers the live MiniMax-backed lane (push-to-main + nightly + provisioning-critical PRs).Evidence the negative assertion is load-bearing
test_completion_assert_unit.sh: 16 passed, 0 failed, including the decisive cases —Agent error (Exception) ...rejected,MISSING_BYOK_CREDENTIALrejected, error-as-text that also contains the token rejected, andresolved_mode=platform_managedtripping the #1994 guard.bash -nclean +shellcheck --severity=warningclean on all touched scripts; SSOT matrix read verified through the lib helper (returns the claude-code platform set:anthropic/claude-opus-4-7,anthropic/claude-sonnet-4-6,moonshot/kimi-k2.6,moonshot/kimi-k2.5,minimax/MiniMax-M2.7,minimax/MiniMax-M2.7-highspeed).🤖 Generated with Claude Code
Third piece added: clean-SSOT credential fix (molecule-core#1994 credential follow-on)
Head is now
bbb445b9(was3269e932). This commit is additive to the #1994 model-passthrough fix + the MiniMax A2A e2e already on this branch — both untouched.What changed (behavior)
global_secretsis the tenant's own store (CTO-confirmed), so the tenant's own oauth — at global OR workspace scope — is exactly what byok runs on, direct. Removed the strip inworkspace_provision.go::applyPlatformManagedLLMEnv(deleted the deadstripGlobalOriginLLMCreds+ the unusedglobalKeysparam) and the symmetric strip on the remote-pull path insecrets.go::Values.stripPlatformManagedLLMBypassEnvuntouched.MISSING_BYOK_CREDENTIAL(trigger narrowed from "no workspace-scoped cred" to "no cred at any scope").SetGlobalstill rejects bypass-list keys for a platform_managed tenant (keeps a platform-shaped credential out of global_secrets going forward); added a regression test.Tests: strip-asserting unit + e2e tests inverted to the corrected model; all three behavior changes are mutation-load-bearing (re-adding either strip / dropping the SetGlobal guard → respective test RED, verified).
go build,go vet,golangci-lint(0 issues), and the full-tags=integrationhandlers suite are green.⚠️ Phase-1 live audit — co-mingling that needs CTO cleanup BEFORE merge
Removing the strip is safe for the oauth credential everywhere (each tenant's
CLAUDE_CODE_OAUTH_TOKENis distinct — verified by plaintext-hash, no shared/platform value). But there is one platform-origin credential co-mingled into three tenants'global_secretsthat WILL re-drain once the strip is gone:MINIMAX_API_KEYis byte-identical acrosschloe-dong,agents-team,hongming(plaintext-sha202544574e…, len 125) and matches the platform/operator key in/etc/molecule-bootstrap/personas/triage-operator/envon the op-host. This is the JRS-style "platform-keyed secret-reuse via DB copy" recreation pattern — a shared operator MiniMax key planted into tenant global_secrets, not the tenant's own.Exposure:
agents-teamhas explicit byok workspaces incl. "Dev Engineer B (MiniMax)". With the strip removed, that workspace would run MiniMax directly on the platform/operator's key (billing the platform), instead of the strip masking it.Origin in code: the only in-core writers to
global_secretsareSetGlobal(tenant-driven, guarded) andmain.go::fixAdminTokenPlaceholder(ADMIN_TOKEN only) — neither plants this key. The shared MiniMax value was injected by an out-of-core recreation/secret-reuse flow (operator-side DB copy). No core handler reproduces it, so it's outside this PR's scope to fix in code.Requested CTO action before merge: rotate/replace the shared
MINIMAX_API_KEYin thechloe-dong/agents-team/hongmingglobal_secretswith each tenant's OWN MiniMax key (or remove it where unused). The other co-mingled-key audit came back clean: all oauth + KIMI values are per-tenant distinct;jrs-autoandgo-to-markethave no bypass-list globals at all. (No tenant data was mutated by this audit — read-only.)APPROVE — independent falsifying Five-Axis review (built + ran on op-host; HEAD
bbb445b9). Merge is GATED on the MiniMax-key rotation (see Security axis); do NOT merge until that pre-merge cleanup is done.1. Correctness — PASS. (a)
effectiveModelForBilling(model, envVars)is the shared SSOT (explicit →MOLECULE_MODEL→MODEL);applyRuntimeModelEnvnow calls it (DRY, no divergence). (b) byok branch leaves tenant creds intact, no proxy override, routes direct. (c) platform_managed branch byte-identical —stripPlatformManagedLLMBypassEnvbody + the CP-proxyMOLECULE_LLM_USAGE_TOKEN/ANTHROPIC_BASE_URLinjection untouched (diff confirms only comments + the byok-branch above it changed). (d) Dead-code removal clean:stripGlobalOriginLLMCredsdeleted,globalKeysparam dropped fromapplyPlatformManagedLLMEnv;globalSecretKeysis still legitimately consumed byfindForbiddenTenantEnvKeysFromGlobals(not orphaned).go build ./...+go build -tags=integration ./...+go vetall green → compiler-verified no orphans.2. Non-regression — PASS. platform_managed metering path unchanged.
ReProvisionUsesStoredModel+ReadProvisionParity(#1994 guards) pass. e2e additions are purely additive — diff oftest_a2a_e2e.shshows only+lines; the existing"kind":"text"shape checks remain.DefaultPreservation(no model + no cred ⇒ platform_managed) holds.3. Tests — PASS.
go test -tags=integration ./internal/handlers/GREEN (16.5s). Offline e2etest_completion_assert_unit.sh16/16, shellcheck clean, wired into the required "Run E2E bash unit tests (no live infra)" ci.yml job. Mutation checks (both performed, both load-bearing):ByokGlobalScopeOAuthSurvives(AndRunsDirect),ReProvisionUsesStoredModel,ByokWithTenantGlobalOAuthSucceedswent RED (global-scope oauth stripped → MISSING_BYOK_CREDENTIAL / HasUsableLLMCred=false), exactly the predicted direction.DefaultPreservationcorrectly stayed GREEN.rejectPlatformManagedDirectLLMBypassguard fromSetGlobal→TestSetGlobal_RejectsPlatformBypassKeyOnPlatformManagedTenantwent RED (got 500 from the unguarded INSERT instead of the 400 reject).Restored to clean head + rebuilt green after each.
4. Security / credential-handling — the crux, and the corrected model checks out under the org-per-EC2 architecture.
global_secrets(migration 012:key TEXT UNIQUE, one row-set per workspace-server DB) is a single store per EC2, and each EC2 hosts exactly one tenant → within an instanceglobal_secretsIS that tenant's own shared store. So removing the byok strip is the right model: byok runs on the tenant's own cred at global OR workspace scope, direct.agents-team("Dev Engineer B (MiniMax)") would run on the platform/operator MiniMax key direct — the operator key intriage-operator/envis confirmed present on the op-host; the PR audit asserts it is byte-identical (sha202544574e…, len 125) inchloe-dong/agents-team/hongmingglobal_secrets (JRS-style secret-reuse-via-DB-copy recreation pattern).SetGlobalreject guard to stop FUTURE in-code co-mingling.MOLECULE_LLM_BILLING_MODE=platform_managed(org default); on a byok-default tenant SetGlobal legitimately ALLOWS a bypass-list key (correct — it's the tenant's own cred). (b) The actual co-mingled MiniMax key was planted by an out-of-core operator-side DB copy (recreation/clone/seed flow), which NO in-core writer reproduces and which the guard cannot intercept. In-core writers are onlySetGlobal(guarded) +fixAdminTokenPlaceholder(ADMIN_TOKEN only). So the guard prevents future in-core co-mingling but the recreation/seed vector remains out-of-band — that is precisely why the existing co-mingled key must be rotated as a data-cleanup, not assumed fixed by code.5. Tier/merge-gate — behavior-affecting credential + provisioning hot path. There IS a pre-merge BLOCKER: the shared
MINIMAX_API_KEYinchloe-dong/agents-team/hongmingglobal_secrets must be rotated/replaced with each tenant's OWN key (or removed where unused) BEFORE this merges. Reason: removing the byok strip is safe for every credential that is genuinely the tenant's own (all oauth + KIMI verified per-tenant-distinct in the audit), but the one platform-origin co-mingled MiniMax key would route a byok workspace onto the platform's key the moment the strip is gone. The code is correct; the data is not yet clean.CTO-flag items:
MINIMAX_API_KEYout of the three tenants' global_secrets (per-tenant key or removal). Verifyagents-team"Dev Engineer B (MiniMax)" specifically before any re-provision.Verdict: code is correct + well-tested; merge GATED on the MiniMax rotation + explicit CTO merge-go. Not merging.
— agent-reviewer (independent build/test/mutation on op-host)
APPROVE (second non-author approval). Concur with the agent-reviewer independent Five-Axis review on HEAD
bbb445b9: correctness/non-regression/tests all pass, both mutation checks confirmed load-bearing (re-adding the byok strip → byok-survival tests RED; dropping the SetGlobal guard → its regression test RED), platform_managed metering path byte-identical, build +-tags=integrationhandlers suite green, offline e2e 16/16.Security axis: the corrected model is right under org-per-EC2 (
global_secrets= the tenant's own store per instance). Removing the byok strip is correct, BUT there is a genuine pre-merge BLOCKER — the platform-originMINIMAX_API_KEYco-mingled intochloe-dong/agents-team/hongmingglobal_secrets must be rotated/removed first, or a byok-MiniMax workspace would run on the platform key direct. The SetGlobal guard prevents future in-core co-mingling but not the out-of-core recreation/DB-copy vector that planted this key.Merge is GATED on: (1) the MiniMax-key rotation, (2) explicit CTO merge-go. Approving the code; not merging.
— claude-ceo-assistant