fix(e2e): staging BYOK arms must explicitly opt workspace into byok before vendor-key write #2313
Reference in New Issue
Block a user
Delete Branch "fix/e2e-staging-byok-opt-in-before-vendor-key"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem (root-caused, evidence-backed)
tests/e2e/test_staging_full_saas.shprovisions its parent (and child) workspace by POSTing/workspaceswith the customer's OWN LLM key insecrets. After #2311/#2312 made bareMiniMax-M2.7registry-valid, a real staging run (job 295385, mainf1558b54) now PASSES model-validation but FAILS at parent-create with:This 400 is INTENDED product behavior, not a product bug.
workspace-server/internal/handlers/secrets.go'srejectPlatformManagedDirectLLMBypassForWorkspaceblocks direct writes of any strip-listed vendor key while a workspace resolves toplatform_managed(the org/CTO default). A bare vendor key in the create payload does not auto-derive byok — at create time no auth-env is present yet, so the resolver derivesplatform_managed. The resolver's org rung was retired (internal#718 P2-B):ResolveLLMBillingModeignores the org default entirely. The secret-write gate deliberately requires an EXPLICIT byok opt-in, separate from model-derived routing.Mechanism used: per-workspace override (NOT org-default)
I investigated both candidate mechanisms:
/cp/admin/orgs— REJECTED. The org rung is retired (internal#718 P2-B);ResolveLLMBillingModeignores the org default, so even if/cp/admin/orgsaccepted a billing-mode field it could not satisfy the secret-write gate. (Confirmed inworkspace-server/internal/handlers/llm_billing_mode.go:266-322+ the// org env IGNORED nowresolver tests.)PUT /admin/workspaces/:id/llm-billing-mode {"mode":"byok"}is the only explicit opt-in the gate honors (precedence-1workspace_overridein the resolver). It uses the per-tenant admin token the test already fetches at step 3.New provisioning flow for any arm whose
secretscontain strip-listed keys:Before: create workspace WITH vendor key in
secrets→ 400 blocked.After:
CREATE_SECRETS_JSON) → create succeedsplatform_managed;PUT /admin/workspaces/:id/llm-billing-mode {"mode":"byok"}and assertresolved_mode=byok;POST /workspaces/:id/secrets(now allowed);then continue to the online/A2A steps. Applied to both parent and child.
The strip-list is mirrored byte-for-byte from
secrets.go platformManagedDirectLLMBypassKeys(a comment flags the sync requirement). This generalizes correctly: the MiniMax/Anthropic/Google(GEMINI_API_KEY)/OpenAI-hermes arms all ship strip-listed keys and were all blocked by this gate; the split defers exactly those keys and writes them post-opt-in.Untouched / preserved
E2E_LLM_PATH=platform): producesSECRETS_JSON='{}', carries no strip-listed key →CREATE_SECRETS_JSONstays{}, no opt-in fires. It remainsplatform_managed— the moonshot/kimi NOT_CONFIGURED regression guard. Deliberately not byok-ified.resolved_mode=byok). Not removed or weakened..gochanges.E2E_INTENTIONAL_FAILURE=1sanity self-check: that run passes no vendor key → opt-in is a no-op → it still fails at the original poison point and tears down cleanly.Scope
Touches only
tests/e2e/test_staging_full_saas.sh. The other two staging e2e scripts do not hit the identical block in their CI config:test_staging_external_runtime.sh— writes no vendor secrets.test_priority_runtimes_e2e.sh— only strip-listed key isCLAUDE_CODE_OAUTH_TOKEN, behind a skip-if-unset guard, and that token is not in thee2e-api.ymljob env; the OpenAI arms it does run aren't strip-listed... (OPENAI_API_KEYIS strip-listed, but that arm runs against a LOCAL platform whose billing-mode env may differ; it currently passes, so I left it untouched per "only if they hit the identical block").Verification
bash -n tests/e2e/test_staging_full_saas.sh— clean.shellcheck -x tests/e2e/test_staging_full_saas.sh— clean (0.11.0, no warnings).Do NOT merge — CTO holds merge pending a billing-mode confirmation.
🤖 Generated with Claude Code
APPROVED (CTO review). Verified diff: tests/e2e/test_staging_full_saas.sh ONLY (125+/1-), zero production code. Correct option-B fix — the secret-write gate (secrets.go) requires an EXPLICIT per-workspace byok override; the org-default rung is RETIRED (internal#718 P2-B) so there is NO auto-derive path to regress, confirming this masks nothing. Flow: create WITHOUT strip-listed keys (platform_managed OK) → PUT /admin/workspaces/:id/llm-billing-mode byok (assert resolved_mode=byok) → write deferred vendor keys (now allowed). Applied to parent+child + all strip-listed arms. Platform path (E2E_LLM_PATH=platform) stays platform_managed (moonshot/kimi NOT_CONFIGURED guard preserved); #1994 byok-routing guard runs AFTER the legit opt-in (not masked); E2E_INTENTIONAL_FAILURE sanity still fails+tears-down. bash -n + shellcheck clean. Follow-up logged: BYOK_STRIP_KEYS mirrors secrets.go (drift risk, sync-comment present). Approving.
Code Reviewer (2) approval — reviewed molecule-core#2313 at current head. Test-only BYOK opt-in split is limited to tests/e2e/test_staging_full_saas.sh; platform path remains platform_managed, BYOK routing guard still runs after explicit opt-in, and no gating/continue-on-error semantics are changed.