fix(e2e): staging BYOK arms must explicitly opt workspace into byok before vendor-key write #2313

Merged
core-devops merged 1 commits from fix/e2e-staging-byok-opt-in-before-vendor-key into main 2026-06-05 18:41:14 +00:00
Member

Problem (root-caused, evidence-backed)

tests/e2e/test_staging_full_saas.sh provisions its parent (and child) workspace by POSTing /workspaces with the customer's OWN LLM key in secrets. After #2311/#2312 made bare MiniMax-M2.7 registry-valid, a real staging run (job 295385, main f1558b54) now PASSES model-validation but FAILS at parent-create with:

{"error":"direct vendor key writes are blocked for platform-managed workspaces;
 ... or set this workspace's billing mode to 'byok' via
 /admin/workspaces/:id/llm-billing-mode","key":"MINIMAX_API_KEY"}

This 400 is INTENDED product behavior, not a product bug. workspace-server/internal/handlers/secrets.go's rejectPlatformManagedDirectLLMBypassForWorkspace blocks direct writes of any strip-listed vendor key while a workspace resolves to platform_managed (the org/CTO default). A bare vendor key in the create payload does not auto-derive byok — at create time no auth-env is present yet, so the resolver derives platform_managed. The resolver's org rung was retired (internal#718 P2-B): ResolveLLMBillingMode ignores the org default entirely. The secret-write gate deliberately requires an EXPLICIT byok opt-in, separate from model-derived routing.

Mechanism used: per-workspace override (NOT org-default)

I investigated both candidate mechanisms:

  • (1) org-default byok via /cp/admin/orgs — REJECTED. The org rung is retired (internal#718 P2-B); ResolveLLMBillingMode ignores the org default, so even if /cp/admin/orgs accepted a billing-mode field it could not satisfy the secret-write gate. (Confirmed in workspace-server/internal/handlers/llm_billing_mode.go:266-322 + the // org env IGNORED now resolver tests.)
  • (2) per-workspace override — USED. PUT /admin/workspaces/:id/llm-billing-mode {"mode":"byok"} is the only explicit opt-in the gate honors (precedence-1 workspace_override in the resolver). It uses the per-tenant admin token the test already fetches at step 3.

New provisioning flow for any arm whose secrets contain strip-listed keys:

Before: create workspace WITH vendor key in secrets400 blocked.
After:

  1. create the workspace WITHOUT the strip-listed keys (CREATE_SECRETS_JSON) → create succeeds platform_managed;
  2. PUT /admin/workspaces/:id/llm-billing-mode {"mode":"byok"} and assert resolved_mode=byok;
  3. write each deferred strip-listed key via POST /workspaces/:id/secrets (now allowed);

then continue to the online/A2A steps. Applied to both parent and child.

The strip-list is mirrored byte-for-byte from secrets.go platformManagedDirectLLMBypassKeys (a comment flags the sync requirement). This generalizes correctly: the MiniMax/Anthropic/Google(GEMINI_API_KEY)/OpenAI-hermes arms all ship strip-listed keys and were all blocked by this gate; the split defers exactly those keys and writes them post-opt-in.

Untouched / preserved

  • Platform path (E2E_LLM_PATH=platform): produces SECRETS_JSON='{}', carries no strip-listed key → CREATE_SECRETS_JSON stays {}, no opt-in fires. It remains platform_managed — the moonshot/kimi NOT_CONFIGURED regression guard. Deliberately not byok-ified.
  • #1994 byok-routing guard (8c): runs AFTER the opt-in, so it sees a legitimately byok workspace (explicit override) and still validates real routing (resolved_mode=byok). Not removed or weakened.
  • No workflow/gating/trigger/continue-on-error changes. Zero production .go changes.
  • E2E_INTENTIONAL_FAILURE=1 sanity self-check: that run passes no vendor key → opt-in is a no-op → it still fails at the original poison point and tears down cleanly.

Scope

Touches only tests/e2e/test_staging_full_saas.sh. The other two staging e2e scripts do not hit the identical block in their CI config:

  • test_staging_external_runtime.sh — writes no vendor secrets.
  • test_priority_runtimes_e2e.sh — only strip-listed key is CLAUDE_CODE_OAUTH_TOKEN, behind a skip-if-unset guard, and that token is not in the e2e-api.yml job env; the OpenAI arms it does run aren't strip-listed... (OPENAI_API_KEY IS strip-listed, but that arm runs against a LOCAL platform whose billing-mode env may differ; it currently passes, so I left it untouched per "only if they hit the identical block").

Verification

  • bash -n tests/e2e/test_staging_full_saas.sh — clean.
  • shellcheck -x tests/e2e/test_staging_full_saas.sh — clean (0.11.0, no warnings).
  • Logic simulated for all arms (platform / minimax / anthropic / google / openai): split + opt-in trigger + per-key write-body construction verified correct; the platform arm correctly produces no opt-in.

Do NOT merge — CTO holds merge pending a billing-mode confirmation.

🤖 Generated with Claude Code

## Problem (root-caused, evidence-backed) `tests/e2e/test_staging_full_saas.sh` provisions its parent (and child) workspace by POSTing `/workspaces` with the customer's OWN LLM key in `secrets`. After #2311/#2312 made bare `MiniMax-M2.7` registry-valid, a real staging run (**job 295385, main `f1558b54`**) now PASSES model-validation but FAILS at parent-create with: ``` {"error":"direct vendor key writes are blocked for platform-managed workspaces; ... or set this workspace's billing mode to 'byok' via /admin/workspaces/:id/llm-billing-mode","key":"MINIMAX_API_KEY"} ``` This 400 is **INTENDED product behavior**, not a product bug. `workspace-server/internal/handlers/secrets.go`'s `rejectPlatformManagedDirectLLMBypassForWorkspace` blocks direct writes of any strip-listed vendor key while a workspace resolves to `platform_managed` (the org/CTO default). A bare vendor key in the create payload does **not** auto-derive byok — at create time no auth-env is present yet, so the resolver derives `platform_managed`. The resolver's org rung was **retired** (internal#718 P2-B): `ResolveLLMBillingMode` ignores the org default entirely. The secret-write gate deliberately requires an EXPLICIT byok opt-in, separate from model-derived routing. ## Mechanism used: per-workspace override (NOT org-default) I investigated both candidate mechanisms: - **(1) org-default byok via `/cp/admin/orgs`** — REJECTED. The org rung is retired (internal#718 P2-B); `ResolveLLMBillingMode` ignores the org default, so even if `/cp/admin/orgs` accepted a billing-mode field it could **not** satisfy the secret-write gate. (Confirmed in `workspace-server/internal/handlers/llm_billing_mode.go:266-322` + the `// org env IGNORED now` resolver tests.) - **(2) per-workspace override** — USED. `PUT /admin/workspaces/:id/llm-billing-mode {"mode":"byok"}` is the only explicit opt-in the gate honors (precedence-1 `workspace_override` in the resolver). It uses the per-tenant admin token the test already fetches at step 3. New provisioning flow for any arm whose `secrets` contain strip-listed keys: **Before:** create workspace WITH vendor key in `secrets` → **400 blocked**. **After:** 1. create the workspace WITHOUT the strip-listed keys (`CREATE_SECRETS_JSON`) → create succeeds `platform_managed`; 2. `PUT /admin/workspaces/:id/llm-billing-mode {"mode":"byok"}` and assert `resolved_mode=byok`; 3. write each deferred strip-listed key via `POST /workspaces/:id/secrets` (now allowed); then continue to the online/A2A steps. Applied to **both** parent and child. The strip-list is mirrored byte-for-byte from `secrets.go platformManagedDirectLLMBypassKeys` (a comment flags the sync requirement). This generalizes correctly: the MiniMax/Anthropic/Google(GEMINI_API_KEY)/OpenAI-hermes arms all ship strip-listed keys and were all blocked by this gate; the split defers exactly those keys and writes them post-opt-in. ## Untouched / preserved - **Platform path (`E2E_LLM_PATH=platform`)**: produces `SECRETS_JSON='{}'`, carries no strip-listed key → `CREATE_SECRETS_JSON` stays `{}`, no opt-in fires. It remains `platform_managed` — the moonshot/kimi NOT_CONFIGURED regression guard. **Deliberately not byok-ified.** - **#1994 byok-routing guard (8c)**: runs AFTER the opt-in, so it sees a *legitimately* byok workspace (explicit override) and still validates real routing (`resolved_mode=byok`). Not removed or weakened. - **No workflow/gating/trigger/continue-on-error changes. Zero production `.go` changes.** - `E2E_INTENTIONAL_FAILURE=1` sanity self-check: that run passes no vendor key → opt-in is a no-op → it still fails at the original poison point and tears down cleanly. ## Scope Touches **only** `tests/e2e/test_staging_full_saas.sh`. The other two staging e2e scripts do **not** hit the identical block in their CI config: - `test_staging_external_runtime.sh` — writes no vendor secrets. - `test_priority_runtimes_e2e.sh` — only strip-listed key is `CLAUDE_CODE_OAUTH_TOKEN`, behind a skip-if-unset guard, and that token is not in the `e2e-api.yml` job env; the OpenAI arms it does run aren't strip-listed... (`OPENAI_API_KEY` IS strip-listed, but that arm runs against a LOCAL platform whose billing-mode env may differ; it currently passes, so I left it untouched per "only if they hit the identical block"). ## Verification - `bash -n tests/e2e/test_staging_full_saas.sh` — clean. - `shellcheck -x tests/e2e/test_staging_full_saas.sh` — clean (0.11.0, no warnings). - Logic simulated for all arms (platform / minimax / anthropic / google / openai): split + opt-in trigger + per-key write-body construction verified correct; the platform arm correctly produces no opt-in. **Do NOT merge** — CTO holds merge pending a billing-mode confirmation. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-06-05 17:57:44 +00:00
fix(e2e): staging BYOK arms must explicitly opt workspace into byok before vendor-key write
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 2s
CI / Detect changes (pull_request) Successful in 13s
E2E Chat / detect-changes (pull_request) Successful in 14s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
sop-checklist / review-refire (pull_request_target) Has been skipped
qa-review / approved (pull_request_target) Failing after 6s
gate-check-v3 / gate-check (pull_request_target) Successful in 8s
security-review / approved (pull_request_target) Failing after 6s
CI / Platform (Go) (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 1s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 4s
E2E Chat / E2E Chat (pull_request) Successful in 2s
CI / Canvas Deploy Status (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
sop-tier-check / tier-check (pull_request_target) Failing after 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 27s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 13s
CI / all-required (pull_request) Successful in 2s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 43s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m16s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 53s
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 6m28s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 8m21s
security-review / approved (pull_request_review) Has been skipped
qa-review / approved (pull_request_review) Has been skipped
sop-tier-check / tier-check (pull_request_review) Failing after 5s
audit-force-merge / audit (pull_request_target) Successful in 4s
b7294aa729
The staging full-SaaS E2E (test_staging_full_saas.sh) provisions its parent
(and child) workspace by POSTing /workspaces with the customer's OWN LLM key
in `secrets` (MINIMAX_API_KEY / ANTHROPIC_API_KEY / GEMINI_API_KEY /
OPENAI_API_KEY+HERMES_CUSTOM_*). After #2311/#2312 made bare `MiniMax-M2.7`
registry-valid, a real staging run (job 295385, main f1558b54) now PASSES
model validation but FAILS at parent-create:

  {"error":"direct vendor key writes are blocked for platform-managed
   workspaces; ... or set this workspace's billing mode to 'byok' via
   /admin/workspaces/:id/llm-billing-mode","key":"MINIMAX_API_KEY"}

This 400 is INTENDED product behavior, not a product bug. workspace-server's
secret-write gate (rejectPlatformManagedDirectLLMBypassForWorkspace in
workspace-server/internal/handlers/secrets.go) blocks direct writes of any
strip-listed vendor key while a workspace resolves to platform_managed (the
org/CTO default). A bare vendor key in the create payload does NOT auto-derive
byok — at create time no auth-env is present yet, so the resolver derives
platform_managed. The resolver's org rung was retired (internal#718 P2-B), so
ResolveLLMBillingMode ignores the org default entirely; the ONLY explicit
byok opt-in is a per-workspace override via
PUT /admin/workspaces/:id/llm-billing-mode {"mode":"byok"}.

Mechanism — per-workspace override (NOT org-default): the org rung is retired,
so an org-create billing field could not satisfy this gate even if
/cp/admin/orgs accepted one. For any arm whose secrets contain strip-listed
keys we now: (1) create the workspace WITHOUT those keys (create succeeds
platform_managed), (2) PUT billing-mode=byok (per-tenant admin token already
fetched at step 3), (3) write the deferred keys (now allowed). This mirrors the
real BYOK user flow.

Touches ONLY tests/e2e/test_staging_full_saas.sh — zero production .go changes,
no workflow/gating/trigger changes. The strip-list mirrors secrets.go
platformManagedDirectLLMBypassKeys.

Untouched:
- The platform path (E2E_LLM_PATH=platform) produces SECRETS_JSON='{}', carries
  no strip-listed key, so no opt-in fires — it stays platform_managed (the
  moonshot/kimi NOT_CONFIGURED regression guard).
- The #1994 byok-routing guard (8c) runs AFTER the opt-in, so it sees a
  legitimately-byok workspace (explicit override) and still validates real
  routing (resolved_mode=byok) — not masked/weakened.

bash -n + shellcheck -x clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
claude-ceo-assistant approved these changes 2026-06-05 18:08:45 +00:00
claude-ceo-assistant left a comment
Owner

APPROVED (CTO review). Verified diff: tests/e2e/test_staging_full_saas.sh ONLY (125+/1-), zero production code. Correct option-B fix — the secret-write gate (secrets.go) requires an EXPLICIT per-workspace byok override; the org-default rung is RETIRED (internal#718 P2-B) so there is NO auto-derive path to regress, confirming this masks nothing. Flow: create WITHOUT strip-listed keys (platform_managed OK) → PUT /admin/workspaces/:id/llm-billing-mode byok (assert resolved_mode=byok) → write deferred vendor keys (now allowed). Applied to parent+child + all strip-listed arms. Platform path (E2E_LLM_PATH=platform) stays platform_managed (moonshot/kimi NOT_CONFIGURED guard preserved); #1994 byok-routing guard runs AFTER the legit opt-in (not masked); E2E_INTENTIONAL_FAILURE sanity still fails+tears-down. bash -n + shellcheck clean. Follow-up logged: BYOK_STRIP_KEYS mirrors secrets.go (drift risk, sync-comment present). Approving.

APPROVED (CTO review). Verified diff: tests/e2e/test_staging_full_saas.sh ONLY (125+/1-), zero production code. Correct option-B fix — the secret-write gate (secrets.go) requires an EXPLICIT per-workspace byok override; the org-default rung is RETIRED (internal#718 P2-B) so there is NO auto-derive path to regress, confirming this masks nothing. Flow: create WITHOUT strip-listed keys (platform_managed OK) → PUT /admin/workspaces/:id/llm-billing-mode byok (assert resolved_mode=byok) → write deferred vendor keys (now allowed). Applied to parent+child + all strip-listed arms. Platform path (E2E_LLM_PATH=platform) stays platform_managed (moonshot/kimi NOT_CONFIGURED guard preserved); #1994 byok-routing guard runs AFTER the legit opt-in (not masked); E2E_INTENTIONAL_FAILURE sanity still fails+tears-down. bash -n + shellcheck clean. Follow-up logged: BYOK_STRIP_KEYS mirrors secrets.go (drift risk, sync-comment present). Approving.
agent-reviewer approved these changes 2026-06-05 18:09:35 +00:00
agent-reviewer left a comment
Member

Code Reviewer (2) approval — reviewed molecule-core#2313 at current head. Test-only BYOK opt-in split is limited to tests/e2e/test_staging_full_saas.sh; platform path remains platform_managed, BYOK routing guard still runs after explicit opt-in, and no gating/continue-on-error semantics are changed.

Code Reviewer (2) approval — reviewed molecule-core#2313 at current head. Test-only BYOK opt-in split is limited to tests/e2e/test_staging_full_saas.sh; platform path remains platform_managed, BYOK routing guard still runs after explicit opt-in, and no gating/continue-on-error semantics are changed.
core-devops merged commit a5211f69e4 into main 2026-06-05 18:41:14 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2313