feat(workspace-server): per-workspace llm_billing_mode override (internal#691) #1927

2026-05-26T21:29:50Z

hongming commented

2026-05-26 21:29:50 +00:00

Summary

Workspace-server side of the per-workspace llm_billing_mode override (internal#691). Adds:

Migration 20260526120000_workspaces_llm_billing_mode — nullable TEXT column with CHECK (...) enum constraint on the workspaces table.
Resolver ResolveLLMBillingMode(ctx, workspaceID, orgMode) — workspaces.llm_billing_mode ?? org_default ?? 'platform_managed'. Default-closed on every NULL / error / garbled enum / JOIN miss.
Strip-gate diff applyPlatformManagedLLMEnv now reads the RESOLVED per-workspace mode (was: os.Getenv("MOLECULE_LLM_BILLING_MODE")). Strips ONLY when resolved == platform_managed. Exports MOLECULE_LLM_BILLING_MODE_RESOLVED into the container env so an in-container debug check can answer "what mode is this workspace running under" without DB queries.
Per-tenant admin routes GET / PUT /admin/workspaces/:id/llm-billing-mode — these are what CP's new /cp/admin/workspaces/:id/llm-billing-mode proxies to, and what the canvas hits directly (same origin).
Per-workspace secret-write gate rejectPlatformManagedDirectLLMBypassForWorkspace — the strip-list on POST /workspaces/:id/secrets now respects the per-workspace mode. The org-level shim is retained for the global secrets path which is intentionally org-scoped.

Deploy order — this PR ships SECOND:

molecule-controlplane PR — CP admin proxy routes. Must merge + deploy first.
THIS PR (molecule-core workspace-server) — migration + resolver + strip-gate diff. Second.
molecule-core canvas PR — Config-tab UI. Third.

Why local DB read, not CP roundtrip

The strip gate runs on the workspace-server, the workspaces table is local to it, and the org-default is already pulled fresh on every container start via tenant_config (cached as MOLECULE_LLM_BILLING_MODE). A new CP roundtrip per provision would add ~50ms p50 and a new failure mode (CP unreachable → default-close to platform_managed, but that would shadow a legitimate byok override). Local DB join is monotonic with the workspace's own lifetime — no cross-service race. Justified in the Phase 1 design comment.

Default-closed resolution contract (RFC Safety hot-spot)

workspace.llm_billing_mode = NULL                  → fall through to org default
workspace.llm_billing_mode = "byok" / "disabled"   → honored
workspace.llm_billing_mode = "platform_managed"    → honored
workspace.llm_billing_mode = "garbled"             → platform_managed (default-closed)
workspace row missing (sql.ErrNoRows)              → fall through to org default
DB error                                            → platform_managed + propagated error
org default = NULL / empty / unknown                → platform_managed

The ONLY paths to byok or disabled are an explicit, validated, recognized string. Every other branch lands on platform_managed. Verified by table-driven TestResolveLLMBillingMode_TableDriven (10 cases) and the post-condition test TestResolveLLMBillingMode_ResolvedModeIsAlwaysValid (garbled-x-garbled must resolve to platform_managed).

Test plan

go build ./... — clean
go test ./... — 40 packages, 0 failures
Resolver: 10 table-driven cases + 1 post-condition test
SetWorkspaceLLMBillingMode: 4 cases (unknown-mode 400 without DB call, empty-ws-id rejected, NULL clear, byok set)
Admin handlers: 7 cases (GET happy, GET bad UUID, PUT set byok, PUT explicit-null clear, PUT missing-mode 400, PUT unknown-mode 400, PUT no-such-workspace 404)
Updated 2 pre-existing tests that relied on the old "empty env = no strip" semantic (the new default-closed behavior is safer; tests now opt out via t.Setenv("MOLECULE_LLM_BILLING_MODE", "byok") to assert the happy-path-of-secret-persist they were testing)
Stage B (staging) — CTO drives
Stage C (prod) — CTO drives the Production Manager workspace flip per RFC

Five-Axis self-review

Correctness: Resolver is the single source of truth — no remaining code path gates on os.Getenv("MOLECULE_LLM_BILLING_MODE") for strip decisions. The org-default env read is the resolver's INPUT, not its decision. AST-style gate would be a nice follow-up; for now grep -rn 'MOLECULE_LLM_BILLING_MODE' workspace-server/ --include="*.go" shows only documented call sites.

Safety/Security: Default-closed contract above. Per-workspace secret-write gate now reads the resolved mode (workspace=byok no longer falsely rejects the user's own CLAUDE_CODE_OAUTH_TOKEN write). The global-secrets path is unchanged — operator-scope token check still fires org-wide.

Test coverage: 21 new test cases (10 resolver + 4 setter + 7 admin handler). The garbled-row test case explicitly covers the "future migration relaxes the CHECK constraint" scenario.

Observability: MOLECULE_LLM_BILLING_MODE_RESOLVED injected into container env. Strip-gate logs workspace_provision: billing mode workspace=X resolved=Y source=Z org_default=W at INFO on every provision. Admin GET surfaces source in the response.

Backward-compat: Nullable column add — no backfill, no flag day. Workspaces without an override behave identically to pre-#691 because the resolver falls through to the same org default that was the only signal before. If CP is on the old version, the per-tenant admin route still works (workspaces table mutation is local); CP's proxy route 404s harmlessly. Deploy order documented above.

CTO attention items

The architectural fork (column on workspace-server, not CP) is explained in the Phase 1 design comment. Two pre-existing tests were updated to set MOLECULE_LLM_BILLING_MODE=byok because the new default-closed behavior rejects vendor-key writes when the env is unset — that's the correct safer behavior per the RFC, but it's a behavior delta worth flagging.

🤖 Generated with Claude Code

SOP-Checklist

Comprehensive testing performed: 21 new test cases (10 table-driven resolver cases + 1 post-condition TestResolveLLMBillingMode_ResolvedModeIsAlwaysValid, 4 SetWorkspaceLLMBillingMode setter cases, 7 admin handler cases covering GET happy/bad-UUID and PUT set/clear/missing-mode/unknown-mode/no-such-workspace). Two pre-existing tests updated to opt into byok via t.Setenv since their intent was secret-persist happy path, not the strip gate. Edge cases covered include garbled-enum rows ("byokk"), DB error, sql.ErrNoRows, JOIN miss, and explicit-null clear.
Local-postgres E2E run: N/A — handler tests use sqlmock for the resolver/setter/admin paths (no live Postgres needed); the migration itself is plain ALTER TABLE workspaces ADD COLUMN llm_billing_mode TEXT NULL CHECK (...) with no data backfill. The handlers_postgres_integration CI job (which DOES exercise live Postgres) is green on this PR — see CI status.
Staging-smoke verified or pending: Pending — CTO drives Stage B per dev-sop.md. The architectural contract was validated in Phase 1 design comment (internal#691#issuecomment-50151). No staging deploy from agent.
Root-cause not symptom: agents-team prod block was rooted in llm_billing_mode being an org-wide dial — the only knob was tenant_config, so flipping to byok for one team flipped every workspace in that org. Per-workspace override IS the root-cause fix; no per-tenant special-case, no env-var hack.
Five-Axis review walked: Correctness / Safety/Security / Test coverage / Observability / Backward-compat — see "Five-Axis self-review" section above.
No backwards-compat shim / dead code added: Yes — nullable column add with no backfill / no flag day. Workspaces without an override read the same tenant_config.llm_billing_mode org default they read pre-#691 (resolver fall-through). No deprecation shim, no parallel code path, no feature flag. The two updated tests are updated, not shimmed.
Memory/saved-feedback consulted: feedback_principles (autonomy + safety layer), feedback_no_silent_checklist_trim (filing every axis even when N/A), feedback_authoritative_source_per_concept_no_indirect_db_assertions (resolver is the single source of truth, not callers' env-var reads), reference_workspace_placement_rfc (data lives on the workspace-server EC2, not CP — confirms Option A column placement).

## Summary Workspace-server side of the per-workspace `llm_billing_mode` override (internal#691). Adds: - **Migration** `20260526120000_workspaces_llm_billing_mode` — nullable `TEXT` column with `CHECK (...)` enum constraint on the `workspaces` table. - **Resolver** `ResolveLLMBillingMode(ctx, workspaceID, orgMode)` — `workspaces.llm_billing_mode ?? org_default ?? 'platform_managed'`. Default-closed on every NULL / error / garbled enum / JOIN miss. - **Strip-gate diff** `applyPlatformManagedLLMEnv` now reads the RESOLVED per-workspace mode (was: `os.Getenv("MOLECULE_LLM_BILLING_MODE")`). Strips ONLY when resolved == `platform_managed`. Exports `MOLECULE_LLM_BILLING_MODE_RESOLVED` into the container env so an in-container debug check can answer "what mode is this workspace running under" without DB queries. - **Per-tenant admin routes** `GET / PUT /admin/workspaces/:id/llm-billing-mode` — these are what CP's new `/cp/admin/workspaces/:id/llm-billing-mode` proxies to, and what the canvas hits directly (same origin). - **Per-workspace secret-write gate** `rejectPlatformManagedDirectLLMBypassForWorkspace` — the strip-list on POST `/workspaces/:id/secrets` now respects the per-workspace mode. The org-level shim is retained for the global secrets path which is intentionally org-scoped. **Deploy order — this PR ships SECOND:** 1. [molecule-controlplane PR](https://git.moleculesai.app/molecule-ai/molecule-controlplane/pulls?type=all&state=open&q=feat%2Fper-workspace-llm-billing-mode) — CP admin proxy routes. Must merge + deploy first. 2. **THIS PR (molecule-core workspace-server)** — migration + resolver + strip-gate diff. Second. 3. [molecule-core canvas PR](https://git.moleculesai.app/molecule-ai/molecule-core/pulls?type=all&state=open&q=feat%2Fcanvas-llm-billing-mode-section) — Config-tab UI. Third. ## Why local DB read, not CP roundtrip The strip gate runs on the workspace-server, the `workspaces` table is local to it, and the org-default is already pulled fresh on every container start via `tenant_config` (cached as `MOLECULE_LLM_BILLING_MODE`). A new CP roundtrip per provision would add ~50ms p50 and a new failure mode (CP unreachable → default-close to `platform_managed`, but that would shadow a legitimate `byok` override). Local DB join is monotonic with the workspace's own lifetime — no cross-service race. Justified in the [Phase 1 design comment](https://git.moleculesai.app/molecule-ai/internal/issues/691#issuecomment-50151). ## Default-closed resolution contract (RFC Safety hot-spot) ``` workspace.llm_billing_mode = NULL → fall through to org default workspace.llm_billing_mode = "byok" / "disabled" → honored workspace.llm_billing_mode = "platform_managed" → honored workspace.llm_billing_mode = "garbled" → platform_managed (default-closed) workspace row missing (sql.ErrNoRows) → fall through to org default DB error → platform_managed + propagated error org default = NULL / empty / unknown → platform_managed ``` The ONLY paths to `byok` or `disabled` are an explicit, validated, recognized string. Every other branch lands on `platform_managed`. Verified by table-driven `TestResolveLLMBillingMode_TableDriven` (10 cases) and the post-condition test `TestResolveLLMBillingMode_ResolvedModeIsAlwaysValid` (garbled-x-garbled must resolve to `platform_managed`). ## Test plan - [x] `go build ./...` — clean - [x] `go test ./...` — 40 packages, 0 failures - [x] Resolver: 10 table-driven cases + 1 post-condition test - [x] SetWorkspaceLLMBillingMode: 4 cases (unknown-mode 400 without DB call, empty-ws-id rejected, NULL clear, byok set) - [x] Admin handlers: 7 cases (GET happy, GET bad UUID, PUT set byok, PUT explicit-null clear, PUT missing-mode 400, PUT unknown-mode 400, PUT no-such-workspace 404) - [x] Updated 2 pre-existing tests that relied on the old "empty env = no strip" semantic (the new default-closed behavior is safer; tests now opt out via `t.Setenv("MOLECULE_LLM_BILLING_MODE", "byok")` to assert the happy-path-of-secret-persist they were testing) - [ ] Stage B (staging) — CTO drives - [ ] Stage C (prod) — CTO drives the Production Manager workspace flip per RFC ## Five-Axis self-review **Correctness**: Resolver is the single source of truth — no remaining code path gates on `os.Getenv("MOLECULE_LLM_BILLING_MODE")` for strip decisions. The org-default env read is the resolver's INPUT, not its decision. AST-style gate would be a nice follow-up; for now `grep -rn 'MOLECULE_LLM_BILLING_MODE' workspace-server/ --include="*.go"` shows only documented call sites. **Safety/Security**: Default-closed contract above. Per-workspace secret-write gate now reads the resolved mode (workspace=byok no longer falsely rejects the user's own CLAUDE_CODE_OAUTH_TOKEN write). The global-secrets path is unchanged — operator-scope token check still fires org-wide. **Test coverage**: 21 new test cases (10 resolver + 4 setter + 7 admin handler). The garbled-row test case explicitly covers the "future migration relaxes the CHECK constraint" scenario. **Observability**: `MOLECULE_LLM_BILLING_MODE_RESOLVED` injected into container env. Strip-gate logs `workspace_provision: billing mode workspace=X resolved=Y source=Z org_default=W` at INFO on every provision. Admin GET surfaces `source` in the response. **Backward-compat**: Nullable column add — no backfill, no flag day. Workspaces without an override behave identically to pre-#691 because the resolver falls through to the same org default that was the only signal before. If CP is on the old version, the per-tenant admin route still works (workspaces table mutation is local); CP's proxy route 404s harmlessly. Deploy order documented above. ## CTO attention items The architectural fork (column on workspace-server, not CP) is explained in the [Phase 1 design comment](https://git.moleculesai.app/molecule-ai/internal/issues/691#issuecomment-50151). Two pre-existing tests were updated to set `MOLECULE_LLM_BILLING_MODE=byok` because the new default-closed behavior rejects vendor-key writes when the env is unset — that's the correct safer behavior per the RFC, but it's a behavior delta worth flagging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- ## SOP-Checklist - [x] **Comprehensive testing performed**: 21 new test cases (10 table-driven resolver cases + 1 post-condition `TestResolveLLMBillingMode_ResolvedModeIsAlwaysValid`, 4 `SetWorkspaceLLMBillingMode` setter cases, 7 admin handler cases covering GET happy/bad-UUID and PUT set/clear/missing-mode/unknown-mode/no-such-workspace). Two pre-existing tests updated to opt into `byok` via `t.Setenv` since their intent was secret-persist happy path, not the strip gate. Edge cases covered include garbled-enum rows (`"byokk"`), DB error, sql.ErrNoRows, JOIN miss, and explicit-null clear. - [x] **Local-postgres E2E run**: N/A — handler tests use `sqlmock` for the resolver/setter/admin paths (no live Postgres needed); the migration itself is plain `ALTER TABLE workspaces ADD COLUMN llm_billing_mode TEXT NULL CHECK (...)` with no data backfill. The handlers_postgres_integration CI job (which DOES exercise live Postgres) is green on this PR — see CI status. - [x] **Staging-smoke verified or pending**: Pending — CTO drives Stage B per `dev-sop.md`. The architectural contract was validated in Phase 1 design comment (internal#691#issuecomment-50151). No staging deploy from agent. - [x] **Root-cause not symptom**: agents-team prod block was rooted in `llm_billing_mode` being an org-wide dial — the only knob was `tenant_config`, so flipping to `byok` for one team flipped every workspace in that org. Per-workspace override IS the root-cause fix; no per-tenant special-case, no env-var hack. - [x] **Five-Axis review walked**: Correctness / Safety/Security / Test coverage / Observability / Backward-compat — see "Five-Axis self-review" section above. - [x] **No backwards-compat shim / dead code added**: Yes — nullable column add with no backfill / no flag day. Workspaces without an override read the same `tenant_config.llm_billing_mode` org default they read pre-#691 (resolver fall-through). No deprecation shim, no parallel code path, no feature flag. The two updated tests are updated, not shimmed. - [x] **Memory/saved-feedback consulted**: `feedback_principles` (autonomy + safety layer), `feedback_no_silent_checklist_trim` (filing every axis even when N/A), `feedback_authoritative_source_per_concept_no_indirect_db_assertions` (resolver is the single source of truth, not callers' env-var reads), `reference_workspace_placement_rfc` (data lives on the workspace-server EC2, not CP — confirms Option A column placement).

hongming added 1 commit 2026-05-26 21:29:50 +00:00

feat(workspace-server): per-workspace llm_billing_mode override (internal#691)

ci-arm64-advisory / fast-checks (pull_request) Waiting to run

Details

Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 13s

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 16s

Details

CI / Detect changes (pull_request) Successful in 17s

Details

CI / Python Lint & Test (pull_request) Successful in 16s

Details

E2E Chat / detect-changes (pull_request) Successful in 20s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 20s

Details

E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 10s

Details

Check migration collisions / Migration version collision check (pull_request) Successful in 49s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 20s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s

Details

Harness Replays / detect-changes (pull_request) Successful in 10s

Details

E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 34s

Details

Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 12s

Details

Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 7s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 13s

Details

qa-review / approved (pull_request) Failing after 8s

Details

E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 1m7s

Details

security-review / approved (pull_request) Failing after 7s

Details

lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m11s

Details

CI / Canvas (Next.js) (pull_request) Successful in 9s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s

Details

E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m22s

Details

E2E Chat / E2E Chat (pull_request) Successful in 5s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m5s

Details

Harness Replays / Harness Replays (pull_request) Successful in 3s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m7s

Details

E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 7m39s

Details

CI / Platform (Go) (pull_request) Successful in 5m44s

Details

CI / all-required (pull_request) Successful in 11m50s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

gate-check-v3 / gate-check (pull_request) Successful in 11s

Details

sop-checklist / review-refire (pull_request) Has been skipped

Details

sop-checklist / na-declarations (pull_request) N/A: (none)

Details

sop-checklist / all-items-acked (pull_request) Successful in 11s

Details

sop-tier-check / tier-check (pull_request) Successful in 7s

Details

audit-force-merge / audit (pull_request) Successful in 3s

Details

a8a93f6fa2

Adds a per-workspace override for llm_billing_mode, resolving the
agents-team prod block where the org-level dial was the only knob and
flipping it to byok unblocked the team but flipped every workspace.

Resolver: workspaces.llm_billing_mode ?? org_default (from tenant_config) ??
'platform_managed'. Default-closed: any NULL, error, garbled enum, or
JOIN miss → platform_managed. The only paths to byok/disabled are explicit,
validated, recognized strings — RFC Safety axis.

Surface area:
- migration 20260526120000_workspaces_llm_billing_mode (nullable column + CHECK)
- internal/handlers/llm_billing_mode.go — ResolveLLMBillingMode + SetWorkspaceLLMBillingMode
- internal/handlers/llm_billing_mode_handler.go — per-tenant admin GET/PUT
  (proxied by CP /cp/admin/workspaces/:id/llm-billing-mode in the next PR)
- internal/handlers/workspace_provision.go::applyPlatformManagedLLMEnv —
  reads RESOLVED mode, strips only when resolved==platform_managed,
  exports MOLECULE_LLM_BILLING_MODE_RESOLVED for in-container debug
- internal/handlers/secrets.go — per-workspace
  rejectPlatformManagedDirectLLMBypassForWorkspace; the org-level shim is
  retained for the global secrets path which is intentionally org-scoped
- Two pre-existing tests (TestExtended_SecretsSet,
  TestWorkspaceCreate_SecretPersistFails_RollsBack) gated on the implicit
  empty-env = no-strip behavior; updated to t.Setenv(byok) since their
  intent was the happy-path of secret persistence, not the strip gate

Stage A: go build ./... && go test ./... — 40 packages, 0 failures.

Deploy order (cross-link in RFC internal#691):
1. molecule-controlplane PR — admin proxy routes (must merge + deploy first)
2. THIS PR — workspace-server migration + resolver + strip-gate diff
3. molecule-core canvas PR — Config-tab UI section (last)

Refs internal#691.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hongming commented

2026-05-26 21:31:49 +00:00

Phase 3 — Five-Axis self-review (workspace-server)

CR1 self-pass per dev-sop. CR2 substance pass from a peer reviewer still needed before merge.

Correctness

Resolver is the single source of truth. The strip-gate previously gated on os.Getenv("MOLECULE_LLM_BILLING_MODE") == "platform_managed" directly; now it gates on ResolveLLMBillingMode(...).ResolvedMode. Verified: grep -rn 'MOLECULE_LLM_BILLING_MODE' workspace-server/ --include="*.go" shows only documented call sites (the resolver INPUT read in workspace_provision.go:947, the legacy org-level shim in secrets.go:52 kept for global-secrets only, plus comments).
Returned mode is ALWAYS a recognized enum. TestResolveLLMBillingMode_ResolvedModeIsAlwaysValid throws a pathological garbled-x-garbled fixture at the resolver and asserts the post-condition. Strip-gate downstream can switch on res.ResolvedMode without a separate is-valid check.
Per-workspace secret-write gate added (rejectPlatformManagedDirectLLMBypassForWorkspace). The org-level shim is retained ONLY for the global-secrets SetGlobal path which is intentionally org-scoped.

Safety / Security

Default-closed contract — non-negotiable per the RFC Safety axis:

Input	Resolved mode
workspace override = `byok` / `disabled`	honored
workspace override = `garbled-string`	platform_managed (default-closed)
workspace override = NULL, org default = `byok`	byok (inherited)
workspace override = NULL, org default = NULL/garbled	platform_managed
workspace row missing (sql.ErrNoRows)	falls through to org default
DB error on lookup	platform_managed + propagated error

The only paths to a non-platform_managed result are explicit, validated, recognized strings. Verified by TestResolveLLMBillingMode_TableDriven (10 cases, one per row above).

The DB-error case is the most important safety property: even if the org-default is byok, a DB error means we can't confirm the workspace doesn't have an override that says byok (it could equally have an override that says platform_managed), so we default to the closed mode. Silently flipping to org-byok on a DB error would leak the OAuth-keeping behavior to workspaces whose row says NULL.

Test coverage

10 resolver table cases + 1 post-condition test
4 SetWorkspaceLLMBillingMode validation cases (unknown-mode-no-DB, empty-ws-id-rejected, NULL clear, byok set)
7 admin handler cases (GET happy, GET bad UUID, PUT set byok, PUT explicit-null clear, PUT missing-mode 400, PUT unknown-mode 400, PUT no-such-workspace 404)
2 pre-existing tests updated: TestExtended_SecretsSet, TestWorkspaceCreate_SecretPersistFails_RollsBack. Their intent was the happy-path of vendor-key persistence, not the strip gate; under the new default-closed semantic (empty env → platform_managed), they needed an explicit org-byok opt-out to assert their original property. The change is documented inline with a comment pointing back to this RFC.

Observability

MOLECULE_LLM_BILLING_MODE_RESOLVED exported into every workspace container so an in-container debug check answers "what mode is this workspace running under" without DB queries.
applyPlatformManagedLLMEnv logs workspace_provision: billing mode workspace=X resolved=Y source=Z org_default=W at INFO on every provision. Operators can grep that line to answer "why is this workspace being stripped" — the RFC Observability hot-spot.
Admin GET returns source in the response so the canvas + ops scripts can render the explanation directly.

Backward-compat

Nullable column add (ALTER TABLE workspaces ADD COLUMN IF NOT EXISTS llm_billing_mode TEXT ...). No backfill, no flag day.
Workspaces without an override behave identically to pre-#691 because the resolver falls through to the same org default that was the only signal before.
Deploy order: CP first (sibling PR), this PR second, canvas third. State-explicit in PR body. If CP is on the old version, the per-tenant admin route on this server still works (workspaces table mutation is local — no CP dependency); CP's proxy route would 404 harmlessly.
The two pre-existing tests' updates are the only behavior delta — and they reflect the safer default-closed property that the RFC explicitly demanded.

Items flagged for CR2 reviewer

Architectural fork: the column lives on this server's workspaces table, not on CP. See the Phase 1 design comment for the Option A vs B trade. If the reviewer prefers Option B, all three PRs need to be redone.
AST gate in the test suite is currently a grep exercise — proper AST traversal asserting no production code path reads MOLECULE_LLM_BILLING_MODE for strip decisions would be a useful follow-up (similar to the existing audit-coverage gate from #335). Filed as a TODO; not blocking this PR.
Backward-compat with existing test fixtures: the 2 test updates document the new contract well, but a reviewer skimming the diff might worry about "test changes". They are correct — the tests' intent is preserved, the env is now explicit. Comments inline explain.

🤖 Generated with Claude Code

## Phase 3 — Five-Axis self-review (workspace-server) CR1 self-pass per dev-sop. CR2 substance pass from a peer reviewer still needed before merge. ### Correctness - **Resolver is the single source of truth**. The strip-gate previously gated on `os.Getenv("MOLECULE_LLM_BILLING_MODE") == "platform_managed"` directly; now it gates on `ResolveLLMBillingMode(...).ResolvedMode`. Verified: `grep -rn 'MOLECULE_LLM_BILLING_MODE' workspace-server/ --include="*.go"` shows only documented call sites (the resolver INPUT read in workspace_provision.go:947, the legacy org-level shim in secrets.go:52 kept for global-secrets only, plus comments). - **Returned mode is ALWAYS a recognized enum**. `TestResolveLLMBillingMode_ResolvedModeIsAlwaysValid` throws a pathological garbled-x-garbled fixture at the resolver and asserts the post-condition. Strip-gate downstream can switch on `res.ResolvedMode` without a separate is-valid check. - **Per-workspace secret-write gate** added (`rejectPlatformManagedDirectLLMBypassForWorkspace`). The org-level shim is retained ONLY for the global-secrets `SetGlobal` path which is intentionally org-scoped. ### Safety / Security Default-closed contract — non-negotiable per the RFC Safety axis: | Input | Resolved mode | |---|---| | workspace override = `byok` / `disabled` | honored | | workspace override = `garbled-string` | platform_managed (default-closed) | | workspace override = NULL, org default = `byok` | byok (inherited) | | workspace override = NULL, org default = NULL/garbled | platform_managed | | workspace row missing (sql.ErrNoRows) | falls through to org default | | DB error on lookup | platform_managed + propagated error | The only paths to a non-platform_managed result are explicit, validated, recognized strings. Verified by `TestResolveLLMBillingMode_TableDriven` (10 cases, one per row above). The DB-error case is the most important safety property: even if the org-default is `byok`, a DB error means we can't confirm the workspace doesn't have an override that says `byok` (it could equally have an override that says `platform_managed`), so we default to the closed mode. Silently flipping to org-byok on a DB error would leak the OAuth-keeping behavior to workspaces whose row says NULL. ### Test coverage - 10 resolver table cases + 1 post-condition test - 4 SetWorkspaceLLMBillingMode validation cases (unknown-mode-no-DB, empty-ws-id-rejected, NULL clear, byok set) - 7 admin handler cases (GET happy, GET bad UUID, PUT set byok, PUT explicit-null clear, PUT missing-mode 400, PUT unknown-mode 400, PUT no-such-workspace 404) - 2 pre-existing tests updated: `TestExtended_SecretsSet`, `TestWorkspaceCreate_SecretPersistFails_RollsBack`. Their intent was the happy-path of vendor-key persistence, not the strip gate; under the new default-closed semantic (empty env → platform_managed), they needed an explicit org-byok opt-out to assert their original property. The change is documented inline with a comment pointing back to this RFC. ### Observability - `MOLECULE_LLM_BILLING_MODE_RESOLVED` exported into every workspace container so an in-container debug check answers "what mode is this workspace running under" without DB queries. - `applyPlatformManagedLLMEnv` logs `workspace_provision: billing mode workspace=X resolved=Y source=Z org_default=W` at INFO on every provision. Operators can grep that line to answer "why is this workspace being stripped" — the RFC Observability hot-spot. - Admin GET returns `source` in the response so the canvas + ops scripts can render the explanation directly. ### Backward-compat - Nullable column add (`ALTER TABLE workspaces ADD COLUMN IF NOT EXISTS llm_billing_mode TEXT ...`). No backfill, no flag day. - Workspaces without an override behave identically to pre-#691 because the resolver falls through to the same org default that was the only signal before. - Deploy order: CP first (sibling PR), this PR second, canvas third. State-explicit in PR body. If CP is on the old version, the per-tenant admin route on this server still works (workspaces table mutation is local — no CP dependency); CP's proxy route would 404 harmlessly. - The two pre-existing tests' updates are the only behavior delta — and they reflect the safer default-closed property that the RFC explicitly demanded. ### Items flagged for CR2 reviewer - **Architectural fork**: the column lives on this server's `workspaces` table, not on CP. See the [Phase 1 design comment](https://git.moleculesai.app/molecule-ai/internal/issues/691#issuecomment-50151) for the Option A vs B trade. If the reviewer prefers Option B, all three PRs need to be redone. - **AST gate** in the test suite is currently a `grep` exercise — proper AST traversal asserting no production code path reads `MOLECULE_LLM_BILLING_MODE` for strip decisions would be a useful follow-up (similar to the existing audit-coverage gate from #335). Filed as a TODO; not blocking this PR. - **Backward-compat with existing test fixtures**: the 2 test updates document the new contract well, but a reviewer skimming the diff might worry about "test changes". They are correct — the tests' intent is preserved, the env is now explicit. Comments inline explain. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

hongming added the tier:high label 2026-05-26 21:32:17 +00:00

hongming referenced this pull request

2026-05-26 21:58:54 +00:00

feat(canvas): LLM Billing section in Config tab (internal#691) #1928

hongming commented

2026-05-26 21:59:34 +00:00

E2E Staging SaaS failure — staging environment, not this PR

Pulled the run log from the failing job (task 182157 on molecule-runner-3):

[21:39:53] 8/11 Sending A2A message to parent — expecting agent response...
[21:39:53] OK A2A parent round-trip succeeded: "PONG"
[21:39:53] 9/11 Writing + reading HMA memory on parent...
curl: (22) The requested URL returned error: 500
[21:39:53] FAIL memory POST failed

The failure is on the HMA memory POST to the parent agent — HTTP 500
from staging — and reproduces independently on the Continuous synthetic
E2E (staging) workflow at the same window:

11:02-11:08 UTC (against main, sha cffe4be): passed (HMA memory write+read roundtripped)
19:12 UTC: failed (memory POST failed)
19:39 UTC: failed (memory POST failed)
21:39 UTC: failed on this PR

The synthetic E2E runs against staging on a timer, not against PRs.
Same symptom, same staging env, no code from this PR involved → this is
a staging environment incident, not a regression from #1927.

Sanity-check on PR scope: this PR's diff touches
workspace-server/internal/handlers/llm_billing_mode*.go,
workspace_provision.go::applyPlatformManagedLLMEnv,
secrets.go::rejectPlatformManagedDirectLLMBypassForWorkspace, and the
migration. None of those are on the HMA memory POST path (separate
service, separate route).

Retrying the E2E once to confirm; if the staging env recovers the
re-run will pass, if not the next sweep will show the same symptom
on every other PR against staging. I am not retrying blindly —
the retry purpose is to distinguish "transient staging blip
recovered" from "staging is still down" so we do not burn an
investigation cycle on a fix that lives outside this PR surface.

Flagging for CTO visibility: HMA memory POST on staging has been
returning 500 since some time between 11:08 and 19:12 UTC today.

## E2E Staging SaaS failure — staging environment, not this PR Pulled the run log from the failing job (task 182157 on molecule-runner-3): ``` [21:39:53] 8/11 Sending A2A message to parent — expecting agent response... [21:39:53] OK A2A parent round-trip succeeded: "PONG" [21:39:53] 9/11 Writing + reading HMA memory on parent... curl: (22) The requested URL returned error: 500 [21:39:53] FAIL memory POST failed ``` The failure is on the **HMA memory POST** to the parent agent — HTTP 500 from staging — and reproduces independently on the **Continuous synthetic E2E (staging)** workflow at the same window: - 11:02-11:08 UTC (against main, sha cffe4be): **passed** (`HMA memory write+read roundtripped`) - 19:12 UTC: **failed** (`memory POST failed`) - 19:39 UTC: **failed** (`memory POST failed`) - 21:39 UTC: **failed** on this PR The synthetic E2E runs against staging on a timer, not against PRs. Same symptom, same staging env, no code from this PR involved → this is a **staging environment incident**, not a regression from #1927. Sanity-check on PR scope: this PR's diff touches `workspace-server/internal/handlers/llm_billing_mode*.go`, `workspace_provision.go::applyPlatformManagedLLMEnv`, `secrets.go::rejectPlatformManagedDirectLLMBypassForWorkspace`, and the migration. None of those are on the HMA memory POST path (separate service, separate route). Retrying the E2E once to confirm; if the staging env recovers the re-run will pass, if not the next sweep will show the same symptom on every other PR against staging. I am **not** retrying blindly — the retry purpose is to distinguish "transient staging blip recovered" from "staging is still down" so we do not burn an investigation cycle on a fix that lives outside this PR surface. Flagging for CTO visibility: HMA memory POST on staging has been returning 500 since some time between 11:08 and 19:12 UTC today.

agent-reviewer approved these changes 2026-05-26 22:27:40 +00:00

agent-reviewer left a comment

Five-Axis review — APPROVED

Fresh-eyes pass on the workspace-server migration + resolver + strip-gate diff for internal#691. Approving — the default-closed contract is real (not just claimed), the per-workspace gate replaces the org-level shim cleanly, and the strip-list authority is preserved.

Correctness

Resolver order matches the RFC literal: workspaces.llm_billing_mode ?? org_default ?? 'platform_managed'. Implementation: workspace override > org default (from tenant_config.MOLECULE_LLM_BILLING_MODE) > constant fallback. BillingModeSource correctly distinguishes org_default (org said the value) from constant_fallback (org garbled, we clamped) — that distinction matters for operator debug. ✅
The empty-workspaceID short-circuit (no DB read, resolve from org-only) is the right behaviour for pre-provision templating contexts. The 5 existing applyPlatformManagedLLMEnv tests rely on this path and they continue to pass (workspaceID="" + t.Setenv(MOLECULE_LLM_BILLING_MODE, "platform_managed")). ✅
applyPlatformManagedLLMEnv now takes ctx + workspaceID and gates on the resolved mode. The call site in workspace_provision_shared.go::prepareProvisionContext passes the real workspace ID + request context. No remaining os.Getenv("MOLECULE_LLM_BILLING_MODE") gate in strip-decision code — the env var is now resolver INPUT only. ✅
MOLECULE_LLM_BILLING_MODE_RESOLVED is exported into envVars BEFORE the early-return for non-platform-managed modes, so byok / disabled containers also see the resolved label. Per RFC observability spec. ✅
Per-workspace secret-write gate (rejectPlatformManagedDirectLLMBypassForWorkspace) hooks both secrets.Set and workspace.Create paths. A workspace with a byok override can now write its own CLAUDE_CODE_OAUTH_TOKEN — the actual UX bug this RFC is solving. ✅
Migration: ADD COLUMN IF NOT EXISTS llm_billing_mode TEXT CHECK (NULL OR IN (...)). Nullable, no backfill, idempotent. Down migration drops the column cleanly. ✅

Safety / Security

Default-closed contract is enforced in code, not just docs. Every branch of ResolveLLMBillingMode lands on a recognized enum: sql.ErrNoRows → org-default-or-fallback; DB error → platform_managed + propagated error; garbled override → fall through; garbled org-default → constant fallback. TestResolveLLMBillingMode_ResolvedModeIsAlwaysValid is the post-condition guard. ✅
DB error case correctly stays default-closed even when org says byok (test db_error_default_closed_to_pm_with_error). This is the subtle one — silently honoring org-byok on a DB failure would leak the OAuth-keeping behaviour to workspaces whose row says NULL. The resolver gets it right. ✅
Strip-list (platformManagedDirectLLMBypassKeys) is unchanged on this PR. I verified by reading secrets.go on main against the diff — the map literal at the top of the file is not touched. Strip-list authority is preserved. ✅
New admin routes mount under wsAdmin group (r.Group("", middleware.AdminAuth(db.DB)) in internal/router/router.go ~L164). Same Tier-2b admin auth as all other workspace-server admin routes. ✅
The legacy platformManagedLLMMode() / rejectPlatformManagedDirectLLMBypass() are kept for test compat only, with explicit comments saying production must use the per-workspace variant. The two production call sites (secrets.Set + workspace.Create) are converted. No remaining production caller of the legacy gate. ✅
Secret-write gate fail-safer: resolver error → still defaults to platform_managed → bypass-list rejection still fires. Transient DB error during a secret write does not accidentally allow a vendor key write. ✅

Test coverage

21 new test cases, sane breakdown:
- Resolver: 10 table-driven cases enumerating every documented branch + 1 post-condition test asserting resolved_mode is always a known enum string. The workspace_override_garbled_falls_through_to_pm_org_DEFAULT_CLOSED case explicitly guards the "future migration relaxes CHECK" scenario. ✅
- SetWorkspaceLLMBillingMode: 4 cases (unknown-mode short-circuits without DB call, empty-ws-id rejected, NULL clear via NULL update, byok via value update). ✅
- Admin handlers: 7 cases (GET happy + bad-UUID; PUT set-byok / explicit-null-clear / missing-mode / unknown-mode / no-such-workspace 404). ✅
Two pre-existing tests (TestExtended_SecretsSet, TestWorkspaceCreate_SecretPersistFails_RollsBack) were updated to t.Setenv(MOLECULE_LLM_BILLING_MODE, "byok") so the new per-workspace resolver lookup falls through to a byok org-default, matching the pre-#691 implicit behaviour of "unset env = no strip." PR body flags this as a behaviour delta — correct call. The new tests cover the strip-decision logic itself. ✅

Observability

MOLECULE_LLM_BILLING_MODE_RESOLVED exported into every workspace container's env (always, even when not platform-managed) — answers "what mode is this container running under" without DB. ✅
INFO log line on every provision: workspace_provision: billing mode workspace=X resolved=Y source=Z org_default=W. Three structured fields = greppable. ✅
Admin GET endpoint surfaces source so an operator can answer "why is this workspace being stripped?" with one curl. ✅
Secrets-gate error path logs the resolver failure with explicit "defaulting to platform_managed for safety" — operator sees the fail-closed happened. ✅

Backward compat

Nullable column add, no backfill, no flag day. Workspaces without an override resolve to the same org default that was the only signal pre-#691. ✅
Deploy order documented in PR body: this PR ships SECOND (after CP #319, before canvas #1928). If CP isn't deployed yet, the per-tenant route works on its own (workspaces-table mutation is local). If canvas isn't deployed yet, no user-visible regression. ✅
If workspace-server is deployed BEFORE CP (reverse-order accident), the resolver still works (it reads the local DB + the local MOLECULE_LLM_BILLING_MODE env that already came from tenant_config). CP's proxy route would just 404 harmlessly. Verified: no cross-service flag-day coupling. ✅

CI / non-PR concerns

E2E Staging SaaS failure is documented as an environment incident (HMA memory POST 500, reproduces against main independently — see comment id 50242 with timestamps). Not a regression from this PR; staging incident is owned elsewhere. Not blocking.
All required Go test / lint / postgres-integration jobs are green.

Minor observations (non-blocking)

The legacy platformManagedLLMMode() and rejectPlatformManagedDirectLLMBypass() shims exist purely for test-harness compat per the comment. Worth a follow-up sweep to delete after the test fleet stabilizes — a future agent could accidentally call them thinking they were canonical. Not a blocker; the comments clearly mark them as legacy.
AST gate to forbid os.Getenv("MOLECULE_LLM_BILLING_MODE") in strip-gate code paths was mentioned in the RFC test plan but is implemented as a grep check, not a real lint. Future-resilience nice-to-have, not a blocker.

Approving as agent-reviewer since hongming is the author.

## Five-Axis review — APPROVED Fresh-eyes pass on the workspace-server migration + resolver + strip-gate diff for internal#691. Approving — the default-closed contract is real (not just claimed), the per-workspace gate replaces the org-level shim cleanly, and the strip-list authority is preserved. ### Correctness - Resolver order matches the RFC literal: `workspaces.llm_billing_mode ?? org_default ?? 'platform_managed'`. Implementation: workspace override > org default (from `tenant_config.MOLECULE_LLM_BILLING_MODE`) > constant fallback. `BillingModeSource` correctly distinguishes `org_default` (org said the value) from `constant_fallback` (org garbled, we clamped) — that distinction matters for operator debug. ✅ - The empty-`workspaceID` short-circuit (no DB read, resolve from org-only) is the right behaviour for pre-provision templating contexts. The 5 existing `applyPlatformManagedLLMEnv` tests rely on this path and they continue to pass (`workspaceID=""` + `t.Setenv(MOLECULE_LLM_BILLING_MODE, "platform_managed")`). ✅ - `applyPlatformManagedLLMEnv` now takes `ctx + workspaceID` and gates on the resolved mode. The call site in `workspace_provision_shared.go::prepareProvisionContext` passes the real workspace ID + request context. No remaining `os.Getenv("MOLECULE_LLM_BILLING_MODE")` gate in strip-decision code — the env var is now resolver INPUT only. ✅ - `MOLECULE_LLM_BILLING_MODE_RESOLVED` is exported into `envVars` BEFORE the early-return for non-platform-managed modes, so byok / disabled containers also see the resolved label. Per RFC observability spec. ✅ - Per-workspace secret-write gate (`rejectPlatformManagedDirectLLMBypassForWorkspace`) hooks both `secrets.Set` and `workspace.Create` paths. A workspace with a `byok` override can now write its own `CLAUDE_CODE_OAUTH_TOKEN` — the actual UX bug this RFC is solving. ✅ - Migration: `ADD COLUMN IF NOT EXISTS llm_billing_mode TEXT CHECK (NULL OR IN (...))`. Nullable, no backfill, idempotent. Down migration drops the column cleanly. ✅ ### Safety / Security - **Default-closed contract is enforced in code, not just docs.** Every branch of `ResolveLLMBillingMode` lands on a recognized enum: `sql.ErrNoRows` → org-default-or-fallback; DB error → `platform_managed` + propagated error; garbled override → fall through; garbled org-default → constant fallback. `TestResolveLLMBillingMode_ResolvedModeIsAlwaysValid` is the post-condition guard. ✅ - **DB error case correctly stays default-closed even when org says byok** (test `db_error_default_closed_to_pm_with_error`). This is the subtle one — silently honoring org-byok on a DB failure would leak the OAuth-keeping behaviour to workspaces whose row says NULL. The resolver gets it right. ✅ - **Strip-list (`platformManagedDirectLLMBypassKeys`) is unchanged** on this PR. I verified by reading `secrets.go` on `main` against the diff — the map literal at the top of the file is not touched. Strip-list authority is preserved. ✅ - New admin routes mount under `wsAdmin` group (`r.Group("", middleware.AdminAuth(db.DB))` in `internal/router/router.go` ~L164). Same Tier-2b admin auth as all other workspace-server admin routes. ✅ - The legacy `platformManagedLLMMode()` / `rejectPlatformManagedDirectLLMBypass()` are kept for test compat only, with explicit comments saying production must use the per-workspace variant. The two production call sites (`secrets.Set` + `workspace.Create`) are converted. No remaining production caller of the legacy gate. ✅ - Secret-write gate fail-safer: resolver error → still defaults to `platform_managed` → bypass-list rejection still fires. Transient DB error during a secret write does not accidentally allow a vendor key write. ✅ ### Test coverage - 21 new test cases, sane breakdown: - Resolver: 10 table-driven cases enumerating every documented branch + 1 post-condition test asserting `resolved_mode` is always a known enum string. The `workspace_override_garbled_falls_through_to_pm_org_DEFAULT_CLOSED` case explicitly guards the "future migration relaxes CHECK" scenario. ✅ - `SetWorkspaceLLMBillingMode`: 4 cases (unknown-mode short-circuits without DB call, empty-ws-id rejected, NULL clear via NULL update, byok via value update). ✅ - Admin handlers: 7 cases (GET happy + bad-UUID; PUT set-byok / explicit-null-clear / missing-mode / unknown-mode / no-such-workspace 404). ✅ - Two pre-existing tests (`TestExtended_SecretsSet`, `TestWorkspaceCreate_SecretPersistFails_RollsBack`) were updated to `t.Setenv(MOLECULE_LLM_BILLING_MODE, "byok")` so the new per-workspace resolver lookup falls through to a byok org-default, matching the pre-#691 implicit behaviour of "unset env = no strip." PR body flags this as a behaviour delta — correct call. The new tests cover the strip-decision logic itself. ✅ ### Observability - `MOLECULE_LLM_BILLING_MODE_RESOLVED` exported into every workspace container's env (always, even when not platform-managed) — answers "what mode is this container running under" without DB. ✅ - INFO log line on every provision: `workspace_provision: billing mode workspace=X resolved=Y source=Z org_default=W`. Three structured fields = greppable. ✅ - Admin GET endpoint surfaces `source` so an operator can answer "why is this workspace being stripped?" with one curl. ✅ - Secrets-gate error path logs the resolver failure with explicit "defaulting to platform_managed for safety" — operator sees the fail-closed happened. ✅ ### Backward compat - Nullable column add, no backfill, no flag day. Workspaces without an override resolve to the same org default that was the only signal pre-#691. ✅ - Deploy order documented in PR body: this PR ships SECOND (after CP #319, before canvas #1928). If CP isn't deployed yet, the per-tenant route works on its own (workspaces-table mutation is local). If canvas isn't deployed yet, no user-visible regression. ✅ - If workspace-server is deployed BEFORE CP (reverse-order accident), the resolver still works (it reads the local DB + the local `MOLECULE_LLM_BILLING_MODE` env that already came from `tenant_config`). CP's proxy route would just 404 harmlessly. Verified: no cross-service flag-day coupling. ✅ ### CI / non-PR concerns - E2E Staging SaaS failure is documented as an environment incident (HMA memory POST 500, reproduces against `main` independently — see comment id 50242 with timestamps). Not a regression from this PR; staging incident is owned elsewhere. Not blocking. - All required Go test / lint / postgres-integration jobs are green. ### Minor observations (non-blocking) - The legacy `platformManagedLLMMode()` and `rejectPlatformManagedDirectLLMBypass()` shims exist purely for test-harness compat per the comment. Worth a follow-up sweep to delete after the test fleet stabilizes — a future agent could accidentally call them thinking they were canonical. Not a blocker; the comments clearly mark them as legacy. - AST gate to forbid `os.Getenv("MOLECULE_LLM_BILLING_MODE")` in strip-gate code paths was mentioned in the RFC test plan but is implemented as a `grep` check, not a real lint. Future-resilience nice-to-have, not a blocker. Approving as `agent-reviewer` since `hongming` is the author.

core-security approved these changes 2026-05-26 22:57:15 +00:00

core-security left a comment

core-security review — APPROVED

Security axis check:

NULL-resolver-default-closed: confirmed in resolveLLMBillingMode — JOIN failures + transient errors fall through to platform_managed, not byok. Strip-gate remains the safety floor.
Strip-list integrity: platformManagedDirectLLMBypassKeys in workspace-server/internal/handlers/secrets.go UNCHANGED by this PR. All vendor tokens (CLAUDE_CODE_OAUTH_TOKEN, KIMI_API_KEY, MINIMAX_API_KEY, ANTHROPIC_AUTH_TOKEN, ...) still get stripped when resolved mode is platform_managed.
Per-workspace gate placement: secrets.Set + workspace.Create both call the new resolver, no bypass surface.
Token leak: no new logging of raw tokens; resolved mode value is fine to log (operator debug).

Backward-compat note (not blocking): legacy platformManagedLLMMode() / rejectPlatformManagedDirectLLMBypass() shims kept for test-harness compat — flag for future cleanup sweep.

Sentinel: APPROVED.

## core-security review — APPROVED Security axis check: - **NULL-resolver-default-closed**: confirmed in `resolveLLMBillingMode` — JOIN failures + transient errors fall through to `platform_managed`, not byok. Strip-gate remains the safety floor. - **Strip-list integrity**: `platformManagedDirectLLMBypassKeys` in `workspace-server/internal/handlers/secrets.go` UNCHANGED by this PR. All vendor tokens (CLAUDE_CODE_OAUTH_TOKEN, KIMI_API_KEY, MINIMAX_API_KEY, ANTHROPIC_AUTH_TOKEN, ...) still get stripped when resolved mode is `platform_managed`. - **Per-workspace gate placement**: secrets.Set + workspace.Create both call the new resolver, no bypass surface. - **Token leak**: no new logging of raw tokens; resolved mode value is fine to log (operator debug). Backward-compat note (not blocking): legacy `platformManagedLLMMode()` / `rejectPlatformManagedDirectLLMBypass()` shims kept for test-harness compat — flag for future cleanup sweep. Sentinel: APPROVED.

core-qa approved these changes 2026-05-26 22:57:16 +00:00

core-qa left a comment

core-qa review — APPROVED

Test coverage check:

Unit: resolveLLMBillingMode table-driven (workspace override → org default → constant fallback).
Unit: applyPlatformManagedLLMEnv strips ONLY when resolved=platform_managed.
Post-condition: resolver stays closed on DB error even when org says byok.
2 pre-existing tests updated for the new default-closed contract (TestExtended_SecretsSet, TestWorkspaceCreate_SecretPersistFails_RollsBack) — original happy-path intent preserved, now explicitly opting into byok via t.Setenv.

Stage A clean: go build ./... && go test ./... 40 packages, 0 FAIL.

E2E Staging SaaS failure is documented as a separate staging incident (internal#692) — not a regression from this PR.

Sentinel: APPROVED.

## core-qa review — APPROVED Test coverage check: - Unit: `resolveLLMBillingMode` table-driven (workspace override → org default → constant fallback). - Unit: `applyPlatformManagedLLMEnv` strips ONLY when resolved=`platform_managed`. - Post-condition: resolver stays closed on DB error even when org says byok. - 2 pre-existing tests updated for the new default-closed contract (`TestExtended_SecretsSet`, `TestWorkspaceCreate_SecretPersistFails_RollsBack`) — original happy-path intent preserved, now explicitly opting into byok via `t.Setenv`. Stage A clean: `go build ./... && go test ./...` 40 packages, 0 FAIL. E2E Staging SaaS failure is documented as a separate staging incident (internal#692) — not a regression from this PR. Sentinel: APPROVED.

core-lead merged commit db5ffed2b5 into main

2026-05-26 22:57:24 +00:00

core-lead referenced this issue from a commit

2026-05-26 22:57:25 +00:00

feat(workspace-server): per-workspace llm_billing_mode override (internal#691) (#1927)

Sign in to join this conversation.

No Reviewers

agent-reviewer

core-security

core-qa