feat(workspace-server): per-workspace llm_billing_mode override (internal#691) #1927
Reference in New Issue
Block a user
Delete Branch "feat/per-workspace-llm-billing-mode"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Workspace-server side of the per-workspace
llm_billing_modeoverride (internal#691). Adds:20260526120000_workspaces_llm_billing_mode— nullableTEXTcolumn withCHECK (...)enum constraint on theworkspacestable.ResolveLLMBillingMode(ctx, workspaceID, orgMode)—workspaces.llm_billing_mode ?? org_default ?? 'platform_managed'. Default-closed on every NULL / error / garbled enum / JOIN miss.applyPlatformManagedLLMEnvnow reads the RESOLVED per-workspace mode (was:os.Getenv("MOLECULE_LLM_BILLING_MODE")). Strips ONLY when resolved ==platform_managed. ExportsMOLECULE_LLM_BILLING_MODE_RESOLVEDinto the container env so an in-container debug check can answer "what mode is this workspace running under" without DB queries.GET / PUT /admin/workspaces/:id/llm-billing-mode— these are what CP's new/cp/admin/workspaces/:id/llm-billing-modeproxies to, and what the canvas hits directly (same origin).rejectPlatformManagedDirectLLMBypassForWorkspace— the strip-list on POST/workspaces/:id/secretsnow respects the per-workspace mode. The org-level shim is retained for the global secrets path which is intentionally org-scoped.Deploy order — this PR ships SECOND:
Why local DB read, not CP roundtrip
The strip gate runs on the workspace-server, the
workspacestable is local to it, and the org-default is already pulled fresh on every container start viatenant_config(cached asMOLECULE_LLM_BILLING_MODE). A new CP roundtrip per provision would add ~50ms p50 and a new failure mode (CP unreachable → default-close toplatform_managed, but that would shadow a legitimatebyokoverride). Local DB join is monotonic with the workspace's own lifetime — no cross-service race. Justified in the Phase 1 design comment.Default-closed resolution contract (RFC Safety hot-spot)
The ONLY paths to
byokordisabledare an explicit, validated, recognized string. Every other branch lands onplatform_managed. Verified by table-drivenTestResolveLLMBillingMode_TableDriven(10 cases) and the post-condition testTestResolveLLMBillingMode_ResolvedModeIsAlwaysValid(garbled-x-garbled must resolve toplatform_managed).Test plan
go build ./...— cleango test ./...— 40 packages, 0 failurest.Setenv("MOLECULE_LLM_BILLING_MODE", "byok")to assert the happy-path-of-secret-persist they were testing)Five-Axis self-review
Correctness: Resolver is the single source of truth — no remaining code path gates on
os.Getenv("MOLECULE_LLM_BILLING_MODE")for strip decisions. The org-default env read is the resolver's INPUT, not its decision. AST-style gate would be a nice follow-up; for nowgrep -rn 'MOLECULE_LLM_BILLING_MODE' workspace-server/ --include="*.go"shows only documented call sites.Safety/Security: Default-closed contract above. Per-workspace secret-write gate now reads the resolved mode (workspace=byok no longer falsely rejects the user's own CLAUDE_CODE_OAUTH_TOKEN write). The global-secrets path is unchanged — operator-scope token check still fires org-wide.
Test coverage: 21 new test cases (10 resolver + 4 setter + 7 admin handler). The garbled-row test case explicitly covers the "future migration relaxes the CHECK constraint" scenario.
Observability:
MOLECULE_LLM_BILLING_MODE_RESOLVEDinjected into container env. Strip-gate logsworkspace_provision: billing mode workspace=X resolved=Y source=Z org_default=Wat INFO on every provision. Admin GET surfacessourcein the response.Backward-compat: Nullable column add — no backfill, no flag day. Workspaces without an override behave identically to pre-#691 because the resolver falls through to the same org default that was the only signal before. If CP is on the old version, the per-tenant admin route still works (workspaces table mutation is local); CP's proxy route 404s harmlessly. Deploy order documented above.
CTO attention items
The architectural fork (column on workspace-server, not CP) is explained in the Phase 1 design comment. Two pre-existing tests were updated to set
MOLECULE_LLM_BILLING_MODE=byokbecause the new default-closed behavior rejects vendor-key writes when the env is unset — that's the correct safer behavior per the RFC, but it's a behavior delta worth flagging.🤖 Generated with Claude Code
SOP-Checklist
TestResolveLLMBillingMode_ResolvedModeIsAlwaysValid, 4SetWorkspaceLLMBillingModesetter cases, 7 admin handler cases covering GET happy/bad-UUID and PUT set/clear/missing-mode/unknown-mode/no-such-workspace). Two pre-existing tests updated to opt intobyokviat.Setenvsince their intent was secret-persist happy path, not the strip gate. Edge cases covered include garbled-enum rows ("byokk"), DB error, sql.ErrNoRows, JOIN miss, and explicit-null clear.sqlmockfor the resolver/setter/admin paths (no live Postgres needed); the migration itself is plainALTER TABLE workspaces ADD COLUMN llm_billing_mode TEXT NULL CHECK (...)with no data backfill. The handlers_postgres_integration CI job (which DOES exercise live Postgres) is green on this PR — see CI status.dev-sop.md. The architectural contract was validated in Phase 1 design comment (internal#691#issuecomment-50151). No staging deploy from agent.llm_billing_modebeing an org-wide dial — the only knob wastenant_config, so flipping tobyokfor one team flipped every workspace in that org. Per-workspace override IS the root-cause fix; no per-tenant special-case, no env-var hack.tenant_config.llm_billing_modeorg default they read pre-#691 (resolver fall-through). No deprecation shim, no parallel code path, no feature flag. The two updated tests are updated, not shimmed.feedback_principles(autonomy + safety layer),feedback_no_silent_checklist_trim(filing every axis even when N/A),feedback_authoritative_source_per_concept_no_indirect_db_assertions(resolver is the single source of truth, not callers' env-var reads),reference_workspace_placement_rfc(data lives on the workspace-server EC2, not CP — confirms Option A column placement).Phase 3 — Five-Axis self-review (workspace-server)
CR1 self-pass per dev-sop. CR2 substance pass from a peer reviewer still needed before merge.
Correctness
os.Getenv("MOLECULE_LLM_BILLING_MODE") == "platform_managed"directly; now it gates onResolveLLMBillingMode(...).ResolvedMode. Verified:grep -rn 'MOLECULE_LLM_BILLING_MODE' workspace-server/ --include="*.go"shows only documented call sites (the resolver INPUT read in workspace_provision.go:947, the legacy org-level shim in secrets.go:52 kept for global-secrets only, plus comments).TestResolveLLMBillingMode_ResolvedModeIsAlwaysValidthrows a pathological garbled-x-garbled fixture at the resolver and asserts the post-condition. Strip-gate downstream can switch onres.ResolvedModewithout a separate is-valid check.rejectPlatformManagedDirectLLMBypassForWorkspace). The org-level shim is retained ONLY for the global-secretsSetGlobalpath which is intentionally org-scoped.Safety / Security
Default-closed contract — non-negotiable per the RFC Safety axis:
byok/disabledgarbled-stringbyokThe only paths to a non-platform_managed result are explicit, validated, recognized strings. Verified by
TestResolveLLMBillingMode_TableDriven(10 cases, one per row above).The DB-error case is the most important safety property: even if the org-default is
byok, a DB error means we can't confirm the workspace doesn't have an override that saysbyok(it could equally have an override that saysplatform_managed), so we default to the closed mode. Silently flipping to org-byok on a DB error would leak the OAuth-keeping behavior to workspaces whose row says NULL.Test coverage
TestExtended_SecretsSet,TestWorkspaceCreate_SecretPersistFails_RollsBack. Their intent was the happy-path of vendor-key persistence, not the strip gate; under the new default-closed semantic (empty env → platform_managed), they needed an explicit org-byok opt-out to assert their original property. The change is documented inline with a comment pointing back to this RFC.Observability
MOLECULE_LLM_BILLING_MODE_RESOLVEDexported into every workspace container so an in-container debug check answers "what mode is this workspace running under" without DB queries.applyPlatformManagedLLMEnvlogsworkspace_provision: billing mode workspace=X resolved=Y source=Z org_default=Wat INFO on every provision. Operators can grep that line to answer "why is this workspace being stripped" — the RFC Observability hot-spot.sourcein the response so the canvas + ops scripts can render the explanation directly.Backward-compat
ALTER TABLE workspaces ADD COLUMN IF NOT EXISTS llm_billing_mode TEXT ...). No backfill, no flag day.Items flagged for CR2 reviewer
workspacestable, not on CP. See the Phase 1 design comment for the Option A vs B trade. If the reviewer prefers Option B, all three PRs need to be redone.grepexercise — proper AST traversal asserting no production code path readsMOLECULE_LLM_BILLING_MODEfor strip decisions would be a useful follow-up (similar to the existing audit-coverage gate from #335). Filed as a TODO; not blocking this PR.🤖 Generated with Claude Code
E2E Staging SaaS failure — staging environment, not this PR
Pulled the run log from the failing job (task 182157 on molecule-runner-3):
The failure is on the HMA memory POST to the parent agent — HTTP 500
from staging — and reproduces independently on the Continuous synthetic
E2E (staging) workflow at the same window:
cffe4be): passed (HMA memory write+read roundtripped)memory POST failed)memory POST failed)The synthetic E2E runs against staging on a timer, not against PRs.
Same symptom, same staging env, no code from this PR involved → this is
a staging environment incident, not a regression from #1927.
Sanity-check on PR scope: this PR's diff touches
workspace-server/internal/handlers/llm_billing_mode*.go,workspace_provision.go::applyPlatformManagedLLMEnv,secrets.go::rejectPlatformManagedDirectLLMBypassForWorkspace, and themigration. None of those are on the HMA memory POST path (separate
service, separate route).
Retrying the E2E once to confirm; if the staging env recovers the
re-run will pass, if not the next sweep will show the same symptom
on every other PR against staging. I am not retrying blindly —
the retry purpose is to distinguish "transient staging blip
recovered" from "staging is still down" so we do not burn an
investigation cycle on a fix that lives outside this PR surface.
Flagging for CTO visibility: HMA memory POST on staging has been
returning 500 since some time between 11:08 and 19:12 UTC today.
Five-Axis review — APPROVED
Fresh-eyes pass on the workspace-server migration + resolver + strip-gate diff for internal#691. Approving — the default-closed contract is real (not just claimed), the per-workspace gate replaces the org-level shim cleanly, and the strip-list authority is preserved.
Correctness
workspaces.llm_billing_mode ?? org_default ?? 'platform_managed'. Implementation: workspace override > org default (fromtenant_config.MOLECULE_LLM_BILLING_MODE) > constant fallback.BillingModeSourcecorrectly distinguishesorg_default(org said the value) fromconstant_fallback(org garbled, we clamped) — that distinction matters for operator debug. ✅workspaceIDshort-circuit (no DB read, resolve from org-only) is the right behaviour for pre-provision templating contexts. The 5 existingapplyPlatformManagedLLMEnvtests rely on this path and they continue to pass (workspaceID=""+t.Setenv(MOLECULE_LLM_BILLING_MODE, "platform_managed")). ✅applyPlatformManagedLLMEnvnow takesctx + workspaceIDand gates on the resolved mode. The call site inworkspace_provision_shared.go::prepareProvisionContextpasses the real workspace ID + request context. No remainingos.Getenv("MOLECULE_LLM_BILLING_MODE")gate in strip-decision code — the env var is now resolver INPUT only. ✅MOLECULE_LLM_BILLING_MODE_RESOLVEDis exported intoenvVarsBEFORE the early-return for non-platform-managed modes, so byok / disabled containers also see the resolved label. Per RFC observability spec. ✅rejectPlatformManagedDirectLLMBypassForWorkspace) hooks bothsecrets.Setandworkspace.Createpaths. A workspace with abyokoverride can now write its ownCLAUDE_CODE_OAUTH_TOKEN— the actual UX bug this RFC is solving. ✅ADD COLUMN IF NOT EXISTS llm_billing_mode TEXT CHECK (NULL OR IN (...)). Nullable, no backfill, idempotent. Down migration drops the column cleanly. ✅Safety / Security
ResolveLLMBillingModelands on a recognized enum:sql.ErrNoRows→ org-default-or-fallback; DB error →platform_managed+ propagated error; garbled override → fall through; garbled org-default → constant fallback.TestResolveLLMBillingMode_ResolvedModeIsAlwaysValidis the post-condition guard. ✅db_error_default_closed_to_pm_with_error). This is the subtle one — silently honoring org-byok on a DB failure would leak the OAuth-keeping behaviour to workspaces whose row says NULL. The resolver gets it right. ✅platformManagedDirectLLMBypassKeys) is unchanged on this PR. I verified by readingsecrets.goonmainagainst the diff — the map literal at the top of the file is not touched. Strip-list authority is preserved. ✅wsAdmingroup (r.Group("", middleware.AdminAuth(db.DB))ininternal/router/router.go~L164). Same Tier-2b admin auth as all other workspace-server admin routes. ✅platformManagedLLMMode()/rejectPlatformManagedDirectLLMBypass()are kept for test compat only, with explicit comments saying production must use the per-workspace variant. The two production call sites (secrets.Set+workspace.Create) are converted. No remaining production caller of the legacy gate. ✅platform_managed→ bypass-list rejection still fires. Transient DB error during a secret write does not accidentally allow a vendor key write. ✅Test coverage
resolved_modeis always a known enum string. Theworkspace_override_garbled_falls_through_to_pm_org_DEFAULT_CLOSEDcase explicitly guards the "future migration relaxes CHECK" scenario. ✅SetWorkspaceLLMBillingMode: 4 cases (unknown-mode short-circuits without DB call, empty-ws-id rejected, NULL clear via NULL update, byok via value update). ✅TestExtended_SecretsSet,TestWorkspaceCreate_SecretPersistFails_RollsBack) were updated tot.Setenv(MOLECULE_LLM_BILLING_MODE, "byok")so the new per-workspace resolver lookup falls through to a byok org-default, matching the pre-#691 implicit behaviour of "unset env = no strip." PR body flags this as a behaviour delta — correct call. The new tests cover the strip-decision logic itself. ✅Observability
MOLECULE_LLM_BILLING_MODE_RESOLVEDexported into every workspace container's env (always, even when not platform-managed) — answers "what mode is this container running under" without DB. ✅workspace_provision: billing mode workspace=X resolved=Y source=Z org_default=W. Three structured fields = greppable. ✅sourceso an operator can answer "why is this workspace being stripped?" with one curl. ✅Backward compat
MOLECULE_LLM_BILLING_MODEenv that already came fromtenant_config). CP's proxy route would just 404 harmlessly. Verified: no cross-service flag-day coupling. ✅CI / non-PR concerns
mainindependently — see comment id 50242 with timestamps). Not a regression from this PR; staging incident is owned elsewhere. Not blocking.Minor observations (non-blocking)
platformManagedLLMMode()andrejectPlatformManagedDirectLLMBypass()shims exist purely for test-harness compat per the comment. Worth a follow-up sweep to delete after the test fleet stabilizes — a future agent could accidentally call them thinking they were canonical. Not a blocker; the comments clearly mark them as legacy.os.Getenv("MOLECULE_LLM_BILLING_MODE")in strip-gate code paths was mentioned in the RFC test plan but is implemented as agrepcheck, not a real lint. Future-resilience nice-to-have, not a blocker.Approving as
agent-reviewersincehongmingis the author.core-security review — APPROVED
Security axis check:
resolveLLMBillingMode— JOIN failures + transient errors fall through toplatform_managed, not byok. Strip-gate remains the safety floor.platformManagedDirectLLMBypassKeysinworkspace-server/internal/handlers/secrets.goUNCHANGED by this PR. All vendor tokens (CLAUDE_CODE_OAUTH_TOKEN, KIMI_API_KEY, MINIMAX_API_KEY, ANTHROPIC_AUTH_TOKEN, ...) still get stripped when resolved mode isplatform_managed.Backward-compat note (not blocking): legacy
platformManagedLLMMode()/rejectPlatformManagedDirectLLMBypass()shims kept for test-harness compat — flag for future cleanup sweep.Sentinel: APPROVED.
core-qa review — APPROVED
Test coverage check:
resolveLLMBillingModetable-driven (workspace override → org default → constant fallback).applyPlatformManagedLLMEnvstrips ONLY when resolved=platform_managed.TestExtended_SecretsSet,TestWorkspaceCreate_SecretPersistFails_RollsBack) — original happy-path intent preserved, now explicitly opting into byok viat.Setenv.Stage A clean:
go build ./... && go test ./...40 packages, 0 FAIL.E2E Staging SaaS failure is documented as a separate staging incident (internal#692) — not a regression from this PR.
Sentinel: APPROVED.