BUG: provision-time billing resolve diverges from read endpoint → byok workspaces provisioned platform_managed → drain platform LLM key #1994

Closed
opened 2026-05-28 18:31:54 +00:00 by hongming · 1 comment
Owner

Impact (live, costs money)

Reno-stars Marketing Agent (6b66de8d-9337-4fb4-be8d-6d49dca0d809, opus, claude-code) is configured byok (cc-subscription, reno's own CLAUDE_CODE_OAUTH_TOKEN) but its provisioned EC2 env routes through the platform proxy on platform_managed, billing the PLATFORM Anthropic key for the customer's usage. Drained a $5 platform recharge in ~hours. Re-provisioning does NOT fix it (verified by canary today — it came back identical).

The divergence (confirmed live, same workspace, same minute)

Source Result
Admin GET /admin/workspaces/6b66de8d.../llm-billing-mode resolved_mode=byok, source=derived_provider, provider_selection=anthropic-oauth
Actual provisioned EC2 env (docker exec printenv) MOLECULE_LLM_BILLING_MODE_RESOLVED=platform_managed, ANTHROPIC_BASE_URL=<platform proxy>, secrets.d EMPTY, llm-auth: no ANTHROPIC_AUTH_TOKEN set

The SSOT billing fix (internal#718 P2-B) is wired into the read endpoint but the provisioning env-emission path resolves differently and bakes platform_managed.

Suspected mechanism (for Phase-1 falsification — verify, don't trust)

  1. readWorkspaceDeriveInputs (llm_billing_mode.go:334, used by the read path ResolveLLMBillingMode) queries ONLY SELECT ... FROM workspace_secrets WHERE workspace_id=$1 (line ~352) — it does NOT union global_secrets. The customer oauth (CLAUDE_CODE_OAUTH_TOKEN) lives in global_secrets (workspace_provision.go:838-844). So the availableAuthEnv passed to DeriveProvider(runtime, model, availableAuthEnv) may differ between the read path and the provision path.
  2. The provision path (workspace_provision.go ~line 892: ResolveLLMBillingModeDerived(ctx, workspaceID, runtime, model, availableAuthEnv)) builds availableAuthEnv from a different source than the read endpoint → DeriveProvider tie-breaks differently for opus (platform-anthropic vs anthropic-oauth) → platform_managed.
  3. Once platform_managed resolves, applyPlatformManagedLLMEnv (workspace_provision.go:882) + stripGlobalOriginLLMCreds (line 995) STRIP the global-origin CLAUDE_CODE_OAUTH_TOKEN and inject the platform proxy key — so the oauth never materializes into secrets.d (observed EMPTY). This is CIRCULAR: resolves platform_managed → strips oauth → no oauth in env → can't derive byok next time.
    • Compare: agents-team PM (also claude-code/opus) admin GET returned source=org_default platform_managed while marketing returned source=derived_provider byok — so the derive result is input-sensitive; nail WHY they differ.

Required fix

The provision-time billing resolution MUST be consistent with the read endpoint for the same workspace: a workspace whose registry-derived provider is non-platform (anthropic-oauth) must provision byok → NOT override ANTHROPIC_BASE_URL (leave it direct / unset so claude-code talks to api.anthropic.com) → AND materialize the customer's own oauth (from global_secrets) into the workspace env/secrets.d rather than stripping it. Net: IsPlatform(derived)==false ⇒ byok-direct using the customer's own credential, no platform-key billing (the CTO principle: non-platform provider ⇒ byok, no fallback).

Acceptance criteria

  • For (claude-code, opus) with the customer's CLAUDE_CODE_OAUTH_TOKEN present in global_secrets: provisioned env has billing_mode=byok, ANTHROPIC_BASE_URL is api.anthropic.com (or unset/direct, NOT the platform proxy), and the oauth token is present in the workspace (secrets.d / env).
  • Admin GET and the provisioned env AGREE on billing_mode for the same workspace (add a regression test asserting parity).
  • No platform-proxy /v1/messages calls from a byok workspace's IP after re-provision.
  • Mutation-load-bearing test: forcing the provision path back to platform_managed for an anthropic-oauth workspace turns the new parity test RED.

Repro / verification

AWS SSM into the workspace EC2, docker exec molecule-workspace printenv | grep ANTHROPIC_BASE_URL → currently shows the platform proxy. After fix + re-provision → should show api.anthropic.com direct. Confirm via CP proxy logs: the workspace IP stops hitting /api/v1/internal/llm/anthropic/v1/messages.

Cross-ref: project_llm_billing_mode_drain_root (the original drain incident), internal#718 P2-B (the read-side fix), the cc-subscription marketing-agent observation 2026-05-28.

## Impact (live, costs money) Reno-stars Marketing Agent (`6b66de8d-9337-4fb4-be8d-6d49dca0d809`, opus, claude-code) is configured byok (cc-subscription, reno's own CLAUDE_CODE_OAUTH_TOKEN) but its **provisioned EC2 env routes through the platform proxy on platform_managed**, billing the PLATFORM Anthropic key for the customer's usage. Drained a $5 platform recharge in ~hours. Re-provisioning does NOT fix it (verified by canary today — it came back identical). ## The divergence (confirmed live, same workspace, same minute) | Source | Result | |---|---| | Admin `GET /admin/workspaces/6b66de8d.../llm-billing-mode` | `resolved_mode=byok`, `source=derived_provider`, `provider_selection=anthropic-oauth` | | Actual provisioned EC2 env (`docker exec printenv`) | `MOLECULE_LLM_BILLING_MODE_RESOLVED=platform_managed`, `ANTHROPIC_BASE_URL=<platform proxy>`, `secrets.d` EMPTY, `llm-auth: no ANTHROPIC_AUTH_TOKEN set` | The SSOT billing fix (internal#718 P2-B) is wired into the **read endpoint** but the **provisioning env-emission path resolves differently** and bakes platform_managed. ## Suspected mechanism (for Phase-1 falsification — verify, don't trust) 1. `readWorkspaceDeriveInputs` (`llm_billing_mode.go:334`, used by the read path `ResolveLLMBillingMode`) queries ONLY `SELECT ... FROM workspace_secrets WHERE workspace_id=$1` (line ~352) — it does NOT union `global_secrets`. The customer oauth (CLAUDE_CODE_OAUTH_TOKEN) lives in `global_secrets` (workspace_provision.go:838-844). So the `availableAuthEnv` passed to `DeriveProvider(runtime, model, availableAuthEnv)` may differ between the read path and the provision path. 2. The provision path (`workspace_provision.go` ~line 892: `ResolveLLMBillingModeDerived(ctx, workspaceID, runtime, model, availableAuthEnv)`) builds `availableAuthEnv` from a different source than the read endpoint → DeriveProvider tie-breaks differently for opus (platform-anthropic vs anthropic-oauth) → platform_managed. 3. Once platform_managed resolves, `applyPlatformManagedLLMEnv` (workspace_provision.go:882) + `stripGlobalOriginLLMCreds` (line 995) STRIP the global-origin CLAUDE_CODE_OAUTH_TOKEN and inject the platform proxy key — so the oauth never materializes into secrets.d (observed EMPTY). This is CIRCULAR: resolves platform_managed → strips oauth → no oauth in env → can't derive byok next time. - Compare: agents-team PM (also claude-code/opus) admin GET returned `source=org_default platform_managed` while marketing returned `source=derived_provider byok` — so the derive result is input-sensitive; nail WHY they differ. ## Required fix The provision-time billing resolution MUST be consistent with the read endpoint for the same workspace: a workspace whose registry-derived provider is non-platform (anthropic-oauth) must provision **byok** → NOT override ANTHROPIC_BASE_URL (leave it direct / unset so claude-code talks to api.anthropic.com) → AND materialize the customer's own oauth (from global_secrets) into the workspace env/secrets.d rather than stripping it. Net: `IsPlatform(derived)==false` ⇒ byok-direct using the customer's own credential, no platform-key billing (the CTO principle: non-platform provider ⇒ byok, no fallback). ## Acceptance criteria - For (claude-code, opus) with the customer's CLAUDE_CODE_OAUTH_TOKEN present in global_secrets: provisioned env has billing_mode=byok, ANTHROPIC_BASE_URL is api.anthropic.com (or unset/direct, NOT the platform proxy), and the oauth token is present in the workspace (secrets.d / env). - Admin GET and the provisioned env AGREE on billing_mode for the same workspace (add a regression test asserting parity). - No platform-proxy `/v1/messages` calls from a byok workspace's IP after re-provision. - Mutation-load-bearing test: forcing the provision path back to platform_managed for an anthropic-oauth workspace turns the new parity test RED. ## Repro / verification AWS SSM into the workspace EC2, `docker exec molecule-workspace printenv | grep ANTHROPIC_BASE_URL` → currently shows the platform proxy. After fix + re-provision → should show api.anthropic.com direct. Confirm via CP proxy logs: the workspace IP stops hitting `/api/v1/internal/llm/anthropic/v1/messages`. Cross-ref: project_llm_billing_mode_drain_root (the original drain incident), internal#718 P2-B (the read-side fix), the cc-subscription marketing-agent observation 2026-05-28.
Author
Owner

Root cause pinned + fix opened as #1995 (NOT merged - CTO merge-go required).

Mechanism: provision-time applyPlatformManagedLLMEnv derived billing from the RAW payload.Model, which is empty on re-provision (withStoredCompute backfills Compute but not Model), so DeriveProvider(claude-code, "") errored -> default-closed platform_managed + platform proxy. The read endpoint reads the MODEL workspace_secret (opus -> anthropic-oauth -> byok). Divergence.

The auth-env / global_secrets-union hypothesis was FALSIFIED: opus resolves by EXACT model-id match in the claude-code native set (auth-env-insensitive). The "materialize the global-origin oauth" hypothesis was also FALSIFIED - that token is the PLATFORM's (the original drain source); internal#711 correctly strips it and fails closed (MISSING_BYOK_CREDENTIAL). The customer must supply their OWN oauth as a workspace_secret. Fix: derive billing from the EFFECTIVE model (payload.Model -> MOLECULE_MODEL -> MODEL, the same chain applyRuntimeModelEnv uses).

Live-confirmed today via SSM on the marketing agent EC2 (i-0ad447e8433902973): MODEL=opus but MOLECULE_LLM_BILLING_MODE_RESOLVED=platform_managed + ANTHROPIC_BASE_URL=<platform proxy>.

Root cause pinned + fix opened as #1995 (NOT merged - CTO merge-go required). **Mechanism:** provision-time `applyPlatformManagedLLMEnv` derived billing from the RAW `payload.Model`, which is empty on re-provision (`withStoredCompute` backfills Compute but not Model), so `DeriveProvider(claude-code, "")` errored -> default-closed `platform_managed` + platform proxy. The read endpoint reads the `MODEL` workspace_secret (`opus` -> `anthropic-oauth` -> byok). Divergence. The auth-env / global_secrets-union hypothesis was FALSIFIED: `opus` resolves by EXACT model-id match in the claude-code native set (auth-env-insensitive). The "materialize the global-origin oauth" hypothesis was also FALSIFIED - that token is the PLATFORM's (the original drain source); internal#711 correctly strips it and fails closed (MISSING_BYOK_CREDENTIAL). The customer must supply their OWN oauth as a workspace_secret. Fix: derive billing from the EFFECTIVE model (`payload.Model -> MOLECULE_MODEL -> MODEL`, the same chain `applyRuntimeModelEnv` uses). Live-confirmed today via SSM on the marketing agent EC2 (i-0ad447e8433902973): `MODEL=opus` but `MOLECULE_LLM_BILLING_MODE_RESOLVED=platform_managed` + `ANTHROPIC_BASE_URL=<platform proxy>`.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1994