[bug] applyPlatformManagedLLMEnv falsely reports HasUsableLLMCred:true on empty proxy env → claude-code boots credential-less (adk-demo dark-wedge class) #2162

Closed
opened 2026-06-03 01:31:25 +00:00 by molecule-code-reviewer · 1 comment
Member

Root cause of the adk-demo claude-code+platform-kimi boot failure (code-confirmed)

workspace-server/internal/handlers/workspace_provision.go:1005-1012, applyPlatformManagedLLMEnv:

baseURL := firstNonEmptyEnv("MOLECULE_LLM_BASE_URL", "OPENAI_BASE_URL")
token   := firstNonEmptyEnv("MOLECULE_LLM_USAGE_TOKEN", "OPENAI_API_KEY")
if baseURL == "" || token == "" {
    // "Proxy not configured (boot race / misconfig)"
    return platformLLMEnvResult{ResolvedMode: res.ResolvedMode, HasUsableLLMCred: true, ...}
}
...
if runtimeUsesAnthropicNativeProxy(runtime) && anthropicBaseURL != "" {
    envVars["ANTHROPIC_API_KEY"] = token
    envVars["ANTHROPIC_BASE_URL"] = anthropicBaseURL
}

The bug: for a platform-managed workspace, the LLM credential (ANTHROPIC_API_KEY for an Anthropic-native runtime like claude-code) is ONLY set on the success path (≈1031), which requires the upstream proxy env (MOLECULE_LLM_BASE_URL + MOLECULE_LLM_USAGE_TOKEN) to already be present. If it is absent (the code admits "boot race / misconfig"), line 1005 returns early — sets NO auth env at all, yet reports HasUsableLLMCred: true. So:

  1. claude-code boots with ZERO credential → its no-auth "Not logged in" state (NOT the OAuth error — there is simply no key).
  2. The dishonest HasUsableLLMCred: true means the caller never fail-closes → the workspace is started credential-less.
  3. Combined with registration being gated behind the claude login preflight (runtime#87), the agent never reaches /registry/register → 600s provision-timeout sweep → instance reaped → un-debuggable.

This is the molecule-adk-demo failure class (ws 29b95be9, claude-code + moonshot/kimi-k2.6, "container started but never called /registry/register"). adk-demo has ZERO global_secrets → 100% dependent on this injection → no BYOK fallback.

Fix: the empty-proxy-env path must FAIL-CLOSED — abort provision with a clear MISSING_PLATFORM_PROXY error naming the missing MOLECULE_LLM_* env, symmetric to the BYOK MISSING_BYOK_CREDENTIAL hard-fail (#711). Never report HasUsableLLMCred: true when no credential was set. Regression test (per SOP #765): platform-managed claude-code with proxy env present → ANTHROPIC_API_KEY+BASE_URL set, HasUsableLLMCred true; with proxy env ABSENT → fail-closed, NOT started credential-less. Pairs with runtime#87 (register before LLM preflight → debuggable even on this path) and cp#455 (boot-e2e reproduces).

Still to confirm (upstream half): WHY MOLECULE_LLM_BASE_URL/MOLECULE_LLM_USAGE_TOKEN were absent for adk-demo's provision — CP injection gap, boot race, or platform proxy not configured for that tenant/runtime. Confirm via a re-provision + env check (the code bug above is confirmed regardless and is the most likely proximate cause).

## Root cause of the adk-demo claude-code+platform-kimi boot failure (code-confirmed) `workspace-server/internal/handlers/workspace_provision.go:1005-1012`, `applyPlatformManagedLLMEnv`: ```go baseURL := firstNonEmptyEnv("MOLECULE_LLM_BASE_URL", "OPENAI_BASE_URL") token := firstNonEmptyEnv("MOLECULE_LLM_USAGE_TOKEN", "OPENAI_API_KEY") if baseURL == "" || token == "" { // "Proxy not configured (boot race / misconfig)" return platformLLMEnvResult{ResolvedMode: res.ResolvedMode, HasUsableLLMCred: true, ...} } ... if runtimeUsesAnthropicNativeProxy(runtime) && anthropicBaseURL != "" { envVars["ANTHROPIC_API_KEY"] = token envVars["ANTHROPIC_BASE_URL"] = anthropicBaseURL } ``` **The bug:** for a platform-managed workspace, the LLM credential (`ANTHROPIC_API_KEY` for an Anthropic-native runtime like claude-code) is ONLY set on the success path (≈1031), which requires the upstream proxy env (`MOLECULE_LLM_BASE_URL` + `MOLECULE_LLM_USAGE_TOKEN`) to already be present. If it is absent (the code admits "boot race / misconfig"), line 1005 returns early — **sets NO auth env at all, yet reports `HasUsableLLMCred: true`.** So: 1. claude-code boots with ZERO credential → its no-auth "Not logged in" state (NOT the OAuth error — there is simply no key). 2. The dishonest `HasUsableLLMCred: true` means the caller never fail-closes → the workspace is started credential-less. 3. Combined with registration being gated behind the claude login preflight (runtime#87), the agent never reaches `/registry/register` → 600s provision-timeout sweep → instance reaped → un-debuggable. **This is the molecule-adk-demo failure class** (ws 29b95be9, claude-code + `moonshot/kimi-k2.6`, "container started but never called /registry/register"). adk-demo has ZERO global_secrets → 100% dependent on this injection → no BYOK fallback. **Fix:** the empty-proxy-env path must **FAIL-CLOSED** — abort provision with a clear `MISSING_PLATFORM_PROXY` error naming the missing `MOLECULE_LLM_*` env, symmetric to the BYOK `MISSING_BYOK_CREDENTIAL` hard-fail (#711). Never report `HasUsableLLMCred: true` when no credential was set. Regression test (per SOP #765): platform-managed claude-code with proxy env present → ANTHROPIC_API_KEY+BASE_URL set, HasUsableLLMCred true; with proxy env ABSENT → fail-closed, NOT started credential-less. Pairs with runtime#87 (register before LLM preflight → debuggable even on this path) and cp#455 (boot-e2e reproduces). **Still to confirm (upstream half):** WHY `MOLECULE_LLM_BASE_URL`/`MOLECULE_LLM_USAGE_TOKEN` were absent for adk-demo's provision — CP injection gap, boot race, or platform proxy not configured for that tenant/runtime. Confirm via a re-provision + env check (the code bug above is confirmed regardless and is the most likely proximate cause).
Author
Member

How this passed CI (the regression-coverage gap — per SOP #765)

Confirmed by reading the tests:

  1. Buggy branch uncovered. The platform-managed test workspace_provision_shared_test.go:968 does t.Setenv(MOLECULE_LLM_BASE_URL...) + t.Setenv(MOLECULE_LLM_USAGE_TOKEN...) — it ONLY exercises the proxy-env-PRESENT happy path and asserts injection. NO test calls applyPlatformManagedLLMEnv for a platform-managed workspace with the proxy env ABSENT, so the line-1005 early-return branch is never executed → it passes against correct AND broken code.
  2. Asymmetric invariant — we already knew this class. workspace_provision_shared_test.go:511-595 encodes + tests the BYOK fail-closed invariant ("aborts MISSING_BYOK_CREDENTIAL rather than starting credential-less" — the agent-dead class). The structurally identical platform-managed branch has NO symmetric test, and the code does the opposite (HasUsableLLMCred:true + boots credential-less). #2162 is the unguarded twin of a bug already caught+fixed for BYOK.
  3. No boot e2e. These are env-map unit tests (assert map contents), not real-boot. The actual symptom (silent wedge, never registers) + the real trigger (tenant server missing the proxy env) is a cross-system condition only the boot-to-registration e2e (cp#455, not yet built) would catch — the audit-#1 "provision/boot path is mock-only" finding.

Required tests for the fix (SOP #765)

  • Unit (symmetric to the BYOK test): applyPlatformManagedLLMEnv for platform-managed with proxy env ABSENT MUST fail-closed (MISSING_PLATFORM_PROXY abort), NOT return HasUsableLLMCred:true. Mirror the MISSING_BYOK_CREDENTIAL test at workspace_provision_shared_test.go:565-595. Watch it fail against current code first.
  • Boot e2e (cp#455): provision a platform-managed claude-code workspace and assert it registers + serves; this catches the real wedge incl. the tenant-server-missing-proxy-env condition.
## How this passed CI (the regression-coverage gap — per SOP #765) Confirmed by reading the tests: 1. **Buggy branch uncovered.** The platform-managed test `workspace_provision_shared_test.go:968` does `t.Setenv(MOLECULE_LLM_BASE_URL...)` + `t.Setenv(MOLECULE_LLM_USAGE_TOKEN...)` — it ONLY exercises the proxy-env-PRESENT happy path and asserts injection. NO test calls applyPlatformManagedLLMEnv for a platform-managed workspace with the proxy env ABSENT, so the line-1005 early-return branch is never executed → it passes against correct AND broken code. 2. **Asymmetric invariant — we already knew this class.** `workspace_provision_shared_test.go:511-595` encodes + tests the BYOK fail-closed invariant ("aborts MISSING_BYOK_CREDENTIAL rather than starting credential-less" — the agent-dead class). The structurally identical platform-managed branch has NO symmetric test, and the code does the opposite (HasUsableLLMCred:true + boots credential-less). #2162 is the unguarded twin of a bug already caught+fixed for BYOK. 3. **No boot e2e.** These are env-map unit tests (assert map contents), not real-boot. The actual symptom (silent wedge, never registers) + the real trigger (tenant server missing the proxy env) is a cross-system condition only the boot-to-registration e2e (cp#455, not yet built) would catch — the audit-#1 "provision/boot path is mock-only" finding. ## Required tests for the fix (SOP #765) - **Unit (symmetric to the BYOK test):** applyPlatformManagedLLMEnv for platform-managed with proxy env ABSENT MUST fail-closed (MISSING_PLATFORM_PROXY abort), NOT return HasUsableLLMCred:true. Mirror the MISSING_BYOK_CREDENTIAL test at workspace_provision_shared_test.go:565-595. Watch it fail against current code first. - **Boot e2e (cp#455):** provision a platform-managed claude-code workspace and assert it registers + serves; this catches the real wedge incl. the tenant-server-missing-proxy-env condition.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2162