RCA: lifecycle-real advisory red after #2704 merge is MiniMax/claude-code auth projection, not restart-token loss #2709

Closed
opened 2026-06-13 05:41:55 +00:00 by agent-researcher · 1 comment
Member

MECHANISM: The main-head red is the advisory Local Provision Lifecycle E2E (real image + MiniMax LLM) job, not a required gate. The workflow intentionally runs LIFECYCLE_LLM=minimax in .gitea/workflows/local-provision-e2e.yml:473, and the script writes a MiniMax key plus MODEL_PROVIDER=minimax and seeds provider: minimax into config (tests/e2e/test_local_provision_lifecycle_e2e.sh:437, :461). After restart, the container has MINIMAX_API_KEY=set but ANTHROPIC_AUTH_TOKEN and ANTHROPIC_BASE_URL are unset, so the real claude-code SDK attempts an invalid Anthropic-compatible auth path, logs a 401, and the workspace remains degraded until the restart poll at tests/e2e/test_local_provision_lifecycle_e2e.sh:547 times out at :559.

EVIDENCE: run 357183, job 485015, main 179ec8fb: seven checks passed, then FAIL: workspace back online after restart (status=degraded). The platform did reinject restart auth (Provisioner: injected fresh auth token ... into config volume) and the container can reach /health, so this is not the #2530 token-loss or #2696/#2701 UUID-normalization class. Key log excerpt: Failed to authenticate. API Error: 401 invalid api key; nearby env dump shows MINIMAX_API_KEY=set and ANTHROPIC_AUTH_TOKEN=unset.

RECOMMENDED FIX SHAPE: Keep this scoped to the advisory real-image lane. Responsible area is the claude-code template/runtime adapter or the local-provision E2E config path, not the core restart token path. Either ensure the real claude-code MiniMax provider projection converts MINIMAX_API_KEY into the Anthropic-compatible variables the SDK actually consumes after restart, or update tests/e2e/test_local_provision_lifecycle_e2e.sh to seed the exact template config/env expected by the current adapter. Add a regression assertion that a MiniMax real-image container has the required Anthropic-compatible base/auth env before the restart-online poll.

MECHANISM: The main-head red is the advisory `Local Provision Lifecycle E2E (real image + MiniMax LLM)` job, not a required gate. The workflow intentionally runs `LIFECYCLE_LLM=minimax` in `.gitea/workflows/local-provision-e2e.yml:473`, and the script writes a MiniMax key plus `MODEL_PROVIDER=minimax` and seeds `provider: minimax` into config (`tests/e2e/test_local_provision_lifecycle_e2e.sh:437`, `:461`). After restart, the container has `MINIMAX_API_KEY=set` but `ANTHROPIC_AUTH_TOKEN` and `ANTHROPIC_BASE_URL` are unset, so the real claude-code SDK attempts an invalid Anthropic-compatible auth path, logs a 401, and the workspace remains `degraded` until the restart poll at `tests/e2e/test_local_provision_lifecycle_e2e.sh:547` times out at `:559`. EVIDENCE: run `357183`, job `485015`, main `179ec8fb`: seven checks passed, then `FAIL: workspace back online after restart (status=degraded)`. The platform did reinject restart auth (`Provisioner: injected fresh auth token ... into config volume`) and the container can reach `/health`, so this is not the #2530 token-loss or #2696/#2701 UUID-normalization class. Key log excerpt: `Failed to authenticate. API Error: 401 invalid api key`; nearby env dump shows `MINIMAX_API_KEY=set` and `ANTHROPIC_AUTH_TOKEN=unset`. RECOMMENDED FIX SHAPE: Keep this scoped to the advisory real-image lane. Responsible area is the claude-code template/runtime adapter or the local-provision E2E config path, not the core restart token path. Either ensure the real claude-code MiniMax provider projection converts `MINIMAX_API_KEY` into the Anthropic-compatible variables the SDK actually consumes after restart, or update `tests/e2e/test_local_provision_lifecycle_e2e.sh` to seed the exact template config/env expected by the current adapter. Add a regression assertion that a MiniMax real-image container has the required Anthropic-compatible base/auth env before the restart-online poll.
Author
Member

MECHANISM: Recurrence confirmed on newer main head aaca82fe in run 357266, job 485184; this is the same advisory real-image lifecycle failure described above, not a separate restart-token regression. The local-provision script reaches the restart-survival poll in tests/e2e/test_local_provision_lifecycle_e2e.sh:547-559, but the real claude-code MiniMax workspace remains degraded because the runtime logs an SDK auth failure after restart. Platform-side restart token injection still occurs during this run, so the failure stays scoped to MiniMax/claude-code auth projection.

EVIDENCE: job 485184 has setup/build/platform/health all green and then 7 passed, 1 failed; the failing assertion is workspace back online after restart (status=degraded). The container dump again shows MINIMAX_API_KEY=set while ANTHROPIC_AUTH_TOKEN=unset, followed by Failed to authenticate. API Error: 401 invalid api key. Platform log shows fresh token writes for the same workspace at restart time: Provisioner: injected fresh auth token ... into config volume.

RECOMMENDED FIX SHAPE: Same fix shape as the original RCA: in the claude-code template/runtime adapter or local-provision E2E config path, make the MiniMax BYOK route project the key/base URL into the exact Anthropic-compatible environment consumed by the SDK after restart, then add a regression guard that asserts those env vars are present before the restart-online poll. No separate core restart/UUID fix is indicated by this recurrence.

MECHANISM: Recurrence confirmed on newer main head `aaca82fe` in run `357266`, job `485184`; this is the same advisory real-image lifecycle failure described above, not a separate restart-token regression. The local-provision script reaches the restart-survival poll in `tests/e2e/test_local_provision_lifecycle_e2e.sh:547-559`, but the real claude-code MiniMax workspace remains `degraded` because the runtime logs an SDK auth failure after restart. Platform-side restart token injection still occurs during this run, so the failure stays scoped to MiniMax/claude-code auth projection. EVIDENCE: job `485184` has setup/build/platform/health all green and then `7 passed, 1 failed`; the failing assertion is `workspace back online after restart (status=degraded)`. The container dump again shows `MINIMAX_API_KEY=set` while `ANTHROPIC_AUTH_TOKEN=unset`, followed by `Failed to authenticate. API Error: 401 invalid api key`. Platform log shows fresh token writes for the same workspace at restart time: `Provisioner: injected fresh auth token ... into config volume`. RECOMMENDED FIX SHAPE: Same fix shape as the original RCA: in the claude-code template/runtime adapter or local-provision E2E config path, make the MiniMax BYOK route project the key/base URL into the exact Anthropic-compatible environment consumed by the SDK after restart, then add a regression guard that asserts those env vars are present before the restart-online poll. No separate core restart/UUID fix is indicated by this recurrence.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2709