fix(e2e): reconciler platform-path model + surface boot error #2316
Reference in New Issue
Block a user
Delete Branch "fix/e2e-reconciler-platform-model-and-boot-error"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root cause (confirmed with evidence)
The
e2e-staging-reconciler.ymlworkflow set bothE2E_LLM_PATH: platform(which makes the test sendsecrets={}and use platform-managed billing) andE2E_MODEL_SLUG: MiniMax-M2.In
pick_model_slug(tests/e2e/lib/model_slug.sh),E2E_MODEL_SLUGwins before theE2E_LLM_PATH=platformbranch. So the workspace was created with the bare idMiniMax-M2— a member of theproviders.yamlclaude-codeminimaxBYOK arm (provider=minimax, requiresMINIMAX_API_KEY) — while no key was injected (platform path sends{}).A keyless BYOK-minimax model cannot resolve a serving path → the workspace boots straight to
status=failedand never reachesonline.Mechanism: platform × BYOK-model config contradiction. Test-config bug, not a workspace-server boot bug.
Evidence — run 223233, job 295646 (
E2E Staging Reconciler)The log prints the contradiction itself: it claims "PLATFORM-MANAGED ... moonshot/kimi-k2.6" (hardcoded string) but
MODEL_SLUG=MiniMax-M2.last_sample_errorwas empty because the agent failed before its first heartbeat (hence the opaqueerr=).providers.yaml:MiniMax-M2is in theminimaxarm; theplatformarm holdsmoonshot/kimi-k2.6,minimax/MiniMax-M2.7(slash), etc. — never bareMiniMax-M2.Fix
1. Workflow (the contradiction). Drop
E2E_MODEL_SLUGand the misleadingE2E_*_API_KEYwiring so the platform path is coherent.pick_model_slugnow returns the platform defaultmoonshot/kimi-k2.6(aplatform-arm member →provider=platform, CP-proxy billed, no tenant key). This mirrors thee2e-staging-platform-bootjob ine2e-staging-saas.yml— the proven-clean keyless platform create combo.2. Diagnostic (#2310-class). On the online-timeout, dump model/llm_path/secrets, every plausible error field, and the full
/workspaces/<id>record — so a future boot-failure names its own cause without a re-run (the emptylast_sample_errorwas the reason this one was opaque).Test-only / workflow-only. No production code touched.
Verification
bash -n+shellcheck -xclean on the testtests/e2e/test_model_slug.sh: 21 passed / 0 failedE2E_LLM_PATH=platform pick_model_slug claude-code→moonshot/kimi-k2.6Recommendation
This is purely a contradictory test config; the reconciler and workspace-server are not at fault. Once merged, the next scheduled/path-triggered reconciler run should boot the workspace
onlineon the platform path and exercise the real heal assertion. Thecontinue-on-error: truemask stays until it has a green track record (per the in-file de-flake note + mc#1982). CTO to review.🤖 Generated with Claude Code
The e2e-staging-reconciler workflow set E2E_LLM_PATH=platform (sends secrets={}, platform-managed billing) AND E2E_MODEL_SLUG=MiniMax-M2. In pick_model_slug (tests/e2e/lib/model_slug.sh) E2E_MODEL_SLUG wins over the E2E_LLM_PATH=platform branch, so the workspace was created with the BARE id `MiniMax-M2` — a member of the providers.yaml claude-code `minimax` BYOK arm (provider=minimax, requires MINIMAX_API_KEY) — while NO key was injected. A keyless BYOK-minimax model cannot resolve a serving path, so the workspace booted straight to status=failed and never reached online ("never reached status=online within 900s, last status=failed"). This is a test-config contradiction, not a workspace-server boot bug: the log even prints the mismatch — "LLM path: PLATFORM-MANAGED ... moonshot/kimi-k2.6" immediately followed by "MODEL_SLUG=MiniMax-M2" then "→ failed" (run 223233, job 295646). Fix (workflow-only): drop E2E_MODEL_SLUG and the misleading E2E_*_API_KEY wiring so the platform path is coherent — pick_model_slug now returns the platform default moonshot/kimi-k2.6 (a providers.yaml claude-code `platform` arm member → provider=platform, CP-proxy billed, no tenant key). Mirrors the e2e-staging-platform-boot job in e2e-staging-saas.yml, which is the proven-clean keyless platform create combo. Also (#2310-class): on the online-timeout, last_sample_error came back EMPTY (the agent failed before its first heartbeat), so "err=" was opaque. Add a diagnostic burst that dumps the model/llm_path/secrets, every plausible error field, and the full /workspaces/<id> record — so a future boot-failure names its own cause without a re-run. Test-only/workflow-only. bash -n + shellcheck clean; test_model_slug.sh 21/0; YAML valid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>APPROVED (CTO review). Verified: tests/e2e + 1 workflow ONLY (+36/-9), zero prod code. Confirms+fixes the reconciler platform×BYOK-model contradiction (E2E_MODEL_SLUG=MiniMax-M2 BYOK on platform-no-key path → boot=failed); drops the override so pick_model_slug returns platform default moonshot/kimi-k2.6; adds boot-error diagnostic capture. No gating change. Approving.