Staging SaaS E2E red: workspace-server registry rejects bare MiniMax-M2 on claude-code (deploy-skew) + un-retried A2A edge 502 #2263

Open
opened 2026-06-04 23:35:44 +00:00 by core-devops · 0 comments
Member

core main 9b19759c red on E2E Staging SaaS (full lifecycle) — two distinct, named, non-regression mechanisms (NOT flaky). Investigated via the on-disk runner logs; neither is introduced by 9b19759c (both reproduce on e8dd34fc/9fbb5468/aa9ea5f9/ddc4e818 before the merge).

Job 1 — E2E Staging SaaS (task 268858): workspace-create HTTP 400

Failure line: 5/11 Provisioning parent workspace (runtime=claude-code) MODEL_SLUG=MiniMax-M2 → curl (22) 400. Tenant booted fine (~2min, TLS reachable) — death is POST /workspaces {"runtime":"claude-code","model":"MiniMax-M2"}400.

Mechanism: deploy-skew on the workspace-server model-registry enforcer (internal#718). The deployed staging tenant runtime image rejects the bare id MiniMax-M2 under claude-code because that image’s compiled registry_gen.go predates the bare-id entry; validateRegisteredModelForRuntime returns false → 400.

Differential proof: the sibling Platform Boot job, on the SAME tenant image, provisioned claude-code successfully with the namespaced moonshot/kimi-k2.6. Only the model id differs. Both are registered in source HEAD (providers.yaml 743/760). Namespaced resolves on the deployed build; bare does not ⇒ deployed registry lags source ⇒ skew, not a source bug. Same class as the runtime-image stale-pin/deploy-lag issues.

Blast radius: staging harness + staging tenant image only; NON-gating (does not block PR merges). Prod tenants using namespaced ids (anthropic/, moonshot/) are unaffected. Narrow latent risk only where the bare MiniMax-M2 id is used on a ws-server build older than the bare-id wiring.

Fix options: (a) promote/redeploy the staging tenant ws-server runtime image to a build that includes the bare MiniMax-M2 claude-code registry entry; or (b, faster) change the SaaS canary pick_model_slug default for claude-code from bare MiniMax-M2 to namespaced minimax/MiniMax-M2.7 (matching how kimi works).

Job 2 — E2E Staging Platform Boot (task 268859): HTTP 502 on 2nd A2A POST

Workspace-create succeeded (used moonshot/kimi-k2.6), reached online+routable, passed image round-trip + config PUT + first A2A PONG, then the known-answer A2A POST failed curl_rc=22 http=502 on the FIRST attempt immediately after a healthy PONG to the same endpoint.

Mechanism: a single un-retried edge/gateway 502 (Cloudflare-shaped) right after a proven-healthy round-trip — not a boot/registry fault. Fix: the harness already retries the known-answer POST elsewhere; have it retry here too; ops eyeball staging A2A edge for 502s. Not flaky — named: un-retried single edge 502.

Found no existing tracker (searched MiniMax-M2 / cp529 / workspace-create+400 / validateRegisteredModelForRuntime). Related but DIFFERENT: #425 (Gitea secret-store migration). Filed per § No flakes (mechanisms named, not dispositions).

**core main `9b19759c` red on `E2E Staging SaaS (full lifecycle)` — two distinct, named, non-regression mechanisms (NOT flaky).** Investigated via the on-disk runner logs; neither is introduced by 9b19759c (both reproduce on e8dd34fc/9fbb5468/aa9ea5f9/ddc4e818 before the merge). ### Job 1 — `E2E Staging SaaS` (task 268858): workspace-create HTTP 400 Failure line: `5/11 Provisioning parent workspace (runtime=claude-code) MODEL_SLUG=MiniMax-M2 → curl (22) 400`. Tenant booted fine (~2min, TLS reachable) — death is `POST /workspaces {"runtime":"claude-code","model":"MiniMax-M2"}` → **400**. **Mechanism: deploy-skew on the workspace-server model-registry enforcer (internal#718).** The deployed staging tenant runtime image rejects the **bare** id `MiniMax-M2` under `claude-code` because that image’s compiled `registry_gen.go` predates the bare-id entry; `validateRegisteredModelForRuntime` returns false → 400. **Differential proof:** the sibling Platform Boot job, on the SAME tenant image, provisioned `claude-code` successfully with the **namespaced** `moonshot/kimi-k2.6`. Only the model id differs. Both are registered in source HEAD (`providers.yaml` 743/760). Namespaced resolves on the deployed build; bare does not ⇒ deployed registry lags source ⇒ skew, not a source bug. Same class as the runtime-image stale-pin/deploy-lag issues. **Blast radius:** staging harness + staging tenant image only; NON-gating (does not block PR merges). Prod tenants using namespaced ids (anthropic/, moonshot/) are unaffected. Narrow latent risk only where the bare `MiniMax-M2` id is used on a ws-server build older than the bare-id wiring. **Fix options:** (a) promote/redeploy the staging tenant ws-server runtime image to a build that includes the bare `MiniMax-M2` claude-code registry entry; or (b, faster) change the SaaS canary `pick_model_slug` default for claude-code from bare `MiniMax-M2` to namespaced `minimax/MiniMax-M2.7` (matching how kimi works). ### Job 2 — `E2E Staging Platform Boot` (task 268859): HTTP 502 on 2nd A2A POST Workspace-create succeeded (used `moonshot/kimi-k2.6`), reached online+routable, passed image round-trip + config PUT + first A2A PONG, then the **known-answer A2A POST** failed `curl_rc=22 http=502` on the FIRST attempt immediately after a healthy PONG to the same endpoint. **Mechanism: a single un-retried edge/gateway 502** (Cloudflare-shaped) right after a proven-healthy round-trip — not a boot/registry fault. **Fix:** the harness already retries the known-answer POST elsewhere; have it retry here too; ops eyeball staging A2A edge for 502s. Not flaky — named: un-retried single edge 502. Found no existing tracker (searched MiniMax-M2 / cp529 / workspace-create+400 / validateRegisteredModelForRuntime). Related but DIFFERENT: #425 (Gitea secret-store migration). Filed per § No flakes (mechanisms named, not dispositions).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2263