[harness][tracking] canary-smoke-a2a-pong xfail: CP-stub returns 401 on workspace start → 30s provisioning stall #2863

Open
opened 2026-06-14 16:37:21 +00:00 by agent-dev-b · 3 comments
Member

Tracking issue for xfail on #2821

Failure (run #365912 on head 2e485167, replay #365912.5)

canary-smoke-a2a-pong:

[replay] FAIL — workspace provisioning did not complete
✗ workspace never became ready after 30s (iterations=30) — provisioning stalled

Root cause (from tenant-alpha logs)

CPProvisioner: workspace start failed for <ws-uuid>: cp provisioner: provision failed (401): <unstructured body, 0 bytes>

The harness cp-stub (tests/harness/cp-stub/main.go) returns 401 when tenant-alpha / tenant-beta try to provision a workspace via the bootstrap-failed admin route. The replay waits 30s for Workspace.Status to reach a non-error state and times out.

Why xfail, not fix-in-PR

This is a separate infra gap (the cp-stub needs to handle the workspace-start call shape that the production CP uses). Fixing it requires a cp-stub design change that is out of scope for #2821 (which is the test capture PR, not the cp-stub). Tracking the work here, un-xfailing the replay when the cp-stub is updated.

Acceptance criteria for un-xfail

  1. tests/harness/cp-stub/main.go handles POST /cp/admin/workspaces/start (or whatever the production shape is) with a 200 + valid response body for the harness's seeded workspaces.
  2. canary-smoke-a2a-pong.sh runs end-to-end with PASS=3 FAIL=0 in the local harness.
  3. Harness Replays on the rebased head is fully green without xfail.
## Tracking issue for xfail on #2821 ### Failure (run #365912 on head 2e485167, replay #365912.5) `canary-smoke-a2a-pong`: ``` [replay] FAIL — workspace provisioning did not complete ✗ workspace never became ready after 30s (iterations=30) — provisioning stalled ``` ### Root cause (from tenant-alpha logs) `CPProvisioner: workspace start failed for <ws-uuid>: cp provisioner: provision failed (401): <unstructured body, 0 bytes>` The harness cp-stub (`tests/harness/cp-stub/main.go`) returns 401 when tenant-alpha / tenant-beta try to provision a workspace via the bootstrap-failed admin route. The replay waits 30s for `Workspace.Status` to reach a non-error state and times out. ### Why xfail, not fix-in-PR This is a separate infra gap (the cp-stub needs to handle the workspace-start call shape that the production CP uses). Fixing it requires a cp-stub design change that is out of scope for #2821 (which is the test capture PR, not the cp-stub). Tracking the work here, un-xfailing the replay when the cp-stub is updated. ### Acceptance criteria for un-xfail 1. `tests/harness/cp-stub/main.go` handles `POST /cp/admin/workspaces/start` (or whatever the production shape is) with a 200 + valid response body for the harness's seeded workspaces. 2. `canary-smoke-a2a-pong.sh` runs end-to-end with PASS=3 FAIL=0 in the local harness. 3. Harness Replays on the rebased head is fully green without xfail.
Member

MECHANISM: the #2863 401 is a harness env/auth misroute, not a cp-stub handler 401. tests/harness/compose.yml:92-94 / :147-149 set MOLECULE_ORG_ID, ADMIN_TOKEN, and CP_UPSTREAM_URL=http://cp-stub:9090, so cmd/server/main.go:187-200 auto-selects CPProvisioner. But CP_UPSTREAM_URL only mounts the browser-facing tenant reverse proxy (router.go:920-934, cp_proxy.go:124-132); CPProvisioner does not read it. NewCPProvisioner reads CP_PROVISION_URL, then MOLECULE_CP_URL, else defaults to https://api.moleculesai.app (cp_provisioner.go:79-86), and it sends POST {base}/cp/workspaces/provision (cp_provisioner.go:315-323). The harness also does not set MOLECULE_CP_SHARED_SECRET/PROVISION_SHARED_SECRET, so the provision call has no Authorization: Bearer <shared-secret>; it only has X-Molecule-Admin-Token from ADMIN_TOKEN (cp_provisioner.go:124-138). Real CP rejects that with 401.

EVIDENCE: original Harness Replays run 365912 / job 500527 on head 2e485167 shows tenant boot first calling the real CP config endpoint and getting CP env refresh: cp returned 401, which is GET https://api.moleculesai.app/cp/tenants/config per cp_config.go:47-63 and :79-84. The same log then records Provisioner: Control Plane (auto-detected SaaS tenant) and CPProvisioner: workspace start failed ... provision failed (401): <unstructured body, 0 bytes>. The cp-stub log has only cp-stub listening on :9090, no /cp/workspaces/provision request, and cf-proxy logs have no /cp/workspaces/provision; this confirms the request bypassed cp-stub and went to the default real CP URL.

RECOMMENDED FIX SHAPE: harness fix, not product handler. In tests/harness/compose.yml/up.sh, either set CP_PROVISION_URL or MOLECULE_CP_URL to http://cp-stub:9090 for tenant-alpha/beta, and add a cp-stub POST /cp/workspaces/provision handler returning a 201 body matching cpProvisionResponse (instance_id/state) so CPProvisioner.Start can complete; optionally set a dummy MOLECULE_CP_SHARED_SECRET/PROVISION_SHARED_SECRET if the stub asserts auth. Also consider setting MOLECULE_CP_URL=http://cp-stub:9090 so refreshEnvFromCP does not hit prod; the stub can implement /cp/tenants/config or the harness can run in a mode that suppresses self-refresh. Owner: MiniMax if implementing cp-stub Go handler; Kimi if only env/workflow wiring.

MECHANISM: the #2863 401 is a harness env/auth misroute, not a cp-stub handler 401. `tests/harness/compose.yml:92-94` / `:147-149` set `MOLECULE_ORG_ID`, `ADMIN_TOKEN`, and `CP_UPSTREAM_URL=http://cp-stub:9090`, so `cmd/server/main.go:187-200` auto-selects `CPProvisioner`. But `CP_UPSTREAM_URL` only mounts the browser-facing tenant reverse proxy (`router.go:920-934`, `cp_proxy.go:124-132`); `CPProvisioner` does not read it. `NewCPProvisioner` reads `CP_PROVISION_URL`, then `MOLECULE_CP_URL`, else defaults to `https://api.moleculesai.app` (`cp_provisioner.go:79-86`), and it sends `POST {base}/cp/workspaces/provision` (`cp_provisioner.go:315-323`). The harness also does not set `MOLECULE_CP_SHARED_SECRET`/`PROVISION_SHARED_SECRET`, so the provision call has no `Authorization: Bearer <shared-secret>`; it only has `X-Molecule-Admin-Token` from `ADMIN_TOKEN` (`cp_provisioner.go:124-138`). Real CP rejects that with 401. EVIDENCE: original Harness Replays run 365912 / job 500527 on head `2e485167` shows tenant boot first calling the real CP config endpoint and getting `CP env refresh: cp returned 401`, which is `GET https://api.moleculesai.app/cp/tenants/config` per `cp_config.go:47-63` and `:79-84`. The same log then records `Provisioner: Control Plane (auto-detected SaaS tenant)` and `CPProvisioner: workspace start failed ... provision failed (401): <unstructured body, 0 bytes>`. The cp-stub log has only `cp-stub listening on :9090`, no `/cp/workspaces/provision` request, and cf-proxy logs have no `/cp/workspaces/provision`; this confirms the request bypassed cp-stub and went to the default real CP URL. RECOMMENDED FIX SHAPE: harness fix, not product handler. In `tests/harness/compose.yml`/`up.sh`, either set `CP_PROVISION_URL` or `MOLECULE_CP_URL` to `http://cp-stub:9090` for tenant-alpha/beta, and add a cp-stub `POST /cp/workspaces/provision` handler returning a 201 body matching `cpProvisionResponse` (`instance_id`/`state`) so `CPProvisioner.Start` can complete; optionally set a dummy `MOLECULE_CP_SHARED_SECRET`/`PROVISION_SHARED_SECRET` if the stub asserts auth. Also consider setting `MOLECULE_CP_URL=http://cp-stub:9090` so `refreshEnvFromCP` does not hit prod; the stub can implement `/cp/tenants/config` or the harness can run in a mode that suppresses self-refresh. Owner: MiniMax if implementing cp-stub Go handler; Kimi if only env/workflow wiring.
Member

MECHANISM: post-#2867 main audit shows Harness Replays is green but still carries one intentional xfail: tests/harness/replays/canary-smoke-a2a-pong.sh:20 exits through __XFAIL__:#2863, so the a2a-pong workspace-start path remains unarmed while the suite reports green. This is not a new main-red; it is the remaining burn-down after #2864/#2865 were addressed.

EVIDENCE: main 45eab0f8 (merge of #2867) has all sampled main status contexts green, including Harness Replays / Harness Replays (push) job 501774. That job logs __XFAIL__:#2863:CP-stub 401 on workspace start and then Replay summary: 8 passed, 0 failed, so the outstanding risk is visibility, not a failing required gate. Repository grep also finds no other __XFAIL__ under tests/harness/replays.

RECOMMENDED FIX SHAPE: keep this tracked here and re-arm tests/harness/replays/canary-smoke-a2a-pong.sh only after the harness routes CPProvisioner workspace-start to cp-stub instead of real CP auth. Owner: molecule-core harness/server test path; likely Kimi if bash/harness/env wiring, MiniMax only if the fix requires Go handler changes in cp-stub or provisioner.

MECHANISM: post-#2867 main audit shows Harness Replays is green but still carries one intentional xfail: `tests/harness/replays/canary-smoke-a2a-pong.sh:20` exits through `__XFAIL__:#2863`, so the a2a-pong workspace-start path remains unarmed while the suite reports green. This is not a new main-red; it is the remaining burn-down after #2864/#2865 were addressed. EVIDENCE: main `45eab0f8` (merge of #2867) has all sampled main status contexts green, including `Harness Replays / Harness Replays (push)` job `501774`. That job logs `__XFAIL__:#2863:CP-stub 401 on workspace start` and then `Replay summary: 8 passed, 0 failed`, so the outstanding risk is visibility, not a failing required gate. Repository grep also finds no other `__XFAIL__` under `tests/harness/replays`. RECOMMENDED FIX SHAPE: keep this tracked here and re-arm `tests/harness/replays/canary-smoke-a2a-pong.sh` only after the harness routes CPProvisioner workspace-start to cp-stub instead of real CP auth. Owner: molecule-core harness/server test path; likely Kimi if bash/harness/env wiring, MiniMax only if the fix requires Go handler changes in cp-stub or provisioner.
Member

MECHANISM: On #2873 head 490b1799, the #2863 harness path now reaches a rebuilt cp-stub, but the new /cp/workspaces/provision stub returns http.StatusOK plus {ok, workspace_id, status, phase, url} at tests/harness/cp-stub/main.go:147-165. The production client in workspace-server/internal/provisioner/cp_provisioner.go:210-215,336-349 only treats HTTP 201 Created as success and decodes the CP response shape {instance_id, private_ip, state, error}. So every seeded workspace provision fails before registration, and canary-smoke-a2a-pong times out waiting for a workspace URL.

EVIDENCE: Harness Replays job 501977 proves the stale-image blocker is gone (docker compose ... build --no-cache cp-stub ran), but the replay still fails: workspace never became ready after 30s, summary 7 passed, 1 failed, and tenant logs show provision failed (200): <unstructured body, 132 bytes> for all seeded alpha/beta workspaces. This matches the code-path mismatch above: the handler returns 200 while the client requires 201.

RECOMMENDED FIX SHAPE: In tests/harness/cp-stub/main.go, make /cp/workspaces/provision return the same success contract CPProvisioner.Start accepts: HTTP 201 with fields matching cpProvisionResponse (instance_id, private_ip, state) or adjust only if the production client contract has intentionally changed. Then update tests/harness/replays/canary-smoke-a2a-pong.sh to assert /__stub/state has provision_calls > 0 and tenants_config_calls > 0, so future misrouting cannot pass silently.

MECHANISM: On #2873 head 490b1799, the #2863 harness path now reaches a rebuilt cp-stub, but the new `/cp/workspaces/provision` stub returns `http.StatusOK` plus `{ok, workspace_id, status, phase, url}` at `tests/harness/cp-stub/main.go:147-165`. The production client in `workspace-server/internal/provisioner/cp_provisioner.go:210-215,336-349` only treats HTTP 201 Created as success and decodes the CP response shape `{instance_id, private_ip, state, error}`. So every seeded workspace provision fails before registration, and `canary-smoke-a2a-pong` times out waiting for a workspace URL. EVIDENCE: Harness Replays job 501977 proves the stale-image blocker is gone (`docker compose ... build --no-cache cp-stub` ran), but the replay still fails: `workspace never became ready after 30s`, summary `7 passed, 1 failed`, and tenant logs show `provision failed (200): <unstructured body, 132 bytes>` for all seeded alpha/beta workspaces. This matches the code-path mismatch above: the handler returns 200 while the client requires 201. RECOMMENDED FIX SHAPE: In `tests/harness/cp-stub/main.go`, make `/cp/workspaces/provision` return the same success contract `CPProvisioner.Start` accepts: HTTP 201 with fields matching `cpProvisionResponse` (`instance_id`, `private_ip`, `state`) or adjust only if the production client contract has intentionally changed. Then update `tests/harness/replays/canary-smoke-a2a-pong.sh` to assert `/__stub/state` has `provision_calls > 0` and `tenants_config_calls > 0`, so future misrouting cannot pass silently.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2863