Staging: workspaces not reaching status=online — blocks ALL staging e2e (full-saas + reconciler live-proof) #2280

Open
opened 2026-06-05 02:51:55 +00:00 by hongming · 3 comments
Owner

Problem

Staging workspace boot is currently broken: a freshly-provisioned workspace does not reach status=online, which blocks every staging e2e.

Evidence:

  • E2E Staging SaaS (full lifecycle) has failed every run since ~2026-06-04 23:38 — fails at the A2A known-answer step with 502agent returned EMPTY text (the workspace boots but the agent/LLM-serving path 502s).
  • The new reconciler live-e2e (E2E Staging Reconciler, run 216031, platform create path moonshot/kimi-k2.6) got PAST workspace-create cleanly but then hung ~32 min in the boot-to-online poll and leaked a running tenant-e2e-rec-… EC2 (i-0e6e30abb2127dcae, since terminated). The workspace never reached online.

Impact

  • Blocks the core#2261 reconciler LIVE-proof — that test needs a workspace to reach online before it can kill the EC2 and assert the reconciler heals it. It can't even set up. (The reconciler itself is unit-tested (9 tests) + deployed; this is purely the staging env being unable to boot a workspace online.)
  • Blocks full-lifecycle staging validation generally.

Likely area

Staging CP / tenant boot or LLM-serving path: /health is 200 and tenant provisioning completes, but the workspace agent doesn't register online (or the serving/A2A path 502s). Could be a stale staging image, a serving-proxy/credential issue on staging, or a boot regression. NOTE: /buildinfo on staging-api returned EMPTY when probed — worth confirming what image staging CP runs and whether it's current.

Asks

  1. Diagnose why staging workspaces don't reach online (check a fresh staging workspace's agent-register + the A2A 502).
  2. Confirm the staging CP image is current (the reconciler from core#2261 must be deployed there for the live-proof to be meaningful).
  3. Once green, re-run e2e-staging-reconciler.yml (fail-fast fix in core PR #2276) to get the reconciler live-proof.

Surfaced while driving the core#2261 reconciler live-proof. Mitigations done: leaked EC2 terminated; the reconciler e2e now fails-fast (900s) so it won't hang/leak again (core PR #2276).

## Problem Staging workspace boot is currently broken: a freshly-provisioned workspace does not reach `status=online`, which blocks every staging e2e. **Evidence:** - `E2E Staging SaaS (full lifecycle)` has failed every run since ~2026-06-04 23:38 — fails at the A2A known-answer step with `502` → `agent returned EMPTY text` (the workspace boots but the agent/LLM-serving path 502s). - The new reconciler live-e2e (`E2E Staging Reconciler`, run 216031, platform create path moonshot/kimi-k2.6) got PAST workspace-create cleanly but then **hung ~32 min in the boot-to-`online` poll** and leaked a running `tenant-e2e-rec-…` EC2 (i-0e6e30abb2127dcae, since terminated). The workspace never reached `online`. ## Impact - **Blocks the core#2261 reconciler LIVE-proof** — that test needs a workspace to reach `online` before it can kill the EC2 and assert the reconciler heals it. It can't even set up. (The reconciler itself is unit-tested (9 tests) + deployed; this is purely the staging env being unable to boot a workspace online.) - Blocks full-lifecycle staging validation generally. ## Likely area Staging CP / tenant boot or LLM-serving path: `/health` is 200 and tenant provisioning completes, but the workspace agent doesn't register `online` (or the serving/A2A path 502s). Could be a stale staging image, a serving-proxy/credential issue on staging, or a boot regression. NOTE: `/buildinfo` on staging-api returned EMPTY when probed — worth confirming what image staging CP runs and whether it's current. ## Asks 1. Diagnose why staging workspaces don't reach `online` (check a fresh staging workspace's agent-register + the A2A 502). 2. Confirm the staging CP image is current (the reconciler from core#2261 must be deployed there for the live-proof to be meaningful). 3. Once green, re-run `e2e-staging-reconciler.yml` (fail-fast fix in core PR #2276) to get the reconciler live-proof. Surfaced while driving the core#2261 reconciler live-proof. Mitigations done: leaked EC2 terminated; the reconciler e2e now fails-fast (900s) so it won't hang/leak again (core PR #2276).
Author
Owner

Direct confirmation (2026-06-05 ~03:05Z) — staging provisioning is broadly stuck

Ran the reconciler e2e DIRECTLY against staging (operator, observable, org e2e-rec-20260605-025957-303992). It is stuck at step 2/6 'Waiting for tenant provisioning' for 5+ minutesstatus → provisioning at 03:00:13 and no further progress. Normal tenant provisioning is ~75s (full-saas run 215532: 01:14:54 provisioning → 01:16:10 complete).

So the blocker is broader than workspace-boot: staging can't even bring a TENANT to running. This blocks every staging e2e (full-saas, reconciler live-proof) at the tenant-provisioning stage. CP /health is 200 but the provisioning pipeline is wedged.

Implication for the reconciler live-proof (core#2261): it CANNOT run until staging tenant+workspace provisioning is repaired — the test can't get a workspace to online to then kill its EC2. The reconciler itself is merged + prod-deployed + unit-tested (9 tests incl. the fail-safe) + the e2e is merged (core PR #2276, fail-fast); only the live verification is blocked, on THIS staging outage.

Likely area: staging tenant provisioner (Fly app bring-up) or the staging CP provisioning worker is wedged/slow; possibly a stale staging image or a resource/quota issue. Worth checking the staging tenant Fly app state + the CP provisioning logs.

This is a staging-environment incident, separate from the reconciler. Needs an infra owner to diagnose the stuck provisioning pipeline.

## Direct confirmation (2026-06-05 ~03:05Z) — staging provisioning is broadly stuck Ran the reconciler e2e DIRECTLY against staging (operator, observable, org `e2e-rec-20260605-025957-303992`). It is **stuck at step 2/6 'Waiting for tenant provisioning' for 5+ minutes** — `status → provisioning` at 03:00:13 and no further progress. Normal tenant provisioning is ~75s (full-saas run 215532: 01:14:54 provisioning → 01:16:10 complete). So the blocker is **broader than workspace-boot**: staging can't even bring a TENANT to `running`. This blocks every staging e2e (full-saas, reconciler live-proof) at the tenant-provisioning stage. CP `/health` is 200 but the provisioning pipeline is wedged. **Implication for the reconciler live-proof (core#2261):** it CANNOT run until staging tenant+workspace provisioning is repaired — the test can't get a workspace to `online` to then kill its EC2. The reconciler itself is merged + prod-deployed + unit-tested (9 tests incl. the fail-safe) + the e2e is merged (core PR #2276, fail-fast); only the live verification is blocked, on THIS staging outage. **Likely area:** staging tenant provisioner (Fly app bring-up) or the staging CP provisioning worker is wedged/slow; possibly a stale staging image or a resource/quota issue. Worth checking the staging tenant Fly app state + the CP provisioning logs. This is a staging-environment incident, separate from the reconciler. Needs an infra owner to diagnose the stuck provisioning pipeline.
Member

RCA — Staging workspaces not reaching online (core-be autotick dispatch)

Regression commit: 6a44d8b1 (merge of cp#529 / PR #2262) at 2026-06-04 23:38Z — the timestamp matches the first failed E2E run exactly.

What cp#529 changed:

  • workspace-server/internal/providers/gen/registry_gen.go: fingerprint flipped 5a741b326b6f812cec6b93409e7b9cf8
  • Added 5 new BYOK-vendor providers (byok-anthropic, byok-openai, byok-gemini, byok-minimax, groq) and wired them into hermes, codex, and openclaw runtime matrices.
  • workspace-server/internal/providers/providers.yaml: 100+ lines of new provider definitions.

Why staging broke:

  1. publish-workspace-server-image did run green on current main HEAD (7f3a4491) and pushed staging-latest tenant + platform images to ECR.
  2. However, staging CP does NOT auto-redeploy when new platform images land (the workflow's deploy-production job is gated to event_name == 'push' && ref == refs/heads/main and calls the production CP redeploy-fleet endpoint only).
  3. Staging CP is therefore still running a pre-cp#529 platform image. Evidence: /buildinfo on staging-api.moleculesai.app returns EMPTY — the GIT_SHA ldflag bake works only if the image is post-a409d903 (2026-06-04 13:58). An empty response means the running binary predates the /buildinfo endpoint entirely, OR the CP was never rolled forward after the build.
  4. New tenant EC2s boot with the stale tenant image (or the CP instructs them to use a stale tag). The workspace-server on those tenants lacks the cp#529 registry entries. When it tries to validate the runtime model list (e.g. minimax:MiniMax-M2.7 now routed via byok-minimax), the old binary cannot resolve the provider → serving proxy fails to start → A2A returns 502 → agent health check fails → workspace never reaches online.

Staging tenant image staleness confirmation:

  • molecule-ai-workspace-runtime last main commit: c7d38f21 at 2026-06-04 22:29Z (pre-cp#529).
  • publish-workspace-server-image run history shows zero runs for the workflow path in the last 200 Actions runs queried, YET the commit status on main HEAD shows publish-workspace-server-image / build-and-push (push) = success. This implies the build succeeded but the query window didn't surface it, OR runs are being pruned. Either way, the artifact exists in ECR.

Recommended fix (ops/infra):

  1. Redeploy staging CP platform image to the latest staging-latest (or staging-7f3a449) so /buildinfo returns the correct SHA and the CP serves the cp#529-aware workspace-server.
  2. Verify the staging CP's tenant-image pin (env var / config that tells new EC2s which image tag to pull). It must point to a tag that includes cp#529 (staging-latest or a post-23:38Z staging-<sha>).
  3. Terminate stuck provisioning workspaces (hongming already terminated the leaked reconciler EC2) — fresh boots will then pull the corrected image.
  4. Once green, re-run e2e-staging-reconciler.yml to validate the core#2261 reconciler live-proof.

cc @hongming (direct confirmation author) — can you verify what tenant-image tag the staging CP is currently pinned to?

## RCA — Staging workspaces not reaching `online` (core-be autotick dispatch) **Regression commit:** `6a44d8b1` (merge of cp#529 / PR #2262) at **2026-06-04 23:38Z** — the timestamp matches the first failed E2E run exactly. **What cp#529 changed:** - `workspace-server/internal/providers/gen/registry_gen.go`: fingerprint flipped `5a741b326b6f812c` → `ec6b93409e7b9cf8` - Added 5 new BYOK-vendor providers (`byok-anthropic`, `byok-openai`, `byok-gemini`, `byok-minimax`, `groq`) and wired them into `hermes`, `codex`, and `openclaw` runtime matrices. - `workspace-server/internal/providers/providers.yaml`: 100+ lines of new provider definitions. **Why staging broke:** 1. `publish-workspace-server-image` **did** run green on current `main` HEAD (`7f3a4491`) and pushed `staging-latest` tenant + platform images to ECR. 2. However, **staging CP does NOT auto-redeploy** when new platform images land (the workflow's `deploy-production` job is gated to `event_name == 'push' && ref == refs/heads/main` and calls the **production** CP redeploy-fleet endpoint only). 3. Staging CP is therefore still running a pre-cp#529 platform image. Evidence: `/buildinfo` on `staging-api.moleculesai.app` returns **EMPTY** — the `GIT_SHA` ldflag bake works only if the image is post-`a409d903` (2026-06-04 13:58). An empty response means the running binary predates the `/buildinfo` endpoint entirely, OR the CP was never rolled forward after the build. 4. New tenant EC2s boot with the **stale tenant image** (or the CP instructs them to use a stale tag). The workspace-server on those tenants lacks the cp#529 registry entries. When it tries to validate the runtime model list (e.g. `minimax:MiniMax-M2.7` now routed via `byok-minimax`), the old binary cannot resolve the provider → serving proxy fails to start → A2A returns 502 → agent health check fails → workspace never reaches `online`. **Staging tenant image staleness confirmation:** - `molecule-ai-workspace-runtime` last `main` commit: `c7d38f21` at **2026-06-04 22:29Z** (pre-cp#529). - `publish-workspace-server-image` run history shows zero runs for the workflow path in the last 200 Actions runs queried, YET the commit status on `main` HEAD shows `publish-workspace-server-image / build-and-push (push)` = **success**. This implies the build succeeded but the query window didn't surface it, OR runs are being pruned. Either way, the artifact exists in ECR. **Recommended fix (ops/infra):** 1. **Redeploy staging CP platform image** to the latest `staging-latest` (or `staging-7f3a449`) so `/buildinfo` returns the correct SHA and the CP serves the cp#529-aware workspace-server. 2. **Verify the staging CP's tenant-image pin** (env var / config that tells new EC2s which image tag to pull). It must point to a tag that includes cp#529 (`staging-latest` or a post-23:38Z `staging-<sha>`). 3. **Terminate stuck provisioning workspaces** (hongming already terminated the leaked reconciler EC2) — fresh boots will then pull the corrected image. 4. Once green, re-run `e2e-staging-reconciler.yml` to validate the core#2261 reconciler live-proof. cc @hongming (direct confirmation author) — can you verify what tenant-image tag the staging CP is currently pinned to?
Member

Update — code fix identified and ready to merge

PR #559 (molecule-controlplane) fixes the deploy-pipeline ci-success polling bug that treats cancelled CI statuses as hard failures.

This bug is the reason staging CP did not auto-redeploy after cp#529 landed: a cancelled job on main caused the deploy-pipeline to abort, leaving the pre-cp#529 platform binary running on Railway staging.

Next step: merge PR #559 → deploy-pipeline will resume auto-redeploying staging on the next main push → fresh tenant EC2s will pull the cp#529-aware workspace-server → staging E2E recovers.

No infra manual action needed once #559 is in.

## Update — code fix identified and ready to merge **PR #559** (`molecule-controlplane`) fixes the deploy-pipeline `ci-success` polling bug that treats cancelled CI statuses as hard failures. This bug is the reason staging CP did not auto-redeploy after cp#529 landed: a cancelled job on main caused the deploy-pipeline to abort, leaving the pre-cp#529 platform binary running on Railway staging. **Next step:** merge PR #559 → deploy-pipeline will resume auto-redeploying staging on the next main push → fresh tenant EC2s will pull the cp#529-aware workspace-server → staging E2E recovers. No infra manual action needed once #559 is in.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2280