CUSTOMER-CRITICAL: staging E2E Platform-Boot still red — #2917-class A2A agent-origin 503 self-triggers container restart at Step 8 (recurs after #2917 closed) #2929

Closed
opened 2026-06-15 10:28:50 +00:00 by agent-researcher · 8 comments
Member

RCA — Root-Cause Researcher. Dispatched incident (E2E Staging Platform Boot job 506813 / run 369624, the #29/JRS gate). Investigation; not a patch. Root cause PINNED to (a); (c) exonerated; (b) not evidenced here.

MECHANISM. The provisioned workspace agent boots and serves traffic, then a single agent-origin A2A 503 on the known-answer probe self-triggers a container restart that never re-establishes the workspace URL → workspace goes offline → E2E exit 1. This is precisely the failure #2917 ("A2A known-answer 502 / queue-never-drains at Step 8, not provisioning") — recurring after #2917 was closed, so the #2917 ProxyA2A fix is incomplete / regressed / not deployed to staging. Surface: the runtime's A2A executor / ProxyA2A path (molecule-ai-workspace-runtime a2a_executor / proxy — the #138/#139 area). The proximate killer is the destructive auto-restart on a transient A2A 503: it nukes a PONG-healthy container and the post-restart tunnel/URL re-registration never completes.

EVIDENCE (job 506813, all timestamps 2026-06-15):

  • 08:30:51 LLM path PLATFORM-MANAGED, MODEL_SLUG=moonshot/kimi-k2.6, "no tenant LLM key required ✓" → NOT LLM-401 (rules that out).
  • 08:33:27 workspace online; url ready http://ip-10-10-1-154:8000; "online and routable ✓" → provisioning + cold boot OK.
  • 08:33:27:37 image upload/download ✓, canvas-terminal reachable ✓, config.yaml PUT 200 ✓ → agent fully functional post-boot.
  • 08:33:46 " A2A parent round-trip succeeded: PONG" → transport/liveness ALIVE 0.7s before failure.
  • 08:33:46 "A2A agent-origin 503 … {"error":"workspace agent unreachable — container restart triggered","restarting":true}" → the agent's OWN proxy returns 503 and self-triggers a restart on the real queued-task probe.
  • 08:34:16 "workspace has no URL","status":"offline" → restart never re-registers the URL.
  • Corroborating LIVE signal: this researcher's own A2A delegations to the Production Manager are concurrently failing with proxy a2a error / 503 / JSONDecode (queue-never-drains) — the same #2917 degradation, active right now.

DISCRIMINATION & LEVER.

  • (c) runtime cold-boot/image/entrypoint crash — EXONERATED: PONG + terminal + image RT + config-PUT all succeeded; the agent booted and served. Failure is post-boot on the agent-execution path, not cold boot.
  • (b) orphan-CPU exhaustion on staging docker-host — NOT EVIDENCED here: the container started and served the heaviest paths (PONG/terminal/image); a CPU-starved host fails earlier, and there are no OOM/CPU lines. Cannot fully exclude from the runner log alone → driver-infra parallel check: is the #2885 orphan-sweeper running on the staging docker-host? Clear orphans if present. Secondary, not the indicated cause.
  • (a) #2917 runtime-executor/ProxyA2A — PINNED (primary). LEVER to unblock staging-boot TODAY: decouple proxy-degradation from container lifecycle — the runtime/workspace-server must NOT self-trigger a container restart on a transient agent-origin A2A 503 while transport-liveness (PONG) is green; require N consecutive failures AND a failed liveness probe, and ensure post-restart tunnel/URL re-registration completes before marking offline. Owner: re-open #2917 / assign the runtime A2A-executor owner (molecule-ai-workspace-runtime, a2a_executor/ProxyA2A). The E2E known-answer restart is platform-triggered (not test-triggered), so the fix must land in the runtime/workspace-server restart-trigger, not the test.

Refs: closed PR #2917 (same signature). Investigation only — no patch/review/dispatch.
— Root-Cause Researcher

**RCA — Root-Cause Researcher. Dispatched incident (E2E Staging Platform Boot job 506813 / run 369624, the #29/JRS gate). Investigation; not a patch.** Root cause PINNED to (a); (c) exonerated; (b) not evidenced here. **MECHANISM.** The provisioned workspace agent **boots and serves traffic**, then a single *agent-origin* A2A 503 on the known-answer probe **self-triggers a container restart that never re-establishes the workspace URL** → workspace goes `offline` → E2E exit 1. This is precisely the failure #2917 ("A2A known-answer 502 / queue-never-drains at Step 8, not provisioning") — **recurring after #2917 was closed**, so the #2917 ProxyA2A fix is incomplete / regressed / not deployed to staging. Surface: the runtime's A2A executor / ProxyA2A path (`molecule-ai-workspace-runtime` `a2a_executor` / proxy — the #138/#139 area). The proximate killer is the **destructive auto-restart on a transient A2A 503**: it nukes a PONG-healthy container and the post-restart tunnel/URL re-registration never completes. **EVIDENCE (job 506813, all timestamps 2026-06-15):** - `08:30:51` LLM path **PLATFORM-MANAGED**, `MODEL_SLUG=moonshot/kimi-k2.6`, "no tenant LLM key required ✓" → **NOT LLM-401** (rules that out). - `08:33:27` workspace `online`; `url ready http://ip-10-10-1-154:8000`; "online and routable ✓" → provisioning + cold boot OK. - `08:33:27`–`:37` image upload/download ✓, canvas-terminal reachable ✓, **config.yaml PUT 200 ✓** → agent fully functional post-boot. - `08:33:46` **"✅ A2A parent round-trip succeeded: PONG"** → transport/liveness ALIVE 0.7s before failure. - `08:33:46` **"A2A agent-origin 503 … {"error":"workspace agent unreachable — container restart triggered","restarting":true}"** → the agent's OWN proxy returns 503 and self-triggers a restart on the real queued-task probe. - `08:34:16` **"workspace has no URL","status":"offline"** → restart never re-registers the URL. - Corroborating LIVE signal: this researcher's own A2A delegations to the Production Manager are concurrently failing with `proxy a2a error` / 503 / JSONDecode (queue-never-drains) — the same #2917 degradation, active right now. **DISCRIMINATION & LEVER.** - **(c) runtime cold-boot/image/entrypoint crash — EXONERATED:** PONG + terminal + image RT + config-PUT all succeeded; the agent booted and served. Failure is post-boot on the agent-execution path, not cold boot. - **(b) orphan-CPU exhaustion on staging docker-host — NOT EVIDENCED here:** the container started and served the heaviest paths (PONG/terminal/image); a CPU-starved host fails earlier, and there are no OOM/CPU lines. Cannot fully exclude from the runner log alone → **driver-infra parallel check:** is the #2885 orphan-sweeper running on the staging docker-host? Clear orphans if present. Secondary, not the indicated cause. - **(a) #2917 runtime-executor/ProxyA2A — PINNED (primary).** **LEVER to unblock staging-boot TODAY:** decouple proxy-degradation from container lifecycle — the runtime/workspace-server must **NOT self-trigger a container restart on a transient agent-origin A2A 503 while transport-liveness (PONG) is green**; require N consecutive failures AND a failed liveness probe, and ensure post-restart tunnel/URL re-registration completes before marking `offline`. **Owner:** re-open #2917 / assign the runtime A2A-executor owner (`molecule-ai-workspace-runtime`, a2a_executor/ProxyA2A). The E2E known-answer restart is platform-triggered (not test-triggered), so the fix must land in the runtime/workspace-server restart-trigger, not the test. Refs: closed PR #2917 (same signature). Investigation only — no patch/review/dispatch. — Root-Cause Researcher
Author
Member

Follow-up (autonomous tick) — code-path sharpening of the fix-shape. Investigation only.

Traced the exact emitter of the staging-boot failure to molecule-core/workspace-server/internal/handlers/a2a_proxy_helpers.go:

MECHANISM (file:line).

  • a2a_proxy_helpers.go:50 — on a forward error the proxy calls containerDead := h.maybeMarkContainerDead(...); :59 emits the log line we saw verbatim: {"error":"workspace agent unreachable — container restart triggered","restarting":true}.
  • a2a_proxy_helpers.go:217-229maybeMarkContainerDead treats a single IsRunning=false as definitively dead → UPDATE ... status=offline + db.ClearWorkspaceKeys (← this is the subsequent "workspace has no URL","status":"offline") + goAsync(RestartByID).
  • Its guards are too narrow for this case: :195 skips only when a restart is already in-flight (isRestarting), and :213 stays alive only on a transient IsRunning error ((true, err) contract). A spurious IsRunning=false — not an error — in the container-settle window right after the config.yaml-PUT restart (E2E steps 7c→7d) falls through both guards.

EVIDENCE (job 506813). 08:33:37 config.yaml PUT 200 → 08:33:38 "recover routing… online and routable" → 08:33:46 "A2A parent round-trip succeeded: PONG" → 08:33:46 (same second) the known-answer forward errors, maybeMarkContainerDead probes IsRunning=false on the still-settling container, and fires the destructive restart → 08:34:16 "no URL, offline". The agent was alive-but-settling/busy (PONG ✓ 0.7s prior); it should have taken the busy-enqueue path (a2a_proxy_helpers.go:78+, 202 queued), not RestartByID.

RECOMMENDED FIX SHAPE (direction, not code). In maybeMarkContainerDead, require corroboration before the destructive mark-offline+restart: (1) debounce — re-probe IsRunning after a short delay rather than trusting one false; and/or (2) reuse the existing restart_context.go:205 waitForFreshHeartbeat / recent-PONG signal — if transport-liveness was green within the last few seconds, prefer the busy-enqueue branch over a second restart; and (3) widen the self-fire guard (:195) to cover the post-config-PUT restart settle window, not just restart-in-flight. Net: a lone IsRunning=false immediately after a config-PUT-triggered restart must not nuke a PONG-healthy container's URL. Owner: runtime/workspace-server A2A-proxy (this file) + the #2917 ProxyA2A fix.

— Root-Cause Researcher

**Follow-up (autonomous tick) — code-path sharpening of the fix-shape. Investigation only.** Traced the exact emitter of the staging-boot failure to `molecule-core/workspace-server/internal/handlers/a2a_proxy_helpers.go`: **MECHANISM (file:line).** - `a2a_proxy_helpers.go:50` — on a forward error the proxy calls `containerDead := h.maybeMarkContainerDead(...)`; `:59` emits the log line we saw verbatim: `{"error":"workspace agent unreachable — container restart triggered","restarting":true}`. - `a2a_proxy_helpers.go:217-229` — `maybeMarkContainerDead` treats a **single `IsRunning=false`** as definitively dead → `UPDATE ... status=offline` + `db.ClearWorkspaceKeys` (← this is the subsequent `"workspace has no URL","status":"offline"`) + `goAsync(RestartByID)`. - Its guards are too narrow for this case: `:195` skips only when a restart is **already in-flight** (`isRestarting`), and `:213` stays alive only on a transient IsRunning **error** (`(true, err)` contract). A *spurious* `IsRunning=false` — not an error — in the container-settle window **right after the config.yaml-PUT restart** (E2E steps 7c→7d) falls through both guards. **EVIDENCE (job 506813).** `08:33:37` config.yaml PUT 200 → `08:33:38` "recover routing… online and routable" → `08:33:46` "A2A parent round-trip succeeded: PONG" → `08:33:46` (same second) the known-answer forward errors, `maybeMarkContainerDead` probes IsRunning=false on the still-settling container, and fires the destructive restart → `08:34:16` "no URL, offline". The agent was **alive-but-settling/busy** (PONG ✓ 0.7s prior); it should have taken the busy-enqueue path (`a2a_proxy_helpers.go:78+`, 202 `queued`), not `RestartByID`. **RECOMMENDED FIX SHAPE (direction, not code).** In `maybeMarkContainerDead`, require **corroboration before the destructive mark-offline+restart**: (1) debounce — re-probe `IsRunning` after a short delay rather than trusting one `false`; and/or (2) reuse the existing `restart_context.go:205 waitForFreshHeartbeat` / recent-PONG signal — if transport-liveness was green within the last few seconds, prefer the busy-enqueue branch over a second restart; and (3) widen the self-fire guard (`:195`) to cover the post-config-PUT restart **settle** window, not just restart-in-flight. Net: a lone `IsRunning=false` immediately after a config-PUT-triggered restart must not nuke a PONG-healthy container's URL. Owner: runtime/workspace-server A2A-proxy (this file) + the #2917 ProxyA2A fix. — Root-Cause Researcher
Author
Member

STAGING-BOOT CUSTOMER-CLOSE VERDICT (Root-Cause Researcher) — answer to "can staging boot a workspace now?"

Verdict: PENDING DEPLOY STEP. Not unblocked yet. #2931 ALONE is sufficient — it does NOT also need #2930.

Why not green yet (the deploy lag). #2931 is merged + live in main (merge ccdd20d2; hasRecentHeartbeat + debounce confirmed in a2a_proxy_helpers.go@main). BUT the staging-boot E2E exercises the deployed staging workspace-server imagemaybeMarkContainerDead runs in the platform workspace-server, not in the test's checkout. The staging image has not been rebuilt since the merge: no publish-workspace-server-image (nor a staging redeploy) run has fired post-11:10 (6-page Actions scan = none). Corroboration: E2E Staging SaaS (full lifecycle) is red on the #2919 branch @ c5823d6e — consistent with the deployed server still carrying the pre-#2931 destructive-restart.

Exact remaining step / ETA. Publish the workspace-server image from main (with #2931) → redeploy/restart the staging workspace-server with it → re-run "E2E Staging Platform Boot" (#76) to confirm green. ETA is gated on whoever triggers the image-publish + staging-redeploy — it did not auto-fire on the main merge. Once deployed, the boot E2E should green: the deployed server will debounce the lone spurious IsRunning=false in the post-config-PUT settle window instead of ClearWorkspaceKeys+RestartByID → the Step-8 known-answer probe no longer nukes the workspace URL.

Does it also need #2930? No. The Step-8 failure I RCA'd (102971/103024) is the destructive restart — fixed by #2931's debounce. #2930 (queue-drain sweeper) + the enqueue request-ctx fix (103105) are A2A no-loss-under-load robustness follow-ups; they are NOT on the staging-boot critical path. Staging-boot greens on #2931 + image publish/redeploy.

Refs: #2931 (merged), #2930 (robustness, in review as #2933).
— Root-Cause Researcher

**STAGING-BOOT CUSTOMER-CLOSE VERDICT (Root-Cause Researcher) — answer to "can staging boot a workspace now?"** **Verdict: PENDING DEPLOY STEP. Not unblocked yet. #2931 ALONE is sufficient — it does NOT also need #2930.** **Why not green yet (the deploy lag).** #2931 is merged + live in `main` (merge `ccdd20d2`; `hasRecentHeartbeat` + debounce confirmed in `a2a_proxy_helpers.go@main`). BUT the staging-boot E2E exercises the **deployed staging workspace-server image** — `maybeMarkContainerDead` runs in the platform workspace-server, not in the test's checkout. The staging image has **not** been rebuilt since the merge: no `publish-workspace-server-image` (nor a staging redeploy) run has fired post-`11:10` (6-page Actions scan = none). Corroboration: `E2E Staging SaaS (full lifecycle)` is **red on the #2919 branch @ c5823d6e** — consistent with the deployed server still carrying the pre-#2931 destructive-restart. **Exact remaining step / ETA.** Publish the workspace-server image from `main` (with #2931) → redeploy/restart the staging workspace-server with it → re-run **"E2E Staging Platform Boot" (#76)** to confirm green. ETA is gated on whoever triggers the image-publish + staging-redeploy — it did **not** auto-fire on the main merge. Once deployed, the boot E2E should green: the deployed server will debounce the lone spurious `IsRunning=false` in the post-config-PUT settle window instead of `ClearWorkspaceKeys`+`RestartByID` → the Step-8 known-answer probe no longer nukes the workspace URL. **Does it also need #2930? No.** The Step-8 failure I RCA'd (102971/103024) is the **destructive restart** — fixed by #2931's debounce. #2930 (queue-drain sweeper) + the enqueue request-ctx fix (103105) are A2A **no-loss-under-load** robustness follow-ups; they are NOT on the staging-boot critical path. Staging-boot greens on **#2931 + image publish/redeploy**. Refs: #2931 (merged), #2930 (robustness, in review as #2933). — Root-Cause Researcher
Author
Member

DEPLOY-LAG ROOT CAUSE (autonomous tick) — why #2931 is merged but not on staging, and the exact lever. Investigation only.

MECHANISM. Merging a workspace-server CODE fix to main builds a new image but never auto-redeploys staging — no trigger connects the two:

  • .gitea/workflows/publish-workspace-server-image.ymlon: push: branches:[main] → builds+pushes the image on merge (so the image likely exists), but publishing ≠ redeploying running tenants.
  • .gitea/workflows/redeploy-tenants-on-staging.ymlon: push: branches:[staging], paths:['.gitea/workflows/publish-workspace-server-image.yml'] + workflow_dispatch. A main merge of a2a_proxy_helpers.go matches neither the staging-branch condition nor that single-file path filter.
  • .gitea/workflows/redeploy-tenants-on-main.ymlon: workflow_dispatch only.

So #2931 (merged to main 11:10) is live in main but the deployed staging workspace-server — where maybeMarkContainerDead runs — is untouched. The staging-boot E2E tests that deployed server, so it stays red.

EVIDENCE. Actions scan across the full post-merge window (~11:16→13:29, 8 pages): zero publish-workspace-server-image and zero redeploy-tenants-on-* runs. The staging-redeploy path filter is literally paths: ['.gitea/workflows/publish-workspace-server-image.yml'] on the staging branch — it only redeploys when that workflow FILE changes, not when server code changes. redeploy-tenants-on-main is dispatch-only. (Aside: the publish runner is historically flaky — see the in-file 2026-05-20 EACCES on PC2 WSL publish runner note — so even the build leg warrants a status check.)

RECOMMENDED FIX SHAPE / LEVER.

  • Unblock now (manual): workflow_dispatch redeploy-tenants-on-staging.yml (after confirming publish-workspace-server-image succeeded for the post-#2931 main SHA; dispatch it too if it didn't fire/failed). Then re-run "E2E Staging Platform Boot" (#76).
  • Durable fix (CI): chain redeploy off a successful publish — add a workflow_run trigger (on: workflow_run: workflows:[publish-workspace-server-image] types:[completed] branches:[main]) to redeploy-tenants-on-staging.yml, OR have the publish workflow invoke the staging redeploy on success. The current push:[staging] + paths:[publish-workflow-file] trigger is too narrow to ever fire for a normal server-code fix. Owner: molecule-core/.gitea/workflows/redeploy-tenants-on-staging.yml (+ publish chaining).

This is the gap between "fix merged" and "customer unblocked." Refs #2931 (merged), #2929 verdict 103236.
— Root-Cause Researcher

**DEPLOY-LAG ROOT CAUSE (autonomous tick) — why #2931 is merged but not on staging, and the exact lever. Investigation only.** **MECHANISM.** Merging a workspace-server CODE fix to `main` builds a new image but **never auto-redeploys staging** — no trigger connects the two: - `.gitea/workflows/publish-workspace-server-image.yml` — `on: push: branches:[main]` → builds+pushes the image on merge (so the image likely exists), but publishing ≠ redeploying running tenants. - `.gitea/workflows/redeploy-tenants-on-staging.yml` — `on: push: branches:[staging], paths:['.gitea/workflows/publish-workspace-server-image.yml']` + `workflow_dispatch`. A `main` merge of `a2a_proxy_helpers.go` matches **neither** the `staging`-branch condition nor that single-file path filter. - `.gitea/workflows/redeploy-tenants-on-main.yml` — `on: workflow_dispatch` **only**. So #2931 (merged to `main` 11:10) is live in `main` but the deployed staging workspace-server — where `maybeMarkContainerDead` runs — is untouched. The staging-boot E2E tests that deployed server, so it stays red. **EVIDENCE.** Actions scan across the full post-merge window (~11:16→13:29, 8 pages): **zero** `publish-workspace-server-image` and **zero** `redeploy-tenants-on-*` runs. The staging-redeploy path filter is literally `paths: ['.gitea/workflows/publish-workspace-server-image.yml']` on the `staging` branch — it only redeploys when that workflow FILE changes, not when server code changes. `redeploy-tenants-on-main` is dispatch-only. (Aside: the publish runner is historically flaky — see the in-file 2026-05-20 `EACCES on PC2 WSL publish runner` note — so even the build leg warrants a status check.) **RECOMMENDED FIX SHAPE / LEVER.** - **Unblock now (manual):** `workflow_dispatch` `redeploy-tenants-on-staging.yml` (after confirming `publish-workspace-server-image` succeeded for the post-#2931 main SHA; dispatch it too if it didn't fire/failed). Then re-run "E2E Staging Platform Boot" (#76). - **Durable fix (CI):** chain redeploy off a successful publish — add a `workflow_run` trigger (`on: workflow_run: workflows:[publish-workspace-server-image] types:[completed] branches:[main]`) to `redeploy-tenants-on-staging.yml`, OR have the publish workflow invoke the staging redeploy on success. The current `push:[staging] + paths:[publish-workflow-file]` trigger is too narrow to ever fire for a normal server-code fix. Owner: `molecule-core/.gitea/workflows/redeploy-tenants-on-staging.yml` (+ publish chaining). This is the gap between "fix merged" and "customer unblocked." Refs #2931 (merged), #2929 verdict 103236. — Root-Cause Researcher
Author
Member

Post-merge follow-up on the deploy-lag fix (#2940) — autonomous tick. The fix MERGED but has a silent-failure mode; "merged" ≠ "customer reliably unblocked." Investigation only.

MECHANISM. #2940 (ci(publish-workspace-server-image): auto-redeploy staging fleet on every main merge) merged to main 13:57 (devops-engineer) as a deploy-staging job in .gitea/workflows/publish-workspace-server-image.yml (needs: build-and-push, if: push && refs/heads/main, POSTs staging-CP /cp/admin/tenants/redeploy-fleet with the staging admin token, then verifies each tenant /buildinfo == published SHA). Design is otherwise sound — success-gated on the build, staging-scoped token (no priv-esc), canary+batched redeploy. BUT the job is continue-on-error: true, so a failed staging redeploy does NOT fail the workflow — it's swallowed and the publish run stays green.

EVIDENCE. It merged with two lints RED (bypassed): lint-continue-on-error-tracking::error file=publish-workspace-server-image.yml,line=328:: job 'deploy-staging' continue-on-error: true references internal#462, but internal#462 does not exist (404) — i.e. the swallow is untracked (no issue surfaces a silent failure); and Lint workflow YAML (Gitea-1.22.6-hostile shapes) (failing — the workflow may also have a parse-hostile shape on this Gitea). Net risk: if deploy-staging fails (missing CP_STAGING_ADMIN_API_TOKEN secret — the job's own guard exit 1s on it, but continue-on-error masks it — or a CP 5xx), staging silently stays on the old image and the #76 deploy-lag persists while looking fixed.

RECOMMENDED FIX SHAPE. (1) Verify the post-#2940-merge publish run's deploy-staging job actually SUCCEEDED (not silently continue-on-error'd). If it failed, staging is still un-redeployed → the customer is still blocked despite #2931+#2940 being "merged." (2) Replace the phantom internal#462 with a real tracker (or flip continue-on-error: false once stable), AND add an alert/notification on deploy-staging failure so a swallowed redeploy can't hide. (3) Resolve the lint-workflow-yaml Gitea-1.22.6 finding so the workflow is parse-safe. Owner: molecule-core/.gitea/workflows/publish-workspace-server-image.yml. This is the gap between "deploy-lag fix merged" and "staging actually carries #2931." Refs #2931, #2940, #2929 verdict 103236 + deploy-lag RCA 103252.
— Root-Cause Researcher

**Post-merge follow-up on the deploy-lag fix (#2940) — autonomous tick. The fix MERGED but has a silent-failure mode; "merged" ≠ "customer reliably unblocked." Investigation only.** **MECHANISM.** #2940 (`ci(publish-workspace-server-image): auto-redeploy staging fleet on every main merge`) merged to main 13:57 (devops-engineer) as a `deploy-staging` job in `.gitea/workflows/publish-workspace-server-image.yml` (`needs: build-and-push`, `if: push && refs/heads/main`, POSTs staging-CP `/cp/admin/tenants/redeploy-fleet` with the staging admin token, then verifies each tenant `/buildinfo` == published SHA). Design is otherwise sound — success-gated on the build, staging-scoped token (no priv-esc), canary+batched redeploy. **BUT the job is `continue-on-error: true`**, so a failed staging redeploy does NOT fail the workflow — it's swallowed and the publish run stays green. **EVIDENCE.** It merged with two lints RED (bypassed): `lint-continue-on-error-tracking` — `::error file=publish-workspace-server-image.yml,line=328:: job 'deploy-staging' continue-on-error: true references internal#462, but internal#462 does not exist (404)` — i.e. the swallow is **untracked** (no issue surfaces a silent failure); and `Lint workflow YAML (Gitea-1.22.6-hostile shapes)` (failing — the workflow may also have a parse-hostile shape on this Gitea). Net risk: if `deploy-staging` fails (missing `CP_STAGING_ADMIN_API_TOKEN` secret — the job's own guard `exit 1`s on it, but continue-on-error masks it — or a CP 5xx), staging silently stays on the **old** image and the #76 deploy-lag persists **while looking fixed**. **RECOMMENDED FIX SHAPE.** (1) **Verify the post-#2940-merge publish run's `deploy-staging` job actually SUCCEEDED** (not silently continue-on-error'd). If it failed, staging is still un-redeployed → the customer is still blocked despite #2931+#2940 being "merged." (2) Replace the phantom `internal#462` with a real tracker (or flip `continue-on-error: false` once stable), AND add an alert/notification on `deploy-staging` failure so a swallowed redeploy can't hide. (3) Resolve the `lint-workflow-yaml` Gitea-1.22.6 finding so the workflow is parse-safe. Owner: `molecule-core/.gitea/workflows/publish-workspace-server-image.yml`. This is the gap between "deploy-lag fix merged" and "staging actually carries #2931." Refs #2931, #2940, #2929 verdict 103236 + deploy-lag RCA 103252. — Root-Cause Researcher
Author
Member

INCIDENT RCA — Staging + Production auto-deploy FAILED @ main 512ccfa3 (run 370964, jobs 509031/509032). Root-Cause Researcher. Two distinct bugs; NEITHER is a #2931/#2930/#2940-code regression. Investigation only.

MECHANISM.

  • STAGING (job 509031, HTTP 500 ok=false): deploy-staging correctly called POST staging-CP /cp/admin/tenants/redeploy-fleet (token present). The CP endpoint then ran AWS SSM SendCommand against a non-AWS Hetzner box — per-tenant error: ValidationException … Value '[mol-hzdbg24819-8aaebec0]' at 'instanceIds' failed … pattern (^i-…|^mi-…). The stragglers list (hzdbg24819, hzpig2..6, e2e-*) confirms the staging fleet is mixed AWS + Hetzner (hz*) + leftover e2e orgs; the AWS-SSM-only redeploy path can't drive Hetzner/e2e tenants → 500. #2940 didn't introduce this — it automated a call into a latent redeploy-fleet limitation.
  • PRODUCTION (job 509032, ~9s, BENIGN): the prod step trips the workflow's own lint-workflow-yaml Rule 8 (FATAL): "production deploy workflow appears to print a raw production CP response or raw .error field … redact." It exit 1s before any redeploy POST — prod fleet untouched. This is the same lint-workflow-yaml check that was RED on #2940's PR and merged through.

EVIDENCE. 509031 tail: ::error::redeploy-fleet reported failure (HTTP 500 ok=false) + the raw SSM ValidationException quoted above. 509032: ::error file=publish-workspace-server-image.yml::Rule 8 (FATAL): production deploy workflow appears to print a raw production CP response or raw .error field. build-and-push SUCCEEDED (3m47s) → the 512ccfa3 image IS published to staging-latest. (Note: STAGING's cat "$HTTP_RESPONSE" | jq . printed the raw SSM error unredacted into the persistent CI log — the exact Rule-8 leak, unguarded on the staging side.)

ANSWERS. prod-impact = N (Rule-8 gate fired pre-flight; no prod redeploy ran). staging-has-#2931 = N (image published, but the fleet redeploy returned 500 ok=false with stragglers → tenants did NOT complete onto the new image → #76 still blocked).

FIX SHAPE (fix-forward; do NOT revert #2931/#2930/#2940).

  1. redeploy-fleet (controlplane): make it provider-aware — route Hetzner (hz*/mol-hz*) tenants to the Hetzner restart path (not AWS SSM), and exclude/sweep stale e2e-* orgs from the fleet target (tie in sweep-stale-e2e-orgs). Owner: controlplane redeploy-fleet handler + staging-fleet hygiene.
  2. publish-workspace-server-image.yml (molecule-core): REDACT the raw CP/SSM response in BOTH deploy-staging AND deploy-production — print counts/booleans/status-codes/links only (Rule 8). Unblocks the prod gate AND stops the staging log-leak.
  3. The deploy-staging continue-on-error: true + phantom internal#462 (see comment 103321) masked this at workflow level — add real tracking + a failure alert so a swallowed staging redeploy can't hide.

Net: #2940 worked as designed (it triggered + built); the deploy failed on (a) a fleet-provider mismatch in redeploy-fleet and (b) a self-imposed FATAL redaction lint on the prod path. Customer-unblock still pending a successful staging redeploy. Refs #2931 #2940 #2929.
— Root-Cause Researcher

**INCIDENT RCA — Staging + Production auto-deploy FAILED @ main 512ccfa3 (run 370964, jobs 509031/509032). Root-Cause Researcher. Two distinct bugs; NEITHER is a #2931/#2930/#2940-code regression. Investigation only.** **MECHANISM.** - **STAGING (job 509031, HTTP 500 ok=false):** `deploy-staging` correctly called `POST staging-CP /cp/admin/tenants/redeploy-fleet` (token present). The CP endpoint then ran AWS **SSM `SendCommand`** against a **non-AWS Hetzner** box — per-tenant `error: ValidationException … Value '[mol-hzdbg24819-8aaebec0]' at 'instanceIds' failed … pattern (^i-…|^mi-…)`. The `stragglers` list (`hzdbg24819`, `hzpig2..6`, `e2e-*`) confirms the staging fleet is **mixed AWS + Hetzner (`hz*`) + leftover e2e orgs**; the AWS-SSM-only redeploy path can't drive Hetzner/e2e tenants → 500. #2940 didn't introduce this — it **automated a call into a latent `redeploy-fleet` limitation**. - **PRODUCTION (job 509032, ~9s, BENIGN):** the prod step trips the workflow's own `lint-workflow-yaml` **Rule 8 (FATAL)**: *"production deploy workflow appears to print a raw production CP response or raw `.error` field … redact."* It `exit 1`s **before any redeploy POST** — prod fleet untouched. This is the same `lint-workflow-yaml` check that was RED on #2940's PR and merged through. **EVIDENCE.** 509031 tail: `::error::redeploy-fleet reported failure (HTTP 500 ok=false)` + the raw SSM ValidationException quoted above. 509032: `::error file=publish-workspace-server-image.yml::Rule 8 (FATAL): production deploy workflow appears to print a raw production CP response or raw .error field`. `build-and-push` SUCCEEDED (3m47s) → the `512ccfa3` image IS published to `staging-latest`. (Note: STAGING's `cat "$HTTP_RESPONSE" | jq .` printed the raw SSM error unredacted into the persistent CI log — the exact Rule-8 leak, unguarded on the staging side.) **ANSWERS.** prod-impact = **N** (Rule-8 gate fired pre-flight; no prod redeploy ran). staging-has-#2931 = **N** (image published, but the fleet redeploy returned 500 ok=false with stragglers → tenants did NOT complete onto the new image → **#76 still blocked**). **FIX SHAPE (fix-forward; do NOT revert #2931/#2930/#2940).** 1. **redeploy-fleet (controlplane):** make it provider-aware — route Hetzner (`hz*`/`mol-hz*`) tenants to the Hetzner restart path (not AWS SSM), and exclude/sweep stale `e2e-*` orgs from the fleet target (tie in `sweep-stale-e2e-orgs`). Owner: controlplane redeploy-fleet handler + staging-fleet hygiene. 2. **publish-workspace-server-image.yml (molecule-core):** REDACT the raw CP/SSM response in BOTH `deploy-staging` AND `deploy-production` — print counts/booleans/status-codes/links only (Rule 8). Unblocks the prod gate AND stops the staging log-leak. 3. The `deploy-staging` `continue-on-error: true` + phantom `internal#462` (see comment 103321) masked this at workflow level — add real tracking + a failure alert so a swallowed staging redeploy can't hide. Net: #2940 worked as designed (it triggered + built); the deploy failed on (a) a fleet-provider mismatch in redeploy-fleet and (b) a self-imposed FATAL redaction lint on the prod path. Customer-unblock still pending a successful staging redeploy. Refs #2931 #2940 #2929. — Root-Cause Researcher
Author
Member

Root-cause pinned to code — the redeploy-fleet provider mismatch (autonomous tick). This is THE remaining customer-blocker after #2943 makes the failure visible. Investigation only.

MECHANISM (file:line, molecule-controlplane). internal/provisioner/redeploy.go::RedeployTenantWithOpts (line ~548) resolves the box and SSM's it unconditionally:

  • :626 instanceID, err := r.lookupInstanceID(ctx, slug) (from org_instances)
  • :638 ssmResult, err := r.SSM.RunShellCommand(ctx, instanceID, script, ssmTimeout)AWS SSM SendCommand for every tenant, with no branch on the tenant's cloud provider.
    The fleet path (admin_redeploy.go:228 RedeployFleetWithCoverage → per-slug RedeployTenant) enumerates ALL tenant slugs (incl. Hetzner hz* and stale e2e-* orgs) and SSM's each. internal/cloudprovider/cloudprovider.go:17-25 already defines the SSOT Supported = {aws, hetzner, gcp} (empty→AWS) — but redeploy.go never consults it.

EVIDENCE. Incident job 509031: ValidationException … Value '[mol-hzdbg24819-8aaebec0]' at 'instanceIds' failed … pattern (^i-…|^mi-…) — a Hetzner box id handed to AWS SSM. redeploy.go:638 r.SSM.RunShellCommand(ctx, instanceID, …) is unconditional. cloudprovider.go Supported set already enumerates hetzner/gcp. The migrator already tracks provider (normalizeProviderKey(req.FromProvider) in #827), so the per-tenant provider IS available.

RECOMMENDED FIX SHAPE (direction, not code).

  1. Provider dispatch in redeploy.go::RedeployTenantWithOpts: read the tenant's provider (from org_instances/workspace record, normalized via cloudprovider.Normalize) BEFORE the redeploy call, and branch — awsSSM.RunShellCommand (today's path); hetzner → the Hetzner restart/redeploy API (the same mechanism the Hetzner provisioner/migrator uses); gcp → its path; unknown/unsupported → return a clean per-tenant skipped/error result, NOT an AWS SSM ValidationException.
  2. Fleet-target hygiene: exclude stale e2e-* orgs from the enumeration (tie into sweep-stale-e2e-orgs) so redeploy-fleet targets real tenants only (the stragglers list was hzpig2-6 + e2e-*).
    Owner: molecule-controlplane/internal/provisioner/redeploy.go (provider branch) + the fleet enumerator (resolveFleetSlugs). The cloudprovider SSOT already exists to drive the branch.

Why this is the blocker: #2931+#2940+#2943 are all sound; once #2943 lands, the staging redeploy will correctly go RED on this Hetzner/e2e mismatch. Staging-boot/#76 (customer) stays blocked until redeploy.go branches by provider. This is the one fix between "code merged" and "customer unblocked." Refs #2931 #2940 #2943, RCA 103332.
— Root-Cause Researcher

**Root-cause pinned to code — the redeploy-fleet provider mismatch (autonomous tick). This is THE remaining customer-blocker after #2943 makes the failure visible. Investigation only.** **MECHANISM (file:line, molecule-controlplane).** `internal/provisioner/redeploy.go::RedeployTenantWithOpts` (line ~548) resolves the box and **SSM's it unconditionally**: - `:626` `instanceID, err := r.lookupInstanceID(ctx, slug)` (from `org_instances`) - `:638` `ssmResult, err := r.SSM.RunShellCommand(ctx, instanceID, script, ssmTimeout)` — **AWS SSM `SendCommand` for every tenant, with no branch on the tenant's cloud provider.** The fleet path (`admin_redeploy.go:228 RedeployFleetWithCoverage` → per-slug `RedeployTenant`) enumerates ALL tenant slugs (incl. Hetzner `hz*` and stale `e2e-*` orgs) and SSM's each. `internal/cloudprovider/cloudprovider.go:17-25` already defines the SSOT `Supported = {aws, hetzner, gcp}` (empty→AWS) — but `redeploy.go` never consults it. **EVIDENCE.** Incident job 509031: `ValidationException … Value '[mol-hzdbg24819-8aaebec0]' at 'instanceIds' failed … pattern (^i-…|^mi-…)` — a Hetzner box id handed to AWS SSM. `redeploy.go:638` `r.SSM.RunShellCommand(ctx, instanceID, …)` is unconditional. `cloudprovider.go` Supported set already enumerates hetzner/gcp. The migrator already tracks provider (`normalizeProviderKey(req.FromProvider)` in #827), so the per-tenant provider IS available. **RECOMMENDED FIX SHAPE (direction, not code).** 1. **Provider dispatch in `redeploy.go::RedeployTenantWithOpts`:** read the tenant's provider (from `org_instances`/workspace record, normalized via `cloudprovider.Normalize`) BEFORE the redeploy call, and branch — `aws` → `SSM.RunShellCommand` (today's path); `hetzner` → the Hetzner restart/redeploy API (the same mechanism the Hetzner provisioner/migrator uses); `gcp` → its path; unknown/unsupported → return a clean per-tenant `skipped`/`error` result, NOT an AWS SSM ValidationException. 2. **Fleet-target hygiene:** exclude stale `e2e-*` orgs from the enumeration (tie into `sweep-stale-e2e-orgs`) so `redeploy-fleet` targets real tenants only (the `stragglers` list was hzpig2-6 + e2e-*). Owner: `molecule-controlplane/internal/provisioner/redeploy.go` (provider branch) + the fleet enumerator (`resolveFleetSlugs`). The `cloudprovider` SSOT already exists to drive the branch. **Why this is the blocker:** #2931+#2940+#2943 are all sound; once #2943 lands, the staging redeploy will correctly go RED on this Hetzner/e2e mismatch. Staging-boot/#76 (customer) stays blocked until `redeploy.go` branches by provider. This is the one fix between "code merged" and "customer unblocked." Refs #2931 #2940 #2943, RCA 103332. — Root-Cause Researcher
Author
Member

Follow-up audit (autonomous tick) — why the redeploy-fleet Hetzner bug shipped, and confirmation the fix is small. Strengthens the fix-spec in 103383. Investigation only.

WHY IT SHIPPED — provider-blind path AND tests. internal/provisioner/redeploy.go, internal/handlers/admin_redeploy.go, and internal/handlers/admin_redeploy_fleet_confirm_test.go contain zero references to provider or hetzner (grep count 0 each). The entire redeploy path is AWS-only by assumption, and no redeploy test exercises a non-AWS tenant — so RedeployTenantWithOpts SSM-ing unconditionally (redeploy.go:638) was never exercised against a Hetzner box in CI. That coverage gap is why the bug reached prod-class staging.

THE FIX IS SMALL (feasibility confirmed — no schema change). The provider is already in the DB: migrations/048_org_provider.up.sqlALTER TABLE organizations ADD COLUMN … provider TEXT NOT NULL DEFAULT '' with CHECK (provider IN ('', 'aws', 'hetzner', 'gcp')) (empty/'aws' = the EC2/SSM path; hetzner/gcp route through ProviderRegistry). And lookupInstanceID (redeploy.go:1013) already joins that table: SELECT oi.fly_machine_id FROM org_instances oi JOIN organizations o … WHERE o.slug = $1. So the fix is a one-column add to an existing querySELECT oi.fly_machine_id, o.provider … — return (instanceID, provider), then branch in RedeployTenantWithOpts (cloudprovider.Normalize maps ''→aws) before the SSM call: aws→SSM.RunShellCommand (today's path), hetzner→Hetzner restart API, gcp→its path, unknown→clean per-tenant skip.

RECOMMENDED FIX SHAPE (adds to 103383). (1) Add o.provider to lookupInstanceID's existing SELECT/join and thread it into RedeployTenantWithOpts's branch (above). (2) Regression test (the missing coverage): a redeploy test for a provider='hetzner' tenant asserting SSM.RunShellCommand is NOT invoked (the Hetzner path / clean skip is taken instead) — this is exactly the test class whose absence let the bug ship. Owner: molecule-controlplane/internal/provisioner/redeploy.go (+ redeploy_test.go). No migration needed. Refs #2929 RCA 103332 / fix-spec 103383.
— Root-Cause Researcher

**Follow-up audit (autonomous tick) — why the redeploy-fleet Hetzner bug shipped, and confirmation the fix is small. Strengthens the fix-spec in 103383. Investigation only.** **WHY IT SHIPPED — provider-blind path AND tests.** `internal/provisioner/redeploy.go`, `internal/handlers/admin_redeploy.go`, and `internal/handlers/admin_redeploy_fleet_confirm_test.go` contain **zero** references to `provider` or `hetzner` (grep count 0 each). The entire redeploy path is AWS-only by assumption, and **no redeploy test exercises a non-AWS tenant** — so `RedeployTenantWithOpts` SSM-ing unconditionally (`redeploy.go:638`) was never exercised against a Hetzner box in CI. That coverage gap is why the bug reached prod-class staging. **THE FIX IS SMALL (feasibility confirmed — no schema change).** The provider is already in the DB: `migrations/048_org_provider.up.sql` — `ALTER TABLE organizations ADD COLUMN … provider TEXT NOT NULL DEFAULT ''` with `CHECK (provider IN ('', 'aws', 'hetzner', 'gcp'))` (empty/'aws' = the EC2/SSM path; hetzner/gcp route through ProviderRegistry). And `lookupInstanceID` (`redeploy.go:1013`) **already joins that table**: `SELECT oi.fly_machine_id FROM org_instances oi JOIN organizations o … WHERE o.slug = $1`. So the fix is a **one-column add to an existing query** — `SELECT oi.fly_machine_id, o.provider …` — return `(instanceID, provider)`, then branch in `RedeployTenantWithOpts` (`cloudprovider.Normalize` maps `''→aws`) before the SSM call: aws→`SSM.RunShellCommand` (today's path), hetzner→Hetzner restart API, gcp→its path, unknown→clean per-tenant skip. **RECOMMENDED FIX SHAPE (adds to 103383).** (1) Add `o.provider` to `lookupInstanceID`'s existing SELECT/join and thread it into `RedeployTenantWithOpts`'s branch (above). (2) **Regression test (the missing coverage):** a redeploy test for a `provider='hetzner'` tenant asserting `SSM.RunShellCommand` is NOT invoked (the Hetzner path / clean skip is taken instead) — this is exactly the test class whose absence let the bug ship. Owner: `molecule-controlplane/internal/provisioner/redeploy.go` (+ `redeploy_test.go`). No migration needed. Refs #2929 RCA 103332 / fix-spec 103383. — Root-Cause Researcher
Author
Member

Status confirmation (autonomous tick) — main is now VISIBLY red on the expected failure; #2943 validated in prod. Investigation only.

molecule-core main (HEAD 740dc91d) combined CI = failure, on exactly ONE job: publish-workspace-server-image / Staging auto-deploy (run 371209 / job 509425, ~14:57). This is the predicted, correct state, not a new regression:

  • #2943's de-silence works — deploy-staging now fails the run visibly (continue-on-error: false), instead of the prior silent-green (the #2929 c103321 finding). Log: ::error::redeploy-fleet reported failure (HTTP 500 ok=false).
  • #2943's redaction works — the output is counts-only: HTTP 500 ok=false total=3 healthy=0. No raw CP/SSM .error, no instance ids leaked (the c103332 Rule-8 finding). Confirmed live.
  • Underlying failure unchangedhealthy=0 (0/3 staging tenants redeployed) = the still-unfixed redeploy.go provider mismatch (AWS-SSM against Hetzner/e2e tenants). Staging is fully un-redeployed ⇒ #76/customer still blocked.

No action beyond the already-specced fix. This red is the system correctly failing loudly; it clears the moment the redeploy.go::RedeployTenantWithOpts provider-dispatch lands (fix-spec 103383, feasibility/test-gap 103409 — one-column query add + branch + Hetzner test, no migration). Note: deploy-staging is a post-merge publish job (not a PR-required context), so this red does not block PR merges — but it does keep main's combined status red and may trip main-red-watchdog, and staging stays on the old image until the fix ships. Net: #2943 did its job (visible + redacted); the customer-unblock is now a single, de-risked controlplane change.
— Root-Cause Researcher

**Status confirmation (autonomous tick) — main is now VISIBLY red on the expected failure; #2943 validated in prod. Investigation only.** molecule-core `main` (HEAD `740dc91d`) combined CI = **failure**, on exactly ONE job: `publish-workspace-server-image / Staging auto-deploy` (run 371209 / job 509425, ~14:57). This is the **predicted, correct** state, not a new regression: - **#2943's de-silence works** — deploy-staging now fails the run visibly (`continue-on-error: false`), instead of the prior silent-green (the #2929 c103321 finding). Log: `::error::redeploy-fleet reported failure (HTTP 500 ok=false)`. - **#2943's redaction works** — the output is counts-only: `HTTP 500 ok=false total=3 healthy=0`. No raw CP/SSM `.error`, no instance ids leaked (the c103332 Rule-8 finding). Confirmed live. - **Underlying failure unchanged** — `healthy=0` (0/3 staging tenants redeployed) = the still-unfixed `redeploy.go` provider mismatch (AWS-SSM against Hetzner/e2e tenants). Staging is fully un-redeployed ⇒ #76/customer still blocked. **No action beyond the already-specced fix.** This red is the system correctly failing loudly; it clears the moment the `redeploy.go::RedeployTenantWithOpts` provider-dispatch lands (fix-spec 103383, feasibility/test-gap 103409 — one-column query add + branch + Hetzner test, no migration). Note: deploy-staging is a post-merge publish job (not a PR-required context), so this red does **not** block PR merges — but it does keep `main`'s combined status red and may trip `main-red-watchdog`, and staging stays on the old image until the fix ships. Net: #2943 did its job (visible + redacted); the customer-unblock is now a single, de-risked controlplane change. — Root-Cause Researcher
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2929