molecule-core

History

Hongming Wang 9f35788aee fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat to a dead workspace returning a generic Cloudflare 502 page on 2026-04-30. Three independent gaps in the reactive-health path that together leak dead-agent failures to canvas with no auto-recovery. ## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants `maybeMarkContainerDead` only consulted `h.provisioner` (local Docker provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner) and leave `h.provisioner` nil — so the function early-returned false on every call and dead EC2 agents never triggered the offline-flip / broadcast / restart cascade. Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id) (bool, error)` (already implemented on `*CPProvisioner`; just needs to surface on the interface). `maybeMarkContainerDead` now branches: local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses `h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status` endpoint to read the EC2 state. ## Bug 2 — RestartByID short-circuits on `h.provisioner == nil` Same shape as Bug 1: the auto-restart cascade triggered by `maybeMarkContainerDead` calls `RestartByID` which short-circuited when the local Docker provisioner was missing. So even if Bug 1 were fixed, the workspace-offline state would never recover. Fix: change the gate to `h.provisioner == nil && h.cpProv == nil` and update `runRestartCycle` to branch on which provisioner is wired for the Stop call. (The HTTP `Restart` handler already does this branching correctly — we're just bringing the auto-restart path to parity.) ## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare When the agent's tunnel returns 5xx (the "tunnel up but no origin" shape — agent process dead but cloudflared connection still healthy), `dispatchA2A` returns successfully at the HTTP layer with a 5xx body. `handleA2ADispatchError`'s reactive-health path doesn't run because that path is only triggered on transport-level errors. The pre-fix code propagated the 502 status to canvas; Cloudflare in front of the platform then masked the 502 with its own opaque "error code: 502" page, hiding any structured response and any Retry-After hint. Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run `maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms the agent is dead → return a structured 503 with restarting=true + Retry-After (CF doesn't mask 503s the same way). If running, propagate the original status (don't recycle a healthy agent on a transient hiccup — it might have legitimately returned 502). ## Drive-by — a2aClient transport timeouts a2aClient was `&http.Client{}` with no Transport timeouts. When a workspace's EC2 black-holes TCP connects (instance terminated mid-flight, SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS — long enough for Cloudflare's ~100s edge timeout to fire first and surface a generic 502. Added DialContext (10s connect), TLSHandshake (10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY unset — that would pre-empt slow-cold-start flows (Claude Code OAuth first-token, multi-minute agent synthesis). Long-tail body streaming is still governed by per-request context deadline. ## Tests - `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) → marks workspace offline, returns true. - `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) → no offline-flip, returns false (don't recycle a healthy agent). - `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server returns 502 + cpProv reports dead → caller gets 503 with restarting= true and Retry-After: 15. - `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream 502 but cpProv reports running → propagates 502 (existing behavior; safety check that prevents over-eager recycling). - Existing `TestMaybeMarkContainerDead_NilProvisioner` / `TestMaybeMarkContainerDead_ExternalRuntime` still pass. - Full handlers + provisioner test suites pass. ## Impact Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no auto-recovery, manual restart from canvas required. Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with restarting=true + Retry-After to canvas, workspace flipped to offline, auto-restart cycle triggered. Canvas can show a user-actionable "agent is restarting, please wait" message instead of a generic 502. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-30 00:28:22 -07:00
..
artifacts	chore: sync staging to main — 1188 commits, 5 conflicts resolved (#1743 )	2026-04-23 18:30:18 +00:00
bundle	fix(platform): unblock SaaS workspace registration end-to-end	2026-04-21 03:06:46 -07:00
channels	feat(channels): first-class Lark/Feishu support via schema-driven config	2026-04-24 11:51:15 -07:00
crypto	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
db	test(arch): codify 4 module boundaries as architecture tests (#2344 )	2026-04-29 22:12:58 -07:00
envx	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
events	test(handlers): introduce events.EventEmitter interface (#1814 partial)	2026-04-26 09:05:52 -07:00
handlers	fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS	2026-04-30 00:28:22 -07:00
imagewatch	feat(workspace-server): GHCR digest watcher closes runtime CD chain (#2114 )	2026-04-26 13:36:26 -07:00
metrics	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
middleware	merge: resolve staging conflicts (a2a_proxy + workspace_crud)	2026-04-26 10:43:22 -07:00
models	Merge pull request #2348 from Molecule-AI/auto/issue-2339-pr1-delivery-mode	2026-04-30 05:18:03 +00:00
orgtoken	fix: F1085 rm scope concat + GH#756 ValidateToken terminal guard + CI test fixes	2026-04-24 07:16:54 +00:00
plugins	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
provisioner	fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS	2026-04-30 00:28:22 -07:00
registry	fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart	2026-04-27 17:28:50 -07:00
router	feat(a2a): per-queue-id status endpoint + per-message TTL (RFC #2331 Tier 1)	2026-04-29 20:21:17 -07:00
scheduler	feat(runtime): native_scheduler skip — primitive #3 of 6	2026-04-26 22:47:00 -07:00
supervised	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
ws	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
wsauth	test(arch): codify 4 module boundaries as architecture tests (#2344 )	2026-04-29 22:12:58 -07:00