molecule-core

History

Hongming Wang 9f35788aee fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat to a dead workspace returning a generic Cloudflare 502 page on 2026-04-30. Three independent gaps in the reactive-health path that together leak dead-agent failures to canvas with no auto-recovery. ## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants `maybeMarkContainerDead` only consulted `h.provisioner` (local Docker provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner) and leave `h.provisioner` nil — so the function early-returned false on every call and dead EC2 agents never triggered the offline-flip / broadcast / restart cascade. Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id) (bool, error)` (already implemented on `*CPProvisioner`; just needs to surface on the interface). `maybeMarkContainerDead` now branches: local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses `h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status` endpoint to read the EC2 state. ## Bug 2 — RestartByID short-circuits on `h.provisioner == nil` Same shape as Bug 1: the auto-restart cascade triggered by `maybeMarkContainerDead` calls `RestartByID` which short-circuited when the local Docker provisioner was missing. So even if Bug 1 were fixed, the workspace-offline state would never recover. Fix: change the gate to `h.provisioner == nil && h.cpProv == nil` and update `runRestartCycle` to branch on which provisioner is wired for the Stop call. (The HTTP `Restart` handler already does this branching correctly — we're just bringing the auto-restart path to parity.) ## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare When the agent's tunnel returns 5xx (the "tunnel up but no origin" shape — agent process dead but cloudflared connection still healthy), `dispatchA2A` returns successfully at the HTTP layer with a 5xx body. `handleA2ADispatchError`'s reactive-health path doesn't run because that path is only triggered on transport-level errors. The pre-fix code propagated the 502 status to canvas; Cloudflare in front of the platform then masked the 502 with its own opaque "error code: 502" page, hiding any structured response and any Retry-After hint. Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run `maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms the agent is dead → return a structured 503 with restarting=true + Retry-After (CF doesn't mask 503s the same way). If running, propagate the original status (don't recycle a healthy agent on a transient hiccup — it might have legitimately returned 502). ## Drive-by — a2aClient transport timeouts a2aClient was `&http.Client{}` with no Transport timeouts. When a workspace's EC2 black-holes TCP connects (instance terminated mid-flight, SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS — long enough for Cloudflare's ~100s edge timeout to fire first and surface a generic 502. Added DialContext (10s connect), TLSHandshake (10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY unset — that would pre-empt slow-cold-start flows (Claude Code OAuth first-token, multi-minute agent synthesis). Long-tail body streaming is still governed by per-request context deadline. ## Tests - `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) → marks workspace offline, returns true. - `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) → no offline-flip, returns false (don't recycle a healthy agent). - `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server returns 502 + cpProv reports dead → caller gets 503 with restarting= true and Retry-After: 15. - `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream 502 but cpProv reports running → propagates 502 (existing behavior; safety check that prevents over-eager recycling). - Existing `TestMaybeMarkContainerDead_NilProvisioner` / `TestMaybeMarkContainerDead_ExternalRuntime` still pass. - Full handlers + provisioner test suites pass. ## Impact Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no auto-recovery, manual restart from canvas required. Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with restarting=true + Retry-After to canvas, workspace flipped to offline, auto-restart cycle triggered. Canvas can show a user-actionable "agent is restarting, please wait" message instead of a generic 502. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-30 00:28:22 -07:00
..
cmd/server	feat(runtime): native_scheduler skip — primitive #3 of 6	2026-04-26 22:47:00 -07:00
internal	fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS	2026-04-30 00:28:22 -07:00
migrations	feat(workspaces): delivery_mode column + poll-mode register flow (#2339 PR 1)	2026-04-29 21:47:14 -07:00
pkg/provisionhook	feat(#1957 ): wire gh-identity plugin into workspace-server	2026-04-24 15:01:41 +00:00
.ci-force	chore: force Platform(Go) CI run on main — validate go vet clean	2026-04-21 15:43:19 +00:00
.gitignore	feat(ws-server): pull env from CP on startup	2026-04-19 02:41:15 -07:00
.golangci.yaml	chore(workspace-server): add golangci.yaml disabling errcheck	2026-04-24 07:16:54 +00:00
Dockerfile	chore: extract ContextMenu Zustand fix + a2a_proxy local-docker SSRF bypass + workspace-server Dockerfile GID entrypoint	2026-04-22 20:00:16 -07:00
Dockerfile.tenant	feat(terminal): remote path via aws ec2-instance-connect + pty	2026-04-21 18:13:29 -07:00
entrypoint-tenant.sh	fix(security): add USER directive before ENTRYPOINT in all tenant images (#1155 )	2026-04-20 23:51:33 +00:00
go.mod	chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave	2026-04-28 16:25:46 -07:00
go.sum	chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave	2026-04-28 16:25:46 -07:00