fix(workspace-server): a2a-proxy preflight container check (closes #36) #37
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix/issue36-a2a-proxy-preflight"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #36. Same SSOT-divergence pattern as #10 (closed by #12), but on the a2a-proxy code path that #12 didn't reach.
Symptom this fixes
Canvas message-send returns generic 503 / "agent may be unreachable" when the workspace's container is missing on its EC2, even though
workspaces.statusreadsonline. Surfaced today onClaude Code Agent(workspace46a9517a-341f-4460-b2ad-33be35aed553) post-molecule-controlplane#20(EC2 Name renamed to include org slug). The CP rename itself is observability-only; the redeploy that landed it triggered the post-replace reconciler gap that exposed this latent SSOT divergence.What was happening
proxyA2ARequestininternal/handlers/a2a_proxy.gocallsresolveAgentURL→ forwards toprovisioner.InternalURL(workspaceID)→http://ws-<wsShort>:8000(Docker-DNS form when the platform is itself in Docker). When the agent container is missing:handleA2ADispatchErrorruns, which callsmaybeMarkContainerDeadREACTIVELY.Plus the workspace-server log line is the generic
ProxyA2A forward error: <network err>, not a structured "container missing" signal — operators couldn't tell at a glance whether the failure was missing-container vs upstream-busy vs network-flake without grepping subsequent log lines.What this PR changes
A pre-flight
Provisioner.IsRunningcheck inserted betweenresolveAgentURLanddispatchA2A, gated on the conditions where we know we're talking to a sibling Docker container we own (h.provisioner != nil && platformInDocker && url == http://<ContainerName(id)>:port).Three outcomes via the SSOT helper (which itself wraps
RunningContainerNameper #12):IsRunningreturns(true, nil)(false, nil)error="workspace container not running — restart triggered",restarting=true,preflight=true. Same offline-flip +WORKSPACE_OFFLINEbroadcast + async restartmaybeMarkContainerDeadproduces. Saves the caller 2-30s of network-timeout.(true, err)maybeMarkContainerDeadstill runs after the forward if it actually fails.The
preflight=truefield in the response distinguishes the proactive short-circuit from the reactivemaybeMarkContainerDead503, so canvas or downstream callers can render distinct messages later (separate canvas-side PR).SSOT decision
Provisioner.IsRunning(which wrapsprovisioner.RunningContainerNameper #12) is the canonical "is this workspace's container alive right now" helper. The a2a-proxy now joins the plugins handler in routing through it. Two consumers, zero parallel impl. Pinned by an AST gate (test 4 below) — mirror ofTestFindRunningContainer_RoutesThroughProvisionerSSOTfrom #12.Tests
internal/handlers/a2a_proxy_preflight_test.go(4 tests, all PASS):Plus
go test ./internal/handlers/...full suite — green.go vet ./...clean.Mutation tests
Mutation 1:
if running { return nil }→if true { return nil }(skip the not-running guard)Compile error stops it before runtime. Stricter than a runtime test failure.
Mutation 2: Replace the
!runningbranch withreturn nil(silent pass-through)The
TestPreflight_ContainerNotRunning_StructuredFastFailtest fails at sqlmock's "expected DB call did not occur" — the offline-flip UPDATE is never executed because the function exited before reaching the side-effect path. Verified end-to-end.Hostile self-review — three weakest spots
Adds one docker-inspect per a2a-call when platformInDocker. ~1-5ms cost. At the volume reno-stars sees (~10 a2a/sec at peak), that's a measurable but small overhead. Could be cached with a TTL if hot-path latency budgets tighten; not doing that today because cache invalidation correctness is its own design problem and the current cost is acceptable.
The gate condition is
strings.HasPrefix(agentURL, "http://"+provisioner.ContainerName(workspaceID)+":"). Sensitive to a future change inContainerNameorInternalURL. If those drift, the preflight silently stops firing. Mitigation: the existingProvisioner.IsRunningis unchanged shape; dispatchA2A's reactive path still catches the failure (just slower). So this is a degrade-gracefully, not a fail-loudly, drift.Doesn't address the
cpProv(SaaS / EC2 status) path. The preflight only fires for local-Docker tenants. For SaaS-EC2 deployments whereh.provisioner == nil && h.cpProv != nil, the forward still runs optimistically andmaybeMarkContainerDeadcatches dead EC2s reactively. Extending preflight to the cpProv path is a logical follow-up but a separate code path with different cost/latency tradeoffs (CP HTTP call vs Docker-inspect) — out of scope for this PR.Versioning + backwards-compat
"preflight": true. Canvas already renders both shapes as the same toast today; tightening the canvas rendering is the orthogonal follow-up I noted in the issue.Rollout / rollback
git revertthe merge OR set the gate to always-false (if false &&). Reactive path is preserved either way.Out of scope (parked as separate follow-ups)
preflight: truevs reactivecontainerDead). Separate canvas PR.🤖 Generated with Claude Code