fix(workspace-server): a2a-proxy SSOT — proactive RunningContainerName check before optimistic forward #36
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
POST /workspaces/:id/a2areturns generic 503 / "agent may be unreachable" when the workspace's container is missing on the host EC2, even though the workspace'sstatuscolumn readsonline. Same SSOT-divergence shape as #10 (which was fixed for the plugins handler in #12), but on the a2a-proxy code path.Real-world example today: Claude Code Agent (workspace
46a9517a-341f-4460-b2ad-33be35aed553) onhongming.moleculesai.app:online503on/workspaces/46a9517a.../a2a(twice) +wss://...../wsconnection failedmolecule-controlplane#20(EC2 Name renamed to include org slug) the EC2 was provisioned/replaced fresh; the agent's per-workspace Docker container (ws-46a9517a-341) hasn't been respawned by the reconciler yet OR is on a different Docker network thanmolecule-tenant.Why #12 didn't catch this
PR #12 routed
findRunningContainer(used byPOST /plugins,DELETE /plugins/:name,GET /plugins, etc.) throughprovisioner.RunningContainerNamefor SSOT. The a2a-proxy atinternal/handlers/a2a_proxy.go:608usesprovisioner.InternalURL(workspaceID)which buildshttp://ws-<wsShort>:8000and forwards optimistically — the container-health check (maybeMarkContainerDeadata2a_proxy_helpers.go:162) only runs reactively after the network call fails.That's defensible (avoids a docker-inspect on every a2a hop) but produces:
{"error": "workspace agent unreachable — container restart triggered", "restarting": true}) doesn't surface upstream — needs a separate canvas-side render fix that reads the response bodyProposed approach
Option A (preferred): Proactive
RunningContainerNamepre-flight in a2a-proxy.Before
proxyA2ARequestbuilds the request, callprovisioner.RunningContainerName(ctx, h.docker, workspaceID):("", nil)→ container is genuinely not running. Return a structured 503 witherror="workspace container not running",restarting=true, plus trigger the same async restartmaybeMarkContainerDeaddoes today. Skip the forward attempt entirely (saves ~2-30s of network timeout).("", err)→ transient docker-daemon error. Either (a) fall through to optimistic forward (current behavior) or (b) return 503 witherror="workspace daemon unreachable". Document the choice.(name, nil)→ forward as today.This consolidates the SSOT for the a2a path on the same helper PR #12 introduced, AND surfaces a cleaner error before canvas hits a network timeout.
Option B: Same as today, plus canvas-side rendering fix.
Leave workspace-server alone (it already returns structured 503 with
container_dead/busy/etc.); update the canvas to read the response body and render distinct toasts. Doesn't fix the latency issue (canvas still waits for the failed forward) and doesn't touch the SSOT.Option C: Both A and B.
Best of both: workspace-server short-circuits faster + canvas renders the distinct messages.
SSOT decision
Per #12:
provisioner.RunningContainerNameis the canonical "is this workspace's container alive right now" check. The a2a-proxy already callsprovisioner.IsRunningreactively viamaybeMarkContainerDead; folding the proactive check onto the same helper means two consumers of the SSOT (plugins handler + a2a-proxy) and zero duplication.Anti-drift gate (parallel to #12's AST gate)
Add a third AST gate in
internal/handlers/a2a_proxy_drift_test.go(or wherever fits) that pins: a2a-proxy's pre-flight check MUST callprovisioner.RunningContainerName, MUST NOT carry a parallelcli.ContainerInspectof its own. MirrorsTestFindRunningContainer_RoutesThroughProvisionerSSOTfrom #12.Security check
validateCallerToken; pre-flight is post-auth.Versioning + backwards-compat
errorstring ("workspace container not running") that callers may want to distinguish from the existing"workspace agent unreachable — container restart triggered". Canvas already renders both as the same toast today; tightening the canvas rendering is an orthogonal follow-up.Acceptance criteria
proxyA2ARequestcallsprovisioner.RunningContainerNamebefore the forwardOut of scope (parked as follow-ups)
molecule-tenantandws-<wsShort>are sometimes on different Docker networks (e.g., post-redeploy), pre-flightRunningContainerNamewon't catch it (it inspects by name only). Worth a separate gate that verifies network membership; not blocking this RFC.Relationship to molecule-controlplane#20
The issue surfaced visibly today after CP#20 (
feat(provisioner): include org slug in workspace EC2 Name) merged at 10:43Z. CP#20 explicitly says it touches only the AWS EC2 Name tag (observability), not container naming. So the rename itself is innocent — but the redeploy that landed it triggered the post-replace reconcile gap that exposed this latent SSOT divergence. The RFC closes the divergence; CP-side reconciler timing is the orthogonal concern.