fix(workspace-server): a2a-proxy SSOT — proactive RunningContainerName check before optimistic forward #36

Closed
opened 2026-05-07 18:07:40 +00:00 by Ghost · 0 comments

Symptom

POST /workspaces/:id/a2a returns generic 503 / "agent may be unreachable" when the workspace's container is missing on the host EC2, even though the workspace's status column reads online. Same SSOT-divergence shape as #10 (which was fixed for the plugins handler in #12), but on the a2a-proxy code path.

Real-world example today: Claude Code Agent (workspace 46a9517a-341f-4460-b2ad-33be35aed553) on hongming.moleculesai.app:

  • Registry STATUS: online
  • DevTools console: 503 on /workspaces/46a9517a.../a2a (twice) + wss://...../ws connection failed
  • Canvas toast: "Failed to send message — agent may be unreachable"
  • Likely root cause: post-molecule-controlplane#20 (EC2 Name renamed to include org slug) the EC2 was provisioned/replaced fresh; the agent's per-workspace Docker container (ws-46a9517a-341) hasn't been respawned by the reconciler yet OR is on a different Docker network than molecule-tenant.

Why #12 didn't catch this

PR #12 routed findRunningContainer (used by POST /plugins, DELETE /plugins/:name, GET /plugins, etc.) through provisioner.RunningContainerName for SSOT. The a2a-proxy at internal/handlers/a2a_proxy.go:608 uses provisioner.InternalURL(workspaceID) which builds http://ws-<wsShort>:8000 and forwards optimistically — the container-health check (maybeMarkContainerDead at a2a_proxy_helpers.go:162) only runs reactively after the network call fails.

That's defensible (avoids a docker-inspect on every a2a hop) but produces:

  • Long timeout before the canvas sees an error (network call has to fail first)
  • Generic "ProxyA2A forward error" log line — operators can't tell at a glance whether the container is gone, on the wrong network, or upstream-busy
  • Canvas-side rendering collapses all 503 into "agent may be unreachable" because the workspace-server's structured response ({"error": "workspace agent unreachable — container restart triggered", "restarting": true}) doesn't surface upstream — needs a separate canvas-side render fix that reads the response body

Proposed approach

Option A (preferred): Proactive RunningContainerName pre-flight in a2a-proxy.

Before proxyA2ARequest builds the request, call provisioner.RunningContainerName(ctx, h.docker, workspaceID):

  • ("", nil) → container is genuinely not running. Return a structured 503 with error="workspace container not running", restarting=true, plus trigger the same async restart maybeMarkContainerDead does today. Skip the forward attempt entirely (saves ~2-30s of network timeout).
  • ("", err) → transient docker-daemon error. Either (a) fall through to optimistic forward (current behavior) or (b) return 503 with error="workspace daemon unreachable". Document the choice.
  • (name, nil) → forward as today.

This consolidates the SSOT for the a2a path on the same helper PR #12 introduced, AND surfaces a cleaner error before canvas hits a network timeout.

Option B: Same as today, plus canvas-side rendering fix.

Leave workspace-server alone (it already returns structured 503 with container_dead/busy/etc.); update the canvas to read the response body and render distinct toasts. Doesn't fix the latency issue (canvas still waits for the failed forward) and doesn't touch the SSOT.

Option C: Both A and B.

Best of both: workspace-server short-circuits faster + canvas renders the distinct messages.

SSOT decision

Per #12: provisioner.RunningContainerName is the canonical "is this workspace's container alive right now" check. The a2a-proxy already calls provisioner.IsRunning reactively via maybeMarkContainerDead; folding the proactive check onto the same helper means two consumers of the SSOT (plugins handler + a2a-proxy) and zero duplication.

Anti-drift gate (parallel to #12's AST gate)

Add a third AST gate in internal/handlers/a2a_proxy_drift_test.go (or wherever fits) that pins: a2a-proxy's pre-flight check MUST call provisioner.RunningContainerName, MUST NOT carry a parallel cli.ContainerInspect of its own. Mirrors TestFindRunningContainer_RoutesThroughProvisionerSSOT from #12.

Security check

  • Untrusted input? No new input path.
  • Auth/sessions? a2a-proxy already gates via validateCallerToken; pre-flight is post-auth.
  • Data collection? Adds one log line on the not-running short-circuit. No PII.
  • Access change? Slightly tighter — a missing container produces an immediate "not running" instead of a vague timeout. Same workspace state machine; just earlier+clearer.

Versioning + backwards-compat

  • The 503 response body shape gets a new error string ("workspace container not running") that callers may want to distinguish from the existing "workspace agent unreachable — container restart triggered". Canvas already renders both as the same toast today; tightening the canvas rendering is an orthogonal follow-up.
  • No schema, API version, or migration impact.

Acceptance criteria

  • PR opens, links this issue
  • proxyA2ARequest calls provisioner.RunningContainerName before the forward
  • Three structured outcomes (running / not-running / transient-error) covered with explicit code paths + log lines
  • Unit tests for each outcome (table-driven, t.Setenv-style if env-bound, otherwise mock-based)
  • AST gate test pinning the SSOT routing (mirror of #12's gate)
  • Mutation test: deleting the pre-flight short-circuit makes the not-running test fail
  • No regression in existing a2a_proxy_test.go or a2a integration tests

Out of scope (parked as follow-ups)

  • Canvas-side rendering fix: read the workspace-server's structured 503 response body and show distinct toasts. Separate canvas PR.
  • EC2 reconciler timing: if the post-CP#20 reconciler is too slow to respawn agent containers after EC2 swap, that's a separate fix in the controlplane / provisioner. This RFC only narrows the symptom on the workspace-server side.
  • Docker network audit: if molecule-tenant and ws-<wsShort> are sometimes on different Docker networks (e.g., post-redeploy), pre-flight RunningContainerName won't catch it (it inspects by name only). Worth a separate gate that verifies network membership; not blocking this RFC.

Relationship to molecule-controlplane#20

The issue surfaced visibly today after CP#20 (feat(provisioner): include org slug in workspace EC2 Name) merged at 10:43Z. CP#20 explicitly says it touches only the AWS EC2 Name tag (observability), not container naming. So the rename itself is innocent — but the redeploy that landed it triggered the post-replace reconcile gap that exposed this latent SSOT divergence. The RFC closes the divergence; CP-side reconciler timing is the orthogonal concern.

## Symptom `POST /workspaces/:id/a2a` returns generic 503 / "agent may be unreachable" when the workspace's container is missing on the host EC2, even though the workspace's `status` column reads `online`. Same SSOT-divergence shape as #10 (which was fixed for the plugins handler in #12), but on the a2a-proxy code path. Real-world example today: Claude Code Agent (workspace `46a9517a-341f-4460-b2ad-33be35aed553`) on `hongming.moleculesai.app`: - Registry STATUS: `online` - DevTools console: `503` on `/workspaces/46a9517a.../a2a` (twice) + `wss://...../ws` connection failed - Canvas toast: "Failed to send message — agent may be unreachable" - Likely root cause: post-`molecule-controlplane#20` (EC2 Name renamed to include org slug) the EC2 was provisioned/replaced fresh; the agent's per-workspace Docker container (`ws-46a9517a-341`) hasn't been respawned by the reconciler yet OR is on a different Docker network than `molecule-tenant`. ## Why #12 didn't catch this PR #12 routed `findRunningContainer` (used by `POST /plugins`, `DELETE /plugins/:name`, `GET /plugins`, etc.) through `provisioner.RunningContainerName` for SSOT. The a2a-proxy at `internal/handlers/a2a_proxy.go:608` uses `provisioner.InternalURL(workspaceID)` which builds `http://ws-<wsShort>:8000` and **forwards optimistically** — the container-health check (`maybeMarkContainerDead` at `a2a_proxy_helpers.go:162`) only runs **reactively** after the network call fails. That's defensible (avoids a docker-inspect on every a2a hop) but produces: - Long timeout before the canvas sees an error (network call has to fail first) - Generic "ProxyA2A forward error" log line — operators can't tell at a glance whether the container is gone, on the wrong network, or upstream-busy - Canvas-side rendering collapses all 503 into "agent may be unreachable" because the workspace-server's structured response (`{"error": "workspace agent unreachable — container restart triggered", "restarting": true}`) doesn't surface upstream — needs a separate canvas-side render fix that reads the response body ## Proposed approach **Option A (preferred): Proactive `RunningContainerName` pre-flight in a2a-proxy.** Before `proxyA2ARequest` builds the request, call `provisioner.RunningContainerName(ctx, h.docker, workspaceID)`: - `("", nil)` → container is genuinely not running. Return a structured 503 with `error="workspace container not running"`, `restarting=true`, plus trigger the same async restart `maybeMarkContainerDead` does today. Skip the forward attempt entirely (saves ~2-30s of network timeout). - `("", err)` → transient docker-daemon error. Either (a) fall through to optimistic forward (current behavior) or (b) return 503 with `error="workspace daemon unreachable"`. Document the choice. - `(name, nil)` → forward as today. This consolidates the SSOT for the a2a path on the same helper PR #12 introduced, AND surfaces a cleaner error before canvas hits a network timeout. **Option B: Same as today, plus canvas-side rendering fix.** Leave workspace-server alone (it already returns structured 503 with `container_dead`/`busy`/etc.); update the canvas to read the response body and render distinct toasts. Doesn't fix the latency issue (canvas still waits for the failed forward) and doesn't touch the SSOT. **Option C: Both A and B.** Best of both: workspace-server short-circuits faster + canvas renders the distinct messages. ## SSOT decision Per #12: `provisioner.RunningContainerName` is the canonical "is this workspace's container alive right now" check. The a2a-proxy already calls `provisioner.IsRunning` reactively via `maybeMarkContainerDead`; folding the proactive check onto the same helper means **two consumers of the SSOT** (plugins handler + a2a-proxy) and zero duplication. ## Anti-drift gate (parallel to #12's AST gate) Add a third AST gate in `internal/handlers/a2a_proxy_drift_test.go` (or wherever fits) that pins: a2a-proxy's pre-flight check MUST call `provisioner.RunningContainerName`, MUST NOT carry a parallel `cli.ContainerInspect` of its own. Mirrors `TestFindRunningContainer_RoutesThroughProvisionerSSOT` from #12. ## Security check - Untrusted input? No new input path. - Auth/sessions? a2a-proxy already gates via `validateCallerToken`; pre-flight is post-auth. - Data collection? Adds one log line on the not-running short-circuit. No PII. - Access change? Slightly tighter — a missing container produces an immediate "not running" instead of a vague timeout. Same workspace state machine; just earlier+clearer. ## Versioning + backwards-compat - The 503 response body shape gets a new `error` string ("workspace container not running") that callers may want to distinguish from the existing `"workspace agent unreachable — container restart triggered"`. Canvas already renders both as the same toast today; tightening the canvas rendering is an orthogonal follow-up. - No schema, API version, or migration impact. ## Acceptance criteria - [ ] PR opens, links this issue - [ ] `proxyA2ARequest` calls `provisioner.RunningContainerName` before the forward - [ ] Three structured outcomes (running / not-running / transient-error) covered with explicit code paths + log lines - [ ] Unit tests for each outcome (table-driven, t.Setenv-style if env-bound, otherwise mock-based) - [ ] AST gate test pinning the SSOT routing (mirror of #12's gate) - [ ] Mutation test: deleting the pre-flight short-circuit makes the not-running test fail - [ ] No regression in existing a2a_proxy_test.go or a2a integration tests ## Out of scope (parked as follow-ups) - **Canvas-side rendering fix**: read the workspace-server's structured 503 response body and show distinct toasts. Separate canvas PR. - **EC2 reconciler timing**: if the post-CP#20 reconciler is too slow to respawn agent containers after EC2 swap, that's a separate fix in the controlplane / provisioner. This RFC only narrows the symptom on the workspace-server side. - **Docker network audit**: if `molecule-tenant` and `ws-<wsShort>` are sometimes on different Docker networks (e.g., post-redeploy), pre-flight `RunningContainerName` won't catch it (it inspects by name only). Worth a separate gate that verifies network membership; not blocking this RFC. ## Relationship to molecule-controlplane#20 The issue surfaced visibly today after CP#20 (`feat(provisioner): include org slug in workspace EC2 Name`) merged at 10:43Z. CP#20 explicitly says it touches **only the AWS EC2 Name tag** (observability), not container naming. So the rename itself is innocent — but the redeploy that landed it triggered the post-replace reconcile gap that exposed this latent SSOT divergence. The RFC closes the divergence; CP-side reconciler timing is the orthogonal concern.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#36
No description provided.