fix(workspace-server): a2a-proxy preflight container check (closes #36) #37

Ghost · 2026-05-07T18:16:06Z

Ghost commented

2026-05-07 18:16:06 +00:00

First-time contributor

Closes #36. Same SSOT-divergence pattern as #10 (closed by #12), but on the a2a-proxy code path that #12 didn't reach.

Symptom this fixes

Canvas message-send returns generic 503 / "agent may be unreachable" when the workspace's container is missing on its EC2, even though workspaces.status reads online. Surfaced today on Claude Code Agent (workspace 46a9517a-341f-4460-b2ad-33be35aed553) post-molecule-controlplane#20 (EC2 Name renamed to include org slug). The CP rename itself is observability-only; the redeploy that landed it triggered the post-replace reconciler gap that exposed this latent SSOT divergence.

What was happening

proxyA2ARequest in internal/handlers/a2a_proxy.go calls resolveAgentURL → forwards to provisioner.InternalURL(workspaceID) → http://ws-<wsShort>:8000 (Docker-DNS form when the platform is itself in Docker). When the agent container is missing:

The forward attempt fails with a network/DNS error after the connect timeout (2–30s).
handleA2ADispatchError runs, which calls maybeMarkContainerDead REACTIVELY.
The reactive check correctly identifies the dead container and returns a structured 503 — but the caller already paid the timeout cost.

Plus the workspace-server log line is the generic ProxyA2A forward error: <network err>, not a structured "container missing" signal — operators couldn't tell at a glance whether the failure was missing-container vs upstream-busy vs network-flake without grepping subsequent log lines.

What this PR changes

A pre-flight Provisioner.IsRunning check inserted between resolveAgentURL and dispatchA2A, gated on the conditions where we know we're talking to a sibling Docker container we own (h.provisioner != nil && platformInDocker && url == http://<ContainerName(id)>:port).

Three outcomes via the SSOT helper (which itself wraps RunningContainerName per #12):

`IsRunning` returns	Action
`(true, nil)`	Forward as today. Net cost: one extra docker-inspect (~1-5ms).
`(false, nil)`	Fast-503 with `error="workspace container not running — restart triggered"`, `restarting=true`, `preflight=true`. Same offline-flip + `WORKSPACE_OFFLINE` broadcast + async restart `maybeMarkContainerDead` produces. Saves the caller 2-30s of network-timeout.
`(true, err)`	Fall through to optimistic forward. Matches IsRunning's "fail-soft as alive" contract — a flaky daemon must NOT trigger a restart cascade. Reactive `maybeMarkContainerDead` still runs after the forward if it actually fails.

The preflight=true field in the response distinguishes the proactive short-circuit from the reactive maybeMarkContainerDead 503, so canvas or downstream callers can render distinct messages later (separate canvas-side PR).

SSOT decision

Provisioner.IsRunning (which wraps provisioner.RunningContainerName per #12) is the canonical "is this workspace's container alive right now" helper. The a2a-proxy now joins the plugins handler in routing through it. Two consumers, zero parallel impl. Pinned by an AST gate (test 4 below) — mirror of TestFindRunningContainer_RoutesThroughProvisionerSSOT from #12.

Tests

internal/handlers/a2a_proxy_preflight_test.go (4 tests, all PASS):

TestPreflight_ContainerRunning_ReturnsNil                PASS
TestPreflight_ContainerNotRunning_StructuredFastFail     PASS  (sqlmock pins offline-flip UPDATE + structure_events insert)
TestPreflight_TransientError_FailsSoftAsAlive            PASS
TestProxyA2A_Preflight_RoutesThroughProvisionerSSOT      PASS  (AST gate)

Plus go test ./internal/handlers/... full suite — green. go vet ./... clean.

Mutation tests

Mutation 1: if running { return nil } → if true { return nil } (skip the not-running guard)

internal/handlers/a2a_proxy_helpers.go:221:2: declared and not used: running
build failed

Compile error stops it before runtime. Stricter than a runtime test failure.

Mutation 2: Replace the !running branch with return nil (silent pass-through)

The TestPreflight_ContainerNotRunning_StructuredFastFail test fails at sqlmock's "expected DB call did not occur" — the offline-flip UPDATE is never executed because the function exited before reaching the side-effect path. Verified end-to-end.

Hostile self-review — three weakest spots

Adds one docker-inspect per a2a-call when platformInDocker. ~1-5ms cost. At the volume reno-stars sees (~10 a2a/sec at peak), that's a measurable but small overhead. Could be cached with a TTL if hot-path latency budgets tighten; not doing that today because cache invalidation correctness is its own design problem and the current cost is acceptable.
The gate condition is strings.HasPrefix(agentURL, "http://"+provisioner.ContainerName(workspaceID)+":"). Sensitive to a future change in ContainerName or InternalURL. If those drift, the preflight silently stops firing. Mitigation: the existing Provisioner.IsRunning is unchanged shape; dispatchA2A's reactive path still catches the failure (just slower). So this is a degrade-gracefully, not a fail-loudly, drift.
Doesn't address the cpProv (SaaS / EC2 status) path. The preflight only fires for local-Docker tenants. For SaaS-EC2 deployments where h.provisioner == nil && h.cpProv != nil, the forward still runs optimistically and maybeMarkContainerDead catches dead EC2s reactively. Extending preflight to the cpProv path is a logical follow-up but a separate code path with different cost/latency tradeoffs (CP HTTP call vs Docker-inspect) — out of scope for this PR.

Versioning + backwards-compat

New 503 response shape includes "preflight": true. Canvas already renders both shapes as the same toast today; tightening the canvas rendering is the orthogonal follow-up I noted in the issue.
No schema, API version, or migration impact.
Operationally additive: when preflight doesn't fire (non-Docker URL, no provisioner), behavior is identical to today.

Rollout / rollback

Rollout: merge → next workspace-server release picks it up. No multi-step rollout.
Rollback: git revert the merge OR set the gate to always-false (if false &&). Reactive path is preserved either way.

Out of scope (parked as separate follow-ups)

Canvas-side rendering fix to read the structured 503 body and show distinct toasts (preflight: true vs reactive containerDead). Separate canvas PR.
CP-provisioner preflight for SaaS-EC2 tenants. Different cost/latency profile; warrants its own design.
EC2 reconciler timing post-CP#20 replaces — root-cause for why the agent container is missing in the first place. CP-side concern; this PR closes the symptom-side latency.

🤖 Generated with Claude Code

Closes #36. Same SSOT-divergence pattern as #10 (closed by #12), but on the a2a-proxy code path that #12 didn't reach. ## Symptom this fixes Canvas message-send returns generic 503 / "agent may be unreachable" when the workspace's container is missing on its EC2, even though `workspaces.status` reads `online`. Surfaced today on `Claude Code Agent` (workspace `46a9517a-341f-4460-b2ad-33be35aed553`) post-`molecule-controlplane#20` (EC2 Name renamed to include org slug). The CP rename itself is observability-only; the redeploy that landed it triggered the post-replace reconciler gap that exposed this latent SSOT divergence. ## What was happening `proxyA2ARequest` in `internal/handlers/a2a_proxy.go` calls `resolveAgentURL` → forwards to `provisioner.InternalURL(workspaceID)` → `http://ws-<wsShort>:8000` (Docker-DNS form when the platform is itself in Docker). When the agent container is missing: 1. The forward attempt fails with a network/DNS error after the connect timeout (2–30s). 2. `handleA2ADispatchError` runs, which calls `maybeMarkContainerDead` REACTIVELY. 3. The reactive check correctly identifies the dead container and returns a structured 503 — but the caller already paid the timeout cost. Plus the workspace-server log line is the generic `ProxyA2A forward error: <network err>`, not a structured "container missing" signal — operators couldn't tell at a glance whether the failure was missing-container vs upstream-busy vs network-flake without grepping subsequent log lines. ## What this PR changes A pre-flight `Provisioner.IsRunning` check inserted between `resolveAgentURL` and `dispatchA2A`, gated on the conditions where we know we're talking to a sibling Docker container we own (`h.provisioner != nil && platformInDocker && url == http://<ContainerName(id)>:port`). Three outcomes via the SSOT helper (which itself wraps `RunningContainerName` per #12): | `IsRunning` returns | Action | |---|---| | `(true, nil)` | Forward as today. Net cost: one extra docker-inspect (~1-5ms). | | `(false, nil)` | **Fast-503** with `error="workspace container not running — restart triggered"`, `restarting=true`, `preflight=true`. Same offline-flip + `WORKSPACE_OFFLINE` broadcast + async restart `maybeMarkContainerDead` produces. Saves the caller 2-30s of network-timeout. | | `(true, err)` | Fall through to optimistic forward. Matches IsRunning's "fail-soft as alive" contract — a flaky daemon must NOT trigger a restart cascade. Reactive `maybeMarkContainerDead` still runs after the forward if it actually fails. | The `preflight=true` field in the response distinguishes the proactive short-circuit from the reactive `maybeMarkContainerDead` 503, so canvas or downstream callers can render distinct messages later (separate canvas-side PR). ## SSOT decision `Provisioner.IsRunning` (which wraps `provisioner.RunningContainerName` per #12) is the canonical "is this workspace's container alive right now" helper. The a2a-proxy now joins the plugins handler in routing through it. Two consumers, zero parallel impl. Pinned by an AST gate (test 4 below) — mirror of `TestFindRunningContainer_RoutesThroughProvisionerSSOT` from #12. ## Tests `internal/handlers/a2a_proxy_preflight_test.go` (4 tests, all PASS): ``` TestPreflight_ContainerRunning_ReturnsNil PASS TestPreflight_ContainerNotRunning_StructuredFastFail PASS (sqlmock pins offline-flip UPDATE + structure_events insert) TestPreflight_TransientError_FailsSoftAsAlive PASS TestProxyA2A_Preflight_RoutesThroughProvisionerSSOT PASS (AST gate) ``` Plus `go test ./internal/handlers/...` full suite — green. `go vet ./...` clean. ### Mutation tests **Mutation 1**: `if running { return nil }` → `if true { return nil }` (skip the not-running guard) ``` internal/handlers/a2a_proxy_helpers.go:221:2: declared and not used: running build failed ``` Compile error stops it before runtime. Stricter than a runtime test failure. **Mutation 2**: Replace the `!running` branch with `return nil` (silent pass-through) The `TestPreflight_ContainerNotRunning_StructuredFastFail` test fails at sqlmock's "expected DB call did not occur" — the offline-flip UPDATE is never executed because the function exited before reaching the side-effect path. Verified end-to-end. ## Hostile self-review — three weakest spots 1. **Adds one docker-inspect per a2a-call when platformInDocker.** ~1-5ms cost. At the volume reno-stars sees (~10 a2a/sec at peak), that's a measurable but small overhead. Could be cached with a TTL if hot-path latency budgets tighten; not doing that today because cache invalidation correctness is its own design problem and the current cost is acceptable. 2. **The gate condition is `strings.HasPrefix(agentURL, "http://"+provisioner.ContainerName(workspaceID)+":")`.** Sensitive to a future change in `ContainerName` or `InternalURL`. If those drift, the preflight silently stops firing. Mitigation: the existing `Provisioner.IsRunning` is unchanged shape; dispatchA2A's reactive path still catches the failure (just slower). So this is a degrade-gracefully, not a fail-loudly, drift. 3. **Doesn't address the `cpProv` (SaaS / EC2 status) path.** The preflight only fires for local-Docker tenants. For SaaS-EC2 deployments where `h.provisioner == nil && h.cpProv != nil`, the forward still runs optimistically and `maybeMarkContainerDead` catches dead EC2s reactively. Extending preflight to the cpProv path is a logical follow-up but a separate code path with different cost/latency tradeoffs (CP HTTP call vs Docker-inspect) — out of scope for this PR. ## Versioning + backwards-compat - New 503 response shape includes `"preflight": true`. Canvas already renders both shapes as the same toast today; tightening the canvas rendering is the orthogonal follow-up I noted in the issue. - No schema, API version, or migration impact. - Operationally additive: when preflight doesn't fire (non-Docker URL, no provisioner), behavior is identical to today. ## Rollout / rollback - **Rollout**: merge → next workspace-server release picks it up. No multi-step rollout. - **Rollback**: `git revert` the merge OR set the gate to always-false (`if false &&`). Reactive path is preserved either way. ## Out of scope (parked as separate follow-ups) - **Canvas-side rendering fix** to read the structured 503 body and show distinct toasts (`preflight: true` vs reactive `containerDead`). Separate canvas PR. - **CP-provisioner preflight** for SaaS-EC2 tenants. Different cost/latency profile; warrants its own design. - **EC2 reconciler timing** post-CP#20 replaces — root-cause for why the agent container is missing in the first place. CP-side concern; this PR closes the symptom-side latency. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Ghost added 1 commit 2026-05-07 18:16:07 +00:00

fix(workspace-server): a2a-proxy preflight container check (closes #36 )

Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s

Details

CI / Detect changes (pull_request) Successful in 8s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 7s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s

Details

Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s

Details

Harness Replays / detect-changes (pull_request) Successful in 7s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s

Details

CI / Python Lint & Test (pull_request) Successful in 5s

Details

Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s

Details

CI / Canvas (Next.js) (pull_request) Successful in 5s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

Harness Replays / Harness Replays (pull_request) Failing after 56s

Details

CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 1m25s

Details

CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m25s

Details

CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m37s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m38s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m46s

Details

CI / Platform (Go) (pull_request) Successful in 2m44s

Details

be5fbb5ad3

Same SSOT-divergence shape as #10 / fixed in #12, but on the a2a-proxy
code path. The plugin handler was routed through `provisioner.RunningContainerName`;
a2a-proxy was forwarding optimistically and only catching missing containers
REACTIVELY via `maybeMarkContainerDead` after the network call timed out.

Result on tenants whose agent containers had been recycled (e.g. post-EC2
replace from molecule-controlplane#20): canvas waits 2-30s for the network
forward to fail before getting a 503, and the workspace-server logs only
"ProxyA2A forward error" without the "container is dead" signal.

This PR adds a proactive `Provisioner.IsRunning` check in `proxyA2ARequest`
between `resolveAgentURL` and `dispatchA2A`, gated on the conditions where
we know we're talking to a sibling Docker container we own (`h.provisioner
!= nil` AND `platformInDocker` AND the URL was rewritten to Docker-DNS form).

Three outcomes via the SSOT helper:
  (true,  nil) → forward as today
  (false, nil) → fast-503 with `error="workspace container not running —
                 restart triggered"`, `restarting=true`, `preflight=true`,
                 plus the same offline-flip + WORKSPACE_OFFLINE broadcast +
                 async restart that `maybeMarkContainerDead` produces
  (true,  err) → fall through to optimistic forward (matches IsRunning's
                 "fail-soft as alive" contract — flaky daemon must not
                 trigger a restart cascade)

The `preflight=true` flag in the response distinguishes the proactive
short-circuit from the reactive `maybeMarkContainerDead` path so canvas
or downstream callers can render distinct messages later.

* `internal/handlers/a2a_proxy.go` — preflight call site between
  resolveAgentURL and dispatchA2A; gated on `h.provisioner != nil &&
  platformInDocker && url == http://<ContainerName(id)>:port`.
* `internal/handlers/a2a_proxy_helpers.go` — `preflightContainerHealth`
  helper. Routes through `h.provisioner.IsRunning` (which itself wraps
  `RunningContainerName`). Identical offline-flip side-effects as
  `maybeMarkContainerDead` for the dead-container case.
* `internal/handlers/a2a_proxy_preflight_test.go` — 4 tests: running →
  nil; not-running → structured 503 + sqlmock expectations on the
  offline-flip + structure_events insert; transient error → nil
  (fail-soft); AST gate pinning the SSOT routing (mirror of #12's gate).

Mutation-tested: removing the `if running { return nil }` guard makes
the production code fail to compile (unused var). A subtler mutation
(replacing the !running branch with `return nil`) would make
TestPreflight_ContainerNotRunning_StructuredFastFail fail at runtime
with sqlmock's "expected DB call did not occur."

Refs: molecule-core#36. Companion to #12 (issue #10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>