Canvas A2A holds synchronously → Cloudflare 524 + WS starvation on long turns (durable fix: async dispatch) #2751

Open
opened 2026-06-13 12:26:07 +00:00 by core-devops · 1 comment
Member

Root cause (DevTools on JRS, long migrate turn ~300s)

The canvas→agent /a2a POST is held open synchronously for the entire agent turn. On turns longer than Cloudflare's ~100s edge limit this produces:

Failed to load /workspaces/<id>/a2a → 524   (CF 'A Timeout Occurred')
WebSocket connection to wss://<tenant>/ws failed   (concurrent)
  • 524: CF gives up at ~100s while the proxy still holds the connection waiting for the agent. Raising server-side timeouts (#2727 idle, #2749 ResponseHeaderTimeout) does NOT help — CF caps at 100s first.
  • WS failures: coincide with the held turn. Ruled out CORS (CheckOrigin passes for the tenant origin — bad origin→403, correct origin→not-403; CORS_ORIGINS is set to the tenant URL in provisioner ec2.go). The WS drop appears to be connection-path saturation/starvation during the long synchronous hold, not an auth/routing bug.

Mitigations already shipped (stop the user-facing symptom)

  • #2745: clear the 'unreachable' banner whenever the agent is thinking.
  • #2750: treat a 524/522/504 as 'still processing' (not 'unreachable'); reply arrives via WS.
    These stop the false banner, but live reply delivery still depends on the WS, and the 524 still aborts the held request.

Durable fix (the real one — needs design review)

Async canvas dispatch: /a2a from the canvas should return promptly (well under 100s) with {status:"queued"} and deliver the agent's reply via the AGENT_MESSAGE WebSocket event — the SAME contract already used for poll-mode/external workspaces. Today push-mode workspaces (those with a URL, e.g. JRS SEO agent) hold the HTTP connection for the whole turn; switching the canvas path to always-async removes the 100s ceiling entirely and frees the connection path (which should also resolve the concurrent WS failures).

Secondary: confirm WS reliability under load + a polling fallback for chat-history so a dropped WS still surfaces the reply without a manual reload.

This is a core chat-flow change — flagging for CTO review rather than a hot rewrite. It supersedes the timeout-raise workarounds (#2727/#2749) for the canvas path.

## Root cause (DevTools on JRS, long migrate turn ~300s) The canvas→agent `/a2a` POST is **held open synchronously for the entire agent turn**. On turns longer than **Cloudflare's ~100s edge limit** this produces: ``` Failed to load /workspaces/<id>/a2a → 524 (CF 'A Timeout Occurred') WebSocket connection to wss://<tenant>/ws failed (concurrent) ``` - **524**: CF gives up at ~100s while the proxy still holds the connection waiting for the agent. Raising server-side timeouts (#2727 idle, #2749 ResponseHeaderTimeout) does NOT help — CF caps at 100s first. - **WS failures**: coincide with the held turn. Ruled out CORS (`CheckOrigin` passes for the tenant origin — bad origin→403, correct origin→not-403; `CORS_ORIGINS` is set to the tenant URL in provisioner `ec2.go`). The WS drop appears to be connection-path saturation/starvation during the long synchronous hold, not an auth/routing bug. ## Mitigations already shipped (stop the user-facing symptom) - #2745: clear the 'unreachable' banner whenever the agent is `thinking`. - #2750: treat a 524/522/504 as 'still processing' (not 'unreachable'); reply arrives via WS. These stop the false banner, but live reply delivery still depends on the WS, and the 524 still aborts the held request. ## Durable fix (the real one — needs design review) **Async canvas dispatch**: `/a2a` from the canvas should return promptly (well under 100s) with `{status:"queued"}` and deliver the agent's reply via the `AGENT_MESSAGE` WebSocket event — the SAME contract already used for poll-mode/external workspaces. Today push-mode workspaces (those with a URL, e.g. JRS SEO agent) hold the HTTP connection for the whole turn; switching the canvas path to always-async removes the 100s ceiling entirely and frees the connection path (which should also resolve the concurrent WS failures). Secondary: confirm WS reliability under load + a polling fallback for chat-history so a dropped WS still surfaces the reply without a manual reload. This is a core chat-flow change — flagging for CTO review rather than a hot rewrite. It supersedes the timeout-raise workarounds (#2727/#2749) for the canvas path.
Author
Member

Design proposal — async canvas A2A dispatch (cap-and-queue)

Grounded findings (this codebase):

  • The canvas /a2a POST is held synchronously by proxyA2ARequest for the whole agent turn; a turn > Cloudflare's ~100s edge limit returns 524 (the recurring "Failed to send").
  • The agent's reply reaches the canvas via the events.Broadcaster (AGENT_MESSAGE over WS) independently of the held HTTP response — proven by applyIdleTimeout keying off broadcaster activity, and by the existing client-side contract (useChatSend abandons at 120s and the reply still arrives via WS). So dropping the held connection does NOT lose the reply.

Proposal: for canvas callers (callerID == ""), cap the synchronous wait at a CF-safe window (A2A_CANVAS_SYNC_BUDGET, default ~90s). If the agent hasn't returned headers by then, return {status:"queued", delivery_mode:"poll"} to the canvas (the exact shape the client already handles), while the dispatch to the agent continues on a detached context (NOT the request context) so the turn is never cancelled. The reply lands via the existing AGENT_MESSAGE broadcast. Turns < 90s are unchanged (inline reply as today).

The one real risk to validate (the crux): detaching the agent dispatch from the request context without (a) cancelling the in-flight turn, (b) double-delivering (inline AND broadcast), or (c) leaking goroutines. Mitigations: background the upstream call with a context derived from a long-lived parent (idle/ceiling-bounded, as today) rather than the gin request ctx; suppress the inline reply once we've returned queued; reuse the existing dedup (the client already dedups AGENT_MESSAGE by messageId).

Test + CI wiring: unit tests in a2a_proxy_test.go (queued-at-budget, no-cancel, no-double-deliver) in the blocking Go gate; an e2e-chat scenario for a >90s turn asserting queued-then-WS-reply. Removes the need for the #2727/#2749 timeout raises on the canvas path (they stay as the agent-to-agent backstop).

Decision needed: approve this approach (cap-and-queue + detached dispatch) before I implement — it touches the core chat path for every turn. Alternatives considered: (1) raise CF's edge timeout — not possible on the current plan; (2) chunked/streaming 200 early from the runtime — a runtime change, heavier, can layer on later.

cc @ for sign-off. Filed from the long-turn-timeout work (#2723/#2727/#2749/#2745/#2750).

## Design proposal — async canvas A2A dispatch (cap-and-queue) **Grounded findings (this codebase):** - The canvas `/a2a` POST is held synchronously by `proxyA2ARequest` for the whole agent turn; a turn > Cloudflare's ~100s edge limit returns **524** (the recurring "Failed to send"). - The agent's reply reaches the canvas via the **`events.Broadcaster`** (AGENT_MESSAGE over WS) **independently** of the held HTTP response — proven by `applyIdleTimeout` keying off broadcaster activity, and by the existing client-side contract (`useChatSend` abandons at 120s and the reply still arrives via WS). **So dropping the held connection does NOT lose the reply.** **Proposal:** for **canvas callers** (`callerID == ""`), cap the synchronous wait at a CF-safe window (`A2A_CANVAS_SYNC_BUDGET`, default ~90s). If the agent hasn't returned headers by then, return `{status:"queued", delivery_mode:"poll"}` to the canvas (the exact shape the client already handles), while the dispatch to the agent **continues on a detached context** (NOT the request context) so the turn is never cancelled. The reply lands via the existing AGENT_MESSAGE broadcast. Turns < 90s are unchanged (inline reply as today). **The one real risk to validate (the crux):** detaching the agent dispatch from the request context without (a) cancelling the in-flight turn, (b) double-delivering (inline AND broadcast), or (c) leaking goroutines. Mitigations: background the upstream call with a context derived from a long-lived parent (idle/ceiling-bounded, as today) rather than the gin request ctx; suppress the inline reply once we've returned queued; reuse the existing dedup (the client already dedups AGENT_MESSAGE by messageId). **Test + CI wiring:** unit tests in `a2a_proxy_test.go` (queued-at-budget, no-cancel, no-double-deliver) in the blocking Go gate; an `e2e-chat` scenario for a >90s turn asserting queued-then-WS-reply. Removes the need for the #2727/#2749 timeout raises on the canvas path (they stay as the agent-to-agent backstop). **Decision needed:** approve this approach (cap-and-queue + detached dispatch) before I implement — it touches the core chat path for every turn. Alternatives considered: (1) raise CF's edge timeout — not possible on the current plan; (2) chunked/streaming 200 early from the runtime — a runtime change, heavier, can layer on later. cc @ for sign-off. Filed from the long-turn-timeout work (#2723/#2727/#2749/#2745/#2750).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2751