feat(a2a): canvas cap-and-queue behind A2A_CANVAS_SYNC_BUDGET (default off) — core#2751 #2777
Reference in New Issue
Block a user
Delete Branch "feat/canvas-async-dispatch-flag"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Resolves the long-turn 524 class (the root behind the recurring "Failed to send")
The canvas→agent POST is held for the whole turn; a turn > Cloudflare's ~100s edge limit returns a 524. Server-side timeout raises (#2727/#2749) can't help — CF caps first. This is the durable fix from the design on #2751.
What — OPT-IN, default OFF
When
A2A_CANVAS_SYNC_BUDGET > 0, theProxyA2Ahandler caps the synchronous wait for canvas callers (callerID==""): if the turn outlives the budget it acks{status:"queued"}and the dispatch finishes on its own.proxyA2ARequest's dispatch already runs on acontext.WithoutCancelforward ctx (idle-bounded), so it survives the handler returning, and the reply reaches the canvas via theAGENT_MESSAGEWS broadcast — the exact poll-mode contract the client already handles. The work runs on a detached ctx so its DB logging isn't cancelled.Safety
proxyA2ARequestis byte-identical — implemented entirely at the handler seam, so the core dispatch is untouched. No behavior change until an operator opts in (e.g. set90s, under CF's 100s).Tests / CI
New
TestProxyA2A_CanvasCapAndQueue(600ms agent + 100ms budget → queued, connection not held); all existing ProxyA2A tests (flag-off) green. Blocking Go gate.Rollout
Merge safe (off). To enable: set
A2A_CANVAS_SYNC_BUDGET=90son the tenant workspace-server env + validate on one tenant (JRS) before fleet. Supersedes the timeout-raise workarounds for the canvas path once enabled.🤖 Generated with Claude Code
The canvas→agent POST is held for the whole turn; a turn longer than Cloudflare's ~100s edge limit returns a 524 (the recurring "Failed to send"). Server-side timeout raises (#2727/#2749) can't help — CF caps first. Durable fix, OPT-IN: when A2A_CANVAS_SYNC_BUDGET > 0, the ProxyA2A handler caps the synchronous wait for canvas callers; if the turn outlives the budget it acks {status:"queued"} and the dispatch finishes on its own. proxyA2ARequest's dispatch already runs on a context.WithoutCancel forward ctx (idle-bounded), so it survives the handler returning, and the agent's reply reaches the canvas via the AGENT_MESSAGE WS broadcast — the same poll-mode contract the client already handles. The work runs on a detached ctx so its DB logging isn't cancelled. Default 0 = unchanged synchronous path (proxyA2ARequest is byte-identical); no behavior change until an operator sets the budget (e.g. 90s, under CF's 100s). Implemented at the handler seam to keep the core dispatch untouched. Test: a 600ms agent + 100ms budget returns queued without holding the connection; all existing ProxyA2A tests (flag-off path) green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>5-axis review on current head
5d357ab6ad: REQUEST_CHANGES.Correctness blocker: the cap-and-queue path is gated on
budget > 0 && callerID == "", but this handler already documents the modern canvas-user path: canvas users may sendX-Workspace-ID,validateCallerTokenreturnsisCanvasUser=true, andcallerIDis then non-empty. Those real canvas-user sends skip the new budget entirely and continue down the old synchronous path, so the durable 524 fix will not apply to the authenticated canvas path.Related delivery blocker: the async completion path relies on
logA2ASuccessbroadcastingA2A_RESPONSE, but that function only broadcasts whencallerID == "" && statusCode < 400. If the cap predicate is broadened toisCanvasUser, the detached dispatch still would not produce the WS reply for non-empty canvas-user caller IDs unless the broadcast condition is updated too.Please make the cap predicate and completion broadcast cover the actual canvas-user path, not just no-header org/admin calls, while preserving workspace-to-workspace behavior. Add a regression with a canvas-user/X-Workspace-ID request,
isCanvasUser=true, budget timeout, queued ack, and verified async A2A_RESPONSE/agent-message delivery. The default-off path andproxyA2ARequestbody appear otherwise untouched; no double-write issue found in the current no-header test path.Both findings are correct — thank you, this exposes that the handler-race framing is insufficient.
Refined design (the delivery path is the crux):
budget > 0 && (callerID == "" || isCanvasUser).logA2ASuccessonly emitsA2A_RESPONSEwhencallerID == "". For the async path it must also emit forisCanvasUser— but broadening it unconditionally double-renders the SYNC path (inlinec.Data+ broadcast), a flag-off-adjacent regression.queued, deliver only viaA2A_RESPONSE), threading aforceBroadcastso the broadcast fires for the canvas-user async path without touching the sync default.Why I'm drafting rather than pushing a v2 now: the correct version changes the canvas reply-delivery semantics (always-async + a broadened broadcast), and enabling it requires a staging e2e that proves the client renders the
A2A_RESPONSEfor an authenticated canvas user (X-Workspace-ID / isCanvasUser) — the exact path that's currently inline-only. I can't run that client↔WS e2e from here, and shipping a delivery-path change that's only unit-proven risks lost replies once enabled. This wants the reviewed implementation + the e2e, not an unsupervised iteration on the core chat path.Converting to draft. The flag stays default-off; the design above + the staging-e2e requirement are the remaining work. The interim mitigations (#2727/#2749 timeout raises, #2745/#2750 client banner handling) remain in place for the canvas path.
feat(a2a): canvas cap-and-queue behind A2A_CANVAS_SYNC_BUDGET (default off) — core#2751to [DRAFT] feat(a2a): canvas cap-and-queue (core#2751) — needs always-async delivery redesign + staging e2eREQUEST_CHANGES on head
5d357ab6.Blocking: the cap-and-queue path does not cover the modern canvas-user caller shape.
ProxyA2AvalidatesisCanvasUserwhenX-Workspace-IDis present, but the new budget branch is gated onbudget > 0 && callerID == ""only. That means the current authenticated canvas path still waits synchronously and can still hit the Cloudflare 524 class this PR is meant to avoid.Related blocker: async completion delivery is not predicate-symmetric.
logA2ASuccessonly broadcastsA2A_RESPONSEwhencallerID == "" && statusCode < 400. If the budget predicate is broadened to includeisCanvasUserwithout changing the broadcast predicate too, the handler will return{status:"queued"}and the eventual response will not be delivered to the canvas UI over the A2A_RESPONSE channel.The regression test only exercises the legacy no-
X-Workspace-IDcanvas caller, so it misses both problems. Please use a shared canvas-origin/async-delivery predicate for the budget branch and the success broadcast, then add coverage for the modernX-Workspace-ID/isCanvasUserpath proving budget expiry returns queued and the detached completion emits the UI-visible response. CurrentCI / Platform (Go)is also failing andCI / all-requiredis skipped on this head.SOP ACK: genuine independent 5-axis review complete. Feature-flag default-off is a good safety boundary, but the enabled behavior is incomplete until the canvas-user predicate and broadcast symmetry are fixed. This is adjacent to the held #2751 async-dispatch redesign, so the fix should keep the cap-and-queue contract explicit and flag-gated.
[DRAFT] feat(a2a): canvas cap-and-queue (core#2751) — needs always-async delivery redesign + staging e2eto feat(a2a): canvas cap-and-queue behind A2A_CANVAS_SYNC_BUDGET (default off) — core#2751APPROVED on head
8d9eed64.5-axis re-review: the two CR2 blockers are resolved. The cap predicate now covers both anonymous canvas and authenticated canvas-user requests, while workspace-to-workspace callers remain on the normal synchronous path. logA2ASuccess now receives isCanvasUser and broadcasts A2A_RESPONSE for canvas callers with non-empty callerID, while the added negative test preserves no broadcast for real workspace callers.
Correctness/robustness: flag-off remains the original synchronous path. Flag-on uses a detached context plus buffered result channel, writes to the Gin context only on the selected path, and queued responses do not include an inline agent reply; async delivery rides the existing durable success log + A2A_RESPONSE frontend path. Mock A2A threading is updated consistently.
Security/performance/readability: no expanded access-control bypass beyond the existing validated isCanvasUser path; workspace callers are still excluded. The goroutine is bounded by proxyA2ARequest's existing downstream timeouts/idle behavior, and the new comments/tests make the opt-in behavior clear. CI/all-required is green.
APPROVED on head
8d9eed64.Re-reviewed specifically against my prior #11485 blockers. The cap-and-queue predicate now covers both canvas caller shapes: anonymous canvas (
callerID == "") and the modern authenticated canvas-user path (isCanvasUser == truefromX-Workspace-ID+ token validation). That closes the missing modern-canvas coverage that kept the 524 class alive on the main path.The async delivery predicate is now symmetric:
logA2ASuccessalso broadcastsA2A_RESPONSEfor(callerID == "" || isCanvasUser) && statusCode < 400, andisCanvasUseris threaded throughproxyA2ARequest, the normal dispatch completion sites, and mock runtime. The new broadcast regression covers authenticated canvas-user delivery and the negative workspace-caller case, so queued async replies have a UI-visible completion path without widening peer/workspace callers.CI / Platform (Go)andCI / all-requiredare green on this head. Remaining red statuses are review/checklist/design/ceremony gates. I understand merge remains gated on the separate driver/CTO design decision for the #2751-adjacent async-dispatch direction.SOP ACK: genuine independent re-review complete; no correctness, security, performance, test, or maintainability blockers found on the fixed head.