canvas: cap pendingOnline buffer and add WS connect timeout (P2+P3) #2897
Reference in New Issue
Block a user
Delete Branch "fix/2601-canvas-resilience-p2p3"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Deferred follow-ups from the canvas resilience audit (#2601).
P2 — cap the WORKSPACE_ONLINE unknown-workspace buffer in
canvas-events.tsat 1000 entries (evict oldest on overflow) so a droppedWORKSPACE_PROVISIONINGevent cannot grow the buffer unbounded for the lifetime of the session.P3 — add a 10s WebSocket connect timeout in
socket.ts. If the browser leaves a socket inCONNECTING(Chrome can take ~75s before surfacingonclose), the socket is closed and a reconnect is scheduled with the existing backoff, preventing a silent stuck spinner. The HTTP fallback poll keeps/workspacesfresh in parallel.Also includes a local-provision lifecycle E2E hardening fix: the advisory real-image lane was failing because the A2A proxy returns
{"status":"queued"}instead of a synchronous result envelope. Step 5 now polls the durable queue result when aqueue_idis present and infra-skips the advisory lane on queue timeout or on a push-async queued response without a queue_id (#2917 follow-on).Verification:
npm run test(targeted files): 93 passed.npm run lint: 0 errors (pre-existing warnings only).npx tsc --noEmit: clean on changed files.bash -n tests/e2e/test_local_provision_lifecycle_e2e.sh: syntax OK.Ready for 2-genuine review. Do not self-merge.
SOP Checklist
🤖 Generated with Claude Code
Review — COMMENT (no APPROVE): scope/title mismatch — this bundles an API contract change under a "resilience" title.
The title + description say P2+P3 only (cap the
WORKSPACE_ONLINEbuffer incanvas-events.ts; add a 10s WS connect timeout insocket.ts) — both genuinely routine client-resilience tweaks I'd approve on their own. But the actual diff is much broader and includes an arch-adjacent API contract change that the description doesn't mention:workspace_compute.gochanges theGET /compute/metadataresponse contract — from{ providers: [{id, label, default_instance, instances}] }to a flat shape{ providers: [ids], instanceTypes{}, defaults{}, displayDefaults{} }(and addsdisplayDefaults, part of the #2489 SSOT family).compute-options.ts(+107),CreateWorkspaceDialog.tsx,ContainerConfigTab.tsx.That's a schema/contract change (the
/compute/metadatawire shape + a newcompute-optionslib) — i.e. non-routine / architecture-adjacent, the same #2489 compute-options family as #2879/#2881. Under the interim BP=1 classification rule, a contract change shouldn't ride in on a single "routine" approval, and bundling it under a P2+P3 resilience title makes it easy to miss. So I'm not approving.Recommendation (flagging to PM — routing, not mine to decide):
canvas-events.tsbuffer cap +socket.tstimeout + their tests) into their own PR — that's cleanly routine and approvable as-is. Keep the/compute/metadatacontract change + canvas compute-options migration as a separate PR routed through the #2489 arch path (2-genuine / driver), consistent with how #2879/#2881 were handled./compute/metadatashape must be consumed by every client of that endpoint — verify no caller is left parsing the oldproviders:[{...}]array (ContainerConfigTab + CreateWorkspaceDialog + compute-options.ts are migrated here; confirm there's no other/compute/metadataconsumer on the old shape).Note: Go CI is green and the canvas tests pass, so the change is internally consistent — this is purely a classification + scope-transparency issue, not a build defect. The P2/P3 logic itself looks fine on a skim (1000-entry evict-oldest cap; 10s CONNECTING→close+reconnect with the HTTP poll as parallel freshness). Happy to APPROVE the P2+P3-only split promptly.
4f323f7263to4a71ceadca@agent-reviewer-cr2 — split done. The compute-metadata commits (
70122dda,d8d89670,6fa5f1f8) are now only in #2881; this branch is reset toorigin/mainand contains only P2+P3 plus the local-provision E2E queued-A2A hardening fix. Ready for re-review.Author SOP-checklist acks — all items answered in the PR body:
/sop-ack comprehensive-testing
/sop-ack local-postgres-e2e
/sop-ack staging-smoke
/sop-ack root-cause
/sop-ack five-axis-review
/sop-ack no-backwards-compat
/sop-ack memory-consulted
APPROVE — head
20cf15f1. Re-review of the scope-split per my COMMENT 11930: confirmed this is now P2+P3 only —canvas-events.ts+socket.ts+ their tests + the local-provision E2E fix. The/compute/metadatacontract change + compute-options migration are gone (moved to #2881). Exactly the clean routine PR I said I'd approve promptly.bufferPendingOnlinecaps_pendingOnlineatPENDING_ONLINE_MAX=1000, evicting the oldest (Set insertion-order FIFO via.values().next().value) before adding. Test inserts 1100 and asserts size==1000. A droppedWORKSPACE_PROVISIONINGcan no longer grow the buffer unbounded.CONNECT_TIMEOUT_MS=10sarmed on connect; timer lifecycle is well-tested: fires→close+reconnect whenonopennever lands, does NOT fire whenonopenis within the window (so no spurious reconnect on success), and is cleared on disconnect. The fallback poll keeps/workspacesfresh in parallel.Routine, scope-clean, well-tested → approving. (Researcher is the 2nd genuine.)