fix(canvas): polite tasks/cancel before /workspaces/:id/restart for Stop All (task #377 companion) #1619
Reference in New Issue
Block a user
Delete Branch "fix/377-canvas-polite-cancel-before-restart"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Companion to
molecule-ai-workspace-template-claude-codePR#40 (fix/377-stop-all-propagation, core-be). Task #377.PR#40 adds a fast-cancel path on the runtime side (
executor.cancel()->killpgthe CLI subprocess group), gated onMOLECULE_STOP_PROPAGATE=true. Until canvas issues the A2Atasks/cancelJSON-RPC at the workspace before the heavy/restart, that runtime path is INERT in production - nothing ever reachesexecutor.cancel(). Flipping the env var would give zero canary signal. This PR closes that gap.Fix - two-phase Stop All in
Toolbar.tsx/workspaces/:id/a2awith{method:"tasks/cancel", params:{}}for every active workspace in parallel.workspace-server/internal/handlers/a2a_proxy.goforwards the envelope verbatim to the runtime; the a2a-sdk framework dispatchestasks/canceltoAgentExecutor.cancel()(template-claude-codeclaude_sdk_executor.py:853).TASK_UPDATEDWS pushes viacanvas-events.ts:400) every 250ms for up to 8000ms. Drained workspaces (activeTasks=0) are excluded from phase 3./workspaces/:id/restart. Behavior is a strict superset of pre-fix Stop All - stuck workspaces still get the hammer, well-behaved ones are spared.Upstream citation
Per
feedback_upstream_docs_first_before_hypothesizing:a2a/compat/v0_3/types.py:1125pins the wire method literal:method: Literal['tasks/cancel'] = 'tasks/cancel'. Matches the slash-notation our codebase already uses for"message/send"(workspace-server/internal/handlers/delegation.go:155,canvas/src/components/tabs/ScheduleTab.tsx:168).Test plan
Four new specs in
Toolbar.test.tsx:tasks/cancelvia/a2afor every active workspace before any/restart(order + envelope shape assertion).activeTasksdrains to 0 during the poll window,/restartis NOT called.activeTasksdoes NOT drain inside the 8s timeout,/restartis called for each stuck workspace (with phase-1-before-phase-3 order assertion)./restartis called only for the stuck one.Full canvas vitest suite locally: 3360 passed, 1 skipped, 0 failed. Toolbar file: 25 tests (21 prior + 4 new).
No new dependencies. No new env vars. No DB migrations.
Refs: task #377, template-claude-code PR#40.
Companion to template-claude-code PR#40 (fix/377-stop-all-propagation, core-be). PR#40 adds a fast-cancel path on the runtime side (executor.cancel -> killpg the CLI subprocess group), gated on MOLECULE_STOP_PROPAGATE=true. But until canvas issues the A2A tasks/cancel JSON-RPC at the workspace before the heavy /restart, that runtime path is INERT in production - nothing ever reaches executor.cancel(). Flipping the env var would produce zero canary signal. Fix - two-phase Stop All in Toolbar.tsx: Phase 1: POST /workspaces/:id/a2a with method "tasks/cancel" and empty params for every workspace that has activeTasks>0, in parallel. The workspace-server a2a_proxy.go forwards the envelope verbatim to the runtime; a2a-sdk dispatches tasks/cancel to AgentExecutor.cancel() on the runtime side (claude_sdk_executor.py cancel() at line 853). Phase 2: poll the canvas Zustand store (TASK_UPDATED pushes drive the active_tasks field via canvas-events.ts:400) every 250ms for up to 8000ms. Drained workspaces (activeTasks=0) are removed from the to-restart set. Phase 3: for any workspace that did NOT drain inside the timeout - runtime on an old image without the cancel hook, or cancel propagation stuck - fall through to the original heavy /workspaces/:id/restart. Behavior is a strict superset of pre-fix Stop All: stuck workspaces still get the hammer, well-behaved workspaces are spared. Upstream wire-shape citations (per feedback_upstream_docs_first_before_hypothesizing): - A2A protocol spec section 9.4.5 "CancelTask" - JSON-RPC binding for the abstract Cancel Task operation (https://a2a-protocol.org/latest/specification/) - a2a-sdk 1.0.3 a2a/compat/v0_3/types.py line 1125 pins the wire method literal: `method: Literal['tasks/cancel'] = 'tasks/cancel'`. Matches the slash-notation our codebase already uses for "message/send" (workspace-server/internal/handlers/delegation.go:155, canvas/src/components/tabs/ScheduleTab.tsx:168). Tests - 4 new specs in Toolbar.test.tsx covering each phase: 1. phase 1 dispatches tasks/cancel via /a2a for every active workspace BEFORE any /restart (order assertion + envelope shape). 2. when activeTasks drains to 0 during the poll window, /restart is NOT called. 3. when activeTasks does NOT drain inside the 8s timeout, /restart is called for each stuck workspace (with phase-1-before-phase-3 order assertion). 4. selective drain - one workspace drains, the other doesn't; /restart is called only for the stuck one. Full canvas vitest suite: 3360 passed, 1 skipped, 0 failed. Toolbar file alone: 25 tests (21 prior + 4 new). Refs: task #377, template-claude-code PR#40core-qa five-axis lens review (PR#1619 @
52304d9)Reviewing Toolbar.tsx stopAll three-phase polite-cancel + 4 new vitest specs against task #377 acceptance criteria.
1. Correctness (drain-poll behaviour + UI feedback). Verified:
setStopping(true)flips at L114 BEFORE phase 1, button text/aria switch to "Stopping..." at L297/291 immediately,setStopping(false)clears only after phase 3 completes (L168). User does NOT see a frozen UI during the 8s drain — they see the button labelled "Stopping..." and disabled. Poll cadence (250ms) is reasonable: 32 polls max per drain, store-read only (no network), no perceivable jitter. No finding.2. Timeout sizing. 8000ms drain timeout, justified in-line against template-claude-code
_SIGTERM_GRACE_S=5splus WS round-trip buffer for TASK_UPDATED push. Sound bound: short enough that a wedged workspace gets /restart within ~8s (user-acceptable for a "Stop All" click), long enough to cover the 5s grace + 1-2s WS broadcast. No finding.3. Selective fallthrough. Phase-3 selective test (L469-486) directly exercises ws-0 drains + ws-1 stuck → only
/workspaces/ws-1/restartfires, expects 1 restart call, asserts target URL. Implementation maintainsundrainedSet granularly per-id (L142-154). No finding.4. Backwards compat (older runtimes lacking
tasks/cancel). Phase-1 errors are .catch'd to no-op (L129), so a 4xx/5xx/timeout from an old runtime simply skips the cancel and falls into the drain-poll. activeTasks will not decrement (no cancel happened), drain times out at 8s, phase 3 hits/restart— strictly status-quo behaviour. Test phase-3-fallthrough (L438-465) pins this ordering. No finding.5. Test-suite signal. 4 new specs cover the spec-mandated cases (phase ordering, drain-skip, timeout-fallthrough, partial selective). Test setup at L94-160 wires the live useCanvasStore so drain-poll reads fresh state. vitest 3360 passing claim per PR body. No finding.
Verdict: APPROVED. Logic is clean, tests cover all stated invariants, fallthrough is conservative.
core-qa five-axis lens review (PR#1619 @
52304d9)Reviewing Toolbar.tsx stopAll three-phase polite-cancel + 4 new vitest specs against task #377 acceptance criteria.
1. Correctness (drain-poll behaviour + UI feedback). Verified:
setStopping(true)flips at L114 BEFORE phase 1, button text/aria switch to "Stopping..." at L297/291 immediately,setStopping(false)clears only after phase 3 completes (L168). User does NOT see a frozen UI during the 8s drain — they see the button labelled "Stopping..." and disabled. Poll cadence (250ms) is reasonable: 32 polls max per drain, store-read only (no network), no perceivable jitter. No finding.2. Timeout sizing. 8000ms drain timeout, justified in-line against template-claude-code
_SIGTERM_GRACE_S=5splus WS round-trip buffer for TASK_UPDATED push. Sound bound: short enough that a wedged workspace gets /restart within ~8s (user-acceptable for a "Stop All" click), long enough to cover the 5s grace + 1-2s WS broadcast. No finding.3. Selective fallthrough. Phase-3 selective test (L469-486) directly exercises ws-0 drains + ws-1 stuck → only
/workspaces/ws-1/restartfires, expects 1 restart call, asserts target URL. Implementation maintainsundrainedSet granularly per-id (L142-154). No finding.4. Backwards compat (older runtimes lacking
tasks/cancel). Phase-1 errors are.catch'd to no-op (L129), so a 4xx/5xx/timeout from an old runtime simply skips the cancel and falls into the drain-poll. activeTasks will not decrement (no cancel happened), drain times out at 8s, phase 3 hits/restart— strictly status-quo behaviour. Test phase-3-fallthrough (L438-465) pins this ordering. No finding.5. Test-suite signal. 4 new specs cover the spec-mandated cases (phase ordering, drain-skip, timeout-fallthrough, partial selective). Test setup at L94-160 wires the live useCanvasStore so drain-poll reads fresh state. vitest 3360 passing claim per PR body. No finding.
Verdict: APPROVED. Logic is clean, tests cover all stated invariants, fallthrough is conservative.
core-qa five-axis lens review (PR#1619 @
52304d9)Reviewing Toolbar.tsx stopAll three-phase polite-cancel + 4 new vitest specs against task #377 acceptance criteria.
1. Correctness (drain-poll behaviour + UI feedback). Verified:
setStopping(true)flips at L114 BEFORE phase 1, button text/aria switch to "Stopping..." at L297/291 immediately,setStopping(false)clears only after phase 3 completes (L168). User does NOT see a frozen UI during the 8s drain — they see the button labelled "Stopping..." and disabled. Poll cadence (250ms) is reasonable: 32 polls max per drain, store-read only (no network), no perceivable jitter. No finding.2. Timeout sizing. 8000ms drain timeout, justified in-line against template-claude-code
_SIGTERM_GRACE_S=5splus WS round-trip buffer for TASK_UPDATED push. Sound bound: short enough that a wedged workspace gets /restart within ~8s (user-acceptable for a "Stop All" click), long enough to cover the 5s grace + 1-2s WS broadcast. No finding.3. Selective fallthrough. Phase-3 selective test (L469-486) directly exercises ws-0 drains + ws-1 stuck → only
/workspaces/ws-1/restartfires, expects 1 restart call, asserts target URL. Implementation maintainsundrainedSet granularly per-id (L142-154). No finding.4. Backwards compat (older runtimes lacking
tasks/cancel). Phase-1 errors are.catch'd to no-op (L129), so a 4xx/5xx/timeout from an old runtime simply skips the cancel and falls into the drain-poll. activeTasks will not decrement (no cancel happened), drain times out at 8s, phase 3 hits/restart— strictly status-quo behaviour. Test phase-3-fallthrough (L438-465) pins this ordering. No finding.5. Test-suite signal. 4 new specs cover the spec-mandated cases (phase ordering, drain-skip, timeout-fallthrough, partial selective). Test setup at L94-160 wires the live useCanvasStore so drain-poll reads fresh state. vitest 3360 passing claim per PR body. No finding.
Verdict: APPROVED. Logic is clean, tests cover all stated invariants, fallthrough is conservative.
core-devops five-axis lens review (PR#1619 @
52304d9)1. A2A endpoint accepts
tasks/cancel(empirical). Read workspace-server/internal/handlers/a2a_proxy.go @ main —normalizeA2APayloadwraps in JSON-RPC envelope and forwards themethodas-is. NO method allowlist, NO switch on method. The v0.2/v0.3params.messageshape-check (L742-779) only fires whenparams.messageis present; fortasks/cancelwithparams:{}it is bypassed.a2aMethodsurfaces inlogA2ASuccess/logA2AReceiveQueued(L509/L418/L553), so attempts will appear in activity_logs withmethod="tasks/cancel". No finding.2. Tiny-PR rule. +91/-2 in Toolbar.tsx is surgical: replaces the single
stopAllcallback body with the three-phase variant + explanatory block-comment. No drive-by refactor. +174/-4 test file is the 4 new specs + shared store-wiring helper. No finding.3. Observability.
tasks/cancelflows through the same A2A log path as every method — Lokimethod="tasks/cancel"will be populated. Canvas side adds no dedicated log for "drain succeeded" vs "fell through to /restart"; activity_logs is the only signal. Non-blocking nit: aconsole.infoat phase-3 entry countingundrained.sizewould help ops correlate Stop-All clicks with fall-through frequency. Non-blocking finding.4. Companion-PR ordering (central devops Q). Two safe orderings. (a) Canvas-first:
tasks/cancelhits an old runtime, a2a-sdk 1.0.3 defaultAgentExecutor.cancel()raisesServerError(UnsupportedOperationError)→ non-2xx → canvas.catchswallows → drain times out → phase 3/restartfires. Strictly status-quo. (b) Runtime-first (PR#40): canvas keeps old behaviour, calls/restartdirectly — also status-quo. Either order is safe. RECOMMENDED: ship PR#40 first, then this PR — that exercises the polite-cancel path against an executor that honors it the moment canvas starts calling, eliminating a transient window where every Stop All is a no-op cancel + restart. Not blocking; just optimal. No blocking finding.5. Risk surface. Fallthrough is conservative (drained skip /restart, undrained get heavy stop). No new route, no new state. Worst case: cancel succeeds but TASK_UPDATED push slow → drain times out → /restart fires — same as pre-fix behaviour. No finding.
Verdict: APPROVED. Generic proxy verified to accept
tasks/cancel, fallthrough is status-quo. Ship PR#40 first if convenient, not blocking./sop-tier-recheck
/qa-recheck
/security-recheck