fix(canvas): polite tasks/cancel before /workspaces/:id/restart for Stop All (task #377 companion) #1619

2026-05-20T21:25:06Z

core-fe commented

2026-05-20 21:25:06 +00:00

Summary

Companion to molecule-ai-workspace-template-claude-code PR#40 (fix/377-stop-all-propagation, core-be). Task #377.

PR#40 adds a fast-cancel path on the runtime side (executor.cancel() -> killpg the CLI subprocess group), gated on MOLECULE_STOP_PROPAGATE=true. Until canvas issues the A2A tasks/cancel JSON-RPC at the workspace before the heavy /restart, that runtime path is INERT in production - nothing ever reaches executor.cancel(). Flipping the env var would give zero canary signal. This PR closes that gap.

Fix - two-phase Stop All in `Toolbar.tsx`

Phase 1: POST /workspaces/:id/a2a with {method:"tasks/cancel", params:{}} for every active workspace in parallel. workspace-server/internal/handlers/a2a_proxy.go forwards the envelope verbatim to the runtime; the a2a-sdk framework dispatches tasks/cancel to AgentExecutor.cancel() (template-claude-code claude_sdk_executor.py:853).
Phase 2: poll the canvas Zustand store (driven by TASK_UPDATED WS pushes via canvas-events.ts:400) every 250ms for up to 8000ms. Drained workspaces (activeTasks=0) are excluded from phase 3.
Phase 3: for any workspace that did NOT drain inside the timeout (old runtime image, or cancel propagation stuck) fall through to the original heavy /workspaces/:id/restart. Behavior is a strict superset of pre-fix Stop All - stuck workspaces still get the hammer, well-behaved ones are spared.

Upstream citation

Per feedback_upstream_docs_first_before_hypothesizing:

A2A protocol spec §9.4.5 CancelTask - JSON-RPC binding for the abstract Cancel Task operation (https://a2a-protocol.org/latest/specification/).
a2a-sdk 1.0.3 a2a/compat/v0_3/types.py:1125 pins the wire method literal: method: Literal['tasks/cancel'] = 'tasks/cancel'. Matches the slash-notation our codebase already uses for "message/send" (workspace-server/internal/handlers/delegation.go:155, canvas/src/components/tabs/ScheduleTab.tsx:168).

Test plan

Four new specs in Toolbar.test.tsx:

Phase 1 dispatches tasks/cancel via /a2a for every active workspace before any /restart (order + envelope shape assertion).
When activeTasks drains to 0 during the poll window, /restart is NOT called.
When activeTasks does NOT drain inside the 8s timeout, /restart is called for each stuck workspace (with phase-1-before-phase-3 order assertion).
Selective drain - one workspace drains, the other doesn't; /restart is called only for the stuck one.

Full canvas vitest suite locally: 3360 passed, 1 skipped, 0 failed. Toolbar file: 25 tests (21 prior + 4 new).

No new dependencies. No new env vars. No DB migrations.

Refs: task #377, template-claude-code PR#40.

## Summary Companion to [`molecule-ai-workspace-template-claude-code` PR#40](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-claude-code/pulls/40) (`fix/377-stop-all-propagation`, core-be). Task #377. PR#40 adds a fast-cancel path on the runtime side (`executor.cancel()` -> `killpg` the CLI subprocess group), gated on `MOLECULE_STOP_PROPAGATE=true`. **Until canvas issues the A2A `tasks/cancel` JSON-RPC at the workspace before the heavy `/restart`, that runtime path is INERT in production** - nothing ever reaches `executor.cancel()`. Flipping the env var would give zero canary signal. This PR closes that gap. ## Fix - two-phase Stop All in `Toolbar.tsx` - **Phase 1**: POST `/workspaces/:id/a2a` with `{method:"tasks/cancel", params:{}}` for every active workspace in parallel. `workspace-server/internal/handlers/a2a_proxy.go` forwards the envelope verbatim to the runtime; the a2a-sdk framework dispatches `tasks/cancel` to `AgentExecutor.cancel()` (template-claude-code `claude_sdk_executor.py:853`). - **Phase 2**: poll the canvas Zustand store (driven by `TASK_UPDATED` WS pushes via `canvas-events.ts:400`) every 250ms for up to 8000ms. Drained workspaces (`activeTasks=0`) are excluded from phase 3. - **Phase 3**: for any workspace that did NOT drain inside the timeout (old runtime image, or cancel propagation stuck) fall through to the original heavy `/workspaces/:id/restart`. Behavior is a strict superset of pre-fix Stop All - stuck workspaces still get the hammer, well-behaved ones are spared. ## Upstream citation Per `feedback_upstream_docs_first_before_hypothesizing`: - A2A protocol spec **§9.4.5 CancelTask** - JSON-RPC binding for the abstract Cancel Task operation (https://a2a-protocol.org/latest/specification/). - a2a-sdk 1.0.3 `a2a/compat/v0_3/types.py:1125` pins the wire method literal: `method: Literal['tasks/cancel'] = 'tasks/cancel'`. Matches the slash-notation our codebase already uses for `"message/send"` (`workspace-server/internal/handlers/delegation.go:155`, `canvas/src/components/tabs/ScheduleTab.tsx:168`). ## Test plan Four new specs in `Toolbar.test.tsx`: - [x] Phase 1 dispatches `tasks/cancel` via `/a2a` for every active workspace **before** any `/restart` (order + envelope shape assertion). - [x] When `activeTasks` drains to 0 during the poll window, `/restart` is NOT called. - [x] When `activeTasks` does NOT drain inside the 8s timeout, `/restart` is called for each stuck workspace (with phase-1-before-phase-3 order assertion). - [x] Selective drain - one workspace drains, the other doesn't; `/restart` is called only for the stuck one. Full canvas vitest suite locally: **3360 passed, 1 skipped, 0 failed**. Toolbar file: **25 tests (21 prior + 4 new)**. No new dependencies. No new env vars. No DB migrations. Refs: task #377, template-claude-code PR#40.

core-fe added 1 commit 2026-05-20 21:25:08 +00:00

fix(canvas): polite tasks/cancel before /workspaces/:id/restart for Stop All (task #377 companion)

Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s

Details

CI / Detect changes (pull_request) Successful in 6s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 22s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 11s

Details

E2E Chat / detect-changes (pull_request) Successful in 11s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s

Details

Harness Replays / detect-changes (pull_request) Successful in 7s

Details

Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s

Details

Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 3s

Details

lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m8s

Details

Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s

Details

gate-check-v3 / gate-check (pull_request) Successful in 3s

Details

sop-checklist / review-refire (pull_request) Has been skipped

Details

sop-checklist / na-declarations (pull_request) N/A: (none)

Details

sop-checklist / all-items-acked (pull_request) Successful in 4s

Details

sop-tier-check / tier-check (pull_request) Successful in 5s

Details

CI / Platform (Go) (pull_request) Successful in 4m46s

Details

CI / Canvas (Next.js) (pull_request) Successful in 6m14s

Details

CI / Python Lint & Test (pull_request) Successful in 6m38s

Details

CI / all-required (pull_request) Successful in 6m28s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s

Details

Harness Replays / Harness Replays (pull_request) Successful in 3s

Details

Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 7s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

E2E Chat / E2E Chat (pull_request) Failing after 5m39s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7m19s

Details

qa-review / approved (pull_request) Refired via /qa-recheck by unknown

Details

security-review / approved (pull_request) Refired via /security-recheck; security-review failed

Details

audit-force-merge / audit (pull_request) Successful in 3s

Details

52304d99a2

Companion to template-claude-code PR#40 (fix/377-stop-all-propagation,
core-be). PR#40 adds a fast-cancel path on the runtime side
(executor.cancel -> killpg the CLI subprocess group), gated on
MOLECULE_STOP_PROPAGATE=true. But until canvas issues the A2A
tasks/cancel JSON-RPC at the workspace before the heavy /restart,
that runtime path is INERT in production - nothing ever reaches
executor.cancel(). Flipping the env var would produce zero canary
signal.

Fix - two-phase Stop All in Toolbar.tsx:

  Phase 1: POST /workspaces/:id/a2a with method "tasks/cancel" and
           empty params for every workspace that has activeTasks>0,
           in parallel. The workspace-server a2a_proxy.go forwards
           the envelope verbatim to the runtime; a2a-sdk dispatches
           tasks/cancel to AgentExecutor.cancel() on the runtime
           side (claude_sdk_executor.py cancel() at line 853).

  Phase 2: poll the canvas Zustand store (TASK_UPDATED pushes drive
           the active_tasks field via canvas-events.ts:400) every
           250ms for up to 8000ms. Drained workspaces (activeTasks=0)
           are removed from the to-restart set.

  Phase 3: for any workspace that did NOT drain inside the timeout
           - runtime on an old image without the cancel hook, or
           cancel propagation stuck - fall through to the original
           heavy /workspaces/:id/restart. Behavior is a strict
           superset of pre-fix Stop All: stuck workspaces still get
           the hammer, well-behaved workspaces are spared.

Upstream wire-shape citations (per
feedback_upstream_docs_first_before_hypothesizing):

- A2A protocol spec section 9.4.5 "CancelTask" - JSON-RPC binding
  for the abstract Cancel Task operation
  (https://a2a-protocol.org/latest/specification/)
- a2a-sdk 1.0.3 a2a/compat/v0_3/types.py line 1125 pins the wire
  method literal: `method: Literal['tasks/cancel'] = 'tasks/cancel'`.
  Matches the slash-notation our codebase already uses for
  "message/send" (workspace-server/internal/handlers/delegation.go:155,
  canvas/src/components/tabs/ScheduleTab.tsx:168).

Tests - 4 new specs in Toolbar.test.tsx covering each phase:

  1. phase 1 dispatches tasks/cancel via /a2a for every active
     workspace BEFORE any /restart (order assertion + envelope shape).
  2. when activeTasks drains to 0 during the poll window, /restart
     is NOT called.
  3. when activeTasks does NOT drain inside the 8s timeout, /restart
     is called for each stuck workspace (with phase-1-before-phase-3
     order assertion).
  4. selective drain - one workspace drains, the other doesn't;
     /restart is called only for the stuck one.

Full canvas vitest suite: 3360 passed, 1 skipped, 0 failed. Toolbar
file alone: 25 tests (21 prior + 4 new).

Refs: task #377, template-claude-code PR#40

core-be referenced this pull request from molecule-ai/molecule-ai-workspace-template-claude-code

2026-05-20 21:25:35 +00:00

fix(executor): propagate Stop All signal to in-flight tool subprocesses (#377) #40

core-be referenced this pull request from molecule-ai/molecule-ai-workspace-template-claude-code

2026-05-20 21:25:52 +00:00

fix(executor): propagate Stop All signal to in-flight tool subprocesses (#377) #40

core-qa approved these changes 2026-05-20 21:37:35 +00:00

Dismissed

core-qa left a comment

core-qa five-axis lens review (PR#1619 @ `52304d9`)

Reviewing Toolbar.tsx stopAll three-phase polite-cancel + 4 new vitest specs against task #377 acceptance criteria.

1. Correctness (drain-poll behaviour + UI feedback). Verified: setStopping(true) flips at L114 BEFORE phase 1, button text/aria switch to "Stopping..." at L297/291 immediately, setStopping(false) clears only after phase 3 completes (L168). User does NOT see a frozen UI during the 8s drain — they see the button labelled "Stopping..." and disabled. Poll cadence (250ms) is reasonable: 32 polls max per drain, store-read only (no network), no perceivable jitter. No finding.

2. Timeout sizing. 8000ms drain timeout, justified in-line against template-claude-code _SIGTERM_GRACE_S=5s plus WS round-trip buffer for TASK_UPDATED push. Sound bound: short enough that a wedged workspace gets /restart within ~8s (user-acceptable for a "Stop All" click), long enough to cover the 5s grace + 1-2s WS broadcast. No finding.

3. Selective fallthrough. Phase-3 selective test (L469-486) directly exercises ws-0 drains + ws-1 stuck → only /workspaces/ws-1/restart fires, expects 1 restart call, asserts target URL. Implementation maintains undrained Set granularly per-id (L142-154). No finding.

4. Backwards compat (older runtimes lacking tasks/cancel). Phase-1 errors are .catch'd to no-op (L129), so a 4xx/5xx/timeout from an old runtime simply skips the cancel and falls into the drain-poll. activeTasks will not decrement (no cancel happened), drain times out at 8s, phase 3 hits /restart — strictly status-quo behaviour. Test phase-3-fallthrough (L438-465) pins this ordering. No finding.

5. Test-suite signal. 4 new specs cover the spec-mandated cases (phase ordering, drain-skip, timeout-fallthrough, partial selective). Test setup at L94-160 wires the live useCanvasStore so drain-poll reads fresh state. vitest 3360 passing claim per PR body. No finding.

Verdict: APPROVED. Logic is clean, tests cover all stated invariants, fallthrough is conservative.

### core-qa five-axis lens review (PR#1619 @ 52304d9) Reviewing Toolbar.tsx stopAll three-phase polite-cancel + 4 new vitest specs against task #377 acceptance criteria. **1. Correctness (drain-poll behaviour + UI feedback).** Verified: `setStopping(true)` flips at L114 BEFORE phase 1, button text/aria switch to "Stopping..." at L297/291 immediately, `setStopping(false)` clears only after phase 3 completes (L168). User does NOT see a frozen UI during the 8s drain — they see the button labelled "Stopping..." and disabled. Poll cadence (250ms) is reasonable: 32 polls max per drain, store-read only (no network), no perceivable jitter. No finding. **2. Timeout sizing.** 8000ms drain timeout, justified in-line against template-claude-code `_SIGTERM_GRACE_S=5s` plus WS round-trip buffer for TASK_UPDATED push. Sound bound: short enough that a wedged workspace gets /restart within ~8s (user-acceptable for a "Stop All" click), long enough to cover the 5s grace + 1-2s WS broadcast. No finding. **3. Selective fallthrough.** Phase-3 selective test (L469-486) directly exercises ws-0 drains + ws-1 stuck → only `/workspaces/ws-1/restart` fires, expects 1 restart call, asserts target URL. Implementation maintains `undrained` Set granularly per-id (L142-154). No finding. **4. Backwards compat (older runtimes lacking `tasks/cancel`).** Phase-1 errors are .catch'd to no-op (L129), so a 4xx/5xx/timeout from an old runtime simply skips the cancel and falls into the drain-poll. activeTasks will not decrement (no cancel happened), drain times out at 8s, phase 3 hits `/restart` — strictly status-quo behaviour. Test phase-3-fallthrough (L438-465) pins this ordering. No finding. **5. Test-suite signal.** 4 new specs cover the spec-mandated cases (phase ordering, drain-skip, timeout-fallthrough, partial selective). Test setup at L94-160 wires the live useCanvasStore so drain-poll reads fresh state. vitest 3360 passing claim per PR body. No finding. Verdict: APPROVED. Logic is clean, tests cover all stated invariants, fallthrough is conservative.

core-qa approved these changes 2026-05-20 21:38:51 +00:00

Dismissed

core-qa left a comment

core-qa five-axis lens review (PR#1619 @ `52304d9`)

Reviewing Toolbar.tsx stopAll three-phase polite-cancel + 4 new vitest specs against task #377 acceptance criteria.

1. Correctness (drain-poll behaviour + UI feedback). Verified: setStopping(true) flips at L114 BEFORE phase 1, button text/aria switch to "Stopping..." at L297/291 immediately, setStopping(false) clears only after phase 3 completes (L168). User does NOT see a frozen UI during the 8s drain — they see the button labelled "Stopping..." and disabled. Poll cadence (250ms) is reasonable: 32 polls max per drain, store-read only (no network), no perceivable jitter. No finding.

2. Timeout sizing. 8000ms drain timeout, justified in-line against template-claude-code _SIGTERM_GRACE_S=5s plus WS round-trip buffer for TASK_UPDATED push. Sound bound: short enough that a wedged workspace gets /restart within ~8s (user-acceptable for a "Stop All" click), long enough to cover the 5s grace + 1-2s WS broadcast. No finding.

3. Selective fallthrough. Phase-3 selective test (L469-486) directly exercises ws-0 drains + ws-1 stuck → only /workspaces/ws-1/restart fires, expects 1 restart call, asserts target URL. Implementation maintains undrained Set granularly per-id (L142-154). No finding.

4. Backwards compat (older runtimes lacking tasks/cancel). Phase-1 errors are .catch'd to no-op (L129), so a 4xx/5xx/timeout from an old runtime simply skips the cancel and falls into the drain-poll. activeTasks will not decrement (no cancel happened), drain times out at 8s, phase 3 hits /restart — strictly status-quo behaviour. Test phase-3-fallthrough (L438-465) pins this ordering. No finding.

5. Test-suite signal. 4 new specs cover the spec-mandated cases (phase ordering, drain-skip, timeout-fallthrough, partial selective). Test setup at L94-160 wires the live useCanvasStore so drain-poll reads fresh state. vitest 3360 passing claim per PR body. No finding.

Verdict: APPROVED. Logic is clean, tests cover all stated invariants, fallthrough is conservative.

### core-qa five-axis lens review (PR#1619 @ 52304d9) Reviewing Toolbar.tsx stopAll three-phase polite-cancel + 4 new vitest specs against task #377 acceptance criteria. **1. Correctness (drain-poll behaviour + UI feedback).** Verified: `setStopping(true)` flips at L114 BEFORE phase 1, button text/aria switch to "Stopping..." at L297/291 immediately, `setStopping(false)` clears only after phase 3 completes (L168). User does NOT see a frozen UI during the 8s drain — they see the button labelled "Stopping..." and disabled. Poll cadence (250ms) is reasonable: 32 polls max per drain, store-read only (no network), no perceivable jitter. No finding. **2. Timeout sizing.** 8000ms drain timeout, justified in-line against template-claude-code `_SIGTERM_GRACE_S=5s` plus WS round-trip buffer for TASK_UPDATED push. Sound bound: short enough that a wedged workspace gets /restart within ~8s (user-acceptable for a "Stop All" click), long enough to cover the 5s grace + 1-2s WS broadcast. No finding. **3. Selective fallthrough.** Phase-3 selective test (L469-486) directly exercises ws-0 drains + ws-1 stuck → only `/workspaces/ws-1/restart` fires, expects 1 restart call, asserts target URL. Implementation maintains `undrained` Set granularly per-id (L142-154). No finding. **4. Backwards compat (older runtimes lacking `tasks/cancel`).** Phase-1 errors are `.catch`'d to no-op (L129), so a 4xx/5xx/timeout from an old runtime simply skips the cancel and falls into the drain-poll. activeTasks will not decrement (no cancel happened), drain times out at 8s, phase 3 hits `/restart` — strictly status-quo behaviour. Test phase-3-fallthrough (L438-465) pins this ordering. No finding. **5. Test-suite signal.** 4 new specs cover the spec-mandated cases (phase ordering, drain-skip, timeout-fallthrough, partial selective). Test setup at L94-160 wires the live useCanvasStore so drain-poll reads fresh state. vitest 3360 passing claim per PR body. No finding. Verdict: APPROVED. Logic is clean, tests cover all stated invariants, fallthrough is conservative.

core-qa approved these changes 2026-05-20 21:38:57 +00:00