POST /workspaces/:id/pause: implicit recursive cascade to descendants — make opt-in via ?cascade=true #1991

Closed
opened 2026-05-28 07:06:44 +00:00 by hongming · 0 comments
Owner

Problem

Pause (workspace_restart.go:878) recursively collects all descendants via WITH RECURSIVE descendants SQL and stops them all. The handler's response field paused_count is the only signal to the caller that more than 1 workspace was affected.

Callers expecting Pause to affect only the named workspace will silently disrupt a whole subtree. Combined with the destructive backup-then-terminate path on subsequent Stop+Provision, this amplifies blast radius from 1 -> N.

Confirmed today 2026-05-28 06:50Z

Pause(deedcb61) returned {"paused_count":5} and terminated all 5 EC2-backed agents-team workspaces (Production Manager + 4 descendants in the parent tree). 4 of those were working; all 4 are now in status=failed. See incident internal#722.

Proposed fix

  • Default behavior: Pause/Resume affect ONLY the named workspace. If descendants exist and cascade=true is not set, return 409 with a body listing the descendants so the caller can decide.
  • New query param: ?cascade=true opts into the recursive behavior. ?cascade=false (default) treats the operation as single-workspace.
  • Migration: a deprecation period where the legacy implicit cascade is still allowed but emits a warning header X-Molecule-Implicit-Cascade: deprecated, then flip default after one minor version.

Acceptance criteria

  • Pause without cascade=true on a parent workspace returns 409 + descendants list.
  • Pause?cascade=true behaves as today.
  • Resume mirrors Pause's contract.
  • Test: parent-with-2-children setup; Pause(parent) without cascade returns 409; Pause(parent)?cascade=true returns 200 with paused_count=3.

Cross-refs

  • internal#722 — original incident.
  • The destructive Stop+Provision underneath Pause is tracked separately (linked to /restart fail-closed issue).
## Problem `Pause` (`workspace_restart.go:878`) recursively collects all descendants via `WITH RECURSIVE descendants` SQL and stops them all. The handler's response field `paused_count` is the only signal to the caller that more than 1 workspace was affected. Callers expecting Pause to affect only the named workspace will silently disrupt a whole subtree. Combined with the destructive backup-then-terminate path on subsequent Stop+Provision, this amplifies blast radius from 1 -> N. ## Confirmed today 2026-05-28 06:50Z Pause(deedcb61) returned `{"paused_count":5}` and terminated all 5 EC2-backed agents-team workspaces (Production Manager + 4 descendants in the parent tree). 4 of those were working; all 4 are now in `status=failed`. See incident internal#722. ## Proposed fix - Default behavior: Pause/Resume affect ONLY the named workspace. If descendants exist and `cascade=true` is not set, return 409 with a body listing the descendants so the caller can decide. - New query param: `?cascade=true` opts into the recursive behavior. `?cascade=false` (default) treats the operation as single-workspace. - Migration: a deprecation period where the legacy implicit cascade is still allowed but emits a warning header `X-Molecule-Implicit-Cascade: deprecated`, then flip default after one minor version. ## Acceptance criteria - Pause without cascade=true on a parent workspace returns 409 + descendants list. - Pause?cascade=true behaves as today. - Resume mirrors Pause's contract. - Test: parent-with-2-children setup; Pause(parent) without cascade returns 409; Pause(parent)?cascade=true returns 200 with paused_count=3. ## Cross-refs - internal#722 — original incident. - The destructive Stop+Provision underneath Pause is tracked separately (linked to /restart fail-closed issue).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1991