POST /workspaces/:id/restart: terminate-anyway under SnapshotCreationPerVolumeRateExceeded is unsafe — fail-closed with 503 instead #1990

Open
opened 2026-05-28 07:06:43 +00:00 by hongming · 1 comment
Owner

Problem

The Restart handler (workspace_restart.go:209) currently calls BackupWorkspaceEBS -> awsapi.CreateSnapshot. When AWS returns SnapshotCreationPerVolumeRateExceeded (per-volume rate ~5/hr), the provisioner logs LEAK-SUSPECT ... — proceeding with terminate (state will NOT be restorable on next recreate) and terminates the EC2 anyway.

This converts a transient-snapshot-quota signal into PERMANENT WORKSPACE STATE LOSS. The new EC2 boots without any restored /home/agent data.

Confirmed today 2026-05-28 06:08:49Z

Production Manager (deedcb61) state was permanently lost via this path while attempting an in-place restart. See incident internal#722.

Proposed fix

Fail-closed: when BackupWorkspaceEBS returns SnapshotCreationPerVolumeRateExceeded, return 503 Service Unavailable with Retry-After: 720 (12min — enough for one snapshot slot to free) and DO NOT terminate. The user can retry once quota drains.

For other backup failures (volume not found, IAM denied, etc.) the same fail-closed posture applies — terminate-without-backup-recoverable should NEVER happen silently. Only an explicit ?force_terminate=true&accept_state_loss=true query param should permit the destructive path, and only for ops-tier callers.

Acceptance criteria

  • /restart on a workspace whose volume has hit its snapshot rate limit returns 503 instead of destroying state.
  • Unit test: mock BackupWorkspaceEBS returning SnapshotCreationPerVolumeRateExceeded; assert handler returns 503 and EC2 is NOT terminated.
  • Integration test: synthesize 5 rapid /restart calls on the same volume; the 6th returns 503 (rate-limit cleanly surfaced).
  • Drift gate: search for proceeding with terminate (state will NOT be restorable and lint-flag any usage outside an explicit-opt-in path.

Cross-refs

  • internal#722 — original incident.
  • feedback_workspace_restart_terminates_when_snapshot_rate_limited.md — memory.
  • Sibling endpoints Pause, Resume, Hibernate likely share the same backup-then-terminate hazard — audit per fix.
## Problem The `Restart` handler (`workspace_restart.go:209`) currently calls `BackupWorkspaceEBS` -> `awsapi.CreateSnapshot`. When AWS returns `SnapshotCreationPerVolumeRateExceeded` (per-volume rate ~5/hr), the provisioner logs `LEAK-SUSPECT ... — proceeding with terminate (state will NOT be restorable on next recreate)` and terminates the EC2 anyway. This converts a transient-snapshot-quota signal into PERMANENT WORKSPACE STATE LOSS. The new EC2 boots without any restored `/home/agent` data. ## Confirmed today 2026-05-28 06:08:49Z Production Manager (deedcb61) state was permanently lost via this path while attempting an in-place restart. See incident internal#722. ## Proposed fix Fail-closed: when `BackupWorkspaceEBS` returns `SnapshotCreationPerVolumeRateExceeded`, return `503 Service Unavailable` with `Retry-After: 720` (12min — enough for one snapshot slot to free) and DO NOT terminate. The user can retry once quota drains. For other backup failures (volume not found, IAM denied, etc.) the same fail-closed posture applies — terminate-without-backup-recoverable should NEVER happen silently. Only an explicit `?force_terminate=true&accept_state_loss=true` query param should permit the destructive path, and only for ops-tier callers. ## Acceptance criteria - /restart on a workspace whose volume has hit its snapshot rate limit returns 503 instead of destroying state. - Unit test: mock `BackupWorkspaceEBS` returning `SnapshotCreationPerVolumeRateExceeded`; assert handler returns 503 and EC2 is NOT terminated. - Integration test: synthesize 5 rapid /restart calls on the same volume; the 6th returns 503 (rate-limit cleanly surfaced). - Drift gate: search for `proceeding with terminate (state will NOT be restorable` and lint-flag any usage outside an explicit-opt-in path. ## Cross-refs - internal#722 — original incident. - feedback_workspace_restart_terminates_when_snapshot_rate_limited.md — memory. - Sibling endpoints `Pause`, `Resume`, `Hibernate` likely share the same backup-then-terminate hazard — audit per fix.
Member

Triage finding: bug-gone in the current architecture. Recommend close.

Investigated on fix/3168-rc13392-rest-delegate-gate (2026-06-23, molecule-core HEAD 924c0fdd + pending RC 13392 fix). The Restart path in workspace-server/internal/handlers/workspace_restart.go + workspace_dispatchers.go no longer matches the architecture this issue describes.

What the issue describes (snapshot/terminate architecture)

The Restart handler (workspace_restart.go:209) currently calls BackupWorkspaceEBS -> awsapi.CreateSnapshot. When AWS returns SnapshotCreationPerVolumeRateExceeded (per-volume rate ~5/hr), the provisioner logs LEAK-SUSPECT ... — proceeding with terminate (state will NOT be restorable on next recreate) and terminates the EC2 anyway.

What's in the code today (cpProv/Stop architecture)

  1. No BackupWorkspaceEBS function exists in workspace-server/. Verified via grep -rn BackupWorkspaceEBS /workspace/molecule-core/workspace-server/ — zero hits.
  2. No CreateSnapshot call in the Restart path. The Restart handler (line 314) does:
    • Read workspace row (status, name, tier, runtime, template)
    • Update status to provisioning
    • Capture restart-context data
    • Dispatch via h.goAsync to RestartWorkspaceAutoOpts (line 482)
  3. RestartWorkspaceAutoOpts (workspace_dispatchers.go:423) calls cpProv.Stop (control-plane) — not a direct AWS API. The stop leg is cpStopWithRetry (line 835) which is bounded retry against the control-plane's stop endpoint, not an AWS-direct EBS snapshot.
  4. No LEAK-SUSPECT snapshot line in the restart path. LEAK-SUSPECT cpProv.Stop is the EXISTING log marker (line 901) — but it signals a control-plane stop failure (the CP can't stop the container), not a snapshot rate limit. Different failure mode, different concern.

Why the architecture changed

The 2026-04 SaaS-migration split restart into:

  • CP (control-plane) first: cpProv (a service that talks to AWS on the workspace-server's behalf) owns EC2 lifecycle. It does the Stop, returns success/failure.
  • Workspace-server retries on CP stop failures (cpStopWithRetry / cpStopWithRetryErr with bounded retry) and proceeds with reprovision regardless — the docstring explicitly says "Restart's contract is 'make the workspace alive again': it proceeds with reprovision regardless of the Stop outcome" (line 820-825).
  • No snapshot-before-stop anywhere in the new path. A new EC2 from cpProv.Start does NOT restore from a snapshot — it boots from a fresh AMI with the workspace's config volume attached (the same Docker volume mount that's been preserved through stop/start).

Could the LEAK-SUSPECT cpProv.Stop path be a similar concern?

Different but adjacent. If cpProv.Stop fails final-exhausted:

  • The old EC2 is in an unknown state (running? terminated? mid-stop?)
  • The provision leg still runs, creating a NEW container
  • This is a "split-brain" risk (two compute instances briefly), not a state-loss risk (the new EC2 has its own fresh state)
  • That's a different issue (potentially: "Restart after cpProv.Stop exhaustion → split-brain") and is out of scope for #1990.

Recommendation

Close #1990 as stale-by-architecture-change. The SnapshotCreationPerVolumeRateExceeded scenario can't occur in the current code because we don't snapshot at restart.

If a snapshot-before-restart is ever wanted back (e.g. for compliance with a backup-window SLA), that should be filed as a new feature issue, not as a fix for a bug that no longer exists.

(No code change proposed; closing per scope discipline.)

**Triage finding: bug-gone in the current architecture. Recommend close.** Investigated on `fix/3168-rc13392-rest-delegate-gate` (2026-06-23, molecule-core HEAD `924c0fdd` + pending RC 13392 fix). The Restart path in `workspace-server/internal/handlers/workspace_restart.go` + `workspace_dispatchers.go` no longer matches the architecture this issue describes. ## What the issue describes (snapshot/terminate architecture) > The `Restart` handler (`workspace_restart.go:209`) currently calls `BackupWorkspaceEBS` -> `awsapi.CreateSnapshot`. When AWS returns `SnapshotCreationPerVolumeRateExceeded` (per-volume rate ~5/hr), the provisioner logs `LEAK-SUSPECT ... — proceeding with terminate (state will NOT be restorable on next recreate)` and terminates the EC2 anyway. ## What's in the code today (cpProv/Stop architecture) 1. **No `BackupWorkspaceEBS` function exists** in `workspace-server/`. Verified via `grep -rn BackupWorkspaceEBS /workspace/molecule-core/workspace-server/` — zero hits. 2. **No `CreateSnapshot` call** in the Restart path. The Restart handler (line 314) does: - Read workspace row (status, name, tier, runtime, template) - Update status to `provisioning` - Capture restart-context data - Dispatch via `h.goAsync` to `RestartWorkspaceAutoOpts` (line 482) 3. **`RestartWorkspaceAutoOpts`** (`workspace_dispatchers.go:423`) calls `cpProv.Stop` (control-plane) — not a direct AWS API. The stop leg is `cpStopWithRetry` (line 835) which is bounded retry against the control-plane's stop endpoint, not an AWS-direct EBS snapshot. 4. **No `LEAK-SUSPECT` snapshot line** in the restart path. `LEAK-SUSPECT cpProv.Stop` is the EXISTING log marker (line 901) — but it signals a *control-plane stop failure* (the CP can't stop the container), not a snapshot rate limit. Different failure mode, different concern. ## Why the architecture changed The 2026-04 SaaS-migration split restart into: - **CP (control-plane) first**: `cpProv` (a service that talks to AWS on the workspace-server's behalf) owns EC2 lifecycle. It does the Stop, returns success/failure. - **Workspace-server** retries on CP stop failures (`cpStopWithRetry` / `cpStopWithRetryErr` with bounded retry) and proceeds with reprovision regardless — the docstring explicitly says "Restart's contract is 'make the workspace alive again': it proceeds with reprovision regardless of the Stop outcome" (line 820-825). - **No snapshot-before-stop** anywhere in the new path. A new EC2 from `cpProv.Start` does NOT restore from a snapshot — it boots from a fresh AMI with the workspace's config volume attached (the same Docker volume mount that's been preserved through stop/start). ## Could the LEAK-SUSPECT cpProv.Stop path be a similar concern? Different but adjacent. If `cpProv.Stop` fails final-exhausted: - The old EC2 is in an unknown state (running? terminated? mid-stop?) - The provision leg still runs, creating a NEW container - This is a "split-brain" risk (two compute instances briefly), not a state-loss risk (the new EC2 has its own fresh state) - That's a different issue (potentially: "Restart after cpProv.Stop exhaustion → split-brain") and is out of scope for #1990. ## Recommendation Close #1990 as **stale-by-architecture-change**. The SnapshotCreationPerVolumeRateExceeded scenario can't occur in the current code because we don't snapshot at restart. If a snapshot-before-restart is ever wanted back (e.g. for compliance with a backup-window SLA), that should be filed as a new feature issue, not as a fix for a bug that no longer exists. (No code change proposed; closing per scope discipline.)
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1990