fix(workspace-server): bounded retry on delete-path EC2 stop + durable leak event #1932

Merged
hongming merged 1 commits from fix/workspace-ec2-leak-delete-retry into main 2026-05-27 06:14:21 +00:00
Owner

Summary

The DELETE path's StopWorkspaceAutocpProv.Stop had no retry, while the restart path used cpStopWithRetry (bounded exp backoff). A transient CP/AWS hiccup on delete left the workspace row at status='removed' with instance_id populated, returned a 500, and relied entirely on the 60s CP-orphan-sweeper to re-drive the terminate. For a cascade descendant, the "client retries → replays terminate" recovery is defeated by CascadeDelete's status != 'removed' CTE filter (the descendant row is already removed, so a retry walks zero descendant rows).

This PR:

  • Extracts the retry loop into cpStopWithRetryErr (returns the terminal error); cpStopWithRetry keeps its void contract for the restart paths via a one-line delegate.
  • Adds stopWorkspaceForDelete: retries the CP terminate, and on exhaustion persists a durable workspace.delete.terminate_retry_exhausted row to structure_events (§Persistent structured logging gate) so the leak/pending decision is queryable.
  • Deliberately does NOT clear status='removed' — the existing CP-orphan-sweeper keys on exactly that state; the row stays removed+instance_id so the backstop still re-drives it. The error is still returned so the HTTP Delete surfaces the retryable 500.

Paired with controlplane PR fix/workspace-ec2-leak-reaper (the row-gone leak class). Refs task #15.

Phase 1 (re-verified, not assumed)

  • Brief H1 (no retry): confirmed.
  • Brief H3 (status != 'removed' retry-defeat): partially falsified — scoped to descendants only; the root id is always re-stopped (stopAndRemove(id) runs unconditionally and cpProv.Stop resolves instance_id with no status gate).
  • Brief H4 (orphan sweeper doesn't reap workspace EC2s): substantially falsifiedregistry/cp_orphan_sweeper.go (PR #2989, wired main.go:325) already reaps status='removed'+instance_id workspace EC2s every 60s. The real gap is the row-gone class → handled by the paired controlplane PR.

Phase 4 self-review (five-axis)

Correctness: No blocking finding — fail-direction proof discriminates new behavior (CPRetriesTransientThenSucceeds: 3 calls, no event) from pre-fix bare Stop. FYI: the per-workspace retry runs inside CascadeDelete's shared 30s cleanupCtx; a large cascade of failing descendants can truncate retries — degrades gracefully (ctx.Err → stopErr → 500 + cp-orphan-sweeper backstop).
Readability: No finding — verb-noun names match siblings; comments explain WHY (the CTE-defeat).
Architecture: No finding — stopWorkspaceForDelete in workspace_dispatchers.go and the loop in workspace_restart.go are both allowedFiles for the bare-Stop AST gate; TestRestart_CPStopOnlyInsideRetryHelper updated to recognize cpStopWithRetryErr as the relocated loop home (not loosened).
Security: No finding — no untrusted input, no secret write.
Performance: No finding — adds ≤3s backoff to an already-500 path.

Test plan

  • go vet ./internal/handlers/ clean
  • go test ./internal/handlers/ ./internal/registry/ green (no regressions)
  • golangci-lint run ./internal/handlers/... → 0 issues
  • Fail-direction proof: retry-success (no event) vs retry-exhaust (event + error)
  • Stage B/C deferred — HOLD for merge. Staging needs: trigger a workspace delete with a forced CP-Stop 5xx, confirm workspace.delete.terminate_retry_exhausted row appears and the cp-orphan-sweeper subsequently clears the EC2.

tier:medium (touches the delete path + structured-logging; revert is git revert-clean).

🤖 Generated with Claude Code


SOP checklist (RFC#351)

  • Comprehensive testing performed: Test-first with fail-direction proofs. Delete-path: CPRetriesTransientThenSucceeds (3 Stop calls, NO event) vs CPExhausts (durable structure_events INSERT + terminal error). No-backend no-op. AST gate TestRestart_CPStopOnlyInsideRetryHelper updated for the relocated retry loop. Edge: ctx-cancel mid-retry returns ctx.Err() without LEAK-SUSPECT (existing test preserved).
  • Local-postgres E2E run: N/A for this PR's changed branches — the new logic is unit-covered via sqlmock + scripted CP-stop stub at the dispatcher layer; the full CascadeDelete row-state path is exercised by the existing handlers Postgres integration job in CI. Stage A (local platform boot) deferred to staging (HOLD).
  • Staging-smoke verified or pending: Pending (HOLD for merge). Staging must: trigger a workspace delete with a forced CP-Stop 5xx, confirm the workspace.delete.terminate_retry_exhausted row appears, and the cp-orphan-sweeper subsequently clears the EC2.
  • Root-cause not symptom: The delete path's cpProv.Stop had no retry (unlike restart's cpStopWithRetry); a transient failure 500'd with no inline recovery, and cascade-descendant replay is defeated by the status != 'removed' CTE filter. Fix = bounded retry + durable leak event, not a status-ordering hack.
  • Five-Axis review walked: See the "Phase 4 self-review (five-axis)" section above — no Critical/Required; one FYI (30s cleanupCtx truncation under large failing cascades, degrades gracefully).
  • No backwards-compat shim / dead code added: No. cpStopWithRetry keeps its void contract via a one-line delegate to the new cpStopWithRetryErr; no shim, no dead code, no removed-comment cruft.
  • Memory/saved-feedback consulted: feedback_gitea_reviews_api_event_value (event=APPROVED), feedback_ci_status_check_combined_state (combined .statuses[].status), feedback_passwords_in_chat_are_burned/token-scrub (push URL scrubbed from .git/config), the dev-sop §Persistent-structured-logging-gate + §Staging-E2E-gate.
## Summary The DELETE path's `StopWorkspaceAuto` → `cpProv.Stop` had **no retry**, while the restart path used `cpStopWithRetry` (bounded exp backoff). A transient CP/AWS hiccup on delete left the workspace row at `status='removed'` with `instance_id` populated, returned a 500, and relied entirely on the 60s CP-orphan-sweeper to re-drive the terminate. For a cascade **descendant**, the "client retries → replays terminate" recovery is defeated by `CascadeDelete`'s `status != 'removed'` CTE filter (the descendant row is already `removed`, so a retry walks zero descendant rows). This PR: - Extracts the retry loop into `cpStopWithRetryErr` (returns the terminal error); `cpStopWithRetry` keeps its void contract for the restart paths via a one-line delegate. - Adds `stopWorkspaceForDelete`: retries the CP terminate, and on exhaustion persists a durable `workspace.delete.terminate_retry_exhausted` row to `structure_events` (§Persistent structured logging gate) so the leak/pending decision is queryable. - Deliberately does NOT clear `status='removed'` — the existing CP-orphan-sweeper keys on exactly that state; the row stays `removed`+`instance_id` so the backstop still re-drives it. The error is still returned so the HTTP Delete surfaces the retryable 500. Paired with controlplane PR `fix/workspace-ec2-leak-reaper` (the row-gone leak class). Refs task #15. ## Phase 1 (re-verified, not assumed) - Brief H1 (no retry): **confirmed**. - Brief H3 (`status != 'removed'` retry-defeat): **partially falsified** — scoped to *descendants* only; the root id is always re-stopped (`stopAndRemove(id)` runs unconditionally and `cpProv.Stop` resolves `instance_id` with no status gate). - Brief H4 (orphan sweeper doesn't reap workspace EC2s): **substantially falsified** — `registry/cp_orphan_sweeper.go` (PR #2989, wired `main.go:325`) already reaps `status='removed'`+`instance_id` workspace EC2s every 60s. The real gap is the **row-gone** class → handled by the paired controlplane PR. ## Phase 4 self-review (five-axis) **Correctness:** No blocking finding — fail-direction proof discriminates new behavior (`CPRetriesTransientThenSucceeds`: 3 calls, no event) from pre-fix bare Stop. _FYI:_ the per-workspace retry runs inside `CascadeDelete`'s shared 30s `cleanupCtx`; a large cascade of failing descendants can truncate retries — degrades gracefully (ctx.Err → stopErr → 500 + cp-orphan-sweeper backstop). **Readability:** No finding — verb-noun names match siblings; comments explain WHY (the CTE-defeat). **Architecture:** No finding — `stopWorkspaceForDelete` in `workspace_dispatchers.go` and the loop in `workspace_restart.go` are both `allowedFiles` for the bare-Stop AST gate; `TestRestart_CPStopOnlyInsideRetryHelper` updated to recognize `cpStopWithRetryErr` as the relocated loop home (not loosened). **Security:** No finding — no untrusted input, no secret write. **Performance:** No finding — adds ≤3s backoff to an already-500 path. ## Test plan - [x] `go vet ./internal/handlers/` clean - [x] `go test ./internal/handlers/ ./internal/registry/` green (no regressions) - [x] `golangci-lint run ./internal/handlers/...` → 0 issues - [x] Fail-direction proof: retry-success (no event) vs retry-exhaust (event + error) - [ ] **Stage B/C deferred — HOLD for merge.** Staging needs: trigger a workspace delete with a forced CP-Stop 5xx, confirm `workspace.delete.terminate_retry_exhausted` row appears and the cp-orphan-sweeper subsequently clears the EC2. `tier:medium` (touches the delete path + structured-logging; revert is `git revert`-clean). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- ## SOP checklist (RFC#351) - **Comprehensive testing performed:** Test-first with fail-direction proofs. Delete-path: `CPRetriesTransientThenSucceeds` (3 Stop calls, NO event) vs `CPExhausts` (durable `structure_events` INSERT + terminal error). No-backend no-op. AST gate `TestRestart_CPStopOnlyInsideRetryHelper` updated for the relocated retry loop. Edge: ctx-cancel mid-retry returns `ctx.Err()` without LEAK-SUSPECT (existing test preserved). - **Local-postgres E2E run:** N/A for this PR's changed branches — the new logic is unit-covered via sqlmock + scripted CP-stop stub at the dispatcher layer; the full CascadeDelete row-state path is exercised by the existing handlers Postgres integration job in CI. Stage A (local platform boot) deferred to staging (HOLD). - **Staging-smoke verified or pending:** Pending (HOLD for merge). Staging must: trigger a workspace delete with a forced CP-Stop 5xx, confirm the `workspace.delete.terminate_retry_exhausted` row appears, and the cp-orphan-sweeper subsequently clears the EC2. - **Root-cause not symptom:** The delete path's `cpProv.Stop` had no retry (unlike restart's `cpStopWithRetry`); a transient failure 500'd with no inline recovery, and cascade-descendant replay is defeated by the `status != 'removed'` CTE filter. Fix = bounded retry + durable leak event, not a status-ordering hack. - **Five-Axis review walked:** See the "Phase 4 self-review (five-axis)" section above — no Critical/Required; one FYI (30s cleanupCtx truncation under large failing cascades, degrades gracefully). - **No backwards-compat shim / dead code added:** No. `cpStopWithRetry` keeps its void contract via a one-line delegate to the new `cpStopWithRetryErr`; no shim, no dead code, no removed-comment cruft. - **Memory/saved-feedback consulted:** `feedback_gitea_reviews_api_event_value` (event=APPROVED), `feedback_ci_status_check_combined_state` (combined `.statuses[].status`), `feedback_passwords_in_chat_are_burned`/token-scrub (push URL scrubbed from `.git/config`), the dev-sop §Persistent-structured-logging-gate + §Staging-E2E-gate.
hongming added 1 commit 2026-05-27 04:51:49 +00:00
fix(workspace-server): bounded retry on delete-path EC2 stop + durable leak event
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 12s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 6s
CI / Detect changes (pull_request) Successful in 11s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
E2E Chat / detect-changes (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Harness Replays / detect-changes (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m10s
qa-review / approved (pull_request) Successful in 7s
security-review / approved (pull_request) Successful in 5s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m12s
gate-check-v3 / gate-check (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 4s
sop-checklist / all-items-acked (pull_request) acked: 7/7
sop-checklist / na-declarations (pull_request) N/A: (none)
CI / Shellcheck (E2E scripts) (pull_request) Successful in 12s
CI / Canvas (Next.js) (pull_request) Successful in 15s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
E2E Chat / E2E Chat (pull_request) Successful in 4s
Harness Replays / Harness Replays (pull_request) Successful in 10s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m19s
CI / Platform (Go) (pull_request) Successful in 6m53s
CI / all-required (pull_request) Successful in 29m0s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
audit-force-merge / audit (pull_request) Successful in 14s
e058137fbf
The DELETE path's StopWorkspaceAuto → cpProv.Stop had no retry, while the
restart path used cpStopWithRetry (bounded exp backoff). A transient CP/AWS
hiccup on delete left the workspace row at status='removed' with instance_id
populated, returned a 500, and relied entirely on the 60s CP-orphan-sweeper
to re-drive the terminate. For a cascade *descendant* the "client retries →
replays terminate" recovery is defeated by CascadeDelete's status != 'removed'
CTE filter — so the only inline recovery is a bounded retry.

This extracts the retry loop into cpStopWithRetryErr (cpStopWithRetry keeps
its void contract for the restart paths) and adds stopWorkspaceForDelete,
which retries the CP terminate and, on exhaustion, persists a durable
workspace.delete.terminate_retry_exhausted row to structure_events (the
§Persistent structured logging gate) so the leak/pending decision is
queryable. The row deliberately stays status='removed' + instance_id so the
existing CP-orphan-sweeper backstop still re-drives it; the error is still
returned so the HTTP Delete surfaces the retryable 500.

Test-first, fail-direction proof: CPRetriesTransientThenSucceeds (3 calls, no
event) vs CPExhausts (event + error) discriminate the new behavior from the
pre-fix bare Stop. AST gate updated to recognize cpStopWithRetryErr as the
relocated home of the retry loop.

Refs task #15 (workspace-ec2-leak). Paired with the controlplane workspace-
EC2 reaper PR for the row-gone leak class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hongming added the tier:medium label 2026-05-27 04:52:35 +00:00
core-qa approved these changes 2026-05-27 04:53:21 +00:00
core-qa left a comment
Member

core-qa review (correctness + test coverage axis):

Verified the changed-branch coverage maps to discriminating tests:

  • stopWorkspaceForDelete CP-retry-then-success: asserts 3 Stop calls + NO terminate_retry_exhausted event (would fail against the pre-fix bare one-shot Stop, which makes 1 call).
  • CP-retry-exhaust: asserts the durable structure_events INSERT fires AND the terminal error is returned wrapping the LAST attempt error — discriminates exhaustion from transient.
  • no-backend no-op covered.
  • cpStopWithRetry void contract preserved (delegates to cpStopWithRetryErr); existing restart-path retry/ctx-cancel/LEAK-SUSPECT tests still green.
  • The status='removed' ordering is deliberately unchanged so the existing cp_orphan_sweeper backstop keeps working — confirmed the comment matches the behavior.

go test ./internal/handlers ./internal/registry green, golangci-lint 0 issues, AST gate updated correctly (relocated, not loosened). Stage B/C correctly deferred (HOLD). Correctness + test axes pass.

APPROVED.

core-qa review (correctness + test coverage axis): Verified the changed-branch coverage maps to discriminating tests: - stopWorkspaceForDelete CP-retry-then-success: asserts 3 Stop calls + NO terminate_retry_exhausted event (would fail against the pre-fix bare one-shot Stop, which makes 1 call). - CP-retry-exhaust: asserts the durable structure_events INSERT fires AND the terminal error is returned wrapping the LAST attempt error — discriminates exhaustion from transient. - no-backend no-op covered. - cpStopWithRetry void contract preserved (delegates to cpStopWithRetryErr); existing restart-path retry/ctx-cancel/LEAK-SUSPECT tests still green. - The status='removed' ordering is deliberately unchanged so the existing cp_orphan_sweeper backstop keeps working — confirmed the comment matches the behavior. go test ./internal/handlers ./internal/registry green, golangci-lint 0 issues, AST gate updated correctly (relocated, not loosened). Stage B/C correctly deferred (HOLD). Correctness + test axes pass. APPROVED.
core-security approved these changes 2026-05-27 04:53:31 +00:00
core-security left a comment
Member

core-security review (trust-boundary axis):

Delete-path change touches no new untrusted input: workspaceID values in CascadeDelete are already UUID-validated (validateWorkspaceID), and the new structure_events INSERT is parameterized ($1/$2/$3) — no string concat into SQL. workspace_id flows into a UUID column; lib/pq casts the validated-UUID string safely. No secret/token written to disk or logs — emitDeleteTerminateRetryExhausted payload carries only the workspace_id, attempt count, and the CP error string (no creds). Telemetry is log-and-swallow so an INSERT failure can never 500 the caller or leak via panic. event_type is a constant, not user-controlled. No authz change — the Delete handler's existing gates are untouched. Trust boundary intact.

APPROVED.

core-security review (trust-boundary axis): Delete-path change touches no new untrusted input: workspaceID values in CascadeDelete are already UUID-validated (validateWorkspaceID), and the new structure_events INSERT is parameterized ($1/$2/$3) — no string concat into SQL. workspace_id flows into a UUID column; lib/pq casts the validated-UUID string safely. No secret/token written to disk or logs — emitDeleteTerminateRetryExhausted payload carries only the workspace_id, attempt count, and the CP error string (no creds). Telemetry is log-and-swallow so an INSERT failure can never 500 the caller or leak via panic. event_type is a constant, not user-controlled. No authz change — the Delete handler's existing gates are untouched. Trust boundary intact. APPROVED.
Member

/sop-ack comprehensive-testing

/sop-ack comprehensive-testing
Member

/sop-ack local-postgres-e2e

/sop-ack local-postgres-e2e
Member

/sop-ack staging-smoke

/sop-ack staging-smoke
Member

/sop-ack root-cause

/sop-ack root-cause
Member

/sop-ack five-axis-review

/sop-ack five-axis-review
Member

/sop-ack no-backwards-compat

/sop-ack no-backwards-compat
Member

/sop-ack memory-consulted

/sop-ack memory-consulted
hongming merged commit fdd3f52bc8 into main 2026-05-27 06:14:21 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1932