fix(workspace-server): bounded retry on delete-path EC2 stop + durable leak event #1932
Reference in New Issue
Block a user
Delete Branch "fix/workspace-ec2-leak-delete-retry"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
The DELETE path's
StopWorkspaceAuto→cpProv.Stophad no retry, while the restart path usedcpStopWithRetry(bounded exp backoff). A transient CP/AWS hiccup on delete left the workspace row atstatus='removed'withinstance_idpopulated, returned a 500, and relied entirely on the 60s CP-orphan-sweeper to re-drive the terminate. For a cascade descendant, the "client retries → replays terminate" recovery is defeated byCascadeDelete'sstatus != 'removed'CTE filter (the descendant row is alreadyremoved, so a retry walks zero descendant rows).This PR:
cpStopWithRetryErr(returns the terminal error);cpStopWithRetrykeeps its void contract for the restart paths via a one-line delegate.stopWorkspaceForDelete: retries the CP terminate, and on exhaustion persists a durableworkspace.delete.terminate_retry_exhaustedrow tostructure_events(§Persistent structured logging gate) so the leak/pending decision is queryable.status='removed'— the existing CP-orphan-sweeper keys on exactly that state; the row staysremoved+instance_idso the backstop still re-drives it. The error is still returned so the HTTP Delete surfaces the retryable 500.Paired with controlplane PR
fix/workspace-ec2-leak-reaper(the row-gone leak class). Refs task #15.Phase 1 (re-verified, not assumed)
status != 'removed'retry-defeat): partially falsified — scoped to descendants only; the root id is always re-stopped (stopAndRemove(id)runs unconditionally andcpProv.Stopresolvesinstance_idwith no status gate).registry/cp_orphan_sweeper.go(PR #2989, wiredmain.go:325) already reapsstatus='removed'+instance_idworkspace EC2s every 60s. The real gap is the row-gone class → handled by the paired controlplane PR.Phase 4 self-review (five-axis)
Correctness: No blocking finding — fail-direction proof discriminates new behavior (
CPRetriesTransientThenSucceeds: 3 calls, no event) from pre-fix bare Stop. FYI: the per-workspace retry runs insideCascadeDelete's shared 30scleanupCtx; a large cascade of failing descendants can truncate retries — degrades gracefully (ctx.Err → stopErr → 500 + cp-orphan-sweeper backstop).Readability: No finding — verb-noun names match siblings; comments explain WHY (the CTE-defeat).
Architecture: No finding —
stopWorkspaceForDeleteinworkspace_dispatchers.goand the loop inworkspace_restart.goare bothallowedFilesfor the bare-Stop AST gate;TestRestart_CPStopOnlyInsideRetryHelperupdated to recognizecpStopWithRetryErras the relocated loop home (not loosened).Security: No finding — no untrusted input, no secret write.
Performance: No finding — adds ≤3s backoff to an already-500 path.
Test plan
go vet ./internal/handlers/cleango test ./internal/handlers/ ./internal/registry/green (no regressions)golangci-lint run ./internal/handlers/...→ 0 issuesworkspace.delete.terminate_retry_exhaustedrow appears and the cp-orphan-sweeper subsequently clears the EC2.tier:medium(touches the delete path + structured-logging; revert isgit revert-clean).🤖 Generated with Claude Code
SOP checklist (RFC#351)
CPRetriesTransientThenSucceeds(3 Stop calls, NO event) vsCPExhausts(durablestructure_eventsINSERT + terminal error). No-backend no-op. AST gateTestRestart_CPStopOnlyInsideRetryHelperupdated for the relocated retry loop. Edge: ctx-cancel mid-retry returnsctx.Err()without LEAK-SUSPECT (existing test preserved).workspace.delete.terminate_retry_exhaustedrow appears, and the cp-orphan-sweeper subsequently clears the EC2.cpProv.Stophad no retry (unlike restart'scpStopWithRetry); a transient failure 500'd with no inline recovery, and cascade-descendant replay is defeated by thestatus != 'removed'CTE filter. Fix = bounded retry + durable leak event, not a status-ordering hack.cpStopWithRetrykeeps its void contract via a one-line delegate to the newcpStopWithRetryErr; no shim, no dead code, no removed-comment cruft.feedback_gitea_reviews_api_event_value(event=APPROVED),feedback_ci_status_check_combined_state(combined.statuses[].status),feedback_passwords_in_chat_are_burned/token-scrub (push URL scrubbed from.git/config), the dev-sop §Persistent-structured-logging-gate + §Staging-E2E-gate.core-qa review (correctness + test coverage axis):
Verified the changed-branch coverage maps to discriminating tests:
go test ./internal/handlers ./internal/registry green, golangci-lint 0 issues, AST gate updated correctly (relocated, not loosened). Stage B/C correctly deferred (HOLD). Correctness + test axes pass.
APPROVED.
core-security review (trust-boundary axis):
Delete-path change touches no new untrusted input: workspaceID values in CascadeDelete are already UUID-validated (validateWorkspaceID), and the new structure_events INSERT is parameterized ($1/$2/$3) — no string concat into SQL. workspace_id flows into a UUID column; lib/pq casts the validated-UUID string safely. No secret/token written to disk or logs — emitDeleteTerminateRetryExhausted payload carries only the workspace_id, attempt count, and the CP error string (no creds). Telemetry is log-and-swallow so an INSERT failure can never 500 the caller or leak via panic. event_type is a constant, not user-controlled. No authz change — the Delete handler's existing gates are untouched. Trust boundary intact.
APPROVED.
/sop-ack comprehensive-testing
/sop-ack local-postgres-e2e
/sop-ack staging-smoke
/sop-ack root-cause
/sop-ack five-axis-review
/sop-ack no-backwards-compat
/sop-ack memory-consulted