fix(workspace-server): CP orphan sweeper closes deprovision split-write race (#2989) #2

claude-ceo-assistant · 2026-05-06T23:44:12Z

2026-05-06 23:44:12 +00:00

Summary

Closes #2989. SaaS-mode counterpart to the existing Docker orphan_sweeper — re-issues cpProv.Stop on a 60s tick for any workspace at status='removed' with non-NULL instance_id, healing the deprovision split-write race that left 9 prod EC2s orphaned over 3 days.

Why this shape (Option E)

RFC discussion in #2989 (was on github.com — repo now mirrored to Gitea). Two heavier options were considered:

Option A (CP-first ordering): protocol change + reverse callback. ~3 PRs, cutover risk.
Option B (durable outbox): new table + new worker + cutover. ~400 LOC for a class with 9 known orphans/quarter.
Option E (this PR): extend the existing 60s sweeper pattern to the SaaS backend. ~30 LOC of production code, no schema change, no protocol change.

The inline path already returns a loud 500 when CP fails — the only missing piece is automatic retry. A 60s sweeper provides exactly that. cpProv.Stop is idempotent on already-terminated EC2s and already-deleted DNS records, so retries are safe.

Test plan

9 unit tests covering happy path, Stop failure, UPDATE failure, multiple orphans, DB query error, nil-DB defense, nil-reaper short-circuit, boot-immediate cadence
Mutation-tested: status='running' substitution → tests fail; remove UPDATE block → tests fail
go build ./... clean on rebased base (gitea/staging)

Note on origin

Originally pushed as github.com PR #3029 (CI green, queued for merge) before the GitHub org suspension on 2026-05-06. Re-pushed to Gitea after rebase onto current staging tip; same commit content.

🤖 Generated with Claude Code

## Summary Closes #2989. SaaS-mode counterpart to the existing Docker `orphan_sweeper` — re-issues `cpProv.Stop` on a 60s tick for any workspace at status='removed' with non-NULL instance_id, healing the deprovision split-write race that left 9 prod EC2s orphaned over 3 days. ## Why this shape (Option E) RFC discussion in #2989 (was on github.com — repo now mirrored to Gitea). Two heavier options were considered: - **Option A (CP-first ordering)**: protocol change + reverse callback. ~3 PRs, cutover risk. - **Option B (durable outbox)**: new table + new worker + cutover. ~400 LOC for a class with 9 known orphans/quarter. - **Option E (this PR)**: extend the existing 60s sweeper pattern to the SaaS backend. ~30 LOC of production code, no schema change, no protocol change. The inline path already returns a loud 500 when CP fails — the only missing piece is automatic retry. A 60s sweeper provides exactly that. `cpProv.Stop` is idempotent on already-terminated EC2s and already-deleted DNS records, so retries are safe. ## Test plan - [x] 9 unit tests covering happy path, Stop failure, UPDATE failure, multiple orphans, DB query error, nil-DB defense, nil-reaper short-circuit, boot-immediate cadence - [x] Mutation-tested: status='running' substitution → tests fail; remove UPDATE block → tests fail - [x] `go build ./...` clean on rebased base (gitea/staging) ## Note on origin Originally pushed as github.com PR #3029 (CI green, queued for merge) before the GitHub org suspension on 2026-05-06. Re-pushed to Gitea after rebase onto current staging tip; same commit content. 🤖 Generated with Claude Code

claude-ceo-assistant added 1 commit 2026-05-06 23:44:12 +00:00

fix(workspace-server): CP orphan sweeper closes deprovision split-write race (#2989 )

CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s

Details

Harness Replays / detect-changes (pull_request) Successful in 8s

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s

Details

CI / Detect changes (pull_request) Successful in 6s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 6s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s

Details

Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s

Details

CI / Python Lint & Test (pull_request) Successful in 3s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s

Details

Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s

Details

CI / Canvas (Next.js) (pull_request) Successful in 18s

Details

CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 43s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m19s

Details

CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m22s

Details

Harness Replays / Harness Replays (pull_request) Failing after 37s

Details

CI / Platform (Go) (pull_request) Failing after 2m33s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 4m48s

Details

3cdb67f27e

The deprovision path marks `workspaces.status='removed'` BEFORE calling
the controlplane DELETE. If that CP call fails (transient 5xx, network
hiccup, AWS provider error), the DB row stays at 'removed' with
`instance_id` populated and there's no retry — the EC2 lives forever.
9 prod orphans accumulated over 3 days under this bug.

Adds a SaaS-mode counterpart to the existing Docker `orphan_sweeper`:
- 60s tick (matches the Docker sweeper cadence)
- LIMIT 100 per cycle so a sustained CP outage drains over multiple
  cycles without blowing the request timeout
- Re-issues `cpProv.Stop` for any workspace at status='removed' with a
  non-NULL `instance_id`. Stop is idempotent (AWS terminate on
  already-terminated is a no-op; CP's Deprovision tolerates already-
  deleted DNS) so retries are safe.
- On Stop success, NULLs `instance_id` so the next cycle skips the row.
- On Stop failure, leaves `instance_id` populated for next cycle.

The existing Docker sweeper is gated on `prov != nil`; the new sweeper
is gated on `cpProv != nil`. SaaS tenants get exactly one of the two,
self-hosted tenants get the Docker one — no overlap.

Why this shape over option A (CP-first ordering) or B (durable outbox):
the existing inline path already returns a loud 500 to the user when
CP fails — the only missing piece is automatic retry, which a 60s
sweeper provides without protocol changes, new tables, or new workers.
~30 LOC of production code vs. ~400 for an outbox. RFC discussion in
#2989 comment chain.

Tests:
- 9 unit tests covering happy path, Stop failure, UPDATE failure,
  multiple orphans (one-fails-others-still-process), DB query error,
  nil-DB defense, nil-reaper short-circuit, and the boot-immediate-then-
  tick cadence contract.
- Mutation-tested: status='running' substitution and removed-UPDATE-
  block both fail at least one test.

Out of scope:
- Backfilling the 9 named orphans — they'll heal automatically on the
  first sweep cycle after this lands; no manual cleanup needed.
- Long-term durable-outbox architecture — separate RFC.

~~Ghost referenced this pull request 2026-05-07 05:51:10 +00:00~~

bug(workspace-server): plugin-install returns 503 container-not-running while status=online (split-state) #10

hongming was assigned by claude-ceo-assistant

2026-05-07 10:29:25 +00:00

Ghost approved these changes 2026-05-07 11:38:46 +00:00

Ghost left a comment

First-time contributor

Hongming-approved (chat 2026-05-07): CP orphan sweeper closes deprovision (#2989). devops-engineer reviewed earlier; ready to merge.