molecule-core/workspace-server
Hongming Wang 4915d1d59e fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB)
The existing sweeper only reaps ws-* containers whose workspace row
has status='removed'. That misses the entire wiped-DB case: an
operator does `docker compose down -v` (kills the postgres volume),
the previous platform's ws-* containers keep running, the new
platform boots into an empty workspaces table — first pass finds
zero candidates and those containers leak forever. Symptom users
hit today: 7 ws-* containers from 11h ago, no rows in DB, no
visibility in Canvas, eating CPU + memory.

Fix shape:

1. Provisioner stamps every ws-* container + volume with
   `molecule.platform.managed=true`. Without a label, the sweeper
   would have to assume any unlabeled ws-* container might belong
   to a sibling platform stack on a shared Docker daemon.

2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter
   counterpart to the existing name-filter.

3. Sweeper splits sweepOnce into two independent passes:
     - sweepRemovedRows (unchanged behavior; status='removed' only)
     - sweepLabeledOrphansWithoutRows (new; labeled containers whose
       workspace_id has no row in the table at all)
   Each pass has its own short-circuit so an empty result or transient
   error in one doesn't block the other — load-bearing because the
   wiped-DB pass exists precisely for cases where the removed-row
   pass finds nothing.

Safe under multi-platform-on-shared-daemon: only containers carrying
our label get reaped, sibling stacks' containers are invisible to this
pass. (For now the label is a constant string; a future per-instance
UUID layer can refine "ours" further if a real shared-daemon scenario
emerges.)

Migration: existing platforms running pre-PR builds have UNLABELED
ws-* containers. After this lands they continue to NOT be reaped by
the new path (no label = invisible). They'll only be cleaned via
manual intervention or once the operator recreates them — same as
today. No regression.

Tests cover all five branches of the new pass: happy-path reap,
no-reap when row exists, mixed reap-some-keep-some, Docker error
short-circuits cleanly, non-UUID prefixes get filtered before the
SQL query.

Pairs with PR #2122 (script-level fix). Together they close the
orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users
(handled by the script) AND `docker compose down -v` users (handled
by the runtime).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:33:41 -07:00
..
cmd/server feat(workspace-server): GHCR digest watcher closes runtime CD chain (#2114) 2026-04-26 13:36:26 -07:00
internal fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB) 2026-04-26 14:33:41 -07:00
migrations chore: second-pass review polish — symmetry + clearer test fixtures 2026-04-25 08:48:30 -07:00
pkg/provisionhook feat(#1957): wire gh-identity plugin into workspace-server 2026-04-24 15:01:41 +00:00
.ci-force chore: force Platform(Go) CI run on main — validate go vet clean 2026-04-21 15:43:19 +00:00
.gitignore feat(ws-server): pull env from CP on startup 2026-04-19 02:41:15 -07:00
.golangci.yaml chore(workspace-server): add golangci.yaml disabling errcheck 2026-04-24 07:16:54 +00:00
Dockerfile chore: extract ContextMenu Zustand fix + a2a_proxy local-docker SSRF bypass + workspace-server Dockerfile GID entrypoint 2026-04-22 20:00:16 -07:00
Dockerfile.tenant feat(terminal): remote path via aws ec2-instance-connect + pty 2026-04-21 18:13:29 -07:00
entrypoint-tenant.sh fix(security): add USER directive before ENTRYPOINT in all tenant images (#1155) 2026-04-20 23:51:33 +00:00
go.mod feat(#1957): wire gh-identity plugin into workspace-server 2026-04-24 15:01:41 +00:00
go.sum feat(#1957): wire gh-identity plugin into workspace-server 2026-04-24 18:28:18 +00:00