molecule-core/workspace-server/internal/registry
Hongming Wang 4915d1d59e fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB)
The existing sweeper only reaps ws-* containers whose workspace row
has status='removed'. That misses the entire wiped-DB case: an
operator does `docker compose down -v` (kills the postgres volume),
the previous platform's ws-* containers keep running, the new
platform boots into an empty workspaces table — first pass finds
zero candidates and those containers leak forever. Symptom users
hit today: 7 ws-* containers from 11h ago, no rows in DB, no
visibility in Canvas, eating CPU + memory.

Fix shape:

1. Provisioner stamps every ws-* container + volume with
   `molecule.platform.managed=true`. Without a label, the sweeper
   would have to assume any unlabeled ws-* container might belong
   to a sibling platform stack on a shared Docker daemon.

2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter
   counterpart to the existing name-filter.

3. Sweeper splits sweepOnce into two independent passes:
     - sweepRemovedRows (unchanged behavior; status='removed' only)
     - sweepLabeledOrphansWithoutRows (new; labeled containers whose
       workspace_id has no row in the table at all)
   Each pass has its own short-circuit so an empty result or transient
   error in one doesn't block the other — load-bearing because the
   wiped-DB pass exists precisely for cases where the removed-row
   pass finds nothing.

Safe under multi-platform-on-shared-daemon: only containers carrying
our label get reaped, sibling stacks' containers are invisible to this
pass. (For now the label is a constant string; a future per-instance
UUID layer can refine "ours" further if a real shared-daemon scenario
emerges.)

Migration: existing platforms running pre-PR builds have UNLABELED
ws-* containers. After this lands they continue to NOT be reaped by
the new path (no label = invisible). They'll only be cleaned via
manual intervention or once the operator recreates them — same as
today. No regression.

Tests cover all five branches of the new pass: happy-path reap,
no-reap when row exists, mixed reap-some-keep-some, Docker error
short-circuits cleanly, non-UUID prefixes get filtered before the
SQL query.

Pairs with PR #2122 (script-level fix). Together they close the
orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users
(handled by the script) AND `docker compose down -v` users (handled
by the runtime).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:33:41 -07:00
..
access_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
access.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
healthsweep_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
healthsweep.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
hibernation_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
hibernation.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
liveness_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
liveness.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
orphan_sweeper_test.go fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB) 2026-04-26 14:33:41 -07:00
orphan_sweeper.go fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB) 2026-04-26 14:33:41 -07:00
provisiontimeout_test.go fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min 2026-04-26 01:44:09 -07:00
provisiontimeout.go fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min 2026-04-26 01:44:09 -07:00