forked from molecule-ai/molecule-core
Symptom: deleting workspaces from the canvas marked DB rows
status='removed' but left Docker containers running indefinitely.
After a session of org imports + cancellations, we counted 10
running ws-* containers all backed by 'removed' DB rows, eating
~1100% CPU on the Docker VM.
Two compounding bugs in handlers/workspace_crud.go's delete cascade:
1. The cleanup loop used `c.Request.Context()` for the Docker
stop/remove calls. When the canvas's `api.del` resolved on the
platform's 200, gin cancelled the request ctx — and any in-flight
Docker call cancelled with `context canceled`, leaving the
container alive. Old logs:
"Delete descendant <id> volume removal warning:
... context canceled"
2. `provisioner.Stop`'s error return was discarded and `RemoveVolume`
ran unconditionally afterward. When Stop didn't actually kill the
container (transient daemon error, ctx cancellation as in #1), the
volume removal would predictably fail with "volume in use" and
the container kept running with the volume mounted. Old logs:
"Delete descendant <id> volume removal warning:
Error response from daemon: remove ... volume is in use"
Fix layered in two parts:
- workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)`
+ a 30s bounded timeout. Stop's error is now checked and on
failure we skip RemoveVolume entirely (the orphan sweeper below
catches what we deferred).
- New registry/orphan_sweeper.go: periodic reconcile pass (every 60s,
initial run on boot). Lists running ws-* containers via Docker name
filter, intersects with DB rows where status='removed', stops +
removes volumes for the leaks. Defence in depth — even a brand-new
Stop failure mode heals on the next sweep instead of leaking
forever.
Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper
that wraps ContainerList with the `name=ws-` filter; the sweeper
takes an OrphanReaper interface (matches the ContainerChecker
pattern in healthsweep.go) so unit tests don't need a real Docker
daemon.
main.go wires the sweeper alongside the existing liveness +
health-sweep + provisioning-timeout monitors, all under
supervised.RunWithRecover so a panic restarts the goroutine.
6 new sweeper tests cover the reconcile path, the
no-running-containers short-circuit, the daemon-error skip, the
Stop-failure-leaves-volume invariant (the same trap that motivated
this fix), the volume-remove-error-is-non-fatal continuation,
and the nil-reaper no-op.
Verified: full Go test suite passes; manually purged the 10 leaked
containers + their orphan volumes from the dev host with `docker
rm -f` + `docker volume rm` (one-off cleanup; the sweeper would
have caught them on the next cycle once deployed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| cp_config_test.go | ||
| cp_config.go | ||
| dotenv_test.go | ||
| dotenv.go | ||
| main.go | ||