molecule-core/workspace-server/internal/provisioner
Hongming Wang cb12601414 fix(platform): make Provisioner.Stop return real errors so cleanup gates fire
Review caught a critical issue with 12c49183: the headline "skip
RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop`
unconditionally `return nil`'d after logging the underlying
ContainerRemove error, so the new `if err := h.provisioner.Stop(...);
err != nil { skip volume }` guard in workspace_crud.go AND the same
guard in the orphan sweeper could never fire. RemoveVolume always
ran, predictably failing with "volume in use" when Stop hadn't
actually killed the container — which is the exact production bug
the commit claimed to fix.

Now Stop:
  - returns nil on successful remove (no change)
  - returns nil when the container is already gone (uses the existing
    isContainerNotFound helper — that's the cleanup post-condition,
    not a failure)
  - returns the wrapped Docker error otherwise (daemon timeout, ctx
    cancellation, socket EOF — anything that means the container
    might still be alive)

Audited every Provisioner.Stop caller in the tree (team.go,
workspace_restart.go ×4, workspace.go) — all of them already
discard the return value, so the widened error surface is purely
opt-in for the new cleanup paths and breaks no existing behaviour.

Other review-driven fixes in this commit:

- workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from
  the request ctx too. RecordAndBroadcast does INSERT INTO
  structure_events + Redis Publish; if the canvas hangs up, a
  request-ctx-bound INSERT can be cancelled mid-write and the
  WORKSPACE_REMOVED event never lands, leaving other WS clients
  ignorant of the cascade.

- orphan_sweeper.go: added isLikelyWorkspaceID guard before turning
  Docker container prefixes into SQL LIKE patterns. The Docker
  name filter is a SUBSTRING match (not prefix), so non-workspace
  containers like `my-ws-tool` slip through; the in-loop HasPrefix
  in provisioner trims most, but the in-sweeper alphabet check
  (hex + dashes only) is the second line of defence and also
  blocks SQL LIKE wildcards (`_`, `%`) from reaching the query.
  Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes
  and TestIsLikelyWorkspaceID with 10 alphabet cases.

- provisioner.go: comment added to ListWorkspaceContainerIDPrefixes
  flagging the substring/HasPrefix relationship as load-bearing.

Verified: full Go test suite passes; all 8 sweeper tests pass
(2 new for the LIKE-pattern guard); existing dispatch / delete /
provisioner tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 23:32:48 -07:00
..
backend_contract_test.go docs: testing strategy + PR hygiene + backend parity matrix + boot-event postmortem (#1824) 2026-04-23 19:59:38 +00:00
cp_provisioner_instance_id_test.go test: regression guard for #1738 — cp-provisioner uses real instance_id 2026-04-23 17:45:13 -07:00
cp_provisioner_test.go fix(cp-provisioner): look up real EC2 instance_id for Stop + IsRunning (#1738) 2026-04-23 18:25:29 +00:00
cp_provisioner.go fix(cp-provisioner): look up real EC2 instance_id for Stop + IsRunning (#1738) 2026-04-23 18:25:29 +00:00
isrunning_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
platform_test.go fix(provisioner): force linux/amd64 pull + create on Apple Silicon hosts (#1875) 2026-04-23 14:55:34 -07:00
provisioner_test.go feat(provisioner): pull workspace-template images from GHCR 2026-04-22 12:39:56 -07:00
provisioner.go fix(platform): make Provisioner.Stop return real errors so cleanup gates fire 2026-04-25 23:32:48 -07:00