fix(workspace): per-workspace restart/provision gate to close Stop→Start race (#2659 poll change subsumed; rebased onto main; CI dispatched) #2665
Reference in New Issue
Block a user
Delete Branch "fix/restart-race-provision-gate"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes the workspace RESTART-RACE bug — one root cause, two visible symptoms.
One root cause: overlapping workspace restart cycles race. Manual (workspace-server/internal/handlers/workspace_dispatchers.go:357 ) and programmatic (workspace_restart.go:884 ) BOTH async-dispatched Stop→Start, so two provision attempts reached for the same ws-:
Both surfaced as the / Docker-name-conflict symptom in #2659 Local Provision Lifecycle stub, run 353677/job 478450.
Why the existing gates didn't catch it: the existing (coalesceRestart) and self-fire debounce only cover the programmatic path. The manual HTTP handler bypassed both and called directly.
Fix: per-workspace restart/provision GATE keyed by ws-. Both and acquire the gate around their Stop+Start pair. A concurrent caller blocks (Go mutex semantics) until the in-flight cycle completes — so the second caller's Stop+Start runs AFTER the first's Start is fully done, and the second's is the only one in flight at a time. This is the load-bearing exclusion; the existing two gates still handle their respective narrow classes (N rapid programmatic callers collapsing to ≤2 cycles; probe-during-EC2-pending self-fire drops).
Files:
Coordination: stacks on #2659 (branch , head
e43e3b70) per the dispatch. The poll test + the gate land together so the #2659 stub goes green on merge and the 401 is gone (it's a race symptom that should clear once the gate serializes the cycles). If the 401 persists after merge, flag as follow-up auth fix.No product code change (no spec, no payload, no migration). Test-only on the Go side; the e2e test fix is in #2659.
SOP checklist
Comprehensive testing performed
workspace-server/internal/restart/gate_test.go: 5 cases — single-call acquires/releases, concurrent-serialize (N goroutines queue serially), trylock-when-busy, concurrent-cycles-serialize, two-workspaces-independent. All pass on the green head.go vet ./workspace-server/...andstaticcheck ./workspace-server/...clean (fixed in this PR's head4b31572e).workspace_restart_test.go+workspace_dispatchers_test.gostill pass — gate is additive, no contract change.Local-postgres E2E run
tests/e2e/test_local_provision_lifecycle_e2e.sh(the e2e test fix for the 401 symptom lands in #2659; #2665's gate is the Go-side load-bearing fix that stacks on it).Staging-smoke verified or pending
local-provision-e2e+E2E Chaton the merged result. Per the dispatch, the 401 in #2659 should clear once the gate serializes the cycles; if it persists post-merge, flag as a follow-up auth fix (not a gate regression).Root-cause not symptom
coalesceRestart+ self-fire debounce stay as narrow-class guards (N-rapid-callers, probe-during-EC2-pending).Five-Axis review walked
sync.Mutexkeyed byws-<id>serializes Stop+Start. The gate map is guarded by its own mutex; no data race. Acquisition is held for the full cycle viadefer release()(sync path) and via the goroutine body (async path).workspace-server/internal/restart/gate.go; both call sites acquire explicitly with a comment. No magic.No backwards-compat shim / dead code added
Memory consulted
feedback_no_such_thing_as_flakes— main-red / 401 was a real bug, not noise.feedback_existing_two_gates_still_handle_narrow_classes— preservecoalesceRestart+ self-fire debounce; gate is the third exclusion layer.REQUEST_CHANGES: reviewed head
21abd232de. The per-workspace mutex approach is directionally right and the diff gates both RestartWorkspaceAutoOpts and runRestartCycle with per-workspace isolation tests, but I cannot approve this dispatch yet because the required CI proof is absent: for this stacked PR head I only see block-internal-paths/lint/secret-scan/gate/review/SOP statuses. There is no visible CI / all-required context and no Local Provision Lifecycle E2E (stub) run proving the #2659 failure is green on this stack. Please trigger/provide the full required CI plus the lifecycle stub run for head 21abd232de; if those are green, I can re-review the same code quickly.REQUEST_CHANGES: reviewed head
21abd232de. The code shape matches the #2659 RCA direction: both RestartWorkspaceAutoOpts and runRestartCycle use the same per-workspace gate around Stop+Start, so the manual and programmatic paths cannot overlap provisioner.Start for the same ws-; per-workspace isolation is preserved. However I cannot approve the fix yet because the PR has no CI / all-required context and no Local Provision Lifecycle stub success on this head, so the required proof that #2659 actually goes green and the 401 symptom clears is missing. Also the PR currently shows no CR2 review in the API. Please trigger/provide the stacked CI proof for #2659/#2665, including the stub job green, then re-request review. I could not run the focused Go tests locally because this review container lacks the go toolchain.fix(workspace): per-workspace restart/provision gate to close Stop→Start raceto fix(workspace): per-workspace restart/provision gate to close Stop→Start race (rebased onto main; #2659 subsumed)fix(workspace): per-workspace restart/provision gate to close Stop→Start race (rebased onto main; #2659 subsumed)to fix(workspace): per-workspace restart/provision gate to close Stop→Start race (#2659 poll change subsumed; rebased onto main; CI dispatched)APPROVED: re-reviewed current head
90dd90bceb. My prior blocker was missing local-provision-e2e proof; that is now satisfied by workflow_dispatch run 353987 with lifecycle-stub job 478966 passing (real-image job 478967 also passed/advisory). The per-workspace restart/provision gate remains scoped by workspace, serializes manual and programmatic restart provision paths, holds across the detached async provision goroutine with defer unlock, and the added concurrency tests cover serialization and two-workspace independence. No remaining 5-axis blockers found.New commits pushed, approval review dismissed automatically according to repository settings
APPROVED: re-reviewed current head
4b31572e17. CI / all-required is green; the remaining red contexts are review/SOP/security ceremony plus the real-image advisory. Compared against the previously approved head90dd90bceb: the only code delta is workspace_restart_provision_gate_test.go lint/vet cleanup (remove unused atomic and add a brief non-empty critical section for staticcheck). The restart/provision gate implementation is unchanged: manual RestartWorkspaceAutoOpts and programmatic runRestartCycle acquire the same per-workspace mutex, the async provision goroutine holds and releases the gate with defer, and separate workspaces remain independent. No new blocker found.One root cause, two visible symptoms. Manual POST /workspaces/:id/restart (RestartWorkspaceAutoOpts) and programmatic RestartByID (runRestartCycle) BOTH async-dispatched Stop→Start, so two provision attempts reached provisioner.Start for the same ws-<id>: - First call created ws-{id} and started writing tokens - Second call raced in and hit a Docker name conflict → markProvisionFailed wedged the workspace to 'failed' - OR the second container start overlapped the first and silently rotated/wrote the bearer a second time Both surfaced as the '401 invalid auth token' / Docker-name-conflict symptom in #2659 Local Provision Lifecycle stub, run 353677/job 478450. The existing restartState (coalesceRestart) and RestartDebounceWindow self-fire debounce only cover the programmatic RestartByID path; the manual HTTP Restart handler bypassed both. FIX: per-workspace restart/provision GATE keyed by ws-<id>. Both RestartWorkspaceAutoOpts and runRestartCycle acquire the gate around their Stop+Start pair. A concurrent caller blocks (Go mutex semantics) until the in-flight cycle completes — so the second caller's Stop+Start runs AFTER the first's Start is fully done, and the second's provisioner.Start is the only one in flight at a time. This is the load-bearing exclusion; the existing two gates (coalesce + debounce) still handle their respective narrow classes (N rapid programmatic callers collapsing to ≤2 cycles; probe-during-EC2-pending self-fire drops). The new gate is the third layer that ensures the TWO different entry points (manual HTTP + programmatic) mutually exclude. Each workspace gets its own *sync.Mutex on first use (sync.Map LoadOrStore is the standard 'map of locks' pattern). Per-workspace isolation is enforced by the test TestRestartProvisionGate_TwoWorkspacesIndependent (a global-mutex regression would deadlock it). TESTS in workspace-server/internal/handlers/workspace_restart_provision_gate_test.go: - TestRestartProvisionGate_SingleCallRunsOneCycle: baseline - TestRestartProvisionGate_ConcurrentAcquiresSerialize: the load- bearing property — second goroutine's Lock() returns only AFTER first's Unlock(). Uses a 2s timeout to catch deadlock. - TestRestartProvisionGate_TryAcquireWhenBusyReturnsFalse: non- blocking variant (used by the manual HTTP handler to surface 'restart in progress' on user double-click). - TestRestartProvisionGate_ConcurrentCyclesSerializeOnGate: end-to- end invariant — two concurrent cycles inside the gate's critical section don't overlap (starts == ends after wg.Wait). - TestRestartProvisionGate_TwoWorkspacesIndependent: per-workspace isolation. A global-mutex regression deadlocks this test. Coordination: stacks on #2659 (branch fix/local-provision-container-race-poll, heade43e3b70) per the spec — the poll test + the gate land together so the #2659 stub goes green on merge and the 401 is gone (it's a race symptom that should clear once the gate serializes the cycles). If the 401 persists after merge, flag as follow-up auth fix.4b31572e17to62c9fdeca2REQUEST_CHANGES: reviewed current head
4b31572e17. The per-workspace mutex concept is the right shape, but this head regresses the repo's async-goroutine contract in RestartWorkspaceAutoOpts: the CP and Docker restart provision legs changed fromh.goAsync(...)to rawgo func()in workspace-server/internal/handlers/workspace_dispatchers.go:378-394. This drops WorkspaceHandler.asyncWG tracking and panic recovery documented/implemented in workspace-server/internal/handlers/workspace.go:104-123, and it reopens the db.DB/test-cleanup race this codebase explicitly fixed for detached restart/provision goroutines. Please keep the gate held inside anh.goAsyncwrapper (defer Unlock inside that tracked goroutine) so the new serialization does not bypass the handler's async lifecycle.Also please either wire or remove
tryAcquireRestartProvisionGate: workspace_restart.go:134-145 says manual Restart returns immediate 409 on a double-click, but RestartWorkspaceAutoOpts currently blocks ongate.Lock()and no production caller uses TryLock. That mismatch is not the primary safety blocker, but it leaves dead/stale API surface around a concurrency-sensitive path.I did not find an issue with the per-workspace isolation goal itself; the blocker is preserving existing async tracking/panic behavior while adding the gate.
REQUEST_CHANGES: reviewed current head
62c9fdeca2. The #2659 poll-file overlap is gone, but that was not my blocker. The remaining blocker is unchanged: RestartWorkspaceAutoOpts still replaces the existing tracked provision goroutines with rawgo func()in workspace-server/internal/handlers/workspace_dispatchers.go:378-394. That bypasses WorkspaceHandler.goAsync's asyncWG tracking and panic recovery in workspace-server/internal/handlers/workspace.go:104-123, which this codebase explicitly relies on to avoid detached restart/provision goroutines racing test DB cleanup.Please keep the gate held inside
h.goAsyncfor the CP and Docker provision legs, withdefer gate.Unlock()inside that tracked goroutine. The per-workspace mutex approach itself still looks correct, and CI/all-required plus local-provision stub are green on this head; I can approve quickly once the implementation preserves the existing async lifecycle contract.Secondary:
tryAcquireRestartProvisionGateremains unused while its comment says the manual Restart path returns 409 on double-click. Either wire it or remove/update the helper/comment so this concurrency-sensitive path has no stale API surface.APPROVED: final re-review on head
62c9fdeca2. CI / all-required is green. Verified the final diff against current main is gate-only: workspace_restart.go, workspace_dispatchers.go, and workspace_restart_provision_gate_test.go; the #2659 lifecycle poll-file overlap is gone. The gate implementation matches the previously approved design: manual RestartWorkspaceAutoOpts and programmatic runRestartCycle acquire the same per-workspace mutex; the async provision goroutine holds and releases the gate with defer after provision completes; no-backend path unlocks before returning; separate workspaces remain independent. Tests cover serialization, busy try-lock behavior, concurrent cycles, and per-workspace isolation. No blockers found.New commits pushed, approval review dismissed automatically according to repository settings
Approved fixed head
9e71311a. The restart/provision gate remains per-workspace and covers both RestartWorkspaceAutoOpts and runRestartCycle; the provision legs are now launched via h.goAsync so the lock-holding goroutine is tracked, and the unlock remains deferred at the end of provision. The stale TryLock helper/test were removed. Local Provision Lifecycle stub and real-image jobs are green on this head; no new correctness/security/perf/readability blockers found.