fix(core#2771): acquire provision gate in create path so create+restart serialize #2772
Reference in New Issue
Block a user
Delete Branch "fix/core2771-create-restart-gate"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
provisionWorkspaceAutonow acquiresacquireRestartProvisionGate(workspaceID)synchronously BEFORE spawning the async provision goroutine. The goroutine holds the gate viadefer Unlockat the tail of the provision, matching the patternRestartWorkspaceAutoOptsalready uses.provisionWorkspaceAutoSyncgets the same treatment (sync path usesLock + defer Unlockdirectly).Acquire is synchronous in the create-side HTTP handler. A concurrent restart blocks in the create caller until the in-flight restart cycle is done.
Why (core#2771, local-provision stub main-red, run 360209/job 490401)
Pre-fix:
provisionWorkspaceAuto→ asyncprovisionWorkspace(NO gate)/restart→RestartWorkspaceAutoOpts→ acquires gate, async restart provisionprovisioner.Startfor the samews-<id>container name already in use→markProvisionFailed→ workspace wedged failedboot_register_failed status=401Post-fix: both serialize through the same per-workspace mutex.
Test plan
Scope kept tight
Generation-aware failure-marking is explicitly out of scope (requires DB generation counter + migration — multi-day). The load-bearing fix closes the live incident's repro. Generation-aware marking is the next-layer defense, tracked as a follow-up in the issue body.
Refs core#2771.
5-axis review on current head
b04d9ec11f: REQUEST_CHANGES.Correctness blocker:
runRestartCyclealready acquires the per-workspace restart/provision gate, then callsprovisionWorkspaceAutoSync, which this PR changed to acquire the same gate again.sync.Mutexis not reentrant, so the programmatic restart path deadlocks before it can reprovision.Concrete path:
workspace_restart.go:runRestartCyclelocksacquireRestartProvisionGate(workspaceID)and defers unlock.h.provisionWorkspaceAutoSync(...).workspace_dispatchers.go:provisionWorkspaceAutoSyncnow callsgate.Lock()on the same workspace ID and waits forever.This changes restart-side behavior and can wedge the auto-restart path. The create-side async gate direction looks right, and the per-workspace map preserves independence for different IDs, but the sync helper needs either an already-held-gate variant or the outer restart cycle must not double-lock. Please add/adjust a regression that exercises
runRestartCycleor the restart-sync call path under an already-held gate so this cannot recur.Security/performance/readability: no new auth/secret surface found; no global serialization bottleneck from the keyed sync.Map; comments are clear. Blocking solely on the self-deadlock in the restart path.
REQUEST_CHANGES: the async create-path fix is directionally correct, and the local-provision stub lane is green, but this head introduces a self-deadlock in the synchronous restart path.
runRestartCyclealready acquires the per-workspace provision gate before its Stop -> provision cycle (workspace-server/internal/handlers/workspace_restart.go:885-887) and then callsh.provisionWorkspaceAutoSync(...)while still holding that gate (workspace-server/internal/handlers/workspace_restart.go:963). This PR adds anothergate.Lock()/defer gate.Unlock()insideprovisionWorkspaceAutoSync(workspace-server/internal/handlers/workspace_dispatchers.go:196-198).sync.Mutexis not reentrant, so any programmatic RestartByID cycle that reaches this path will block itself forever before provisioning. That would wedge auto/programmatic restarts and likely leave the workspace inprovisioningafter the status update atworkspace_restart.go:932-936.Please keep the create-side async gate, but do not double-lock the already-gated sync restart path. One reasonable shape is to leave
provisionWorkspaceAutoSyncungated and document it as caller-gated, or split an internal sync helper sorunRestartCyclecan call the already-held variant while any truly ungated sync caller can opt into locking. Add/adjust a regression test that exercisesrunRestartCycle/sync provision under an already-held gate or otherwise proves the sync path does not self-deadlock.APPROVED on head
b000a4a4.Verified the #2771 fix against the prior RC: create-path provisioning now synchronously acquires
acquireRestartProvisionGatebefore spawning the async provision goroutine, so create+restart for the same workspace serialize through the same per-workspace mutex. The gate is held through the full async CP/Docker provision and explicitly released on the no-backend path; different workspace IDs remain independent.Verified the deadlock RC is resolved:
runRestartCyclestill holds the gate across Stop+Start, but now callsprovisionWorkspaceAutoSyncLocked, while the publicprovisionWorkspaceAutoSyncremains the unlocked entry that acquires the gate for callers that do not already hold it. I found no remaining double-lock path in the touched code. The new tests cover shared create/restart serialization, per-workspace independence, held-until-completion, no-deadlock locked variant, and the negative-control unlocked-under-held-gate behavior.CI:
CI / all-requiredgreen onb000a4a4; local-provision stub is green, and the real-image MiniMax advisory lane is also green. Generation-aware failure marking remains a reasonable follow-up, not a blocker for this race fix.5-axis re-review on current head
b000a4a439: APPROVED.Verified the CR2 #11473 deadlock is fixed.
provisionWorkspaceAutoSyncnow owns the acquire/defer-unlock path for callers that do not already hold the per-workspace gate, whileprovisionWorkspaceAutoSyncLockedperforms the same sync provision work under an already-held gate.runRestartCyclenow calls the locked variant after acquiringacquireRestartProvisionGate(workspaceID), so the previous non-reentrant self-lock is gone.Create-side
provisionWorkspaceAutoremains gated before async dispatch and holds the gate through the async provision via defer, so create+restart for the same workspace still serialize through the same per-workspace mutex. Different workspace IDs remain independent through the keyedsync.Map; no global serialization bottleneck added. No-backend paths release/return appropriately, and restart-side behavior is preserved apart from the deadlock fix.Tests pin the shared gate, different-workspace independence, async hold-until-completion, locked restart no-deadlock, and negative reacquire behavior. Stub local-provision and CI/all-required are green on this head. No security, performance, or readability blockers found.