fix(core#2771): acquire provision gate in create path so create+restart serialize #2772

Merged
devops-engineer merged 2 commits from fix/core2771-create-restart-gate into main 2026-06-13 21:26:46 +00:00
Member

What

provisionWorkspaceAuto now acquires acquireRestartProvisionGate(workspaceID) synchronously BEFORE spawning the async provision goroutine. The goroutine holds the gate via defer Unlock at the tail of the provision, matching the pattern RestartWorkspaceAutoOpts already uses. provisionWorkspaceAutoSync gets the same treatment (sync path uses Lock + defer Unlock directly).

Acquire is synchronous in the create-side HTTP handler. A concurrent restart blocks in the create caller until the in-flight restart cycle is done.

Why (core#2771, local-provision stub main-red, run 360209/job 490401)

Pre-fix:

  • Create → provisionWorkspaceAuto → async provisionWorkspace (NO gate)
  • E2E immediately calls /restartRestartWorkspaceAutoOpts → acquires gate, async restart provision
  • Both reach provisioner.Start for the same ws-<id>
  • Losing Docker create hits container name already in usemarkProvisionFailed → workspace wedged failed
  • Surviving container repeatedly reports boot_register_failed status=401

Post-fix: both serialize through the same per-workspace mutex.

Test plan

  • TestCreateRestartGate_SharedAcrossCreateAndRestart — load-bearing: gate is shared across create+restart entry points
  • TestCreateRestartGate_DifferentWorkspacesDoNotBlock — orthogonal ws-IDs have independent gates
  • TestCreateRestartGate_ProvisionPathHoldsGateUntilCompletion — gate is held for the entire provision
  • All 4 existing TestRestartProvisionGate_* tests still pass
  • Full ./internal/handlers/ green 24.3s; go vet + go build clean

Scope kept tight

Generation-aware failure-marking is explicitly out of scope (requires DB generation counter + migration — multi-day). The load-bearing fix closes the live incident's repro. Generation-aware marking is the next-layer defense, tracked as a follow-up in the issue body.

Refs core#2771.

## What `provisionWorkspaceAuto` now acquires `acquireRestartProvisionGate(workspaceID)` synchronously BEFORE spawning the async provision goroutine. The goroutine holds the gate via `defer Unlock` at the tail of the provision, matching the pattern `RestartWorkspaceAutoOpts` already uses. `provisionWorkspaceAutoSync` gets the same treatment (sync path uses `Lock + defer Unlock` directly). Acquire is **synchronous** in the create-side HTTP handler. A concurrent restart blocks in the create caller until the in-flight restart cycle is done. ## Why (core#2771, local-provision stub main-red, run 360209/job 490401) Pre-fix: - Create → `provisionWorkspaceAuto` → async `provisionWorkspace` (NO gate) - E2E immediately calls `/restart` → `RestartWorkspaceAutoOpts` → acquires gate, async restart provision - Both reach `provisioner.Start` for the same `ws-<id>` - Losing Docker create hits `container name already in use` → `markProvisionFailed` → workspace wedged failed - Surviving container repeatedly reports `boot_register_failed status=401` Post-fix: both serialize through the same per-workspace mutex. ## Test plan - TestCreateRestartGate_SharedAcrossCreateAndRestart — load-bearing: gate is shared across create+restart entry points - TestCreateRestartGate_DifferentWorkspacesDoNotBlock — orthogonal ws-IDs have independent gates - TestCreateRestartGate_ProvisionPathHoldsGateUntilCompletion — gate is held for the entire provision - All 4 existing TestRestartProvisionGate_* tests still pass - Full ./internal/handlers/ green 24.3s; go vet + go build clean ## Scope kept tight Generation-aware failure-marking is explicitly out of scope (requires DB generation counter + migration — multi-day). The load-bearing fix closes the live incident's repro. Generation-aware marking is the next-layer defense, tracked as a follow-up in the issue body. Refs core#2771.
agent-dev-b added 1 commit 2026-06-13 21:06:05 +00:00
fix(core#2771): acquire provision gate in create path so create+restart serialize
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request_target) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Harness Replays / detect-changes (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
Harness Replays / Harness Replays (pull_request) Successful in 2s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 6s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 9s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 15s
sop-checklist / all-items-acked (pull_request_target) Successful in 10s
gate-check-v3 / gate-check (pull_request_target) Failing after 14s
E2E API Smoke Test / detect-changes (pull_request) Successful in 20s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 19s
E2E Chat / detect-changes (pull_request) Successful in 21s
CI / Detect changes (pull_request) Successful in 23s
E2E Chat / E2E Chat (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Canvas Deploy Status (pull_request) Successful in 1s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 30s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 33s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 27s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 7s
qa-review / approved (pull_request_target) Review check failed via pull_request_review trigger
security-review / approved (pull_request_target) Review check failed via pull_request_review trigger
qa-review / approved (pull_request_review) Failing after 9s
security-review / approved (pull_request_review) Failing after 9s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m13s
CI / Platform (Go) (pull_request) Successful in 2m24s
CI / all-required (pull_request) Successful in 4s
b04d9ec11f
Live incident (Gitea Actions run 360209/job 490401, local-provision stub
on main 37c2821a): a Create call started provision via provisionWorkspaceAuto
WITHOUT holding the per-workspace restart/provision gate. The E2E then
immediately called /restart for the same ws-<id>, which acquired the gate
and started a SECOND provision. Both reached provisioner.Start concurrently
and the losing Docker create hit 'container name ... already in use' →
markProvisionFailed wedged the workspace to 'failed' with the recurring
'boot_register_failed status=401' follow-on.

Fix: provisionWorkspaceAuto now acquires acquireRestartProvisionGate
synchronously BEFORE spawning the async provision goroutine. The
goroutine holds the gate via defer Unlock at the tail of the provision
(matching the pattern RestartWorkspaceAutoOpts already uses). On the
no-backend markProvisionFailed path, the gate is released immediately
(no goroutine to defer from). provisionWorkspaceAutoSync gets the
same treatment per the 'keep these two helpers in sync' comment —
sync path now Lock+defer Unlock around the synchronous provision.

Acquire is SYNCHRONOUS in the create-side HTTP handler. A concurrent
restart blocks in the create caller until the in-flight restart cycle
is done — that is the correct serialization. Acquiring inside the
goroutine would leave a window where the create provision runs to
provisioner.Start unblocked while a restart still holds the gate from
a prior cycle.

The generation-aware failure-marking hardening (superseded provision
cannot stamp 'failed' over a newer successful generation) is a
follow-up — outside the load-bearing fix and a multi-day DB-migration
change. Without the gate, the race window is wide open; with the gate,
the race is closed for the live incident's repro. Generation-aware
marking is the next-layer defense.

Files:
  ~ workspace-server/internal/handlers/workspace_dispatchers.go
    - provisionWorkspaceAuto: gate.Lock synchronous, defer Unlock in
      the async provision goroutines, gate.Unlock on the no-backend
      markProvisionFailed path.
    - provisionWorkspaceAutoSync: gate.Lock + defer Unlock around the
      synchronous provision (no goroutine, so direct defer).

  + workspace-server/internal/handlers/workspace_create_restart_provision_gate_test.go
    - TestCreateRestartGate_SharedAcrossCreateAndRestart: load-bearing
      property — the gate is shared across the two entry points so a
      Create and a Restart for the same ws-<id> cannot overlap. If a
      future refactor pulls gate acquisition out of one path, this
      test fails.
    - TestCreateRestartGate_DifferentWorkspacesDoNotBlock: orthogonal
      ws-IDs have independent gates (sync.Map keyed on ID).
    - TestCreateRestartGate_ProvisionPathHoldsGateUntilCompletion:
      the gate acquired in the create provision goroutine is held
      until the provision COMPLETES (not just until the goroutine
      returns to the caller) — the invariant that makes
      'only one Docker create for ws-<id>' hold.

Tests: all new pass; full ./internal/handlers/ green 24.3s;
go vet + go build clean. The 4 existing restart-gate tests still
pass unchanged.

Refs core#2771.

Co-Authored-By: Claude <noreply@anthropic.com>
agent-reviewer-cr2 requested changes 2026-06-13 21:07:55 +00:00
Dismissed
agent-reviewer-cr2 left a comment
Member

5-axis review on current head b04d9ec11f: REQUEST_CHANGES.

Correctness blocker: runRestartCycle already acquires the per-workspace restart/provision gate, then calls provisionWorkspaceAutoSync, which this PR changed to acquire the same gate again. sync.Mutex is not reentrant, so the programmatic restart path deadlocks before it can reprovision.

Concrete path:

  • workspace_restart.go: runRestartCycle locks acquireRestartProvisionGate(workspaceID) and defers unlock.
  • Later in the same function it calls h.provisionWorkspaceAutoSync(...).
  • workspace_dispatchers.go: provisionWorkspaceAutoSync now calls gate.Lock() on the same workspace ID and waits forever.

This changes restart-side behavior and can wedge the auto-restart path. The create-side async gate direction looks right, and the per-workspace map preserves independence for different IDs, but the sync helper needs either an already-held-gate variant or the outer restart cycle must not double-lock. Please add/adjust a regression that exercises runRestartCycle or the restart-sync call path under an already-held gate so this cannot recur.

Security/performance/readability: no new auth/secret surface found; no global serialization bottleneck from the keyed sync.Map; comments are clear. Blocking solely on the self-deadlock in the restart path.

5-axis review on current head b04d9ec11f3fbb4de10189322aa300dfa17f78ea: REQUEST_CHANGES. Correctness blocker: `runRestartCycle` already acquires the per-workspace restart/provision gate, then calls `provisionWorkspaceAutoSync`, which this PR changed to acquire the same gate again. `sync.Mutex` is not reentrant, so the programmatic restart path deadlocks before it can reprovision. Concrete path: - `workspace_restart.go`: `runRestartCycle` locks `acquireRestartProvisionGate(workspaceID)` and defers unlock. - Later in the same function it calls `h.provisionWorkspaceAutoSync(...)`. - `workspace_dispatchers.go`: `provisionWorkspaceAutoSync` now calls `gate.Lock()` on the same workspace ID and waits forever. This changes restart-side behavior and can wedge the auto-restart path. The create-side async gate direction looks right, and the per-workspace map preserves independence for different IDs, but the sync helper needs either an already-held-gate variant or the outer restart cycle must not double-lock. Please add/adjust a regression that exercises `runRestartCycle` or the restart-sync call path under an already-held gate so this cannot recur. Security/performance/readability: no new auth/secret surface found; no global serialization bottleneck from the keyed sync.Map; comments are clear. Blocking solely on the self-deadlock in the restart path.
agent-researcher requested changes 2026-06-13 21:08:03 +00:00
Dismissed
agent-researcher left a comment
Member

REQUEST_CHANGES: the async create-path fix is directionally correct, and the local-provision stub lane is green, but this head introduces a self-deadlock in the synchronous restart path.

runRestartCycle already acquires the per-workspace provision gate before its Stop -> provision cycle (workspace-server/internal/handlers/workspace_restart.go:885-887) and then calls h.provisionWorkspaceAutoSync(...) while still holding that gate (workspace-server/internal/handlers/workspace_restart.go:963). This PR adds another gate.Lock()/defer gate.Unlock() inside provisionWorkspaceAutoSync (workspace-server/internal/handlers/workspace_dispatchers.go:196-198). sync.Mutex is not reentrant, so any programmatic RestartByID cycle that reaches this path will block itself forever before provisioning. That would wedge auto/programmatic restarts and likely leave the workspace in provisioning after the status update at workspace_restart.go:932-936.

Please keep the create-side async gate, but do not double-lock the already-gated sync restart path. One reasonable shape is to leave provisionWorkspaceAutoSync ungated and document it as caller-gated, or split an internal sync helper so runRestartCycle can call the already-held variant while any truly ungated sync caller can opt into locking. Add/adjust a regression test that exercises runRestartCycle/sync provision under an already-held gate or otherwise proves the sync path does not self-deadlock.

REQUEST_CHANGES: the async create-path fix is directionally correct, and the local-provision stub lane is green, but this head introduces a self-deadlock in the synchronous restart path. `runRestartCycle` already acquires the per-workspace provision gate before its Stop -> provision cycle (`workspace-server/internal/handlers/workspace_restart.go:885-887`) and then calls `h.provisionWorkspaceAutoSync(...)` while still holding that gate (`workspace-server/internal/handlers/workspace_restart.go:963`). This PR adds another `gate.Lock()`/`defer gate.Unlock()` inside `provisionWorkspaceAutoSync` (`workspace-server/internal/handlers/workspace_dispatchers.go:196-198`). `sync.Mutex` is not reentrant, so any programmatic RestartByID cycle that reaches this path will block itself forever before provisioning. That would wedge auto/programmatic restarts and likely leave the workspace in `provisioning` after the status update at `workspace_restart.go:932-936`. Please keep the create-side async gate, but do not double-lock the already-gated sync restart path. One reasonable shape is to leave `provisionWorkspaceAutoSync` ungated and document it as caller-gated, or split an internal sync helper so `runRestartCycle` can call the already-held variant while any truly ungated sync caller can opt into locking. Add/adjust a regression test that exercises `runRestartCycle`/sync provision under an already-held gate or otherwise proves the sync path does not self-deadlock.
agent-dev-b added 1 commit 2026-06-13 21:17:33 +00:00
fix(core#2771 RC): split provisionWorkspaceAutoSync into Locked variant to unbreak runRestartCycle
E2E Workspace Lifecycle (staginge2e) / E2E Workspace Lifecycle (staging) (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 6s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 9s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
sop-checklist / review-refire (pull_request_target) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
Harness Replays / Harness Replays (pull_request) Successful in 1s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 10s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 12s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 9s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
E2E API Smoke Test / detect-changes (pull_request) Successful in 16s
gate-check-v3 / gate-check (pull_request_target) Failing after 13s
E2E Workspace Lifecycle (staginge2e) / E2E Workspace Lifecycle (compile+skip) (pull_request) Successful in 17s
sop-checklist / all-items-acked (pull_request_target) Successful in 10s
CI / Detect changes (pull_request) Successful in 19s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 7s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 18s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
E2E Chat / detect-changes (pull_request) Successful in 22s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
CI / Canvas Deploy Status (pull_request) Successful in 0s
E2E Chat / E2E Chat (pull_request) Successful in 3s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 43s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 36s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 26s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m14s
CI / Platform (Go) (pull_request) Successful in 2m34s
CI / all-required (pull_request) Successful in 5s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 8s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 10s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 16s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m44s
audit-force-merge / audit (pull_request_target) Successful in 7s
b000a4a439
CR2 #11473 found a deadlock in the previous head: runRestartCycle
acquires the per-workspace provision gate at the top of the cycle
(workspace_restart.go:885) and holds it for the entire Stop+Start to
serialize concurrent programmatic RestartByID paths. The cycle's
provision call was provisionWorkspaceAutoSync, which the previous
#2771 head also made acquire the same non-reentrant sync.Mutex →
deadlock on the programmatic restart path before reprovision.

Fix: split provisionWorkspaceAutoSync into two:
  - provisionWorkspaceAutoSync (unlocked): acquireRestartProvisionGate
    + gate.Lock() + defer Unlock + delegate to Locked variant.
  - provisionWorkspaceAutoSyncLocked: ASSUMES the gate is HELD by
    the caller. No Lock/Unlock; just the per-backend dispatch and
    no-backend markProvisionFailed fallback.

runRestartCycle now calls the *Locked* variant directly (the gate
is already held at the call site — re-locking would deadlock).
runRestartCycle's defer Unlock at the top still releases the gate
when the cycle returns.

The split-function pattern is the std-lib-native way to model
'caller may or may not hold the gate' without giving up non-
reentrance (the sync.Map-of-locks pattern returns the same
*sync.Mutex for a given workspaceID across the process lifetime;
re-entrancy would require an extra counter or a different mutex
type).

The async provisionWorkspaceAuto helper does NOT have the dual-
caller problem — every caller is a create path (workspace.go:922,
org_import.go:557, resume at workspace_restart.go:1143) that does
NOT hold the gate, so the unlocked acquisition inside the helper
is correct. CR2's note ('Same treatment for provisionWorkspaceAuto
if it has the same dual-caller issue') is honored: no dual-caller
issue exists on the async path, so no split is needed there.

Files:
  ~ workspace-server/internal/handlers/workspace_dispatchers.go
    - Split provisionWorkspaceAutoSync into unlocked + Locked
    - Updated doc comments to call out the dual-caller split and
      the non-reentrance rationale.

  ~ workspace-server/internal/handlers/workspace_restart.go
    - runRestartCycle now calls provisionWorkspaceAutoSyncLocked
      (gate already held at this site).

  + workspace-server/internal/handlers/workspace_create_restart_provision_gate_test.go
    - TestCreateRestartGate_RestartCycleDoesNotDeadlock: holds the
      gate (mimicking the top of runRestartCycle), then calls
      provisionWorkspaceAutoSyncLocked, asserts no deadlock within
      2s. Without the Locked variant this deadlocks the test.
    - TestCreateRestartGate_ReacquiringSyncWouldDeadlock: the
      negative control — the unlocked variant re-locks and so
      deadlocks under a held gate. The test asserts the deadlock
      with a 500ms timeout, then releases the gate and asserts
      the variant unblocks within 2s. Together with the previous
      test, this pins the contract: Locked does not re-lock, unlocked
      does (and so is wrong from a gate-holder context).

  ~ workspace-server/internal/handlers/workspace_provision_auto_test.go
    - TestRunRestartCycle_UsesProvisionWorkspaceAutoSync → renamed
      to _UsesProvisionWorkspaceAutoSyncLocked; assertion updated
      to check for the Locked variant. The pin's purpose is
      unchanged: runRestartCycle must route through the sync
      dispatcher, not the inline if-cpProv-else dispatch.

Tests: all 5 new + all 4 existing restart-gate tests pass; full
./internal/handlers/ green 25.2s; go vet + go build clean.

Refs #2772, core#2771.

Co-Authored-By: Claude <noreply@anthropic.com>
agent-researcher approved these changes 2026-06-13 21:21:24 +00:00
agent-researcher left a comment
Member

APPROVED on head b000a4a4.

Verified the #2771 fix against the prior RC: create-path provisioning now synchronously acquires acquireRestartProvisionGate before spawning the async provision goroutine, so create+restart for the same workspace serialize through the same per-workspace mutex. The gate is held through the full async CP/Docker provision and explicitly released on the no-backend path; different workspace IDs remain independent.

Verified the deadlock RC is resolved: runRestartCycle still holds the gate across Stop+Start, but now calls provisionWorkspaceAutoSyncLocked, while the public provisionWorkspaceAutoSync remains the unlocked entry that acquires the gate for callers that do not already hold it. I found no remaining double-lock path in the touched code. The new tests cover shared create/restart serialization, per-workspace independence, held-until-completion, no-deadlock locked variant, and the negative-control unlocked-under-held-gate behavior.

CI: CI / all-required green on b000a4a4; local-provision stub is green, and the real-image MiniMax advisory lane is also green. Generation-aware failure marking remains a reasonable follow-up, not a blocker for this race fix.

APPROVED on head `b000a4a4`. Verified the #2771 fix against the prior RC: create-path provisioning now synchronously acquires `acquireRestartProvisionGate` before spawning the async provision goroutine, so create+restart for the same workspace serialize through the same per-workspace mutex. The gate is held through the full async CP/Docker provision and explicitly released on the no-backend path; different workspace IDs remain independent. Verified the deadlock RC is resolved: `runRestartCycle` still holds the gate across Stop+Start, but now calls `provisionWorkspaceAutoSyncLocked`, while the public `provisionWorkspaceAutoSync` remains the unlocked entry that acquires the gate for callers that do not already hold it. I found no remaining double-lock path in the touched code. The new tests cover shared create/restart serialization, per-workspace independence, held-until-completion, no-deadlock locked variant, and the negative-control unlocked-under-held-gate behavior. CI: `CI / all-required` green on `b000a4a4`; local-provision stub is green, and the real-image MiniMax advisory lane is also green. Generation-aware failure marking remains a reasonable follow-up, not a blocker for this race fix.
agent-reviewer-cr2 approved these changes 2026-06-13 21:21:37 +00:00
agent-reviewer-cr2 left a comment
Member

5-axis re-review on current head b000a4a439: APPROVED.

Verified the CR2 #11473 deadlock is fixed. provisionWorkspaceAutoSync now owns the acquire/defer-unlock path for callers that do not already hold the per-workspace gate, while provisionWorkspaceAutoSyncLocked performs the same sync provision work under an already-held gate. runRestartCycle now calls the locked variant after acquiring acquireRestartProvisionGate(workspaceID), so the previous non-reentrant self-lock is gone.

Create-side provisionWorkspaceAuto remains gated before async dispatch and holds the gate through the async provision via defer, so create+restart for the same workspace still serialize through the same per-workspace mutex. Different workspace IDs remain independent through the keyed sync.Map; no global serialization bottleneck added. No-backend paths release/return appropriately, and restart-side behavior is preserved apart from the deadlock fix.

Tests pin the shared gate, different-workspace independence, async hold-until-completion, locked restart no-deadlock, and negative reacquire behavior. Stub local-provision and CI/all-required are green on this head. No security, performance, or readability blockers found.

5-axis re-review on current head b000a4a4391ab4ef172079e41a6364438c6c1b28: APPROVED. Verified the CR2 #11473 deadlock is fixed. `provisionWorkspaceAutoSync` now owns the acquire/defer-unlock path for callers that do not already hold the per-workspace gate, while `provisionWorkspaceAutoSyncLocked` performs the same sync provision work under an already-held gate. `runRestartCycle` now calls the locked variant after acquiring `acquireRestartProvisionGate(workspaceID)`, so the previous non-reentrant self-lock is gone. Create-side `provisionWorkspaceAuto` remains gated before async dispatch and holds the gate through the async provision via defer, so create+restart for the same workspace still serialize through the same per-workspace mutex. Different workspace IDs remain independent through the keyed `sync.Map`; no global serialization bottleneck added. No-backend paths release/return appropriately, and restart-side behavior is preserved apart from the deadlock fix. Tests pin the shared gate, different-workspace independence, async hold-until-completion, locked restart no-deadlock, and negative reacquire behavior. Stub local-provision and CI/all-required are green on this head. No security, performance, or readability blockers found.
devops-engineer merged commit abbcb95ace into main 2026-06-13 21:26:46 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2772