fix(molecule-core#2422): pre-claim gates stop (CR2 #11473) + provision via Auto (RCA tick) #2778
Reference in New Issue
Block a user
Delete Branch "fix/2422-deadlock-and-routing"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Picks up stalled PR #2422 (feat/ws-switch-provider-endpoint, head
4d7256c6) on the devops-engineer branch and addresses the two remaining RCs:CR2 (agent-reviewer-cr2) blocking finding on
4d7256c6: SwitchProvider performedcpStopWithRetryErrBEFORE the status<>provisioning CAS, so a request against a workspace that was already provisioning (or a racing duplicate) could still execute the stop side effect and only then return ALREADY_SWITCHING — the loser would stop a box it didn't own.Researcher RCA tick on
4d7256c6: workspace_switch_provider.go was callingh.provisionWorkspaceCP()directly, bypassing the central Auto dispatcher. The source-level pinTestNoCallSiteCallsDirectProvisionerExceptAutowas red on the Platform Go lane as a result.Fix
Pre-claim gates the stop — split the original combined UPDATE into a PRE-CLAIM and a provider-write, with the stop strictly between them:
UPDATE workspaces SET status="provisioning" WHERE id=$1 AND status<>$2 AND COALESCE(compute->>"provider","") IS NOT DISTINCT FROM $3. 0 rows → 409 ALREADY_SWITCHING, NO stop.h.provisionWorkspaceAuto(noth.provisionWorkspaceCPdirectly — see #2).Route the new-box provision through
provisionWorkspaceAuto(centralized dispatcher) instead ofh.provisionWorkspaceCPdirectly. The pre-claim already setstatus="provisioning", so Auto's per-workspace provision gate (from core#2771) serializes the new provision against any concurrent restart for the same ws-.CR3 (already addressed)
The CR3 content-security RC on
f38dc96fwas ALREADY addressed in4d7256c6(sanitize comments/test prose). Verified: the current source no longer exposes orphan/CP-side reconciler/internal mechanics in the test or handler prose.Test plan
TestSwitchProvider_PreClaimGatesStop(NEW) — load-bearing source-level position check that the ALREADY_SWITCHING 409 path AND the pre-claim appear BEFORE thecpStopWithRetryErrcall. This is the regression test for the CR2 #11473 blocking finding.TestSwitchProvider_ConcurrencyGuardAndAudit(UPDATED) — assertion moved from the OLD combined UPDATE to the new PRE-CLAIM, plus a negative pin flagging any future re-introduction ofh.provisionWorkspaceCPdirectly.TestSwitchProvider_StopBeforeProviderWrite(unchanged, still passes).TestSwitchProvider_RejectsBadProvider(unchanged, still passes).TestSwitchProvider_RouteRegistered(unchanged, still passes).TestNoCallSiteCallsDirectProvisionerExceptAuto(the Platform Go red per the RCA tick) — now passes../internal/handlers/green: 19.7s.go vet+go buildclean.PR strategy (per PM guidance)
This is a SUPERSEDING push on a fresh branch (
fix/2422-deadlock-and-routing) offorigin/feat/ws-switch-provider-endpoint. The original PR #2422 on the devops-engineer branch is still open; a new PR on this fix branch will reference #2422 and request the original be closed in favor of the new one once both reviewers approve.Closes #2422 (the two outstanding RCs; CR3 was already addressed).
REQUEST_CHANGES on head
8f046f42.5-axis review: the original #2422 stop-before-claim bug is fixed: the pre-claim happens before cpStopWithRetryErr, and a 0-row pre-claim returns 409 ALREADY_SWITCHING before any stop side effect. Routing the new-box provision through provisionWorkspaceAuto also addresses the direct-provisioner bypass.
Blocking robustness issue: after the pre-claim sets status=provisioning, the handler continues the critical stop/provider-write sequence on c.Request.Context(). If the client disconnects or the HTTP request is canceled after the pre-claim, cpStopWithRetryErr can return ctx.Err and the subsequent provider-write ExecContext(ctx, ...) will fail with the canceled context. That leaves the row claimed as provisioning with the old provider/instance still on the row and no new provision launched; retries then hit ALREADY_SWITCHING. The provider-write error path has the same shape: after pre-claim, it returns 500 without rolling the claim back or otherwise completing/recovering the switch.
Please make the post-claim critical section independent of the client request lifetime (bounded background context is fine) and/or add explicit rollback/fail-state handling so every path after a successful pre-claim either completes the provider write + provisions, or leaves the workspace in a retryable/diagnosable terminal state. Add a regression for cancellation/error after pre-claim so this does not come back.
Security/perf/readability otherwise look acceptable; this is a correctness/robustness blocker introduced by moving the claim before the stop.
APPROVED on head
8f046f42.Independent review focused on the two previously-blocking paths. The switch-provider handler now gates the stop behind the pre-claim CAS: it sets
status='provisioning'while preserving the old provider, checksRowsAffected, and returnsALREADY_SWITCHINGbefore anycpStopWithRetryErrside effect on a losing/racing request. The old-box stop still happens before the provider write, so teardown reads the old provider/instance metadata.The new-box provision now routes through
provisionWorkspaceAutoinstead of callingprovisionWorkspaceCPdirectly. I also simulated the merge with currentmain; the merge is clean and the merged result uses current-mainprovisionWorkspaceAuto, including the #2771 per-workspace provision gate, so create/restart serialization is preserved without a double-lock path from this handler. The source-level direct-provisioner pin and ordering tests cover the regression shape.CI / Platform (Go)andCI / all-requiredare green on this head. Remaining red statuses I saw are review/checklist/non-required lint gates, not code-test failures for this review.SOP ACK: genuine independent 5-axis review complete; no correctness, security, performance, test, or maintainability blockers found.
New commits pushed, approval review dismissed automatically according to repository settings
REQUEST_CHANGES on head
36ec65e7.The #11486 stranding fix is directionally right, but the rollback defer is armed too early and can now clobber a concurrent switch/provision that owns the claim.
Current flow:
online; capturepriorStatus := status.committed == false).provisioning.ALREADY_SWITCHING.UPDATE workspaces SET status = priorStatus WHERE id=$1 AND status='provisioning', changing the other request's active claim fromprovisioningback toonline.That is the same ownership class as the original stop-before-claim bug, just on status instead of Stop: a losing request can mutate state it did not claim. The
status='provisioning'guard is not sufficient because the competing owner is also represented asprovisioning.Please arm rollback only after this request has actually won the pre-claim, e.g. with a
claimed := falseflag set true only after RowsAffected==1, and have the defer no-op unlessclaimed && !committed. A stronger guard such as an owner/switch token would also work, but the minimal fix is to avoid rollback on pre-claim failure/error paths where this request never owned the row.After the pre-claim succeeds, the fresh-context rollback is the right shape for request cancellation, and removing
url=''from the pre-claim is good. But until rollback is claim-scoped, the fix introduces a new race. CI was still pending at review time; this is a correctness/robustness blocker independent of CI.REQUEST_CHANGES on head
36ec65e7.The prior pre-claim/stop ordering and Auto-routing fixes are still present, but the new rollback guard is armed too early.
committed := falseand the rollbackdeferare installed before the pre-claim UPDATE runs. On a normal lost-race path, the request can readpriorStatus='online', another switch wins and sets the row tostatus='provisioning', then this request's pre-claim affects 0 rows and returnsALREADY_SWITCHING. The defer still runs withcommitted=falseand executesUPDATE workspaces SET status=$priorStatus WHERE id=$1 AND status='provisioning', which can clobber the winning switch/provision back to the stale prior status.The
status='provisioning'rollback guard is not sufficient because the winning request's legitimate pre-claim is also exactlystatus='provisioning'. The rollback needs to be armed only after this request actually acquired the pre-claim, or it needs an ownership/generation guard that distinguishes this request's claim from another request's claim. A simple shape isclaimed := false; set it only afterRowsAffected()>0, and make the defer no-op unlessclaimed && !committed.Please add a regression test for the lost-preclaim interleaving: initial lookup observes prior status, another request has already moved the row to provisioning before this request's pre-claim returns 0 rows, and the loser must return 409 without issuing the rollback UPDATE. Until then, the fix can reintroduce the exact "loser mutates winner-owned switch" class, just via rollback instead of stop.
SOP ACK: genuine independent re-review complete. Blocking correctness issue found in the new rollback path.
APPROVED on head
b2353f55.Re-reviewed the ownership-scoped rollback fix against my #11494 blocker. The rollback defer is now armed only after the pre-claim succeeds: the 0-row
ALREADY_SWITCHINGpath returns beforecommitted := false/defer func()are installed, so a losing request can no longer roll back another request's legitimatestatus='provisioning'claim. The newTestSwitchProvider_PreClaimLoserDoesNotArmRollbackpins that ordering.The winner-side rollback from #11486 is still intact: after a successful pre-claim, uncommitted error/cancel paths use a fresh
context.Background()timeout and revert only rows still atstatus='provisioning'. The earlier #11473 behavior is also preserved: pre-claim gates the stop, stop still precedes provider write, and new-box provisioning still routes throughprovisionWorkspaceAuto. I simulated the merge with current main; it merges cleanly and the merged result uses the current #2771-gated Auto dispatcher.CI / Platform (Go),E2E API Smoke Test,Handlers Postgres Integration, andCI / all-requiredare green on this head. Remaining red statuses are review/checklist/ceremony or non-code gates.SOP ACK: genuine independent re-review complete; no correctness, security, performance, test, or maintainability blockers found on the fixed head.
APPROVED on head
b2353f55.5-axis re-review: my prior blockers are resolved. The stop remains gated behind a successful pre-claim, so a losing switch returns 409 before any stop side effect. The commit-or-rollback defer is now armed only after this request wins the pre-claim, so a losing request can no longer roll back another request's active
provisioningclaim. On the winning path, the fresh-context rollback remains in place until provision dispatch commits, so request cancellation or provider-write errors no longer strand the row inprovisioning.Correctness/robustness: the pre-claim no longer mutates
url, provider write still happens after the old-provider stop, and the new-box path goes throughprovisionWorkspaceAuto. Stop-exhaustion audit behavior is preserved. The remaining rollback guard (status='provisioning') is acceptable once rollback is claim-scoped by defer placement.Security/performance/readability: no new external input surface beyond the existing provider allowlist; no global serialization; comments and source-level regression tests pin the ordering. CI/all-required is green on this head.
New commits pushed, approval review dismissed automatically according to repository settings
New commits pushed, approval review dismissed automatically according to repository settings
REQUEST_CHANGES on head
454c8e71.The intended workflow cherry-pick itself looks directionally correct:
.gitea/workflows/e2e-peer-visibility.ymlno longer has top-levelon.paths, it uses the single always-emitting requiredE2E Peer Visibilityjob with per-step path gating, and staging secrets are scoped to non-PR steps. The previously approved provider-switch code is unchanged fromb2353f55and still routes new-box provisioning throughh.provisionWorkspaceAuto(...), so the #2771 per-workspace provision gate integration remains intact.I can’t approve this head because the live required lint state is still red, and the failure is consistent with the stale-base risk called out in the review request. On
454c8e71,lint-required-no-pathsis green, butlint-continue-on-error-trackingfails with 30 expired mc#1982continue-on-errortracker violations across workflow files, andlint-required-context-exists-in-bpfails onlint-shellcheck-arm64-pilot.ymlmissingbp-requireddirectives for its PR/push emissions.CI / Platform (Go)was still pending when checked. These are not the qa-review/gate-check-v3 ceremony reds.Fix shape: rebase/sync the branch onto current main (or otherwise carry the current workflow tree) so the PR contains only the provider-switch changes plus the clean e2e-peer-visibility required-emitter update, then rerun the required lint/Platform lanes. Once those are green, my prior code approval should still hold.
REQUEST_CHANGES on head
454c8e71.The narrow e2e-peer-visibility fix is correct:
.gitea/workflows/e2e-peer-visibility.ymlis byte-identical to currentorigin/main, has no top-levelon.paths, preserves the single always-emittingE2E Peer Visibilityrequired-check job, keeps path scoping insidedetect-changes/per-step gates, and keeps staging secrets bound only on non-PR steps. The switch-provider code is also unchanged from my prior approved headb2353f55: the loser pre-claim returns 409 before stop/defer, the rollback is armed only after this request wins the claim and uses a fresh context, and new provisioning still routes throughprovisionWorkspaceAuto. I do not see a fresh code-level regression in the provider-switch path.I cannot approve this head because the stale-base risk is now concrete in the live status set.
CI / all-requiredandlint-required-no-pathsare green, but the head status is still failure with fresh red lint/governance contexts:lint-continue-on-error-trackingandlint-required-context-exists-in-bpare failing on454c8e71, alongside gate/review ceremony reds. That matches the 689-commits-behind concern: cherry-picking only this workflow leaves the branch carrying stale workflow/lint state elsewhere.Fix: rebase/sync the branch onto current main (or otherwise carry the current workflow tree) so the head has the provider-switch changes plus the clean current CI workflow set, then rerun the required lint/governance lanes. Once those are green, my prior provider-switch approval should still hold.
454c8e7138to6e424ca36bAPPROVED on head
6e424ca3.Re-confirmed the code delta after the full rebase.
workspace_switch_provider.goandworkspace_switch_provider_test.goare byte-identical to the previously approved head; against current main the router delta is only the expectedPOST /switch-providerregistration. The prior SwitchProvider review still holds.Verified the stale workflow branch issue is gone: the PR diff no longer carries the old e2e-peer-visibility workflow hunk. Live status on
6e424ca3showslint-required-no-pathsgreen,CI / Platform (Go)green,E2E API Smoke Testgreen, andCI / all-requiredgreen. The runtime checks actually executed (Platform/API Smoke each ran ~2.5 min; all-required completed after them), so this is not a no-op/false-green case.Remaining red statuses are governance/review ceremony (
qa-review,security-review,gate-check-v3, SOP checklist), not code/required-lint failures.