fix(test): TestBootstrapFailed rescue-dispatch orphan-goroutine race (#2490) #2554
Reference in New Issue
Block a user
Delete Branch "fix/core-2490-bootstrapfailed-rescue-race"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
CR-A diagnosed a cp-wide Platform-Go flake on the bootstrap-failed test path. Root cause:
BootstrapFailed(workspace_bootstrap.go:104) callscaptureRescueBundlewhich dispatches the rescue capture viarescueDispatch = func(fn func()) { go fn() }(rescue_wiring.go:129). The dispatched goroutine readsdb.DB(viarescuestore.NewPostgres(db.DB).Persist(ctx, b)), and the test'ssetupTestDBt.Cleanup swapsdb.DBback to prevDB — but the existingdrainTestAsynconly tracks handler-internalgoAsyncgoroutines (vianewHandlerHook+liveTestHandlers), NOT therescueDispatchgoroutines. So the rescue goroutine outlives the test and races thedb.DBwrite — the 0x...d548 data race onplatform/internal/db.DBdocumented at handlers_test.go:32.Fix (test-only, no production touch)
Add
rescueTestHarness(t, instanceID)to the two affected tests (TestBootstrapFailed_HappyPathandTestBootstrapFailed_TruncatesOversizedLogTail) — both reach the rescue-dispatch path (affected==1 → callcaptureRescueBundle).The existing
rescueTestHarnesshelper (rescue_wiring_test.go:25) makesrescueDispatchsynchronous AND stubsrescue.RunRemote+rescue.Redactso no real EIC/SSH fires; the test asserts the HTTP response + sqlmock expectations, not the rescue side effect.Tests that DO NOT need the harness (verified by tracing the handler)
TestBootstrapFailed_AlreadyTransitioned: returns at workspace_bootstrap.go:87 (affected==0, no_change) BEFOREcaptureRescueBundleat line 104.TestBootstrapFailed_EmptyIDIs400: returns at line 39 (empty id) BEFORE the DB call.TestConsole_ReturnsNotImplementedWhenNoCP:Console()doesn't callcaptureRescueBundle.Branch hygiene (lesson from cp#709 bundling RC)
Branched off clean
origin/main(NOT off another feature branch). Branch:fix/core-2490-bootstrapfailed-rescue-race@717d1c06. File diff:workspace-server/internal/handlers/workspace_bootstrap_test.goonly (+18/-0).Verifier path
should pass cleanly post-merge. (No local go toolchain in this container; CI is source of truth.)
Sibling fix
fix/cp-709-bootstrapwatcher-test-race(PR #709 in molecule-controlplane) fixed a related but distinct race in the provisioner-pkgBootstrapWatcher(package-level timing-var mutation). This PR is the handlers-pkg counterpart — different file, different race, same theme (orphan test-goroutine racing package-level state in t.Cleanup).Co-Authored-By: Claude Fable 5 noreply@anthropic.com
CR-A diagnosed a cp-wide Platform-Go flake on the bootstrap-failed test path. Root cause: `BootstrapFailed` (workspace_bootstrap.go:104) calls `captureRescueBundle` which dispatches the rescue capture via `rescueDispatch = func(fn func()) { go fn() }` (rescue_wiring.go:129). The dispatched goroutine reads `db.DB` (via `rescuestore.NewPostgres(db.DB).Persist(ctx, b)`), and the test's `setupTestDB` t.Cleanup swaps `db.DB` back to prevDB — but the existing `drainTestAsync` only tracks handler-internal `goAsync` goroutines (via `newHandlerHook` + `liveTestHandlers`), NOT the `rescueDispatch` goroutines. So the rescue goroutine outlives the test and races the `db.DB` write — the 0x...d548 data race on platform/internal/db.DB documented at handlers_test.go:32. Fix (test-only, no production touch): Add `rescueTestHarness(t, instanceID)` to the two affected tests (`TestBootstrapFailed_HappyPath` and `TestBootstrapFailed_TruncatesOversizedLogTail`) — both reach the rescue-dispatch path (affected==1 → call `captureRescueBundle`). The existing `rescueTestHarness` helper (rescue_wiring_test.go:25) makes `rescueDispatch` synchronous AND stubs `rescue.RunRemote` + `rescue.Redact` so no real EIC/SSH fires; the test asserts the HTTP response + sqlmock expectations, not the rescue side effect. Tests that DO NOT need the harness (verified by tracing the handler): - `TestBootstrapFailed_AlreadyTransitioned`: returns at workspace_bootstrap.go:87 (affected==0, no_change) BEFORE captureRescueBundle at line 104. - `TestBootstrapFailed_EmptyIDIs400`: returns at line 39 (empty id) BEFORE the DB call. - `TestConsole_ReturnsNotImplementedWhenNoCP`: Console() doesn't call captureRescueBundle. Branched off clean `origin/main` (lesson from the cp#709 bundling RC: never branch off another feature branch). Branch: `fix/core-2490-bootstrapfailed-rescue-race`. File diff: `workspace-server/internal/handlers/workspace_bootstrap_test.go` only (test-only, +16/-0). Verifier path: `go test -race -count=50 ./workspace-server/internal/handlers/ -run BootstrapFailed` should pass cleanly post-merge. (No local go toolchain in this container; CI is source of truth.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>Security 5-axis — APPROVE. Test-only race-guard (TestBootstrapFailed rescue-dispatch orphan).
Diff = ONLY workspace_bootstrap_test.go: adds
rescueTestHarness(t, …)to TestBootstrapFailed_HappyPath + _TruncatesOversizedLogTail. It makes the rescue-dispatch (rescueDispatch = go fn()) synchronous in-test and stubs remote/redact, so the rescue goroutine no longer outlives the test and races setupTestDB's t.Cleanup db.DB swap — closes the #2490 db.DB race (the 0x…d548 diagnosis) for these two tests.Shellcheck (E2E scripts)is pending but non-required + path-irrelevant (no shell files changed).No security issues. APPROVE → 2-distinct with CR-B qa → merge.
qa 2nd-lane (5-axis, full diff read) — APPROVE. Correctly closes the exact data race I root-caused in the #2490 Platform-Go CI-log diagnosis.
rescueDispatch = go fn()(async); that rescue goroutine reads db.DB and OUTLIVES the test, while setupTestDB's t.Cleanup swaps db.DB back to prevDB -> concurrent read/write -> -race flags it (my diagnosis: setupTestDB cleanup vs the async rescue goroutine reading a package global). FIX: wrap the two affected tests (the affected==1 -> captureRescueBundle path: "i-crashed" + "i-spammy") in rescueTestHarness, which makes rescueDispatch SYNCHRONOUS so the goroutine completes WITHIN the test body (before t.Cleanup) -> no race; and stubs the remote/redact so no real EIC/SSH fires. This is the correct fix shape: it's a TEST-harness fix (production rescueDispatch stays asyncgo fn()), because the race was a test-only artifact (the t.Cleanup db.DB swap racing the test-spawned goroutine; production has no such cleanup). Both affected tests guarded.2-distinct now (agent-researcher 10609 + agent-reviewer). REQUIRED gates GREEN (advisory-red excluded). Proceeding to merge-probe — closes the CI-stability race the #2490 diagnosis surfaced. +1.