test(handlers): drain preflight restart goroutine #780
No reviewers
Labels
No Label
merge-queue
merge-queue-hold
release-blocker
security
test-label-sre
tier:high
tier:low
tier:medium
triage-test
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: molecule-ai/molecule-core#780
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix/core-main-red-race-20260512"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Fixes the
molecule-core/mainPlatform (Go) race failure seen on merge SHAddba57e3.The failing runner log showed
TestPreflight_TransientError_FailsSoftAsAliveracing withsetupTestDB()because the previous preflight test spawnedgo h.RestartByID(...)and returned before that goroutine finished reading the package-globaldb.DB.This patch makes the container-not-running preflight test wait until the restart goroutine has drained before the next test can replace
db.DB.Verification
go test -race -v -timeout 60s ./internal/handlers/... -run 'TestPreflight_(ContainerNotRunning_StructuredFastFail|TransientError_FailsSoftAsAlive)' -count=50go test -race -v -timeout 60s ./internal/handlers/... -count=3go test -race -coverprofile=coverage.out ./...Gate context
PR #699 merged at 2026-05-13T03:43:06Z before the org-wide aggregate status branch protection update was applied at about 2026-05-13T03:50Z. Its head already had QA/security/SOP failures or pending statuses, which would be blocked under the current branch protection rule.
SOP Checklist
Comprehensive testing performed: race-focused handler preflight tests passed 50 consecutive runs; the full handlers race suite passed 3 consecutive runs; full workspace-server package tests passed with the race detector and coverage profile.
Local-postgres E2E run: N/A. This PR changes a unit-test synchronization helper only; it does not change application database behavior, migrations, or runtime SQL paths.
Staging-smoke verified or pending: N/A. This is a Go test-only repair for CI race stability and does not change deployed staging behavior.
Root-cause not symptom: the failing test returned while an async
RestartByIDgoroutine was still reading global database state, allowing the next test to replacedb.DBconcurrently.Five-Axis review walked: correctness checked against the race log and restart-state lifecycle; readability kept to one local helper; architecture avoids production code changes; security impact is none; performance impact is test-only polling bounded to 2 seconds.
No backwards-compat shim / dead code added: yes. The patch adds only a test helper used by the affected test and does not introduce runtime compatibility code.
Memory/saved-feedback consulted: consulted saved guidance on Gitea status gating, path-filtered workflow caveats, and evidence-first hypothesis discipline before interpreting the main-red/root-gate failure.
Five-Axis Code Review — PR #780
test(handlers): drain preflight restart goroutineVerdict: APPROVED
Reviewed at head SHA
381c710f8ae9beb194ff29c3e1cdd625e9484d55. Pure test-file change (+31/-2) inworkspace-server/internal/handlers/a2a_proxy_preflight_test.go. CI / all-required and CI / Platform (Go) are bothsuccess.Axis 1 — Correctness
The race root cause is correctly identified and fixed.
preflightContainerHealthfiresgo h.RestartByID(workspaceID)before returning;RestartByIDcallscoalesceRestartwhich reads and writes the package-levelrestartStates sync.Map. If the next test callssetupTestDB(t)and replaces the package-leveldb.DBbefore that goroutine exits itsrunRestartCycleloop, the goroutine races ondb.DB.The fix polls
restartStates.Load(wsID)→ checksst.runningunder the existingst.mulock → waits untilrunning == false. This is the correct synchronisation point:coalesceRestartholdsst.muand setsrunning = falsein itsdeferblock, which is the last write before the goroutine exits. The poll correctly observes that transition.The
!sawStatebranch correctly catches a false-pass (goroutine never started), which would silently mask the race instead of fixing it.resetRestartStatesFor(wsID)is called beforesetupTestDB(correct: clears any stale state from a prior run) and again insidewaitRestartByIDGoroutineIdleafter the drain confirmsrunning == false(correct: prevents the next test from seeing leftover state from the idle entry).No coverage gap: only one goroutine is launched by
preflightContainerHealth— the singlego h.RestartByID(workspaceID)call. The helper waits for exactly that one. ThecoalesceRestartpending-loop can spawn additional cycles internally, but those run synchronously within the same goroutine (the innerforloop callscycle()inline), so the single drain is sufficient.Edge case handled: if
coalesceRestartreturns immediately because!h.HasProvisioner()(test stub has no provisioner wired), the goroutine exits beforerestartStatesis populated. The!sawStatetimeout path handles this but would produce at.Fatalf— acceptable because a test stub that suppresses the goroutine entry is a misconfiguration of the test, not a latent race (the goroutine never ran, so no cleanup is needed). The test itself wires no provisioner intentionally and the test does expect the goroutine to start, which is validated by the existing sqlmock expectations passing.Correctness: PASS
Axis 2 — Readability
The helper
waitRestartByIDGoroutineIdleis self-contained, well-named, and carriest.Helper()so failure lines attribute to the call site. The two failure modes (goroutine never started vs. goroutine didn't drain in time) produce distinctt.Fatalfmessages. The 1 ms sleep is the minimal busy-wait granularity appropriate for a test poll. The 2-second deadline matches the PR description's stated-timeout 60stest budget.The
const wsIDextraction is a readability improvement over the inline string literal — it also ensures theresetRestartStatesForpre-call and theWithArgsexpectation use the exact same value.Readability: PASS
Axis 3 — Architecture
The drain pattern is consistent with the existing codebase.
workspace_restart_coalesce_test.goalready definesdrainCoalesceGoroutine(a WaitGroup-based variant used when tests spawncoalesceRestartdirectly). This PR does not usedrainCoalesceGoroutinebecause the goroutine is spawned inside the production code path (go h.RestartByID(...)insidepreflightContainerHealth), not by the test. The poll-on-restartState.runningapproach is appropriate here: it drains the production goroutine rather than replacing it with a test-controlled one, which preserves full fidelity.resetRestartStatesForis already defined inworkspace_restart_coalesce_test.goand is visible across the package test because both files are inpackage handlers. No duplicate definition.Architecture: PASS
Axis 4 — Security
Test-file only. No production code changes, no new secrets, no new HTTP surface. No concerns.
Security: PASS (N/A)
Axis 5 — Performance
The drain adds at most 2 seconds to
TestPreflight_ContainerNotRunning_StructuredFastFailin the degenerate case where the restart goroutine is very slow. In the normal case (test stub provisioner, no real Docker call),RestartByID→runRestartCyclecompletes within milliseconds and the poll exits on its first iteration. The 1 ms sleep cadence produces negligible CPU overhead. The 2-second timeout is well within the test suite's 60-second budget and is the same order of magnitude used by other drain helpers in the codebase.Performance: PASS
Summary: all five axes pass. The fix is minimal, correctly targeted, and consistent with the established drain patterns in this package. Approving.
[core-qa-agent] Needs rebase — PR is based on staging SHA
7ad26f4awhich is 5203 commits behind current staging HEAD. Diff vs current staging: 316 files changed, 39913 insertions(+), 11668 deletions(-). Please rebase on current staging before requesting QA review.