test(handlers): drain preflight restart goroutine #780

hongming-codex-laptop · 2026-05-13T04:08:16Z

2026-05-13 04:08:16 +00:00

Summary

Fixes the molecule-core/main Platform (Go) race failure seen on merge SHA ddba57e3.

The failing runner log showed TestPreflight_TransientError_FailsSoftAsAlive racing with setupTestDB() because the previous preflight test spawned go h.RestartByID(...) and returned before that goroutine finished reading the package-global db.DB.

This patch makes the container-not-running preflight test wait until the restart goroutine has drained before the next test can replace db.DB.

Verification

go test -race -v -timeout 60s ./internal/handlers/... -run 'TestPreflight_(ContainerNotRunning_StructuredFastFail|TransientError_FailsSoftAsAlive)' -count=50
go test -race -v -timeout 60s ./internal/handlers/... -count=3
go test -race -coverprofile=coverage.out ./...

Gate context

PR #699 merged at 2026-05-13T03:43:06Z before the org-wide aggregate status branch protection update was applied at about 2026-05-13T03:50Z. Its head already had QA/security/SOP failures or pending statuses, which would be blocked under the current branch protection rule.

SOP Checklist

Comprehensive testing performed: race-focused handler preflight tests passed 50 consecutive runs; the full handlers race suite passed 3 consecutive runs; full workspace-server package tests passed with the race detector and coverage profile.

Local-postgres E2E run: N/A. This PR changes a unit-test synchronization helper only; it does not change application database behavior, migrations, or runtime SQL paths.

Staging-smoke verified or pending: N/A. This is a Go test-only repair for CI race stability and does not change deployed staging behavior.

Root-cause not symptom: the failing test returned while an async RestartByID goroutine was still reading global database state, allowing the next test to replace db.DB concurrently.

Five-Axis review walked: correctness checked against the race log and restart-state lifecycle; readability kept to one local helper; architecture avoids production code changes; security impact is none; performance impact is test-only polling bounded to 2 seconds.

No backwards-compat shim / dead code added: yes. The patch adds only a test helper used by the affected test and does not introduce runtime compatibility code.

Memory/saved-feedback consulted: consulted saved guidance on Gitea status gating, path-filtered workflow caveats, and evidence-first hypothesis discipline before interpreting the main-red/root-gate failure.

## Summary Fixes the `molecule-core/main` Platform (Go) race failure seen on merge SHA `ddba57e3`. The failing runner log showed `TestPreflight_TransientError_FailsSoftAsAlive` racing with `setupTestDB()` because the previous preflight test spawned `go h.RestartByID(...)` and returned before that goroutine finished reading the package-global `db.DB`. This patch makes the container-not-running preflight test wait until the restart goroutine has drained before the next test can replace `db.DB`. ## Verification - `go test -race -v -timeout 60s ./internal/handlers/... -run 'TestPreflight_(ContainerNotRunning_StructuredFastFail|TransientError_FailsSoftAsAlive)' -count=50` - `go test -race -v -timeout 60s ./internal/handlers/... -count=3` - `go test -race -coverprofile=coverage.out ./...` ## Gate context PR #699 merged at 2026-05-13T03:43:06Z before the org-wide aggregate status branch protection update was applied at about 2026-05-13T03:50Z. Its head already had QA/security/SOP failures or pending statuses, which would be blocked under the current branch protection rule. ## SOP Checklist Comprehensive testing performed: race-focused handler preflight tests passed 50 consecutive runs; the full handlers race suite passed 3 consecutive runs; full workspace-server package tests passed with the race detector and coverage profile. Local-postgres E2E run: N/A. This PR changes a unit-test synchronization helper only; it does not change application database behavior, migrations, or runtime SQL paths. Staging-smoke verified or pending: N/A. This is a Go test-only repair for CI race stability and does not change deployed staging behavior. Root-cause not symptom: the failing test returned while an async `RestartByID` goroutine was still reading global database state, allowing the next test to replace `db.DB` concurrently. Five-Axis review walked: correctness checked against the race log and restart-state lifecycle; readability kept to one local helper; architecture avoids production code changes; security impact is none; performance impact is test-only polling bounded to 2 seconds. No backwards-compat shim / dead code added: yes. The patch adds only a test helper used by the affected test and does not introduce runtime compatibility code. Memory/saved-feedback consulted: consulted saved guidance on Gitea status gating, path-filtered workflow caveats, and evidence-first hypothesis discipline before interpreting the main-red/root-gate failure.

hongming-codex-laptop added 1 commit 2026-05-13 04:08:16 +00:00

test(handlers): drain preflight restart goroutine

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s

Details

Harness Replays / detect-changes (pull_request) Successful in 9s

Details

CI / Detect changes (pull_request) Successful in 20s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 27s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 28s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 29s

Details

qa-review / approved (pull_request) Failing after 14s

Details

security-review / approved (pull_request) Failing after 12s

Details

Harness Replays / Harness Replays (pull_request) Successful in 5s

Details

CI / Canvas (Next.js) (pull_request) Successful in 7s

Details

Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 29s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

CI / Python Lint & Test (pull_request) Successful in 7s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s

Details

Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s

Details

lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m9s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m23s

Details

sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4

Details

sop-tier-check / tier-check (pull_request) Successful in 11s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m9s

Details

sop-checklist-gate / gate (pull_request) Successful in 11s

Details

gate-check-v3 / gate-check (pull_request) Successful in 13s

Details

CI / Platform (Go) (pull_request) Successful in 5m11s

Details

CI / all-required (pull_request) Successful in 1s

Details

audit-force-merge / audit (pull_request) Successful in 22s

Details

381c710f8a

core-qa approved these changes 2026-05-13 04:27:36 +00:00

core-qa left a comment

Five-Axis Code Review — PR #780 `test(handlers): drain preflight restart goroutine`

Verdict: APPROVED

Reviewed at head SHA 381c710f8ae9beb194ff29c3e1cdd625e9484d55. Pure test-file change (+31/-2) in workspace-server/internal/handlers/a2a_proxy_preflight_test.go. CI / all-required and CI / Platform (Go) are both success.

Axis 1 — Correctness

The race root cause is correctly identified and fixed. preflightContainerHealth fires go h.RestartByID(workspaceID) before returning; RestartByID calls coalesceRestart which reads and writes the package-level restartStates sync.Map. If the next test calls setupTestDB(t) and replaces the package-level db.DB before that goroutine exits its runRestartCycle loop, the goroutine races on db.DB.

The fix polls restartStates.Load(wsID) → checks st.running under the existing st.mu lock → waits until running == false. This is the correct synchronisation point: coalesceRestart holds st.mu and sets running = false in its defer block, which is the last write before the goroutine exits. The poll correctly observes that transition.

The !sawState branch correctly catches a false-pass (goroutine never started), which would silently mask the race instead of fixing it.

resetRestartStatesFor(wsID) is called before setupTestDB (correct: clears any stale state from a prior run) and again inside waitRestartByIDGoroutineIdle after the drain confirms running == false (correct: prevents the next test from seeing leftover state from the idle entry).

No coverage gap: only one goroutine is launched by preflightContainerHealth — the single go h.RestartByID(workspaceID) call. The helper waits for exactly that one. The coalesceRestart pending-loop can spawn additional cycles internally, but those run synchronously within the same goroutine (the inner for loop calls cycle() inline), so the single drain is sufficient.

Edge case handled: if coalesceRestart returns immediately because !h.HasProvisioner() (test stub has no provisioner wired), the goroutine exits before restartStates is populated. The !sawState timeout path handles this but would produce a t.Fatalf — acceptable because a test stub that suppresses the goroutine entry is a misconfiguration of the test, not a latent race (the goroutine never ran, so no cleanup is needed). The test itself wires no provisioner intentionally and the test does expect the goroutine to start, which is validated by the existing sqlmock expectations passing.

Correctness: PASS

Axis 2 — Readability

The helper waitRestartByIDGoroutineIdle is self-contained, well-named, and carries t.Helper() so failure lines attribute to the call site. The two failure modes (goroutine never started vs. goroutine didn't drain in time) produce distinct t.Fatalf messages. The 1 ms sleep is the minimal busy-wait granularity appropriate for a test poll. The 2-second deadline matches the PR description's stated -timeout 60s test budget.

The const wsID extraction is a readability improvement over the inline string literal — it also ensures the resetRestartStatesFor pre-call and the WithArgs expectation use the exact same value.

Readability: PASS

Axis 3 — Architecture

The drain pattern is consistent with the existing codebase. workspace_restart_coalesce_test.go already defines drainCoalesceGoroutine (a WaitGroup-based variant used when tests spawn coalesceRestart directly). This PR does not use drainCoalesceGoroutine because the goroutine is spawned inside the production code path (go h.RestartByID(...) inside preflightContainerHealth), not by the test. The poll-on-restartState.running approach is appropriate here: it drains the production goroutine rather than replacing it with a test-controlled one, which preserves full fidelity.

resetRestartStatesFor is already defined in workspace_restart_coalesce_test.go and is visible across the package test because both files are in package handlers. No duplicate definition.

Architecture: PASS

Axis 4 — Security

Test-file only. No production code changes, no new secrets, no new HTTP surface. No concerns.

Security: PASS (N/A)

Axis 5 — Performance

The drain adds at most 2 seconds to TestPreflight_ContainerNotRunning_StructuredFastFail in the degenerate case where the restart goroutine is very slow. In the normal case (test stub provisioner, no real Docker call), RestartByID → runRestartCycle completes within milliseconds and the poll exits on its first iteration. The 1 ms sleep cadence produces negligible CPU overhead. The 2-second timeout is well within the test suite's 60-second budget and is the same order of magnitude used by other drain helpers in the codebase.

Performance: PASS

Summary: all five axes pass. The fix is minimal, correctly targeted, and consistent with the established drain patterns in this package. Approving.

## Five-Axis Code Review — PR #780 `test(handlers): drain preflight restart goroutine` **Verdict: APPROVED** Reviewed at head SHA `381c710f8ae9beb194ff29c3e1cdd625e9484d55`. Pure test-file change (+31/-2) in `workspace-server/internal/handlers/a2a_proxy_preflight_test.go`. CI / all-required and CI / Platform (Go) are both `success`. --- ### Axis 1 — Correctness The race root cause is correctly identified and fixed. `preflightContainerHealth` fires `go h.RestartByID(workspaceID)` before returning; `RestartByID` calls `coalesceRestart` which reads and writes the package-level `restartStates sync.Map`. If the next test calls `setupTestDB(t)` and replaces the package-level `db.DB` before that goroutine exits its `runRestartCycle` loop, the goroutine races on `db.DB`. The fix polls `restartStates.Load(wsID)` → checks `st.running` under the existing `st.mu` lock → waits until `running == false`. This is the correct synchronisation point: `coalesceRestart` holds `st.mu` and sets `running = false` in its `defer` block, which is the last write before the goroutine exits. The poll correctly observes that transition. The `!sawState` branch correctly catches a false-pass (goroutine never started), which would silently mask the race instead of fixing it. `resetRestartStatesFor(wsID)` is called before `setupTestDB` (correct: clears any stale state from a prior run) and again inside `waitRestartByIDGoroutineIdle` after the drain confirms `running == false` (correct: prevents the next test from seeing leftover state from the idle entry). **No coverage gap**: only one goroutine is launched by `preflightContainerHealth` — the single `go h.RestartByID(workspaceID)` call. The helper waits for exactly that one. The `coalesceRestart` pending-loop can spawn additional cycles internally, but those run synchronously within the same goroutine (the inner `for` loop calls `cycle()` inline), so the single drain is sufficient. **Edge case handled**: if `coalesceRestart` returns immediately because `!h.HasProvisioner()` (test stub has no provisioner wired), the goroutine exits before `restartStates` is populated. The `!sawState` timeout path handles this but would produce a `t.Fatalf` — acceptable because a test stub that suppresses the goroutine entry is a misconfiguration of the test, not a latent race (the goroutine never ran, so no cleanup is needed). The test itself wires no provisioner intentionally and the test *does* expect the goroutine to start, which is validated by the existing sqlmock expectations passing. Correctness: **PASS** --- ### Axis 2 — Readability The helper `waitRestartByIDGoroutineIdle` is self-contained, well-named, and carries `t.Helper()` so failure lines attribute to the call site. The two failure modes (goroutine never started vs. goroutine didn't drain in time) produce distinct `t.Fatalf` messages. The 1 ms sleep is the minimal busy-wait granularity appropriate for a test poll. The 2-second deadline matches the PR description's stated `-timeout 60s` test budget. The `const wsID` extraction is a readability improvement over the inline string literal — it also ensures the `resetRestartStatesFor` pre-call and the `WithArgs` expectation use the exact same value. Readability: **PASS** --- ### Axis 3 — Architecture The drain pattern is consistent with the existing codebase. `workspace_restart_coalesce_test.go` already defines `drainCoalesceGoroutine` (a WaitGroup-based variant used when tests spawn `coalesceRestart` directly). This PR does not use `drainCoalesceGoroutine` because the goroutine is spawned *inside* the production code path (`go h.RestartByID(...)` inside `preflightContainerHealth`), not by the test. The poll-on-`restartState.running` approach is appropriate here: it drains the *production* goroutine rather than replacing it with a test-controlled one, which preserves full fidelity. `resetRestartStatesFor` is already defined in `workspace_restart_coalesce_test.go` and is visible across the package test because both files are in `package handlers`. No duplicate definition. Architecture: **PASS** --- ### Axis 4 — Security Test-file only. No production code changes, no new secrets, no new HTTP surface. No concerns. Security: **PASS (N/A)** --- ### Axis 5 — Performance The drain adds at most 2 seconds to `TestPreflight_ContainerNotRunning_StructuredFastFail` in the degenerate case where the restart goroutine is very slow. In the normal case (test stub provisioner, no real Docker call), `RestartByID` → `runRestartCycle` completes within milliseconds and the poll exits on its first iteration. The 1 ms sleep cadence produces negligible CPU overhead. The 2-second timeout is well within the test suite's 60-second budget and is the same order of magnitude used by other drain helpers in the codebase. Performance: **PASS** --- **Summary**: all five axes pass. The fix is minimal, correctly targeted, and consistent with the established drain patterns in this package. Approving.

hongming merged commit 03e7a2d8a5 into main

2026-05-13 04:30:08 +00:00

hongming referenced this issue from a commit

2026-05-13 04:30:09 +00:00

Merge pull request 'test(handlers): drain preflight restart goroutine' (#780) from fix/core-main-red-race-20260512 into main

hongming deleted branch fix/core-main-red-race-20260512

2026-05-13 04:30:11 +00:00

core-qa reviewed 2026-05-13 04:37:53 +00:00

core-qa left a comment

[core-qa-agent] Needs rebase — PR is based on staging SHA 7ad26f4a which is 5203 commits behind current staging HEAD. Diff vs current staging: 316 files changed, 39913 insertions(+), 11668 deletions(-). Please rebase on current staging before requesting QA review.

[core-qa-agent] Needs rebase — PR is based on staging SHA 7ad26f4a which is 5203 commits behind current staging HEAD. Diff vs current staging: 316 files changed, 39913 insertions(+), 11668 deletions(-). Please rebase on current staging before requesting QA review.

Sign in to join this conversation.

No reviewers