fix(test): drain coalesceRestart goroutines before t.Cleanup (Class H, #170) #39

Merged
claude-ceo-assistant merged 1 commits from fix/170-goroutine-bleed-test-isolation into main 2026-05-07 20:27:10 +00:00

Closes Task #170 (Class H — Go test goroutine bleed across t.Run boundaries).

Summary

  • Add drainCoalesceGoroutine(t, wsID, cycle) test helper that spawns coalesceRestart on a goroutine matching the production callsite shape (go h.RestartByID(...)) and registers a t.Cleanup with sync.WaitGroup.Wait so the test cannot return while a goroutine is still draining.
  • Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper. Previously it called coalesceRestart synchronously, which never exercised the production goroutine-survival contract.
  • Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the deterministic regression guard for the Class H bleed.

Why the bleed mattered

With the synchronous test, a future refactor that swapped to go coalesceRestart(...) would silently let a panicking goroutine outlive the test. In production-shape tests (sqlmock-using neighbours), the goroutine's runRestartCycle calls LogActivity to write kind=DELEGATION_FAILED / kind=WORKSPACE_PROVISION_FAILED rows. After mockDB.Close() runs in t.Cleanup, that INSERT lands in the next test's sqlmock connection as INSERT-not-expected — surface-symptom that the brief flagged on TestPooledWithEICTunnel_PreservesFnErr.

Regression-guard contract

TestCoalesceRestart_DrainHelperWaitsForGoroutineExit:

  • Cycle blocks for 150ms inside a t.Run then panics.
  • After t.Run returns, asserts elapsed >= 150ms (proves the Wait barrier engaged).
  • Asserts the cycle's deferred close(exited) ran (proves the panic-recovery defer chain ran on the goroutine).
  • Asserts state.running was cleared (sticky-running guard).

Mutation-tested: removing t.Cleanup(wg.Wait) makes the test fail deterministically with elapsed < 300µs.

Per saved-memory feedback_assert_exact_not_substring: regression test asserts an exact-shape contract (elapsed >= blockFor), not a substring-in-output, so it discriminates between "drain works" and "drain skipped".

Test plan

  • go test -race -run "TestCoalesceRestart" -count=10 ./workspace-server/internal/handlers/... — 10/10 pass
  • go test ./workspace-server/internal/handlers/... — full handlers suite green
  • go vet ./workspace-server/internal/handlers/... — clean
  • Mutation test confirms regression guard is real (fails when t.Cleanup(wg.Wait) removed)

Coordination

Scope is workspace_restart_coalesce_test.go only — no overlap with sister agents on #173 (workspace-server/Dockerfile.tenant + publish workflow) or #168 (Class G post-suspension URL audit).

🤖 Generated with Claude Code

Closes Task #170 (Class H — Go test goroutine bleed across t.Run boundaries). ## Summary - Add `drainCoalesceGoroutine(t, wsID, cycle)` test helper that spawns `coalesceRestart` on a goroutine matching the production callsite shape (`go h.RestartByID(...)`) and registers a `t.Cleanup` with `sync.WaitGroup.Wait` so the test cannot return while a goroutine is still draining. - Convert `TestCoalesceRestart_PanicInCycleClearsState` to use the helper. Previously it called `coalesceRestart` synchronously, which never exercised the production goroutine-survival contract. - Add `TestCoalesceRestart_DrainHelperWaitsForGoroutineExit` as the deterministic regression guard for the Class H bleed. ## Why the bleed mattered With the synchronous test, a future refactor that swapped to `go coalesceRestart(...)` would silently let a panicking goroutine outlive the test. In production-shape tests (sqlmock-using neighbours), the goroutine's `runRestartCycle` calls `LogActivity` to write `kind=DELEGATION_FAILED` / `kind=WORKSPACE_PROVISION_FAILED` rows. After `mockDB.Close()` runs in `t.Cleanup`, that INSERT lands in the next test's sqlmock connection as `INSERT-not-expected` — surface-symptom that the brief flagged on `TestPooledWithEICTunnel_PreservesFnErr`. ## Regression-guard contract `TestCoalesceRestart_DrainHelperWaitsForGoroutineExit`: - Cycle blocks for 150ms inside a `t.Run` then panics. - After `t.Run` returns, asserts elapsed >= 150ms (proves the Wait barrier engaged). - Asserts the cycle's deferred `close(exited)` ran (proves the panic-recovery defer chain ran on the goroutine). - Asserts `state.running` was cleared (sticky-running guard). Mutation-tested: removing `t.Cleanup(wg.Wait)` makes the test fail deterministically with elapsed < 300µs. Per saved-memory `feedback_assert_exact_not_substring`: regression test asserts an exact-shape contract (elapsed >= blockFor), not a substring-in-output, so it discriminates between "drain works" and "drain skipped". ## Test plan - [x] `go test -race -run "TestCoalesceRestart" -count=10 ./workspace-server/internal/handlers/...` — 10/10 pass - [x] `go test ./workspace-server/internal/handlers/...` — full handlers suite green - [x] `go vet ./workspace-server/internal/handlers/...` — clean - [x] Mutation test confirms regression guard is real (fails when `t.Cleanup(wg.Wait)` removed) ## Coordination Scope is `workspace_restart_coalesce_test.go` only — no overlap with sister agents on #173 (`workspace-server/Dockerfile.tenant` + publish workflow) or #168 (Class G post-suspension URL audit). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
claude-ceo-assistant added 1 commit 2026-05-07 20:05:25 +00:00
fix(test): drain coalesceRestart goroutines before t.Cleanup (Class H, #170)
Some checks failed
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 13s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s
CI / Detect changes (pull_request) Successful in 19s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 14s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 16s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 12s
Harness Replays / detect-changes (pull_request) Successful in 17s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 15s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 16s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 51s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m47s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 2m8s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 2m9s
CI / Canvas (Next.js) (pull_request) Successful in 11s
CI / Python Lint & Test (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 23s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 8s
Harness Replays / Harness Replays (pull_request) Failing after 1m18s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 4m15s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5m7s
CI / Platform (Go) (pull_request) Successful in 13m16s
694c05552b
TestPooledWithEICTunnel_PreservesFnErr (and any sqlmock-using neighbour
test) was at risk of inheriting stale INSERT calls from a previous
test's coalesceRestart goroutine that survived its t.Cleanup boundary.

The production callsite shape is `go h.RestartByID(...)` from
a2a_proxy.go, a2a_proxy_helpers.go and main.go. When that goroutine's
runRestartCycle panics, coalesceRestart's deferred recover swallows it
to keep the platform process alive — but in tests, nothing waits for
the goroutine to fully exit. If it's still draining LogActivity-shaped
work after the test returns, those INSERTs land in the next test's
sqlmock connection as kind=DELEGATION_FAILED /
kind=WORKSPACE_PROVISION_FAILED, surfacing as "INSERT-not-expected".

Fix: introduce drainCoalesceGoroutine(t, wsID, cycle) test helper that
spawns coalesceRestart on a goroutine (matching production) and
registers a t.Cleanup with sync.WaitGroup.Wait so the test can't
declare itself done while a goroutine is still alive.

Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper
(previously it called coalesceRestart synchronously, which never
exercised the production goroutine-survival contract).

Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the
regression guard: cycle blocks 150ms then panics; the test asserts
t.Run elapsed >= 150ms (proving the Wait barrier engaged) AND the
deferred close ran (proving the panic-recovery defer chain executed)
AND state.running was cleared. Verified the assertion is real by
mutation-testing: removing t.Cleanup(wg.Wait) makes this test FAIL
deterministically with elapsed <300µs.

Per saved memory feedback_assert_exact_not_substring: the regression
test asserts an exact-shape contract (elapsed >= blockFor) rather than
a substring-in-output, so it discriminates between "drain works" and
"drain skipped".

Per Phase 3: 10/10 race-detector runs pass for all TestCoalesceRestart_*
tests. Full ./internal/handlers/... suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude-ceo-assistant merged commit c1e32ff4a7 into main 2026-05-07 20:27:10 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#39
No description provided.