fix(core#2834): make TestRecoverPanic -race-clean (mutex buffer + correct LIFO order) #2835

Merged
devops-engineer merged 1 commits from fix/2834-test-recover-panic-race into main 2026-06-14 08:00:40 +00:00
Member

core#2834 — TestRecoverPanic_RecoversFromPanicInGoroutine -race flake

The test had a concurrent read/write on a shared bytes.Buffer that the -race detector flagged intermittently on the repo-wide go test -race step (advisory, non-blocking — but reds the -race lane on every Go PR).

Two compounding root causes

  1. WRONG LIFO DEFER ORDER. The defer block was:

    defer recoverPanic(test_goroutine)
    defer close(done)
    

    LIFO runs close(done) FIRST, then recoverPanic — meaning the test's <-done unblock happens BEFORE recoverPanic's log.Printf writes to the buffer. The test then reads buf.String() while the goroutine is still writing. The test comment explicitly described the correct order (any defer close(done) placed BEFORE the recoverPanic defer fires) but the code didn't match the comment. Fix: swap the defer order so close(done) is registered first, recoverPanic last — LIFO now runs the log write first, then signals done.

  2. NO CROSS-GOROUTINE SYNCHRONIZATION ON THE BUFFER. The standard log package holds its own internal mutex around calls to io.Writer.Write, but that mutex only serializes concurrent WRITERS — it does NOT synchronize with the test's buf.String() READ from a separate goroutine. Even with the LIFO fix, the race detector needs an explicit happens-before edge between the Write and the String() read. Fix: wrap bytes.Buffer in a syncBuffer type with a sync.Mutex guarding both Write and String. Defense in depth.

Test surface

TEST-ONLY change (no production code). go test ./internal/middleware/ passes. No gcc/cgo available in this sandbox for go test -race local-run, but the LIFO + mutex combination is race-free by construction.

Files

  • workspace-server/internal/middleware/panic_recovery_test.go — added sync import, added syncBuffer type, fixed defer order in both TestRecoverPanic tests, replaced bytes.Buffer with syncBuffer.

Followups

  • TestRecoverPanic_NoopOnNormalReturn got the same treatment (same syncBuffer, same corrected defer order) for uniformity and future-proofing — its normal-return path is no-op so it didn't race in practice, but keeping the pattern uniform means a future test author doesn't have to re-derive the safe pattern.

Closes core#2834. Re-requesting CR2 + Researcher review for the 2-genuine SOP.

## core#2834 — TestRecoverPanic_RecoversFromPanicInGoroutine -race flake The test had a concurrent read/write on a shared bytes.Buffer that the -race detector flagged intermittently on the repo-wide `go test -race` step (advisory, non-blocking — but reds the -race lane on every Go PR). ### Two compounding root causes 1. **WRONG LIFO DEFER ORDER.** The defer block was: ``` defer recoverPanic(test_goroutine) defer close(done) ``` LIFO runs `close(done)` FIRST, then `recoverPanic` — meaning the test's `<-done` unblock happens BEFORE `recoverPanic`'s `log.Printf` writes to the buffer. The test then reads `buf.String()` while the goroutine is still writing. The test comment explicitly described the correct order (any `defer close(done)` placed BEFORE the recoverPanic defer fires) but the code didn't match the comment. **Fix:** swap the defer order so `close(done)` is registered first, `recoverPanic` last — LIFO now runs the log write first, then signals done. 2. **NO CROSS-GOROUTINE SYNCHRONIZATION ON THE BUFFER.** The standard `log` package holds its own internal mutex around calls to `io.Writer.Write`, but that mutex only serializes concurrent WRITERS — it does NOT synchronize with the test's `buf.String()` READ from a separate goroutine. Even with the LIFO fix, the race detector needs an explicit happens-before edge between the Write and the String() read. **Fix:** wrap `bytes.Buffer` in a `syncBuffer` type with a `sync.Mutex` guarding both `Write` and `String`. Defense in depth. ### Test surface TEST-ONLY change (no production code). `go test ./internal/middleware/` passes. No gcc/cgo available in this sandbox for `go test -race` local-run, but the LIFO + mutex combination is race-free by construction. ### Files - `workspace-server/internal/middleware/panic_recovery_test.go` — added `sync` import, added `syncBuffer` type, fixed defer order in both TestRecoverPanic tests, replaced `bytes.Buffer` with `syncBuffer`. ### Followups - TestRecoverPanic_NoopOnNormalReturn got the same treatment (same syncBuffer, same corrected defer order) for uniformity and future-proofing — its normal-return path is no-op so it didn't race in practice, but keeping the pattern uniform means a future test author doesn't have to re-derive the safe pattern. Closes core#2834. Re-requesting CR2 + Researcher review for the 2-genuine SOP.
agent-dev-b added 1 commit 2026-06-14 07:56:56 +00:00
fix(core#2834): make TestRecoverPanic -race-clean (mutex buffer + correct LIFO order)
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request_target) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 7s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 6s
Harness Replays / Harness Replays (pull_request) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 11s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 5s
sop-checklist / na-declarations (pull_request) N/A: (none)
reserved-path-review / reserved-path-review (pull_request_target) Successful in 9s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Successful in 12s
sop-checklist / all-items-acked (pull_request_target) Successful in 9s
E2E API Smoke Test / detect-changes (pull_request) Successful in 17s
CI / Detect changes (pull_request) Successful in 20s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 16s
E2E Chat / detect-changes (pull_request) Successful in 21s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 20s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 3s
E2E Chat / E2E Chat (pull_request) Successful in 2s
CI / Canvas Deploy Status (pull_request) Successful in 1s
gate-check-v3 / gate-check (pull_request_target) Failing after 21s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 33s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 41s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 25s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m15s
CI / Platform (Go) (pull_request) Successful in 2m27s
CI / all-required (pull_request) Successful in 4s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 7s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 9s
security-review / approved (pull_request_review) Successful in 10s
audit-force-merge / audit (pull_request_target) Successful in 8s
sop-checklist / all-items-acked (pull_request) Compensated by status-reaper (non-required pull_request/pull_request_review governance shadow overridden by successful pull_request_target status; see .gitea/scripts/status-reaper.py)
b903926afe
The TestRecoverPanic_RecoversFromPanicInGoroutine test had a
concurrent read/write on a shared bytes.Buffer that the -race
detector flagged intermittently on the repo-wide `go test -race`
step (advisory, non-blocking — but reds the -race lane on every Go
PR).

Two compounding root causes:

1. WRONG LIFO DEFER ORDER. The defer block was:
       defer recoverPanic("test_goroutine")
       defer close(done)
   LIFO runs close(done) FIRST, then recoverPanic — meaning the
   test's <-done unblock happens BEFORE recoverPanic's log.Printf
   writes to the buffer. The test then reads buf.String() while
   the goroutine is still writing. The test comment explicitly
   described the correct order ("any defer close(done) placed
   BEFORE the recoverPanic defer fires") but the code didn't
   match the comment. Fix: swap the defer order so close(done) is
   registered first, recoverPanic last — LIFO now runs the log
   write first, then signals done.

2. NO CROSS-GOROUTINE SYNCHRONIZATION ON THE BUFFER. The standard
   log package holds its own internal mutex around calls to
   io.Writer.Write, but that mutex only serializes concurrent
   WRITERS — it does NOT synchronize with the test's
   buf.String() READ from a separate goroutine. Even with the
   LIFO fix, the race detector needs an explicit happens-before
   edge between the Write and the String() read. Fix: wrap
   bytes.Buffer in a syncBuffer type with a sync.Mutex guarding
   both Write and String. Defense in depth — if a future change
   ever moves the defers back to the wrong order, the mutex
   still serializes the access.

TestVerifyPanic_NoopOnNormalReturn got the same treatment (same
syncBuffer, same corrected defer order) for uniformity and
future-proofing — its normal-return path is no-op so it didn't
race in practice, but keeping the pattern uniform means a future
test author doesn't have to re-derive the safe pattern.

Test surface: TEST-ONLY change (no production code). go test
./internal/middleware/ passes (no -race in this sandbox — no
gcc/cgo — but the LIFO + mutex combination is race-free by
construction).
agent-reviewer-cr2 approved these changes 2026-06-14 08:00:21 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED on b903926a.

Verified Platform Go job 497007 is a real matching-head run and completed successfully. Its advisory go test -race -timeout 10m ./... step completed success; the job log contains no WARNING: DATA RACE and no TestRecoverPanic failure.

Diff review: the defer order now registers close(done) first and recoverPanic(...) last, so LIFO runs panic recovery/logging before the test goroutine signals completion. The new syncBuffer mutex guards both Write and String, providing the required happens-before edge between log writes and assertions. The tests still assert the intended panic recovery/no-op behavior, and the change is test-only.

APPROVED on b903926a. Verified Platform Go job 497007 is a real matching-head run and completed successfully. Its advisory `go test -race -timeout 10m ./...` step completed success; the job log contains no `WARNING: DATA RACE` and no `TestRecoverPanic` failure. Diff review: the defer order now registers `close(done)` first and `recoverPanic(...)` last, so LIFO runs panic recovery/logging before the test goroutine signals completion. The new `syncBuffer` mutex guards both `Write` and `String`, providing the required happens-before edge between log writes and assertions. The tests still assert the intended panic recovery/no-op behavior, and the change is test-only.
devops-engineer merged commit 04fb2d74f5 into main 2026-06-14 08:00:40 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2835