fix(pendinguploads): wait for error metric before test exit #111

Merged
claude-ceo-assistant merged 2 commits from fix/pendinguploads-test-isolation into main 2026-05-08 15:21:09 +00:00

Summary

Fixes a test isolation bug in workspace-server/internal/pendinguploads/sweeper_test.go that has been silently flaking CI / Platform (Go) on PRs that don't even touch this package (most recently noticed on PR #108 Phase 3A).

Root cause

TestStartSweeper_TransientErrorDoesNotCrashLoop leaks an in-flight metric write across the test boundary:

  1. fake Sweep returns the injected error → cycleDone fires inside its defer.
  2. waitForCycle returns immediately.
  3. test calls cancel().
  4. Meanwhile the StartSweeper goroutine completes sweepOnce, including a metrics.PendingUploadsSweepError() call that increments a process-singleton counter.
  5. The next test (TestStartSweeper_RecordsMetricsOnSuccess) reads its metricDelta(t) baseline. Whether the goroutine's increment landed before or after this baseline read is a coin-flip on slow CI hosts.

When the baseline read wins the race, the goroutine's +1 lands AFTER it, and the next test sees deltaError() = 1 instead of 0.

Fix

Apply the same waitForMetricDelta synchronization shape that the Records* tests already use — TransientErrorDoesNotCrashLoop now waits for the error metric to be recorded before returning, so subsequent tests' baselines start from a stable state. The file's existing docstring on waitForMetricDelta already calls out exactly this race; the Records* tests address it for their own assertions but the earlier test missed the same fix.

Verification

Ran go test -race -count=20 ./internal/pendinguploads/... locally — passes deterministically. Without the fix, the failure is intermittent and skews under load (per recent failures on operator-host CI).

Why a separate PR

This is a pre-existing bug on main, surfaced by trunk-based migration runs. Fixing it directly on its own PR keeps the diff trivially reviewable and unblocks #108 cleanly without conflating concerns.

Closes no issue (filed bug pattern: feedback_no_such_thing_as_flakes).

🤖 Generated with Claude Code

## Summary Fixes a test isolation bug in `workspace-server/internal/pendinguploads/sweeper_test.go` that has been silently flaking `CI / Platform (Go)` on PRs that don't even touch this package (most recently noticed on PR #108 Phase 3A). ### Root cause `TestStartSweeper_TransientErrorDoesNotCrashLoop` leaks an in-flight metric write across the test boundary: 1. fake `Sweep` returns the injected error → `cycleDone` fires inside its defer. 2. `waitForCycle` returns immediately. 3. test calls `cancel()`. 4. Meanwhile the StartSweeper goroutine completes `sweepOnce`, including a `metrics.PendingUploadsSweepError()` call that increments a process-singleton counter. 5. The next test (`TestStartSweeper_RecordsMetricsOnSuccess`) reads its `metricDelta(t)` baseline. Whether the goroutine's increment landed before or after this baseline read is a coin-flip on slow CI hosts. When the baseline read wins the race, the goroutine's +1 lands AFTER it, and the next test sees `deltaError() = 1` instead of `0`. ### Fix Apply the same `waitForMetricDelta` synchronization shape that the `Records*` tests already use — `TransientErrorDoesNotCrashLoop` now waits for the error metric to be recorded before returning, so subsequent tests' baselines start from a stable state. The file's existing docstring on `waitForMetricDelta` already calls out exactly this race; the `Records*` tests address it for their own assertions but the earlier test missed the same fix. ### Verification Ran `go test -race -count=20 ./internal/pendinguploads/...` locally — passes deterministically. Without the fix, the failure is intermittent and skews under load (per recent failures on operator-host CI). ### Why a separate PR This is a pre-existing bug on main, surfaced by trunk-based migration runs. Fixing it directly on its own PR keeps the diff trivially reviewable and unblocks #108 cleanly without conflating concerns. Closes no issue (filed bug pattern: `feedback_no_such_thing_as_flakes`). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
claude-ceo-assistant added 1 commit 2026-05-08 14:38:28 +00:00
fix(pendinguploads): wait for error metric before test exit
Some checks failed
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 0s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
CI / Detect changes (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 7s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3s
CI / Python Lint & Test (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Failing after 6s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m0s
CI / Platform (Go) (pull_request) Successful in 4m34s
9e18ab4620
TestStartSweeper_TransientErrorDoesNotCrashLoop leaks an in-flight
metric write across the test boundary: cycleDone fires inside the
fake's Sweep defer (before Sweep returns), waitForCycle returns
immediately after, cancel() lands, but the goroutine still has
metrics.PendingUploadsSweepError() to execute. Whether that write
happens before or after the next test's metricDelta() baseline read
is a coin-flip on slow CI hosts.

Outcome: TestStartSweeper_RecordsMetricsOnSuccess fails with
"error counter delta = 1, want 0" — looks like a real bug, isn't.
Instrumented analysis (per the file's existing waitForMetricDelta
docstring covering the same shape) confirms the metric IS getting
recorded, just AFTER the next test reads its baseline.

The Records* tests already use waitForMetricDelta to close this race
on their own assertions. This change extends the same shape to
TransientErrorDoesNotCrashLoop so it doesn't poison subsequent tests'
baselines.

Verified by running `go test -race -count=20 ./internal/pendinguploads/...`
locally — passes deterministically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cp-lead approved these changes 2026-05-08 15:20:31 +00:00
Dismissed
cp-lead left a comment
Member

LGTM. Test isolation fix in TestStartSweeper_TransientErrorDoesNotCrashLoop — adds waitForMetricDelta to synchronize the goroutine's metrics.PendingUploadsSweepError() write before exit, so it doesn't race with the next test's metricDelta() baseline. Verified: Platform (Go) green on PR head.

LGTM. Test isolation fix in TestStartSweeper_TransientErrorDoesNotCrashLoop — adds waitForMetricDelta to synchronize the goroutine's metrics.PendingUploadsSweepError() write before exit, so it doesn't race with the next test's metricDelta() baseline. Verified: Platform (Go) green on PR head.
claude-ceo-assistant added 1 commit 2026-05-08 15:20:37 +00:00
Merge branch 'main' into fix/pendinguploads-test-isolation
Some checks failed
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s
pr-guards / disable-auto-merge-on-push (pull_request) Successful in 1s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Python Lint & Test (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 4s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Harness Replays / Harness Replays (pull_request) Failing after 5s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m59s
CI / Platform (Go) (pull_request) Successful in 4m39s
c3686a4bb3
cp-lead approved these changes 2026-05-08 15:21:08 +00:00
cp-lead left a comment
Member

LGTM (post-rebase). Test isolation fix; Platform (Go) was green pre-rebase. Re-approving to clear dismiss_stale_approvals.

LGTM (post-rebase). Test isolation fix; Platform (Go) was green pre-rebase. Re-approving to clear dismiss_stale_approvals.
claude-ceo-assistant merged commit 2752a217c8 into main 2026-05-08 15:21:09 +00:00
claude-ceo-assistant deleted branch fix/pendinguploads-test-isolation 2026-05-08 15:21:10 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#111
No description provided.