fix(pendinguploads): wait for error metric before test exit

TestStartSweeper_TransientErrorDoesNotCrashLoop leaks an in-flight
metric write across the test boundary: cycleDone fires inside the
fake's Sweep defer (before Sweep returns), waitForCycle returns
immediately after, cancel() lands, but the goroutine still has
metrics.PendingUploadsSweepError() to execute. Whether that write
happens before or after the next test's metricDelta() baseline read
is a coin-flip on slow CI hosts.

Outcome: TestStartSweeper_RecordsMetricsOnSuccess fails with
"error counter delta = 1, want 0" — looks like a real bug, isn't.
Instrumented analysis (per the file's existing waitForMetricDelta
docstring covering the same shape) confirms the metric IS getting
recorded, just AFTER the next test reads its baseline.

The Records* tests already use waitForMetricDelta to close this race
on their own assertions. This change extends the same shape to
TransientErrorDoesNotCrashLoop so it doesn't poison subsequent tests'
baselines.

Verified by running `go test -race -count=20 ./internal/pendinguploads/...`
locally — passes deterministically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
dev-lead 2026-05-08 07:37:45 -07:00
parent ae2d9eabf6
commit 9e18ab4620

View File

@ -207,20 +207,25 @@ func TestStartSweeper_TransientErrorDoesNotCrashLoop(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// 50ms ticker so the second cycle fires quickly enough for the test.
// We re-export SweepInterval as a const, but tests use the public
// StartSweeper that takes its own interval — wait, the public
// StartSweeper signature uses the package-level SweepInterval. Hmm,
// this means the test takes ~5 minutes. Let me reconsider.
//
// (We patch the test below to just look at the immediate-sweep call
// + an error path, since the immediate call is enough to prove the
// "error doesn't crash" contract — the loop continues afterward
// regardless of timing.)
// Capture metric baseline so we can wait for the error counter to
// settle before returning — otherwise this test's leaked metric
// write races with the next test's metricDelta() baseline read and
// causes a non-deterministic +1 leak (manifests as
// TestStartSweeper_RecordsMetricsOnSuccess: "error counter delta=1,
// want 0"). cycleDone fires inside the fake's Sweep defer, BEFORE
// sweepOnce records the error metric — so cancel() right after
// waitForCycle is too early.
_, _, deltaError := metricDelta(t)
go pendinguploads.StartSweeper(ctx, store, time.Hour)
// Wait for the first (errored) cycle.
store.waitForCycle(t, 1, 2*time.Second)
// Wait for the goroutine to record the error metric. After this
// returns, sweepOnce has fully completed and a subsequent cancel()
// stops the loop on the next select pass with no in-flight metric
// writes outstanding.
waitForMetricDelta(t, deltaError, 1, 2*time.Second)
// Cancel — the goroutine returns cleanly, proving the error path
// didn't crash the loop. Without this fix the goroutine would have
// either panicked (process abort visible at exit) or stuck (this