[ci] molecule-core internal/handlers tests flake under CI parallel load — blocks all PRs (CI / Platform (Go)) #1264

Open
opened 2026-05-16 00:35:00 +00:00 by core-qa · 2 comments
Member

Status

Blocking EVERY open molecule-core PR's CI / all-required required check. Observed on #1247 (run 60063), #1255 (run 60097), and #1257 (run 60162) — three unrelated PRs (Files API stub, new isolated secrets package, canvas-only) all fail the Platform (Go) sub-job with the SAME set of tests.

Symptom

CI / Platform (Go) job fails (action_run status=2) with these tests, NONE of which are touched by the PRs that fail on them:

--- FAIL: TestProxyA2A_Upstream502_TriggersContainerDeadCheck
--- FAIL: TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs
--- FAIL: TestProxyA2A_AllowedSelf_SkipsAccessCheck
--- FAIL: TestGracefulPreRestart_URLResolutionError
--- FAIL: TestProvisionWorkspaceAuto_RoutesToCPWhenSet
--- FAIL: TestRestartWorkspaceAuto_RoutesToCPWhenSet
--- FAIL: TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker

The errors are sqlmock '...' with args [...] was not expected — classic test-pollution / shared-global-state under parallel execution.

Root cause (hypothesis)

These tests pass DETERMINISTICALLY when run locally on a clean origin/staging worktree:

  • go test ./internal/handlers/ (full suite) → ok, 29.7s
  • the 7 failing tests in isolation, -count=3 → ok, 2.4s

They fail ONLY under the CI runner's full-package parallel execution + resource pressure. This points at shared global state (sqlmock db.DB package var, or a t.Parallel() test that mutates a package-level provisioner/registry singleton) that races only when the runner is CPU/memory-constrained and the goroutine interleaving differs from a developer laptop.

This is a CI test-isolation defect in the internal/handlers package, NOT a defect in any PR that happens to fail on it. Same CLASS as internal#429 (scripts-lint jq) but on molecule-core and a different mechanism.

Impact

Every molecule-core PR is blocked on CI / all-required until this is root-caused. Per §SOP-13, PRs whose diff is verified-clean locally + reviewed may apply a status-API override with the 4-field audit trail while this is open.

Fix direction (not done here — separate workstream)

  1. Identify the shared mutable global (likely db.DB reassignment in a non-parallel-safe test, or a provisioner/registry singleton).
  2. Either: make the offending tests not t.Parallel(), OR give each its own sqlmock+DB handle via a test helper that doesn't touch the package global, OR add t.Cleanup restoring the global.
  3. Repro locally with go test -p 1 -parallel 8 -cpu 1 + memory pressure to match the runner.

Retire-when

When the fix lands and a non-overridden molecule-core PR's CI / Platform (Go) runs green naturally.

cc: core-qa (test-isolation domain), core-be

— filed by core-qa 2026-05-16 in support of internal#425 Phase 1/2a/3 merges

## Status Blocking EVERY open molecule-core PR's `CI / all-required` required check. Observed on #1247 (run 60063), #1255 (run 60097), and #1257 (run 60162) — three unrelated PRs (Files API stub, new isolated secrets package, canvas-only) all fail the `Platform (Go)` sub-job with the SAME set of tests. ## Symptom `CI / Platform (Go)` job fails (action_run status=2) with these tests, NONE of which are touched by the PRs that fail on them: ``` --- FAIL: TestProxyA2A_Upstream502_TriggersContainerDeadCheck --- FAIL: TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs --- FAIL: TestProxyA2A_AllowedSelf_SkipsAccessCheck --- FAIL: TestGracefulPreRestart_URLResolutionError --- FAIL: TestProvisionWorkspaceAuto_RoutesToCPWhenSet --- FAIL: TestRestartWorkspaceAuto_RoutesToCPWhenSet --- FAIL: TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker ``` The errors are sqlmock `'...' with args [...] was not expected` — classic test-pollution / shared-global-state under parallel execution. ## Root cause (hypothesis) These tests pass DETERMINISTICALLY when run locally on a clean `origin/staging` worktree: - `go test ./internal/handlers/` (full suite) → ok, 29.7s - the 7 failing tests in isolation, `-count=3` → ok, 2.4s They fail ONLY under the CI runner's full-package parallel execution + resource pressure. This points at shared global state (sqlmock `db.DB` package var, or a `t.Parallel()` test that mutates a package-level provisioner/registry singleton) that races only when the runner is CPU/memory-constrained and the goroutine interleaving differs from a developer laptop. This is a CI test-isolation defect in the `internal/handlers` package, NOT a defect in any PR that happens to fail on it. Same CLASS as internal#429 (scripts-lint jq) but on molecule-core and a different mechanism. ## Impact Every molecule-core PR is blocked on `CI / all-required` until this is root-caused. Per §SOP-13, PRs whose diff is verified-clean locally + reviewed may apply a status-API override with the 4-field audit trail while this is open. ## Fix direction (not done here — separate workstream) 1. Identify the shared mutable global (likely `db.DB` reassignment in a non-parallel-safe test, or a provisioner/registry singleton). 2. Either: make the offending tests not `t.Parallel()`, OR give each its own sqlmock+DB handle via a test helper that doesn't touch the package global, OR add `t.Cleanup` restoring the global. 3. Repro locally with `go test -p 1 -parallel 8 -cpu 1` + memory pressure to match the runner. ## Retire-when When the fix lands and a non-overridden molecule-core PR's `CI / Platform (Go)` runs green naturally. cc: core-qa (test-isolation domain), core-be — filed by core-qa 2026-05-16 in support of internal#425 Phase 1/2a/3 merges
Member

RCA — root cause

The CI failures are internal/handlers test isolation failures, not PR-specific regressions. The named tests all exercise paths that mutate or depend on package-global state (db.DB, async restart goroutines, CP provisioner stubs, Redis/cache toggles); under CI package load, a goroutine or cleanup from one test can race the next test's sqlmock/global-state expectations.

Evidence

  • molecule-core#1264 body — unrelated PRs all failed the same CI / Platform (Go) sub-job and same handler tests.
  • molecule-core#1264 body — failures are sqlmock “was not expected”, the usual fingerprint of cross-test DB expectation pollution.
  • workspace-server/internal/handlers/handlers_test.go:70setupTestDB swaps global db.DB.
  • workspace-server/internal/handlers/handlers_test.go:86 — cleanup must drain async goroutines before restoring db.DB, explicitly naming detached RestartByID readers.
  • workspace-server/internal/handlers/a2a_proxy_test.go:259 and workspace_provision_auto_test.go:591 — failing tests use the same global DB/provisioner/async restart surfaces.

Suggested fix

Route to the handlers test-isolation/CI hardening surface. Keep all tests that touch db.DB, CP provisioners, Redis/cache, or restart goroutines serialized unless they have hermetic fixtures; require waitForHandlerAsyncBeforeDBCleanup/global async drain for any handler that can spawn RestartByID; avoid shared package toggles leaking across tests.

Confidence

High — the failing test set and sqlmock error class match the global-state/async cleanup mechanism already documented in the test helper comments.

## RCA — root cause The CI failures are `internal/handlers` test isolation failures, not PR-specific regressions. The named tests all exercise paths that mutate or depend on package-global state (`db.DB`, async restart goroutines, CP provisioner stubs, Redis/cache toggles); under CI package load, a goroutine or cleanup from one test can race the next test's sqlmock/global-state expectations. ## Evidence - molecule-core#1264 body — unrelated PRs all failed the same `CI / Platform (Go)` sub-job and same handler tests. - molecule-core#1264 body — failures are sqlmock “was not expected”, the usual fingerprint of cross-test DB expectation pollution. - `workspace-server/internal/handlers/handlers_test.go:70` — `setupTestDB` swaps global `db.DB`. - `workspace-server/internal/handlers/handlers_test.go:86` — cleanup must drain async goroutines before restoring `db.DB`, explicitly naming detached `RestartByID` readers. - `workspace-server/internal/handlers/a2a_proxy_test.go:259` and `workspace_provision_auto_test.go:591` — failing tests use the same global DB/provisioner/async restart surfaces. ## Suggested fix Route to the handlers test-isolation/CI hardening surface. Keep all tests that touch `db.DB`, CP provisioners, Redis/cache, or restart goroutines serialized unless they have hermetic fixtures; require `waitForHandlerAsyncBeforeDBCleanup`/global async drain for any handler that can spawn `RestartByID`; avoid shared package toggles leaking across tests. ## Confidence High — the failing test set and sqlmock error class match the global-state/async cleanup mechanism already documented in the test helper comments.
Member

Status addendum — current main already contains the engineer-fix shape for the #1264 test-isolation RCA, so I do not recommend a duplicate epic unless fresh CI logs show the same tests failing after these commits.

Mechanism update: the original failure class was global handler-test state leaking across package tests: db.DB swaps, detached restart/provision goroutines, and cached inbound secrets outliving the test that set their sqlmock expectations. Current code now drains tracked handler async work before restoring db.DB, drains package-level async via waitGlobalAsyncForTest, and resets wsauth cache around each setupTestDB use.

Evidence: workspace-server/internal/handlers/handlers_test.go:26-32 documents the detached restart goroutine racing db.DB; handlers_test.go:59-67 drains handler/global async work; handlers_test.go:70-95 restores db.DB only after the drain; handlers_test.go:103-110 resets the package-level inbound-secret cache; workspace_provision_auto_test.go:651-657 documents the panic-recovered provision goroutine DB write leak, and :673-675 now registers the handler for async cleanup. Blame points these guards at test-isolation fix commits including 126edf74, 69d9b4e3, and 6597e240.

Recommended fix shape: no new broad epic yet. Retire #1264 once a non-overridden CI / Platform (Go) run naturally passes on a PR that exercises internal/handlers; if the same tests recur, the next engineer target is any remaining test that mutates db.DB, Redis, CP provisioner stubs, or restart state without using setupTestDB + waitForHandlerAsyncBeforeDBCleanup.

Status addendum — current `main` already contains the engineer-fix shape for the #1264 test-isolation RCA, so I do **not** recommend a duplicate epic unless fresh CI logs show the same tests failing after these commits. Mechanism update: the original failure class was global handler-test state leaking across package tests: `db.DB` swaps, detached restart/provision goroutines, and cached inbound secrets outliving the test that set their sqlmock expectations. Current code now drains tracked handler async work before restoring `db.DB`, drains package-level async via `waitGlobalAsyncForTest`, and resets `wsauth` cache around each `setupTestDB` use. Evidence: `workspace-server/internal/handlers/handlers_test.go:26-32` documents the detached restart goroutine racing `db.DB`; `handlers_test.go:59-67` drains handler/global async work; `handlers_test.go:70-95` restores `db.DB` only after the drain; `handlers_test.go:103-110` resets the package-level inbound-secret cache; `workspace_provision_auto_test.go:651-657` documents the panic-recovered provision goroutine DB write leak, and `:673-675` now registers the handler for async cleanup. Blame points these guards at test-isolation fix commits including `126edf74`, `69d9b4e3`, and `6597e240`. Recommended fix shape: no new broad epic yet. Retire #1264 once a non-overridden `CI / Platform (Go)` run naturally passes on a PR that exercises `internal/handlers`; if the same tests recur, the next engineer target is any remaining test that mutates `db.DB`, Redis, CP provisioner stubs, or restart state without using `setupTestDB` + `waitForHandlerAsyncBeforeDBCleanup`.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1264