[pre-existing] Go CI: -race flag causes false failures on cold runners #1184

Closed
opened 2026-05-15 12:21:48 +00:00 by core-qa · 1 comment
Member

Problem

The Go CI Platform (Go) step consistently fails on cold runners after 16-25 minutes, but ALL tests pass locally and on warm runners. The failures are NOT caused by code changes.

Symptoms

  • Platform (Go) step fails after ~16-25 minutes on cold runners
  • Status shows "Failing after Xm" — no test names accessible via Gitea API
  • go test -race requires CGO (gcc)
  • All 31 workspace-server packages pass without -race: exit 0, zero FAIL
  • Warm runners: ~5-7 min; cold runners: 13-25 min

Root Cause (Likely)

The -race flag dramatically increases compile time with race instrumentation across ~31 packages. On cold runners, compilation alone takes 13-25 minutes, leaving insufficient time for test execution within the 30m timeout.

Confirmed Data

Check Result
Staging HEAD (6452456f) no -race 31/31 packages PASS, exit 0
PR #1150 handlers no -race 27 tests PASS
PR #1157 broadcast tests no -race 11/11 PASS
Zero FAIL results (grep across all runs) Confirmed
Python workspace (staging) 2124 passed, 90.22% cov
Affected PRs: #1157, #1168, #1150, #1165 All fail Platform (Go) on CI

Code Changes — NOT the Cause

  • PR #1150 (commit 14acde98): rows.Err() guards in memories.go, container_files.go, tokens.go — defensive, correct
  • PR #1157 (commit 657f03f1): OFFSEC-015 org isolation via recursive CTEs — 11 new tests, all pass
  • PR #1168: workflow-only change

Recommendations

  1. Add go test -count=1 ./... (non-race) as primary CI gate with 30m timeout — always runs, merge-blocking
  2. Run go test -race as optional parallel job for race detection, not merge-blocking
  3. Increase -timeout to 60m and CI ceilings to 40m/50m as band-aid
  4. Need cold VM with gcc to identify any genuine race conditions in specific packages

Reproduction

Cold VM with gcc installed:

CGO_ENABLED=1 go test -race -count=1 -timeout 60m ./...
## Problem The Go CI Platform (Go) step consistently fails on cold runners after 16-25 minutes, but ALL tests pass locally and on warm runners. The failures are NOT caused by code changes. ## Symptoms - Platform (Go) step fails after ~16-25 minutes on cold runners - Status shows "Failing after Xm" — no test names accessible via Gitea API - `go test -race` requires CGO (gcc) - All 31 workspace-server packages pass without `-race`: exit 0, zero FAIL - Warm runners: ~5-7 min; cold runners: 13-25 min ## Root Cause (Likely) The `-race` flag dramatically increases compile time with race instrumentation across ~31 packages. On cold runners, compilation alone takes 13-25 minutes, leaving insufficient time for test execution within the 30m timeout. ## Confirmed Data | Check | Result | |-------|--------| | Staging HEAD (6452456f) no -race | 31/31 packages PASS, exit 0 | | PR #1150 handlers no -race | 27 tests PASS | | PR #1157 broadcast tests no -race | 11/11 PASS | | Zero FAIL results (grep across all runs) | Confirmed | | Python workspace (staging) | 2124 passed, 90.22% cov | | Affected PRs: #1157, #1168, #1150, #1165 | All fail Platform (Go) on CI | ## Code Changes — NOT the Cause - **PR #1150** (commit 14acde98): `rows.Err()` guards in `memories.go`, `container_files.go`, `tokens.go` — defensive, correct - **PR #1157** (commit 657f03f1): OFFSEC-015 org isolation via recursive CTEs — 11 new tests, all pass - **PR #1168**: workflow-only change ## Recommendations 1. Add `go test -count=1 ./...` (non-race) as primary CI gate with 30m timeout — always runs, merge-blocking 2. Run `go test -race` as optional parallel job for race detection, not merge-blocking 3. Increase `-timeout` to 60m and CI ceilings to 40m/50m as band-aid 4. Need cold VM with gcc to identify any genuine race conditions in specific packages ## Reproduction Cold VM with gcc installed: ```bash CGO_ENABLED=1 go test -race -count=1 -timeout 60m ./... ```
core-qa added the kind/infrastructurearea/ciplatform/go labels 2026-05-15 12:21:48 +00:00
Member

RCA — root cause

This is not the same shared-global/sqlmock isolation class as #1264/#1176. The failure mode is CI capacity/contract drift: the primary Platform (Go) lane runs the full Go suite with -race and coverage under tight runner ceilings, so cold-cache compile/instrumentation time can exhaust the job before tests produce actionable failures.

Evidence

  • .gitea/workflows/ci.yml:120-123 — Platform (Go) has a 15m job ceiling.
  • .gitea/workflows/ci.yml:167-172 — the required step still runs go test -race -timeout 10m -coverprofile=coverage.out ./....
  • .gitea/workflows/ci.yml:151-165 — diagnostic per-package race tests are continue-on-error, so the required signal comes from the full-suite race+coverage step.
  • .gitea/workflows/weekly-platform-go.yml:74-75 — the weekly lane still runs go test -race -coverprofile=coverage.out ./... without an explicit Go timeout, so cold-runner hangs can outlive the clearer CI contract.
  • #1184 reports no concrete FAIL tests and all non-race package runs passing; the observed variable is cold vs warm runner duration.

Suggested fix

Keep #1184 separate from the handlers test-isolation epic. Make non-race go test -count=1 ./... the fast required correctness gate, move full -race -coverprofile to scheduled/advisory or a widened runner class, and align weekly-platform-go.yml with explicit timeout/reporting so cold-runner infra failures are labeled as capacity/timeouts rather than test regressions.

Confidence

Medium-high — the issue data and workflow shape point to race-instrumentation runtime, not failing test logic; a cold-runner log with final timeout/OOM text would make it definitive.

## RCA — root cause This is not the same shared-global/sqlmock isolation class as #1264/#1176. The failure mode is CI capacity/contract drift: the primary Platform (Go) lane runs the full Go suite with `-race` and coverage under tight runner ceilings, so cold-cache compile/instrumentation time can exhaust the job before tests produce actionable failures. ## Evidence - `.gitea/workflows/ci.yml:120-123` — Platform (Go) has a 15m job ceiling. - `.gitea/workflows/ci.yml:167-172` — the required step still runs `go test -race -timeout 10m -coverprofile=coverage.out ./...`. - `.gitea/workflows/ci.yml:151-165` — diagnostic per-package race tests are continue-on-error, so the required signal comes from the full-suite race+coverage step. - `.gitea/workflows/weekly-platform-go.yml:74-75` — the weekly lane still runs `go test -race -coverprofile=coverage.out ./...` without an explicit Go timeout, so cold-runner hangs can outlive the clearer CI contract. - #1184 reports no concrete `FAIL` tests and all non-race package runs passing; the observed variable is cold vs warm runner duration. ## Suggested fix Keep #1184 separate from the handlers test-isolation epic. Make non-race `go test -count=1 ./...` the fast required correctness gate, move full `-race -coverprofile` to scheduled/advisory or a widened runner class, and align `weekly-platform-go.yml` with explicit timeout/reporting so cold-runner infra failures are labeled as capacity/timeouts rather than test regressions. ## Confidence Medium-high — the issue data and workflow shape point to race-instrumentation runtime, not failing test logic; a cold-runner log with final timeout/OOM text would make it definitive.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1184