fix(ci): port 3 verified-green CI fixes from staging to main
Some checks failed
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
branch-protection drift check / Branch protection drift (pull_request) Successful in 8s
CI / Detect changes (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 9s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 57s
Harness Replays / Harness Replays (pull_request) Failing after 57s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m20s
CI / Canvas (Next.js) (pull_request) Successful in 1m56s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4m12s
Some checks failed
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
branch-protection drift check / Branch protection drift (pull_request) Successful in 8s
CI / Detect changes (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 8s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 9s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 57s
Harness Replays / Harness Replays (pull_request) Failing after 57s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m20s
CI / Canvas (Next.js) (pull_request) Successful in 1m56s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4m12s
Files copied from staging tip (a4ab623b):
- canvas/vitest.config.ts (vitest testTimeout 5000→30000ms on CI; was PR #97 unblocking 4 canvas-test PRs)
- .github/workflows/handlers-postgres-integration.yml (parallel-safe pg containers; was PR #98 unblocking #84)
- .github/workflows/e2e-api.yml (parallel-safe pg+redis; was PR #100 unblocking #84 + #99)
- docs/runbooks/handlers-postgres-integration-port-collision.md (new — substrate runbook)
Why a separate cherry-pick PR rather than promote-staging-to-main: PR #99 (full
staging→main promote) hit a Platform (Go) sqlmock regression in some other
staging commit (under Phase 1 investigation by sister agent a283d938). To
unblock prod with the verified-green CI fixes WITHOUT carrying the Go-test
regression, port these 3 workflow/config files surgically.
Verified clean: workflow-YAML + vitest config + runbook only — zero Go code
touched, so the Platform (Go) failure on PR #99 cannot apply here.
Co-authored-by: Claude (orchestrator)
This commit is contained in:
parent
068c968206
commit
570f456436
130
.github/workflows/e2e-api.yml
vendored
130
.github/workflows/e2e-api.yml
vendored
@ -12,6 +12,59 @@ name: E2E API Smoke Test
|
||||
# spending CI cycles. See the in-job comment on the `e2e-api` job for
|
||||
# why this is one job (not two-jobs-sharing-name) and the 2026-04-29
|
||||
# PR #2264 incident that drove the consolidation.
|
||||
#
|
||||
# Parallel-safety (Class B Hongming-owned CICD red sweep, 2026-05-08)
|
||||
# -------------------------------------------------------------------
|
||||
# Same substrate hazard as PR #98 (handlers-postgres-integration). Our
|
||||
# Gitea act_runner runs with `container.network: host` (operator host
|
||||
# `/opt/molecule/runners/config.yaml`), which means:
|
||||
#
|
||||
# * Two concurrent runs both try to bind their `-p 15432:5432` /
|
||||
# `-p 16379:6379` host ports — the second postgres/redis FATALs
|
||||
# with `Address in use` and `docker run` returns exit 125 with
|
||||
# `Conflict. The container name "/molecule-ci-postgres" is already
|
||||
# in use by container ...`. Verified in run a7/2727 on 2026-05-07.
|
||||
# * The fixed container names `molecule-ci-postgres` / `-redis` (the
|
||||
# pre-fix shape) collide on name AS WELL AS port. The cleanup-with-
|
||||
# `docker rm -f` at the start of the second job KILLS the first
|
||||
# job's still-running postgres/redis.
|
||||
#
|
||||
# Fix shape (mirrors PR #98's bridge-net pattern, adapted because
|
||||
# platform-server is a Go binary on the host, not a containerised
|
||||
# step):
|
||||
#
|
||||
# 1. Unique container names per run:
|
||||
# pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
|
||||
# redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
|
||||
# `${RUN_ID}-${RUN_ATTEMPT}` is unique even across reruns of the
|
||||
# same run_id.
|
||||
# 2. Ephemeral host port per run (`-p 0:5432`), then read the actual
|
||||
# bound port via `docker port` and export DATABASE_URL/REDIS_URL
|
||||
# pointing at it. No fixed host-port → no port collision.
|
||||
# 3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve was
|
||||
# the original flake fixed in #92 and the script's still IPv6-
|
||||
# enabled.
|
||||
# 4. `if: always()` cleanup so containers don't leak when test steps
|
||||
# fail.
|
||||
#
|
||||
# Issue #94 items #2 + #3 (also fixed here):
|
||||
# * Pre-pull `alpine:latest` so the platform-server's provisioner
|
||||
# (`internal/handlers/container_files.go`) can stand up its
|
||||
# ephemeral token-write helper without a daemon.io round-trip.
|
||||
# * Create `molecule-monorepo-net` bridge network if missing so the
|
||||
# provisioner's container.HostConfig {NetworkMode: ...} attach
|
||||
# succeeds.
|
||||
# Item #1 (timeouts) — evidence on recent runs (77/3191, ae/4270, 0e/
|
||||
# 2318) shows Postgres ready in 3s, Redis in 1s, Platform in 1s when
|
||||
# they DO come up. Timeouts are not the bottleneck; not bumped.
|
||||
#
|
||||
# Item explicitly NOT fixed here: failing test `Status back online`
|
||||
# fails because the platform's langgraph workspace template image
|
||||
# (ghcr.io/molecule-ai/workspace-template-langgraph:latest) returns
|
||||
# 403 Forbidden post-2026-05-06 GitHub org suspension. That is a
|
||||
# template-registry resolution issue (ADR-002 / local-build mode) and
|
||||
# belongs in a separate change that touches workspace-server, not
|
||||
# this workflow file.
|
||||
|
||||
on:
|
||||
push:
|
||||
@ -78,11 +131,14 @@ jobs:
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 15
|
||||
env:
|
||||
DATABASE_URL: postgres://dev:dev@localhost:15432/molecule?sslmode=disable
|
||||
REDIS_URL: redis://localhost:16379
|
||||
# Unique per-run container names so concurrent runs on the host-
|
||||
# network act_runner don't collide on name OR port.
|
||||
# `${RUN_ID}-${RUN_ATTEMPT}` stays unique across reruns of the
|
||||
# same run_id. PORT is set later (after docker port lookup) since
|
||||
# we let Docker assign an ephemeral host port.
|
||||
PG_CONTAINER: pg-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
|
||||
REDIS_CONTAINER: redis-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
|
||||
PORT: "8080"
|
||||
PG_CONTAINER: molecule-ci-postgres
|
||||
REDIS_CONTAINER: molecule-ci-redis
|
||||
steps:
|
||||
- name: No-op pass (paths filter excluded this commit)
|
||||
if: needs.detect-changes.outputs.api != 'true'
|
||||
@ -97,11 +153,53 @@ jobs:
|
||||
go-version: 'stable'
|
||||
cache: true
|
||||
cache-dependency-path: workspace-server/go.sum
|
||||
- name: Pre-pull alpine + ensure provisioner network (Issue #94 items #2 + #3)
|
||||
if: needs.detect-changes.outputs.api == 'true'
|
||||
run: |
|
||||
# Provisioner uses alpine:latest for ephemeral token-write
|
||||
# containers (workspace-server/internal/handlers/container_files.go).
|
||||
# Pre-pull so the first provision in test_api.sh doesn't race
|
||||
# the daemon's pull cache. Idempotent — `docker pull` is a no-op
|
||||
# when the image is already present.
|
||||
docker pull alpine:latest >/dev/null
|
||||
# Provisioner attaches workspace containers to
|
||||
# molecule-monorepo-net (workspace-server/internal/provisioner/
|
||||
# provisioner.go::DefaultNetwork). The bridge already exists on
|
||||
# the operator host's docker daemon — `network create` is
|
||||
# idempotent via `|| true`.
|
||||
docker network create molecule-monorepo-net >/dev/null 2>&1 || true
|
||||
echo "alpine:latest pre-pulled; molecule-monorepo-net ensured."
|
||||
- name: Start Postgres (docker)
|
||||
if: needs.detect-changes.outputs.api == 'true'
|
||||
run: |
|
||||
# Defensive cleanup — only matches THIS run's container name,
|
||||
# so it cannot kill a sibling run's postgres. (Pre-fix the
|
||||
# name was static and this rm hit other runs' containers.)
|
||||
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
|
||||
docker run -d --name "$PG_CONTAINER" -e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule -p 15432:5432 postgres:16
|
||||
# `-p 0:5432` requests an ephemeral host port; we read it back
|
||||
# below and export DATABASE_URL.
|
||||
docker run -d --name "$PG_CONTAINER" \
|
||||
-e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule \
|
||||
-p 0:5432 postgres:16 >/dev/null
|
||||
# Resolve the host-side port assignment. `docker port` prints
|
||||
# `0.0.0.0:NNNN` (and on host-net runners may also print an
|
||||
# IPv6 line — take the first IPv4 line).
|
||||
PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
|
||||
if [ -z "$PG_PORT" ]; then
|
||||
# Fallback: any first line. Some Docker versions print only
|
||||
# one line.
|
||||
PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | head -1 | awk -F: '{print $NF}')
|
||||
fi
|
||||
if [ -z "$PG_PORT" ]; then
|
||||
echo "::error::Could not resolve host port for $PG_CONTAINER"
|
||||
docker port "$PG_CONTAINER" 5432/tcp || true
|
||||
docker logs "$PG_CONTAINER" || true
|
||||
exit 1
|
||||
fi
|
||||
# 127.0.0.1 (NOT localhost) — IPv6 first-resolve flake (#92).
|
||||
echo "PG_PORT=${PG_PORT}" >> "$GITHUB_ENV"
|
||||
echo "DATABASE_URL=postgres://dev:dev@127.0.0.1:${PG_PORT}/molecule?sslmode=disable" >> "$GITHUB_ENV"
|
||||
echo "Postgres host port: ${PG_PORT}"
|
||||
for i in $(seq 1 30); do
|
||||
if docker exec "$PG_CONTAINER" pg_isready -U dev >/dev/null 2>&1; then
|
||||
echo "Postgres ready after ${i}s"
|
||||
@ -116,7 +214,20 @@ jobs:
|
||||
if: needs.detect-changes.outputs.api == 'true'
|
||||
run: |
|
||||
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
|
||||
docker run -d --name "$REDIS_CONTAINER" -p 16379:6379 redis:7
|
||||
docker run -d --name "$REDIS_CONTAINER" -p 0:6379 redis:7 >/dev/null
|
||||
REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
|
||||
if [ -z "$REDIS_PORT" ]; then
|
||||
REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | head -1 | awk -F: '{print $NF}')
|
||||
fi
|
||||
if [ -z "$REDIS_PORT" ]; then
|
||||
echo "::error::Could not resolve host port for $REDIS_CONTAINER"
|
||||
docker port "$REDIS_CONTAINER" 6379/tcp || true
|
||||
docker logs "$REDIS_CONTAINER" || true
|
||||
exit 1
|
||||
fi
|
||||
echo "REDIS_PORT=${REDIS_PORT}" >> "$GITHUB_ENV"
|
||||
echo "REDIS_URL=redis://127.0.0.1:${REDIS_PORT}" >> "$GITHUB_ENV"
|
||||
echo "Redis host port: ${REDIS_PORT}"
|
||||
for i in $(seq 1 15); do
|
||||
if docker exec "$REDIS_CONTAINER" redis-cli ping 2>/dev/null | grep -q PONG; then
|
||||
echo "Redis ready after ${i}s"
|
||||
@ -135,13 +246,15 @@ jobs:
|
||||
if: needs.detect-changes.outputs.api == 'true'
|
||||
working-directory: workspace-server
|
||||
run: |
|
||||
# DATABASE_URL + REDIS_URL exported by the start-postgres /
|
||||
# start-redis steps point at this run's per-run host ports.
|
||||
./platform-server > platform.log 2>&1 &
|
||||
echo $! > platform.pid
|
||||
- name: Wait for /health
|
||||
if: needs.detect-changes.outputs.api == 'true'
|
||||
run: |
|
||||
for i in $(seq 1 30); do
|
||||
if curl -sf http://localhost:8080/health > /dev/null; then
|
||||
if curl -sf http://127.0.0.1:8080/health > /dev/null; then
|
||||
echo "Platform up after ${i}s"
|
||||
exit 0
|
||||
fi
|
||||
@ -185,6 +298,9 @@ jobs:
|
||||
kill "$(cat workspace-server/platform.pid)" 2>/dev/null || true
|
||||
fi
|
||||
- name: Stop service containers
|
||||
# always() so containers don't leak when test steps fail. The
|
||||
# cleanup is best-effort: if the container is already gone
|
||||
# (e.g. concurrent rerun race), don't fail the job.
|
||||
if: always() && needs.detect-changes.outputs.api == 'true'
|
||||
run: |
|
||||
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
|
||||
|
||||
139
.github/workflows/handlers-postgres-integration.yml
vendored
139
.github/workflows/handlers-postgres-integration.yml
vendored
@ -14,12 +14,42 @@ name: Handlers Postgres Integration
|
||||
# self-review caught it took 2 minutes to set up and would have caught
|
||||
# the bug at PR-time.
|
||||
#
|
||||
# This job spins a Postgres service container, applies the migration,
|
||||
# and runs `go test -tags=integration` against a live DB. Required
|
||||
# check on staging branch protection — backend handler PRs cannot
|
||||
# merge without a real-DB regression gate.
|
||||
# Why this workflow does NOT use `services: postgres:` (Class B fix)
|
||||
# ------------------------------------------------------------------
|
||||
# Our act_runner config has `container.network: host` (operator host
|
||||
# /opt/molecule/runners/config.yaml), which act_runner applies to BOTH
|
||||
# the job container AND every service container. With host-net, two
|
||||
# concurrent runs of this workflow both try to bind 0.0.0.0:5432 — the
|
||||
# second postgres FATALs with `could not create any TCP/IP sockets:
|
||||
# Address in use`, and Docker auto-removes it (act_runner sets
|
||||
# AutoRemove:true on service containers). By the time the migrations
|
||||
# step runs `psql`, the postgres container is gone, hence
|
||||
# `Connection refused` then `failed to remove container: No such
|
||||
# container` at cleanup time.
|
||||
#
|
||||
# Cost: ~30s job (postgres pull from GH cache + go build + 4 tests).
|
||||
# Per-job `container.network` override is silently ignored by
|
||||
# act_runner — `--network and --net in the options will be ignored.`
|
||||
# appears in the runner log. Documented constraint.
|
||||
#
|
||||
# So we sidestep `services:` entirely. The job container still uses
|
||||
# host-net (inherited from runner config; required for cache server
|
||||
# discovery on the bridge IP 172.18.0.17:42631). We launch a sibling
|
||||
# postgres on the existing `molecule-monorepo-net` bridge with a
|
||||
# UNIQUE name per run — `pg-handlers-${RUN_ID}-${RUN_ATTEMPT}` — and
|
||||
# read its bridge IP via `docker inspect`. A host-net job container
|
||||
# can reach a bridge-net container directly via the bridge IP (verified
|
||||
# manually on operator host 2026-05-08).
|
||||
#
|
||||
# Trade-offs vs. the original `services:` shape:
|
||||
# + No host-port collision; N parallel runs share the bridge cleanly
|
||||
# + `if: always()` cleanup runs even on test-step failure
|
||||
# - One more step in the workflow (+~3 lines)
|
||||
# - Requires `molecule-monorepo-net` to exist on the operator host
|
||||
# (it does; declared in docker-compose.yml + docker-compose.infra.yml)
|
||||
#
|
||||
# Class B Hongming-owned CICD red sweep, 2026-05-08.
|
||||
#
|
||||
# Cost: ~30s job (postgres pull from cache + go build + 4 tests).
|
||||
|
||||
on:
|
||||
push:
|
||||
@ -59,20 +89,14 @@ jobs:
|
||||
name: Handlers Postgres Integration
|
||||
needs: detect-changes
|
||||
runs-on: ubuntu-latest
|
||||
services:
|
||||
postgres:
|
||||
image: postgres:15-alpine
|
||||
env:
|
||||
POSTGRES_PASSWORD: test
|
||||
POSTGRES_DB: molecule
|
||||
ports:
|
||||
- 5432:5432
|
||||
# GHA spins this with --health-cmd built in for postgres images.
|
||||
options: >-
|
||||
--health-cmd pg_isready
|
||||
--health-interval 5s
|
||||
--health-timeout 5s
|
||||
--health-retries 10
|
||||
# Unique name per run so concurrent jobs don't collide on the
|
||||
# bridge network. ${RUN_ID}-${RUN_ATTEMPT} is unique even across
|
||||
# workflow_dispatch reruns of the same run_id.
|
||||
PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
|
||||
# Bridge network already exists on the operator host (declared
|
||||
# in docker-compose.yml + docker-compose.infra.yml).
|
||||
PG_NETWORK: molecule-monorepo-net
|
||||
defaults:
|
||||
run:
|
||||
working-directory: workspace-server
|
||||
@ -89,16 +113,57 @@ jobs:
|
||||
with:
|
||||
go-version: 'stable'
|
||||
|
||||
- if: needs.detect-changes.outputs.handlers == 'true'
|
||||
name: Start sibling Postgres on bridge network
|
||||
working-directory: .
|
||||
run: |
|
||||
# Sanity: the bridge network must exist on the operator host.
|
||||
# Hard-fail loud if it doesn't — easier to spot than a silent
|
||||
# auto-create that diverges from the rest of the stack.
|
||||
if ! docker network inspect "${PG_NETWORK}" >/dev/null 2>&1; then
|
||||
echo "::error::Bridge network '${PG_NETWORK}' missing on operator host. Re-run docker-compose.infra.yml or check ops handbook."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# If a stale container with the same name exists (rerun on
|
||||
# the same run_id), wipe it first.
|
||||
docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
|
||||
|
||||
docker run -d \
|
||||
--name "${PG_NAME}" \
|
||||
--network "${PG_NETWORK}" \
|
||||
--health-cmd "pg_isready -U postgres" \
|
||||
--health-interval 5s \
|
||||
--health-timeout 5s \
|
||||
--health-retries 10 \
|
||||
-e POSTGRES_PASSWORD=test \
|
||||
-e POSTGRES_DB=molecule \
|
||||
postgres:15-alpine >/dev/null
|
||||
|
||||
# Read back the bridge IP. Always present immediately after
|
||||
# `docker run -d` for bridge networks.
|
||||
PG_HOST=$(docker inspect "${PG_NAME}" \
|
||||
--format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
|
||||
if [ -z "${PG_HOST}" ]; then
|
||||
echo "::error::Could not resolve PG_HOST for ${PG_NAME} on ${PG_NETWORK}"
|
||||
docker logs "${PG_NAME}" || true
|
||||
exit 1
|
||||
fi
|
||||
echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
|
||||
echo "INTEGRATION_DB_URL=postgres://postgres:test@${PG_HOST}:5432/molecule?sslmode=disable" >> "$GITHUB_ENV"
|
||||
echo "Started ${PG_NAME} at ${PG_HOST}:5432"
|
||||
|
||||
- if: needs.detect-changes.outputs.handlers == 'true'
|
||||
name: Apply migrations to Postgres service
|
||||
env:
|
||||
PGPASSWORD: test
|
||||
run: |
|
||||
# Wait for postgres to actually accept connections (the
|
||||
# GHA --health-cmd is best-effort but psql can still race).
|
||||
# Wait for postgres to actually accept connections. Docker's
|
||||
# health-cmd handles container-side readiness, but the wire
|
||||
# to the bridge IP is best-tested with pg_isready directly.
|
||||
for i in {1..15}; do
|
||||
if pg_isready -h localhost -p 5432 -U postgres -q; then break; fi
|
||||
echo "waiting for postgres..."; sleep 2
|
||||
if pg_isready -h "${PG_HOST}" -p 5432 -U postgres -q; then break; fi
|
||||
echo "waiting for postgres at ${PG_HOST}:5432..."; sleep 2
|
||||
done
|
||||
|
||||
# Apply every .up.sql in lexicographic order with
|
||||
@ -131,7 +196,7 @@ jobs:
|
||||
# not fine once a cross-table atomicity test came in.
|
||||
set +e
|
||||
for migration in $(ls migrations/*.sql 2>/dev/null | grep -v '\.down\.sql$' | sort); do
|
||||
if psql -h localhost -U postgres -d molecule -v ON_ERROR_STOP=1 \
|
||||
if psql -h "${PG_HOST}" -U postgres -d molecule -v ON_ERROR_STOP=1 \
|
||||
-f "$migration" >/dev/null 2>&1; then
|
||||
echo "✓ $(basename "$migration")"
|
||||
else
|
||||
@ -145,7 +210,7 @@ jobs:
|
||||
# fail if any didn't land — that would be a real regression we
|
||||
# want loud.
|
||||
for tbl in delegations workspaces activity_logs pending_uploads; do
|
||||
if ! psql -h localhost -U postgres -d molecule -tA \
|
||||
if ! psql -h "${PG_HOST}" -U postgres -d molecule -tA \
|
||||
-c "SELECT 1 FROM information_schema.tables WHERE table_name = '$tbl'" \
|
||||
| grep -q 1; then
|
||||
echo "::error::$tbl table missing after migration replay — handler integration tests would be meaningless"
|
||||
@ -156,16 +221,32 @@ jobs:
|
||||
|
||||
- if: needs.detect-changes.outputs.handlers == 'true'
|
||||
name: Run integration tests
|
||||
env:
|
||||
INTEGRATION_DB_URL: postgres://postgres:test@localhost:5432/molecule?sslmode=disable
|
||||
run: |
|
||||
# INTEGRATION_DB_URL is exported by the start-postgres step;
|
||||
# points at the per-run bridge IP, not 127.0.0.1, so concurrent
|
||||
# workflow runs don't fight over a host-net 5432 port.
|
||||
go test -tags=integration -timeout 5m -v ./internal/handlers/ -run "^TestIntegration_"
|
||||
|
||||
- if: needs.detect-changes.outputs.handlers == 'true' && failure()
|
||||
- if: failure() && needs.detect-changes.outputs.handlers == 'true'
|
||||
name: Diagnostic dump on failure
|
||||
env:
|
||||
PGPASSWORD: test
|
||||
run: |
|
||||
echo "::group::delegations table state"
|
||||
psql -h localhost -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
|
||||
echo "::group::postgres container status"
|
||||
docker ps -a --filter "name=${PG_NAME}" --format '{{.Status}} {{.Names}}' || true
|
||||
docker logs "${PG_NAME}" 2>&1 | tail -50 || true
|
||||
echo "::endgroup::"
|
||||
echo "::group::delegations table state"
|
||||
psql -h "${PG_HOST}" -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
|
||||
echo "::endgroup::"
|
||||
|
||||
- if: always() && needs.detect-changes.outputs.handlers == 'true'
|
||||
name: Stop sibling Postgres
|
||||
working-directory: .
|
||||
run: |
|
||||
# always() so containers don't leak when migrations or tests
|
||||
# fail. The cleanup is best-effort: if the container is
|
||||
# already gone (e.g. concurrent rerun race), don't fail the job.
|
||||
docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
|
||||
echo "Cleaned up ${PG_NAME}"
|
||||
|
||||
|
||||
@ -7,6 +7,32 @@ export default defineConfig({
|
||||
test: {
|
||||
environment: 'node',
|
||||
exclude: ['e2e/**', 'node_modules/**', '**/dist/**'],
|
||||
// CI-conditional test timeout (issue #96).
|
||||
//
|
||||
// Vitest's 5000ms default is too tight for the first test in any
|
||||
// file under our CI shape: `npx vitest run --coverage` on the
|
||||
// self-hosted Gitea Actions Docker runner. The cold-start cost
|
||||
// (v8 coverage instrumentation init + JSDOM bootstrap + module-
|
||||
// graph import for @/components/* and @/lib/* + first React
|
||||
// render) consistently consumes 5-7 seconds for the first
|
||||
// synchronous test in heavyweight component files
|
||||
// (ActivityTab.test.tsx, CreateWorkspaceDialog.test.tsx,
|
||||
// ConfigTab.provider.test.tsx) — even though every subsequent
|
||||
// test in the same file completes in 100-1500ms.
|
||||
//
|
||||
// Empirically the worst observed first-test was 6453ms in a
|
||||
// single file (CreateWorkspaceDialog). 30000ms gives ~5x
|
||||
// headroom over that on CI; we still keep 5000ms locally so
|
||||
// genuine waitFor races / hung promises stay sensitive in dev.
|
||||
//
|
||||
// Same vitest pattern documented at:
|
||||
// https://vitest.dev/config/testtimeout
|
||||
// https://vitest.dev/guide/coverage#profiling-test-performance
|
||||
//
|
||||
// Per-test duration is still emitted to the CI log; if a test
|
||||
// ever silently approaches 25-30s under this raised ceiling that
|
||||
// will surface as a duration regression and we revisit.
|
||||
testTimeout: process.env.CI ? 30000 : 5000,
|
||||
// Coverage is instrumented but NOT yet a CI gate — first land
|
||||
// observability so we can see the baseline, then dial in
|
||||
// thresholds + a hard gate in a follow-up PR (#1815). Today's
|
||||
|
||||
137
docs/runbooks/handlers-postgres-integration-port-collision.md
Normal file
137
docs/runbooks/handlers-postgres-integration-port-collision.md
Normal file
@ -0,0 +1,137 @@
|
||||
# Runbook — Handlers Postgres Integration port-collision substrate
|
||||
|
||||
**Status:** Resolved 2026-05-08 (PR for class B Hongming-owned CICD red sweep).
|
||||
|
||||
## Symptom
|
||||
|
||||
`Handlers Postgres Integration` workflow fails on staging push and PRs.
|
||||
Step `Apply migrations to Postgres service` shows:
|
||||
|
||||
```
|
||||
psql: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
|
||||
```
|
||||
|
||||
Job-cleanup step further down logs:
|
||||
|
||||
```
|
||||
Cleaning up services for job Handlers Postgres Integration
|
||||
failed to remove container: Error response from daemon: No such container: <id>
|
||||
```
|
||||
|
||||
…confirming the postgres service container was already gone before
|
||||
cleanup ran.
|
||||
|
||||
## Root cause
|
||||
|
||||
Our Gitea act_runner (operator host `5.78.80.188`,
|
||||
`/opt/molecule/runners/config.yaml`) sets:
|
||||
|
||||
```yaml
|
||||
container:
|
||||
network: host
|
||||
```
|
||||
|
||||
…which act_runner applies to BOTH the job container AND every
|
||||
`services:` container in a workflow. Multiple workflow instances
|
||||
running concurrently across the 16 parallel runners each try to bind
|
||||
postgres on `0.0.0.0:5432`. The first wins; subsequent instances exit
|
||||
immediately with:
|
||||
|
||||
```
|
||||
LOG: could not bind IPv4 address "0.0.0.0": Address in use
|
||||
HINT: Is another postmaster already running on port 5432?
|
||||
FATAL: could not create any TCP/IP sockets
|
||||
```
|
||||
|
||||
act_runner sets `AutoRemove:true` on service containers, so Docker
|
||||
garbage-collects them as soon as they exit. By the time the migrations
|
||||
step runs `pg_isready` / `psql`, the container is gone and connection
|
||||
refused.
|
||||
|
||||
Reproduction (operator host):
|
||||
|
||||
```bash
|
||||
docker run --rm -d --name pg-A --network host \
|
||||
-e POSTGRES_PASSWORD=test postgres:15-alpine
|
||||
docker run -d --name pg-B --network host \
|
||||
-e POSTGRES_PASSWORD=test postgres:15-alpine
|
||||
docker logs pg-B # FATAL: could not create any TCP/IP sockets
|
||||
```
|
||||
|
||||
## Why per-job override doesn't work
|
||||
|
||||
The natural fix — per-job `container.network` override — is silently
|
||||
ignored by act_runner. The runner log emits:
|
||||
|
||||
```
|
||||
--network and --net in the options will be ignored.
|
||||
```
|
||||
|
||||
This is a documented act_runner constraint: container network is a
|
||||
runner-wide setting, not per-job. Source: gitea/act_runner config docs
|
||||
+ vegardit/docker-gitea-act-runner issue #7.
|
||||
|
||||
Flipping the global `container.network` to `bridge` would break every
|
||||
other workflow in the repo (cache server discovery,
|
||||
`molecule-monorepo-net` peer access during integration tests, etc.) —
|
||||
unacceptable blast radius for a per-test bug.
|
||||
|
||||
## Fix shape
|
||||
|
||||
`handlers-postgres-integration.yml` no longer uses `services: postgres:`.
|
||||
It launches a sibling postgres container manually on the existing
|
||||
`molecule-monorepo-net` bridge network with a per-run unique name:
|
||||
|
||||
```yaml
|
||||
env:
|
||||
PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
|
||||
PG_NETWORK: molecule-monorepo-net
|
||||
|
||||
steps:
|
||||
- name: Start sibling Postgres on bridge network
|
||||
run: |
|
||||
docker run -d --name "${PG_NAME}" --network "${PG_NETWORK}" \
|
||||
...
|
||||
postgres:15-alpine
|
||||
PG_HOST=$(docker inspect "${PG_NAME}" \
|
||||
--format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
|
||||
echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
|
||||
|
||||
# … migrations + tests use ${PG_HOST}, not 127.0.0.1 …
|
||||
|
||||
- if: always() && …
|
||||
name: Stop sibling Postgres
|
||||
run: docker rm -f "${PG_NAME}" || true
|
||||
```
|
||||
|
||||
The host-net job container can reach a bridge-net container via the
|
||||
bridge IP directly (verified manually, 2026-05-08). Two parallel runs
|
||||
use different names + different bridge IPs — no collision.
|
||||
|
||||
## Future-proofing
|
||||
|
||||
Other workflows that hit the same shape (any `services:` with a
|
||||
fixed-port image) will exhibit the same failure mode under
|
||||
host-network runner config. Translate using this same pattern:
|
||||
|
||||
1. Drop the `services:` block.
|
||||
2. Use `${{ github.run_id }}-${{ github.run_attempt }}` for unique
|
||||
container name.
|
||||
3. Launch on `molecule-monorepo-net` (already trusted bridge in
|
||||
`docker-compose.infra.yml`).
|
||||
4. Read back the bridge IP via `docker inspect` and export as a step env.
|
||||
5. `if: always()` cleanup step at the end.
|
||||
|
||||
If the count of such workflows grows, factor into a composite action
|
||||
(`./.github/actions/sibling-postgres`) so the substrate logic lives
|
||||
in one place.
|
||||
|
||||
## Related
|
||||
|
||||
- Issue #88 (closed by #92): localhost → 127.0.0.1 fix that unmasked
|
||||
this collision; the IPv6 fix is correct, port collision is the new
|
||||
layer.
|
||||
- Issue #94 created `molecule-monorepo-net` + `alpine:latest` as
|
||||
prereqs.
|
||||
- Saved memory `feedback_act_runner_github_server_url` documents
|
||||
another act_runner-vs-GHA divergence (server URL).
|
||||
Loading…
Reference in New Issue
Block a user