From 241859b5529b6657c8c0c7a52bd411eb28679e26 Mon Sep 17 00:00:00 2001 From: devops-engineer Date: Thu, 7 May 2026 18:21:12 -0700 Subject: [PATCH] =?UTF-8?q?fix(ci):=20handlers-postgres=20=E2=80=94=20side?= =?UTF-8?q?step=20port=20collision=20under=20host-network=20runner?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Class B Hongming-owned CICD red sweep. The Handlers Postgres Integration workflow has been silently failing on staging push and PRs ever since #92 fixed the IPv6 flake — the IPv6 fix correctly pinned 127.0.0.1, but unmasked a deeper issue: with our act_runner global container.network=host config, multiple concurrent runs of this workflow each tried to bind 0.0.0.0:5432 on the operator host. The first wins; subsequent postgres service containers exit with `FATAL: could not create any TCP/IP sockets` + `Address in use`. Docker auto-removes them (act_runner sets AutoRemove:true), so by the time `Apply migrations` runs `psql`, the container is gone — Connection refused, then `failed to remove container: No such container` at cleanup time. Per-job container.network override is silently ignored by act_runner (`--network and --net in the options will be ignored.`), so we sidestep `services:` entirely. The job container still uses host-net (required for cache server discovery on the operator's bridge IP). We launch a sibling postgres on the existing molecule-monorepo-net bridge with a unique name per run (run_id+run_attempt) and connect via the bridge IP read from `docker inspect`. Verified manually on operator host 2026-05-08: 2× postgres on host-net collides, but on the bridge with unique names + different IPs, both succeed and each is reachable from a host-net job container. Adds: - always()-cleanup step so containers don't leak on test failure - Diagnostic dump now includes the postgres container's docker logs - Runbook at docs/runbooks/ documenting the substrate behavior + the pattern future workflows should adopt for any `services:`-shaped need. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../handlers-postgres-integration.yml | 147 +++++++++++++----- ...ers-postgres-integration-port-collision.md | 137 ++++++++++++++++ 2 files changed, 247 insertions(+), 37 deletions(-) create mode 100644 docs/runbooks/handlers-postgres-integration-port-collision.md diff --git a/.github/workflows/handlers-postgres-integration.yml b/.github/workflows/handlers-postgres-integration.yml index 41f00b83..ae03e6d5 100644 --- a/.github/workflows/handlers-postgres-integration.yml +++ b/.github/workflows/handlers-postgres-integration.yml @@ -14,12 +14,42 @@ name: Handlers Postgres Integration # self-review caught it took 2 minutes to set up and would have caught # the bug at PR-time. # -# This job spins a Postgres service container, applies the migration, -# and runs `go test -tags=integration` against a live DB. Required -# check on staging branch protection — backend handler PRs cannot -# merge without a real-DB regression gate. +# Why this workflow does NOT use `services: postgres:` (Class B fix) +# ------------------------------------------------------------------ +# Our act_runner config has `container.network: host` (operator host +# /opt/molecule/runners/config.yaml), which act_runner applies to BOTH +# the job container AND every service container. With host-net, two +# concurrent runs of this workflow both try to bind 0.0.0.0:5432 — the +# second postgres FATALs with `could not create any TCP/IP sockets: +# Address in use`, and Docker auto-removes it (act_runner sets +# AutoRemove:true on service containers). By the time the migrations +# step runs `psql`, the postgres container is gone, hence +# `Connection refused` then `failed to remove container: No such +# container` at cleanup time. # -# Cost: ~30s job (postgres pull from GH cache + go build + 4 tests). +# Per-job `container.network` override is silently ignored by +# act_runner — `--network and --net in the options will be ignored.` +# appears in the runner log. Documented constraint. +# +# So we sidestep `services:` entirely. The job container still uses +# host-net (inherited from runner config; required for cache server +# discovery on the bridge IP 172.18.0.17:42631). We launch a sibling +# postgres on the existing `molecule-monorepo-net` bridge with a +# UNIQUE name per run — `pg-handlers-${RUN_ID}-${RUN_ATTEMPT}` — and +# read its bridge IP via `docker inspect`. A host-net job container +# can reach a bridge-net container directly via the bridge IP (verified +# manually on operator host 2026-05-08). +# +# Trade-offs vs. the original `services:` shape: +# + No host-port collision; N parallel runs share the bridge cleanly +# + `if: always()` cleanup runs even on test-step failure +# - One more step in the workflow (+~3 lines) +# - Requires `molecule-monorepo-net` to exist on the operator host +# (it does; declared in docker-compose.yml + docker-compose.infra.yml) +# +# Class B Hongming-owned CICD red sweep, 2026-05-08. +# +# Cost: ~30s job (postgres pull from cache + go build + 4 tests). on: push: @@ -59,20 +89,14 @@ jobs: name: Handlers Postgres Integration needs: detect-changes runs-on: ubuntu-latest - services: - postgres: - image: postgres:15-alpine - env: - POSTGRES_PASSWORD: test - POSTGRES_DB: molecule - ports: - - 5432:5432 - # GHA spins this with --health-cmd built in for postgres images. - options: >- - --health-cmd pg_isready - --health-interval 5s - --health-timeout 5s - --health-retries 10 + env: + # Unique name per run so concurrent jobs don't collide on the + # bridge network. ${RUN_ID}-${RUN_ATTEMPT} is unique even across + # workflow_dispatch reruns of the same run_id. + PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }} + # Bridge network already exists on the operator host (declared + # in docker-compose.yml + docker-compose.infra.yml). + PG_NETWORK: molecule-monorepo-net defaults: run: working-directory: workspace-server @@ -89,16 +113,57 @@ jobs: with: go-version: 'stable' + - if: needs.detect-changes.outputs.handlers == 'true' + name: Start sibling Postgres on bridge network + working-directory: . + run: | + # Sanity: the bridge network must exist on the operator host. + # Hard-fail loud if it doesn't — easier to spot than a silent + # auto-create that diverges from the rest of the stack. + if ! docker network inspect "${PG_NETWORK}" >/dev/null 2>&1; then + echo "::error::Bridge network '${PG_NETWORK}' missing on operator host. Re-run docker-compose.infra.yml or check ops handbook." + exit 1 + fi + + # If a stale container with the same name exists (rerun on + # the same run_id), wipe it first. + docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true + + docker run -d \ + --name "${PG_NAME}" \ + --network "${PG_NETWORK}" \ + --health-cmd "pg_isready -U postgres" \ + --health-interval 5s \ + --health-timeout 5s \ + --health-retries 10 \ + -e POSTGRES_PASSWORD=test \ + -e POSTGRES_DB=molecule \ + postgres:15-alpine >/dev/null + + # Read back the bridge IP. Always present immediately after + # `docker run -d` for bridge networks. + PG_HOST=$(docker inspect "${PG_NAME}" \ + --format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}") + if [ -z "${PG_HOST}" ]; then + echo "::error::Could not resolve PG_HOST for ${PG_NAME} on ${PG_NETWORK}" + docker logs "${PG_NAME}" || true + exit 1 + fi + echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV" + echo "INTEGRATION_DB_URL=postgres://postgres:test@${PG_HOST}:5432/molecule?sslmode=disable" >> "$GITHUB_ENV" + echo "Started ${PG_NAME} at ${PG_HOST}:5432" + - if: needs.detect-changes.outputs.handlers == 'true' name: Apply migrations to Postgres service env: PGPASSWORD: test run: | - # Wait for postgres to actually accept connections (the - # GHA --health-cmd is best-effort but psql can still race). + # Wait for postgres to actually accept connections. Docker's + # health-cmd handles container-side readiness, but the wire + # to the bridge IP is best-tested with pg_isready directly. for i in {1..15}; do - if pg_isready -h 127.0.0.1 -p 5432 -U postgres -q; then break; fi - echo "waiting for postgres..."; sleep 2 + if pg_isready -h "${PG_HOST}" -p 5432 -U postgres -q; then break; fi + echo "waiting for postgres at ${PG_HOST}:5432..."; sleep 2 done # Apply every .up.sql in lexicographic order with @@ -131,7 +196,7 @@ jobs: # not fine once a cross-table atomicity test came in. set +e for migration in $(ls migrations/*.sql 2>/dev/null | grep -v '\.down\.sql$' | sort); do - if psql -h 127.0.0.1 -U postgres -d molecule -v ON_ERROR_STOP=1 \ + if psql -h "${PG_HOST}" -U postgres -d molecule -v ON_ERROR_STOP=1 \ -f "$migration" >/dev/null 2>&1; then echo "✓ $(basename "$migration")" else @@ -145,7 +210,7 @@ jobs: # fail if any didn't land — that would be a real regression we # want loud. for tbl in delegations workspaces activity_logs pending_uploads; do - if ! psql -h 127.0.0.1 -U postgres -d molecule -tA \ + if ! psql -h "${PG_HOST}" -U postgres -d molecule -tA \ -c "SELECT 1 FROM information_schema.tables WHERE table_name = '$tbl'" \ | grep -q 1; then echo "::error::$tbl table missing after migration replay — handler integration tests would be meaningless" @@ -156,23 +221,31 @@ jobs: - if: needs.detect-changes.outputs.handlers == 'true' name: Run integration tests - env: - # 127.0.0.1, NOT localhost. On Gitea / act_runner the runner host - # has IPv6 enabled, so `localhost` resolves to `::1` first, and - # the Postgres service container only listens on IPv4 → lib/pq's - # first dial hits ECONNREFUSED. The migration step uses psql -h - # localhost which falls back to IPv4 cleanly, so the flake hides - # there and surfaces only at test time. Pinning IPv4 makes the - # whole job deterministic. (Issue #88, item 3.) - INTEGRATION_DB_URL: postgres://postgres:test@127.0.0.1:5432/molecule?sslmode=disable run: | + # INTEGRATION_DB_URL is exported by the start-postgres step; + # points at the per-run bridge IP, not 127.0.0.1, so concurrent + # workflow runs don't fight over a host-net 5432 port. go test -tags=integration -timeout 5m -v ./internal/handlers/ -run "^TestIntegration_" - - if: needs.detect-changes.outputs.handlers == 'true' && failure() + - if: failure() && needs.detect-changes.outputs.handlers == 'true' name: Diagnostic dump on failure env: PGPASSWORD: test run: | - echo "::group::delegations table state" - psql -h 127.0.0.1 -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true + echo "::group::postgres container status" + docker ps -a --filter "name=${PG_NAME}" --format '{{.Status}} {{.Names}}' || true + docker logs "${PG_NAME}" 2>&1 | tail -50 || true echo "::endgroup::" + echo "::group::delegations table state" + psql -h "${PG_HOST}" -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true + echo "::endgroup::" + + - if: always() && needs.detect-changes.outputs.handlers == 'true' + name: Stop sibling Postgres + working-directory: . + run: | + # always() so containers don't leak when migrations or tests + # fail. The cleanup is best-effort: if the container is + # already gone (e.g. concurrent rerun race), don't fail the job. + docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true + echo "Cleaned up ${PG_NAME}" diff --git a/docs/runbooks/handlers-postgres-integration-port-collision.md b/docs/runbooks/handlers-postgres-integration-port-collision.md new file mode 100644 index 00000000..0b9df483 --- /dev/null +++ b/docs/runbooks/handlers-postgres-integration-port-collision.md @@ -0,0 +1,137 @@ +# Runbook — Handlers Postgres Integration port-collision substrate + +**Status:** Resolved 2026-05-08 (PR for class B Hongming-owned CICD red sweep). + +## Symptom + +`Handlers Postgres Integration` workflow fails on staging push and PRs. +Step `Apply migrations to Postgres service` shows: + +``` +psql: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused +``` + +Job-cleanup step further down logs: + +``` +Cleaning up services for job Handlers Postgres Integration +failed to remove container: Error response from daemon: No such container: +``` + +…confirming the postgres service container was already gone before +cleanup ran. + +## Root cause + +Our Gitea act_runner (operator host `5.78.80.188`, +`/opt/molecule/runners/config.yaml`) sets: + +```yaml +container: + network: host +``` + +…which act_runner applies to BOTH the job container AND every +`services:` container in a workflow. Multiple workflow instances +running concurrently across the 16 parallel runners each try to bind +postgres on `0.0.0.0:5432`. The first wins; subsequent instances exit +immediately with: + +``` +LOG: could not bind IPv4 address "0.0.0.0": Address in use +HINT: Is another postmaster already running on port 5432? +FATAL: could not create any TCP/IP sockets +``` + +act_runner sets `AutoRemove:true` on service containers, so Docker +garbage-collects them as soon as they exit. By the time the migrations +step runs `pg_isready` / `psql`, the container is gone and connection +refused. + +Reproduction (operator host): + +```bash +docker run --rm -d --name pg-A --network host \ + -e POSTGRES_PASSWORD=test postgres:15-alpine +docker run -d --name pg-B --network host \ + -e POSTGRES_PASSWORD=test postgres:15-alpine +docker logs pg-B # FATAL: could not create any TCP/IP sockets +``` + +## Why per-job override doesn't work + +The natural fix — per-job `container.network` override — is silently +ignored by act_runner. The runner log emits: + +``` +--network and --net in the options will be ignored. +``` + +This is a documented act_runner constraint: container network is a +runner-wide setting, not per-job. Source: gitea/act_runner config docs ++ vegardit/docker-gitea-act-runner issue #7. + +Flipping the global `container.network` to `bridge` would break every +other workflow in the repo (cache server discovery, +`molecule-monorepo-net` peer access during integration tests, etc.) — +unacceptable blast radius for a per-test bug. + +## Fix shape + +`handlers-postgres-integration.yml` no longer uses `services: postgres:`. +It launches a sibling postgres container manually on the existing +`molecule-monorepo-net` bridge network with a per-run unique name: + +```yaml +env: + PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }} + PG_NETWORK: molecule-monorepo-net + +steps: + - name: Start sibling Postgres on bridge network + run: | + docker run -d --name "${PG_NAME}" --network "${PG_NETWORK}" \ + ... + postgres:15-alpine + PG_HOST=$(docker inspect "${PG_NAME}" \ + --format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}") + echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV" + + # … migrations + tests use ${PG_HOST}, not 127.0.0.1 … + + - if: always() && … + name: Stop sibling Postgres + run: docker rm -f "${PG_NAME}" || true +``` + +The host-net job container can reach a bridge-net container via the +bridge IP directly (verified manually, 2026-05-08). Two parallel runs +use different names + different bridge IPs — no collision. + +## Future-proofing + +Other workflows that hit the same shape (any `services:` with a +fixed-port image) will exhibit the same failure mode under +host-network runner config. Translate using this same pattern: + +1. Drop the `services:` block. +2. Use `${{ github.run_id }}-${{ github.run_attempt }}` for unique + container name. +3. Launch on `molecule-monorepo-net` (already trusted bridge in + `docker-compose.infra.yml`). +4. Read back the bridge IP via `docker inspect` and export as a step env. +5. `if: always()` cleanup step at the end. + +If the count of such workflows grows, factor into a composite action +(`./.github/actions/sibling-postgres`) so the substrate logic lives +in one place. + +## Related + +- Issue #88 (closed by #92): localhost → 127.0.0.1 fix that unmasked + this collision; the IPv6 fix is correct, port collision is the new + layer. +- Issue #94 created `molecule-monorepo-net` + `alpine:latest` as + prereqs. +- Saved memory `feedback_act_runner_github_server_url` documents + another act_runner-vs-GHA divergence (server URL).