molecule-core/docs/runbooks/handlers-postgres-integration-port-collision.md
Molecule AI Core-DevOps 252f8d0c47
Some checks failed
sop-tier-check / tier-check (pull_request) Failing after 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
tech-debt: rename molecule-monorepo-net -> molecule-core-net
Renames Docker network across all code, configs, scripts, and docs.

Per issue #93: the network was named molecule-monorepo-net as a holdover
from when the repo was called molecule-monorepo. The canonical repo name is
now molecule-core, so the network should be molecule-core-net.

Files changed:
- docker-compose.yml, docker-compose.infra.yml: network definition
- infra/scripts/setup.sh: docker network create
- scripts/nuke-and-rebuild.sh: docker network rm
- workspace-server/internal/provisioner/provisioner.go: DefaultNetwork
- All comments/docs: updated wording

Acceptance: grep -rn 'molecule-monorepo-net' returns zero matches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 20:51:48 +00:00

4.4 KiB

Runbook — Handlers Postgres Integration port-collision substrate

Status: Resolved 2026-05-08 (PR for class B Hongming-owned CICD red sweep).

Symptom

Handlers Postgres Integration workflow fails on staging push and PRs. Step Apply migrations to Postgres service shows:

psql: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused

Job-cleanup step further down logs:

Cleaning up services for job Handlers Postgres Integration
failed to remove container: Error response from daemon: No such container: <id>

…confirming the postgres service container was already gone before cleanup ran.

Root cause

Our Gitea act_runner (operator host 5.78.80.188, /opt/molecule/runners/config.yaml) sets:

container:
  network: host

…which act_runner applies to BOTH the job container AND every services: container in a workflow. Multiple workflow instances running concurrently across the 16 parallel runners each try to bind postgres on 0.0.0.0:5432. The first wins; subsequent instances exit immediately with:

LOG:  could not bind IPv4 address "0.0.0.0": Address in use
HINT: Is another postmaster already running on port 5432?
FATAL: could not create any TCP/IP sockets

act_runner sets AutoRemove:true on service containers, so Docker garbage-collects them as soon as they exit. By the time the migrations step runs pg_isready / psql, the container is gone and connection refused.

Reproduction (operator host):

docker run --rm -d --name pg-A --network host \
  -e POSTGRES_PASSWORD=test postgres:15-alpine
docker run -d --name pg-B --network host \
  -e POSTGRES_PASSWORD=test postgres:15-alpine
docker logs pg-B   # FATAL: could not create any TCP/IP sockets

Why per-job override doesn't work

The natural fix — per-job container.network override — is silently ignored by act_runner. The runner log emits:

--network and --net in the options will be ignored.

This is a documented act_runner constraint: container network is a runner-wide setting, not per-job. Source: gitea/act_runner config docs

  • vegardit/docker-gitea-act-runner issue #7.

Flipping the global container.network to bridge would break every other workflow in the repo (cache server discovery, molecule-core-net peer access during integration tests, etc.) — unacceptable blast radius for a per-test bug.

Fix shape

handlers-postgres-integration.yml no longer uses services: postgres:. It launches a sibling postgres container manually on the existing molecule-core-net bridge network with a per-run unique name:

env:
  PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
  PG_NETWORK: molecule-core-net

steps:
  - name: Start sibling Postgres on bridge network
    run: |
      docker run -d --name "${PG_NAME}" --network "${PG_NETWORK}" \
        ...
        postgres:15-alpine
      PG_HOST=$(docker inspect "${PG_NAME}" \
        --format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
      echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"      

  # … migrations + tests use ${PG_HOST}, not 127.0.0.1 …

  - if: always() && …
    name: Stop sibling Postgres
    run: docker rm -f "${PG_NAME}" || true

The host-net job container can reach a bridge-net container via the bridge IP directly (verified manually, 2026-05-08). Two parallel runs use different names + different bridge IPs — no collision.

Future-proofing

Other workflows that hit the same shape (any services: with a fixed-port image) will exhibit the same failure mode under host-network runner config. Translate using this same pattern:

  1. Drop the services: block.
  2. Use ${{ github.run_id }}-${{ github.run_attempt }} for unique container name.
  3. Launch on molecule-core-net (already trusted bridge in docker-compose.infra.yml).
  4. Read back the bridge IP via docker inspect and export as a step env.
  5. if: always() cleanup step at the end.

If the count of such workflows grows, factor into a composite action (./.github/actions/sibling-postgres) so the substrate logic lives in one place.

  • Issue #88 (closed by #92): localhost → 127.0.0.1 fix that unmasked this collision; the IPv6 fix is correct, port collision is the new layer.
  • Issue #94 created molecule-core-net + alpine:latest as prereqs.
  • Saved memory feedback_act_runner_github_server_url documents another act_runner-vs-GHA divergence (server URL).