molecule-core/docs/runbooks/handlers-postgres-integration-port-collision.md
devops-engineer 241859b552
All checks were successful
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 7s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 12s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 14s
CI / Detect changes (pull_request) Successful in 17s
branch-protection drift check / Branch protection drift (pull_request) Successful in 19s
E2E API Smoke Test / detect-changes (pull_request) Successful in 19s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 20s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 17s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 17s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 18s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 16s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
CI / Canvas (Next.js) (pull_request) Successful in 9s
CI / Platform (Go) (pull_request) Successful in 10s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 10s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 10s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m47s
fix(ci): handlers-postgres — sidestep port collision under host-network runner
Class B Hongming-owned CICD red sweep. The Handlers Postgres Integration
workflow has been silently failing on staging push and PRs ever since
#92 fixed the IPv6 flake — the IPv6 fix correctly pinned 127.0.0.1, but
unmasked a deeper issue: with our act_runner global container.network=host
config, multiple concurrent runs of this workflow each tried to bind
0.0.0.0:5432 on the operator host. The first wins; subsequent postgres
service containers exit with `FATAL: could not create any TCP/IP sockets`
+ `Address in use`. Docker auto-removes them (act_runner sets
AutoRemove:true), so by the time `Apply migrations` runs `psql`, the
container is gone — Connection refused, then `failed to remove container:
No such container` at cleanup time.

Per-job container.network override is silently ignored by act_runner
(`--network and --net in the options will be ignored.`), so we sidestep
`services:` entirely. The job container still uses host-net (required
for cache server discovery on the operator's bridge IP). We launch a
sibling postgres on the existing molecule-monorepo-net bridge with a
unique name per run (run_id+run_attempt) and connect via the bridge IP
read from `docker inspect`.

Verified manually on operator host 2026-05-08: 2× postgres on host-net
collides, but on the bridge with unique names + different IPs, both
succeed and each is reachable from a host-net job container.

Adds:
- always()-cleanup step so containers don't leak on test failure
- Diagnostic dump now includes the postgres container's docker logs
- Runbook at docs/runbooks/ documenting the substrate behavior + the
  pattern future workflows should adopt for any `services:`-shaped need.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:21:12 -07:00

138 lines
4.4 KiB
Markdown

# Runbook — Handlers Postgres Integration port-collision substrate
**Status:** Resolved 2026-05-08 (PR for class B Hongming-owned CICD red sweep).
## Symptom
`Handlers Postgres Integration` workflow fails on staging push and PRs.
Step `Apply migrations to Postgres service` shows:
```
psql: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
```
Job-cleanup step further down logs:
```
Cleaning up services for job Handlers Postgres Integration
failed to remove container: Error response from daemon: No such container: <id>
```
…confirming the postgres service container was already gone before
cleanup ran.
## Root cause
Our Gitea act_runner (operator host `5.78.80.188`,
`/opt/molecule/runners/config.yaml`) sets:
```yaml
container:
network: host
```
…which act_runner applies to BOTH the job container AND every
`services:` container in a workflow. Multiple workflow instances
running concurrently across the 16 parallel runners each try to bind
postgres on `0.0.0.0:5432`. The first wins; subsequent instances exit
immediately with:
```
LOG: could not bind IPv4 address "0.0.0.0": Address in use
HINT: Is another postmaster already running on port 5432?
FATAL: could not create any TCP/IP sockets
```
act_runner sets `AutoRemove:true` on service containers, so Docker
garbage-collects them as soon as they exit. By the time the migrations
step runs `pg_isready` / `psql`, the container is gone and connection
refused.
Reproduction (operator host):
```bash
docker run --rm -d --name pg-A --network host \
-e POSTGRES_PASSWORD=test postgres:15-alpine
docker run -d --name pg-B --network host \
-e POSTGRES_PASSWORD=test postgres:15-alpine
docker logs pg-B # FATAL: could not create any TCP/IP sockets
```
## Why per-job override doesn't work
The natural fix — per-job `container.network` override — is silently
ignored by act_runner. The runner log emits:
```
--network and --net in the options will be ignored.
```
This is a documented act_runner constraint: container network is a
runner-wide setting, not per-job. Source: gitea/act_runner config docs
+ vegardit/docker-gitea-act-runner issue #7.
Flipping the global `container.network` to `bridge` would break every
other workflow in the repo (cache server discovery,
`molecule-monorepo-net` peer access during integration tests, etc.) —
unacceptable blast radius for a per-test bug.
## Fix shape
`handlers-postgres-integration.yml` no longer uses `services: postgres:`.
It launches a sibling postgres container manually on the existing
`molecule-monorepo-net` bridge network with a per-run unique name:
```yaml
env:
PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
PG_NETWORK: molecule-monorepo-net
steps:
- name: Start sibling Postgres on bridge network
run: |
docker run -d --name "${PG_NAME}" --network "${PG_NETWORK}" \
...
postgres:15-alpine
PG_HOST=$(docker inspect "${PG_NAME}" \
--format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
# … migrations + tests use ${PG_HOST}, not 127.0.0.1 …
- if: always() && …
name: Stop sibling Postgres
run: docker rm -f "${PG_NAME}" || true
```
The host-net job container can reach a bridge-net container via the
bridge IP directly (verified manually, 2026-05-08). Two parallel runs
use different names + different bridge IPs — no collision.
## Future-proofing
Other workflows that hit the same shape (any `services:` with a
fixed-port image) will exhibit the same failure mode under
host-network runner config. Translate using this same pattern:
1. Drop the `services:` block.
2. Use `${{ github.run_id }}-${{ github.run_attempt }}` for unique
container name.
3. Launch on `molecule-monorepo-net` (already trusted bridge in
`docker-compose.infra.yml`).
4. Read back the bridge IP via `docker inspect` and export as a step env.
5. `if: always()` cleanup step at the end.
If the count of such workflows grows, factor into a composite action
(`./.github/actions/sibling-postgres`) so the substrate logic lives
in one place.
## Related
- Issue #88 (closed by #92): localhost → 127.0.0.1 fix that unmasked
this collision; the IPv6 fix is correct, port collision is the new
layer.
- Issue #94 created `molecule-monorepo-net` + `alpine:latest` as
prereqs.
- Saved memory `feedback_act_runner_github_server_url` documents
another act_runner-vs-GHA divergence (server URL).