From 570f456436521a7414a7146405b99f851630a50f Mon Sep 17 00:00:00 2001
From: devops-engineer <devops-engineer@molecule.ai>
Date: Fri, 8 May 2026 02:19:01 +0000
Subject: [PATCH] fix(ci): port 3 verified-green CI fixes from staging to main
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Files copied from staging tip (a4ab623b):
- canvas/vitest.config.ts          (vitest testTimeout 5000→30000ms on CI; was PR #97 unblocking 4 canvas-test PRs)
- .github/workflows/handlers-postgres-integration.yml  (parallel-safe pg containers; was PR #98 unblocking #84)
- .github/workflows/e2e-api.yml    (parallel-safe pg+redis; was PR #100 unblocking #84 + #99)
- docs/runbooks/handlers-postgres-integration-port-collision.md (new — substrate runbook)

Why a separate cherry-pick PR rather than promote-staging-to-main: PR #99 (full
staging→main promote) hit a Platform (Go) sqlmock regression in some other
staging commit (under Phase 1 investigation by sister agent a283d938). To
unblock prod with the verified-green CI fixes WITHOUT carrying the Go-test
regression, port these 3 workflow/config files surgically.

Verified clean: workflow-YAML + vitest config + runbook only — zero Go code
touched, so the Platform (Go) failure on PR #99 cannot apply here.

Co-authored-by: Claude (orchestrator)
---
 .github/workflows/e2e-api.yml                 | 130 +++++++++++++++-
 .../handlers-postgres-integration.yml         | 141 ++++++++++++++----
 canvas/vitest.config.ts                       |  26 ++++
 ...ers-postgres-integration-port-collision.md | 137 +++++++++++++++++
 4 files changed, 397 insertions(+), 37 deletions(-)
 create mode 100644 docs/runbooks/handlers-postgres-integration-port-collision.md

diff --git a/.github/workflows/e2e-api.yml b/.github/workflows/e2e-api.yml
index 782cbedc..da7dbcd3 100644
--- a/.github/workflows/e2e-api.yml
+++ b/.github/workflows/e2e-api.yml
@@ -12,6 +12,59 @@ name: E2E API Smoke Test
 # spending CI cycles. See the in-job comment on the `e2e-api` job for
 # why this is one job (not two-jobs-sharing-name) and the 2026-04-29
 # PR #2264 incident that drove the consolidation.
+#
+# Parallel-safety (Class B Hongming-owned CICD red sweep, 2026-05-08)
+# -------------------------------------------------------------------
+# Same substrate hazard as PR #98 (handlers-postgres-integration). Our
+# Gitea act_runner runs with `container.network: host` (operator host
+# `/opt/molecule/runners/config.yaml`), which means:
+#
+#   * Two concurrent runs both try to bind their `-p 15432:5432` /
+#     `-p 16379:6379` host ports — the second postgres/redis FATALs
+#     with `Address in use` and `docker run` returns exit 125 with
+#     `Conflict. The container name "/molecule-ci-postgres" is already
+#     in use by container ...`. Verified in run a7/2727 on 2026-05-07.
+#   * The fixed container names `molecule-ci-postgres` / `-redis` (the
+#     pre-fix shape) collide on name AS WELL AS port. The cleanup-with-
+#     `docker rm -f` at the start of the second job KILLS the first
+#     job's still-running postgres/redis.
+#
+# Fix shape (mirrors PR #98's bridge-net pattern, adapted because
+# platform-server is a Go binary on the host, not a containerised
+# step):
+#
+#   1. Unique container names per run:
+#         pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
+#         redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
+#      `${RUN_ID}-${RUN_ATTEMPT}` is unique even across reruns of the
+#      same run_id.
+#   2. Ephemeral host port per run (`-p 0:5432`), then read the actual
+#      bound port via `docker port` and export DATABASE_URL/REDIS_URL
+#      pointing at it. No fixed host-port → no port collision.
+#   3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve was
+#      the original flake fixed in #92 and the script's still IPv6-
+#      enabled.
+#   4. `if: always()` cleanup so containers don't leak when test steps
+#      fail.
+#
+# Issue #94 items #2 + #3 (also fixed here):
+#   * Pre-pull `alpine:latest` so the platform-server's provisioner
+#     (`internal/handlers/container_files.go`) can stand up its
+#     ephemeral token-write helper without a daemon.io round-trip.
+#   * Create `molecule-monorepo-net` bridge network if missing so the
+#     provisioner's container.HostConfig {NetworkMode: ...} attach
+#     succeeds.
+# Item #1 (timeouts) — evidence on recent runs (77/3191, ae/4270, 0e/
+# 2318) shows Postgres ready in 3s, Redis in 1s, Platform in 1s when
+# they DO come up. Timeouts are not the bottleneck; not bumped.
+#
+# Item explicitly NOT fixed here: failing test `Status back online`
+# fails because the platform's langgraph workspace template image
+# (ghcr.io/molecule-ai/workspace-template-langgraph:latest) returns
+# 403 Forbidden post-2026-05-06 GitHub org suspension. That is a
+# template-registry resolution issue (ADR-002 / local-build mode) and
+# belongs in a separate change that touches workspace-server, not
+# this workflow file.
 
 on:
   push:
@@ -78,11 +131,14 @@ jobs:
     runs-on: ubuntu-latest
     timeout-minutes: 15
     env:
-      DATABASE_URL: postgres://dev:dev@localhost:15432/molecule?sslmode=disable
-      REDIS_URL: redis://localhost:16379
+      # Unique per-run container names so concurrent runs on the host-
+      # network act_runner don't collide on name OR port.
+      # `${RUN_ID}-${RUN_ATTEMPT}` stays unique across reruns of the
+      # same run_id. PORT is set later (after docker port lookup) since
+      # we let Docker assign an ephemeral host port.
+      PG_CONTAINER: pg-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
+      REDIS_CONTAINER: redis-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
       PORT: "8080"
-      PG_CONTAINER: molecule-ci-postgres
-      REDIS_CONTAINER: molecule-ci-redis
     steps:
       - name: No-op pass (paths filter excluded this commit)
         if: needs.detect-changes.outputs.api != 'true'
@@ -97,11 +153,53 @@ jobs:
           go-version: 'stable'
           cache: true
           cache-dependency-path: workspace-server/go.sum
+      - name: Pre-pull alpine + ensure provisioner network (Issue #94 items #2 + #3)
+        if: needs.detect-changes.outputs.api == 'true'
+        run: |
+          # Provisioner uses alpine:latest for ephemeral token-write
+          # containers (workspace-server/internal/handlers/container_files.go).
+          # Pre-pull so the first provision in test_api.sh doesn't race
+          # the daemon's pull cache. Idempotent — `docker pull` is a no-op
+          # when the image is already present.
+          docker pull alpine:latest >/dev/null
+          # Provisioner attaches workspace containers to
+          # molecule-monorepo-net (workspace-server/internal/provisioner/
+          # provisioner.go::DefaultNetwork). The bridge already exists on
+          # the operator host's docker daemon — `network create` is
+          # idempotent via `|| true`.
+          docker network create molecule-monorepo-net >/dev/null 2>&1 || true
+          echo "alpine:latest pre-pulled; molecule-monorepo-net ensured."
       - name: Start Postgres (docker)
         if: needs.detect-changes.outputs.api == 'true'
         run: |
+          # Defensive cleanup — only matches THIS run's container name,
+          # so it cannot kill a sibling run's postgres. (Pre-fix the
+          # name was static and this rm hit other runs' containers.)
           docker rm -f "$PG_CONTAINER" 2>/dev/null || true
-          docker run -d --name "$PG_CONTAINER" -e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule -p 15432:5432 postgres:16
+          # `-p 0:5432` requests an ephemeral host port; we read it back
+          # below and export DATABASE_URL.
+          docker run -d --name "$PG_CONTAINER" \
+            -e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule \
+            -p 0:5432 postgres:16 >/dev/null
+          # Resolve the host-side port assignment. `docker port` prints
+          # `0.0.0.0:NNNN` (and on host-net runners may also print an
+          # IPv6 line — take the first IPv4 line).
+          PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
+          if [ -z "$PG_PORT" ]; then
+            # Fallback: any first line. Some Docker versions print only
+            # one line.
+            PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | head -1 | awk -F: '{print $NF}')
+          fi
+          if [ -z "$PG_PORT" ]; then
+            echo "::error::Could not resolve host port for $PG_CONTAINER"
+            docker port "$PG_CONTAINER" 5432/tcp || true
+            docker logs "$PG_CONTAINER" || true
+            exit 1
+          fi
+          # 127.0.0.1 (NOT localhost) — IPv6 first-resolve flake (#92).
+          echo "PG_PORT=${PG_PORT}" >> "$GITHUB_ENV"
+          echo "DATABASE_URL=postgres://dev:dev@127.0.0.1:${PG_PORT}/molecule?sslmode=disable" >> "$GITHUB_ENV"
+          echo "Postgres host port: ${PG_PORT}"
           for i in $(seq 1 30); do
             if docker exec "$PG_CONTAINER" pg_isready -U dev >/dev/null 2>&1; then
               echo "Postgres ready after ${i}s"
@@ -116,7 +214,20 @@ jobs:
         if: needs.detect-changes.outputs.api == 'true'
         run: |
           docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
-          docker run -d --name "$REDIS_CONTAINER" -p 16379:6379 redis:7
+          docker run -d --name "$REDIS_CONTAINER" -p 0:6379 redis:7 >/dev/null
+          REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
+          if [ -z "$REDIS_PORT" ]; then
+            REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | head -1 | awk -F: '{print $NF}')
+          fi
+          if [ -z "$REDIS_PORT" ]; then
+            echo "::error::Could not resolve host port for $REDIS_CONTAINER"
+            docker port "$REDIS_CONTAINER" 6379/tcp || true
+            docker logs "$REDIS_CONTAINER" || true
+            exit 1
+          fi
+          echo "REDIS_PORT=${REDIS_PORT}" >> "$GITHUB_ENV"
+          echo "REDIS_URL=redis://127.0.0.1:${REDIS_PORT}" >> "$GITHUB_ENV"
+          echo "Redis host port: ${REDIS_PORT}"
           for i in $(seq 1 15); do
             if docker exec "$REDIS_CONTAINER" redis-cli ping 2>/dev/null | grep -q PONG; then
               echo "Redis ready after ${i}s"
@@ -135,13 +246,15 @@ jobs:
         if: needs.detect-changes.outputs.api == 'true'
         working-directory: workspace-server
         run: |
+          # DATABASE_URL + REDIS_URL exported by the start-postgres /
+          # start-redis steps point at this run's per-run host ports.
           ./platform-server > platform.log 2>&1 &
           echo $! > platform.pid
       - name: Wait for /health
         if: needs.detect-changes.outputs.api == 'true'
         run: |
           for i in $(seq 1 30); do
-            if curl -sf http://localhost:8080/health > /dev/null; then
+            if curl -sf http://127.0.0.1:8080/health > /dev/null; then
               echo "Platform up after ${i}s"
               exit 0
             fi
@@ -185,6 +298,9 @@ jobs:
             kill "$(cat workspace-server/platform.pid)" 2>/dev/null || true
           fi
       - name: Stop service containers
+        # always() so containers don't leak when test steps fail. The
+        # cleanup is best-effort: if the container is already gone
+        # (e.g. concurrent rerun race), don't fail the job.
         if: always() && needs.detect-changes.outputs.api == 'true'
         run: |
           docker rm -f "$PG_CONTAINER" 2>/dev/null || true
diff --git a/.github/workflows/handlers-postgres-integration.yml b/.github/workflows/handlers-postgres-integration.yml
index 98927ac9..05216b59 100644
--- a/.github/workflows/handlers-postgres-integration.yml
+++ b/.github/workflows/handlers-postgres-integration.yml
@@ -14,12 +14,42 @@ name: Handlers Postgres Integration
 # self-review caught it took 2 minutes to set up and would have caught
 # the bug at PR-time.
 #
-# This job spins a Postgres service container, applies the migration,
-# and runs `go test -tags=integration` against a live DB. Required
-# check on staging branch protection — backend handler PRs cannot
-# merge without a real-DB regression gate.
+# Why this workflow does NOT use `services: postgres:` (Class B fix)
+# ------------------------------------------------------------------
+# Our act_runner config has `container.network: host` (operator host
+# /opt/molecule/runners/config.yaml), which act_runner applies to BOTH
+# the job container AND every service container. With host-net, two
+# concurrent runs of this workflow both try to bind 0.0.0.0:5432 — the
+# second postgres FATALs with `could not create any TCP/IP sockets:
+# Address in use`, and Docker auto-removes it (act_runner sets
+# AutoRemove:true on service containers). By the time the migrations
+# step runs `psql`, the postgres container is gone, hence
+# `Connection refused` then `failed to remove container: No such
+# container` at cleanup time.
 #
-# Cost: ~30s job (postgres pull from GH cache + go build + 4 tests).
+# Per-job `container.network` override is silently ignored by
+# act_runner — `--network and --net in the options will be ignored.`
+# appears in the runner log. Documented constraint.
+#
+# So we sidestep `services:` entirely. The job container still uses
+# host-net (inherited from runner config; required for cache server
+# discovery on the bridge IP 172.18.0.17:42631). We launch a sibling
+# postgres on the existing `molecule-monorepo-net` bridge with a
+# UNIQUE name per run — `pg-handlers-${RUN_ID}-${RUN_ATTEMPT}` — and
+# read its bridge IP via `docker inspect`. A host-net job container
+# can reach a bridge-net container directly via the bridge IP (verified
+# manually on operator host 2026-05-08).
+#
+# Trade-offs vs. the original `services:` shape:
+#   + No host-port collision; N parallel runs share the bridge cleanly
+#   + `if: always()` cleanup runs even on test-step failure
+#   - One more step in the workflow (+~3 lines)
+#   - Requires `molecule-monorepo-net` to exist on the operator host
+#     (it does; declared in docker-compose.yml + docker-compose.infra.yml)
+#
+# Class B Hongming-owned CICD red sweep, 2026-05-08.
+#
+# Cost: ~30s job (postgres pull from cache + go build + 4 tests).
 
 on:
   push:
@@ -59,20 +89,14 @@ jobs:
     name: Handlers Postgres Integration
     needs: detect-changes
     runs-on: ubuntu-latest
-    services:
-      postgres:
-        image: postgres:15-alpine
-        env:
-          POSTGRES_PASSWORD: test
-          POSTGRES_DB: molecule
-        ports:
-          - 5432:5432
-        # GHA spins this with --health-cmd built in for postgres images.
-        options: >-
-          --health-cmd pg_isready
-          --health-interval 5s
-          --health-timeout 5s
-          --health-retries 10
+    env:
+      # Unique name per run so concurrent jobs don't collide on the
+      # bridge network. ${RUN_ID}-${RUN_ATTEMPT} is unique even across
+      # workflow_dispatch reruns of the same run_id.
+      PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
+      # Bridge network already exists on the operator host (declared
+      # in docker-compose.yml + docker-compose.infra.yml).
+      PG_NETWORK: molecule-monorepo-net
     defaults:
       run:
         working-directory: workspace-server
@@ -89,16 +113,57 @@ jobs:
         with:
           go-version: 'stable'
 
+      - if: needs.detect-changes.outputs.handlers == 'true'
+        name: Start sibling Postgres on bridge network
+        working-directory: .
+        run: |
+          # Sanity: the bridge network must exist on the operator host.
+          # Hard-fail loud if it doesn't — easier to spot than a silent
+          # auto-create that diverges from the rest of the stack.
+          if ! docker network inspect "${PG_NETWORK}" >/dev/null 2>&1; then
+            echo "::error::Bridge network '${PG_NETWORK}' missing on operator host. Re-run docker-compose.infra.yml or check ops handbook."
+            exit 1
+          fi
+
+          # If a stale container with the same name exists (rerun on
+          # the same run_id), wipe it first.
+          docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
+
+          docker run -d \
+            --name "${PG_NAME}" \
+            --network "${PG_NETWORK}" \
+            --health-cmd "pg_isready -U postgres" \
+            --health-interval 5s \
+            --health-timeout 5s \
+            --health-retries 10 \
+            -e POSTGRES_PASSWORD=test \
+            -e POSTGRES_DB=molecule \
+            postgres:15-alpine >/dev/null
+
+          # Read back the bridge IP. Always present immediately after
+          # `docker run -d` for bridge networks.
+          PG_HOST=$(docker inspect "${PG_NAME}" \
+            --format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
+          if [ -z "${PG_HOST}" ]; then
+            echo "::error::Could not resolve PG_HOST for ${PG_NAME} on ${PG_NETWORK}"
+            docker logs "${PG_NAME}" || true
+            exit 1
+          fi
+          echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
+          echo "INTEGRATION_DB_URL=postgres://postgres:test@${PG_HOST}:5432/molecule?sslmode=disable" >> "$GITHUB_ENV"
+          echo "Started ${PG_NAME} at ${PG_HOST}:5432"
+
       - if: needs.detect-changes.outputs.handlers == 'true'
         name: Apply migrations to Postgres service
         env:
           PGPASSWORD: test
         run: |
-          # Wait for postgres to actually accept connections (the
-          # GHA --health-cmd is best-effort but psql can still race).
+          # Wait for postgres to actually accept connections. Docker's
+          # health-cmd handles container-side readiness, but the wire
+          # to the bridge IP is best-tested with pg_isready directly.
           for i in {1..15}; do
-            if pg_isready -h localhost -p 5432 -U postgres -q; then break; fi
-            echo "waiting for postgres..."; sleep 2
+            if pg_isready -h "${PG_HOST}" -p 5432 -U postgres -q; then break; fi
+            echo "waiting for postgres at ${PG_HOST}:5432..."; sleep 2
           done
 
           # Apply every .up.sql in lexicographic order with
@@ -131,7 +196,7 @@ jobs:
           # not fine once a cross-table atomicity test came in.
           set +e
           for migration in $(ls migrations/*.sql 2>/dev/null | grep -v '\.down\.sql$' | sort); do
-            if psql -h localhost -U postgres -d molecule -v ON_ERROR_STOP=1 \
+            if psql -h "${PG_HOST}" -U postgres -d molecule -v ON_ERROR_STOP=1 \
                   -f "$migration" >/dev/null 2>&1; then
               echo "✓ $(basename "$migration")"
             else
@@ -145,7 +210,7 @@ jobs:
           # fail if any didn't land — that would be a real regression we
           # want loud.
           for tbl in delegations workspaces activity_logs pending_uploads; do
-            if ! psql -h localhost -U postgres -d molecule -tA \
+            if ! psql -h "${PG_HOST}" -U postgres -d molecule -tA \
                 -c "SELECT 1 FROM information_schema.tables WHERE table_name = '$tbl'" \
                 | grep -q 1; then
               echo "::error::$tbl table missing after migration replay — handler integration tests would be meaningless"
@@ -156,16 +221,32 @@ jobs:
 
       - if: needs.detect-changes.outputs.handlers == 'true'
         name: Run integration tests
-        env:
-          INTEGRATION_DB_URL: postgres://postgres:test@localhost:5432/molecule?sslmode=disable
         run: |
+          # INTEGRATION_DB_URL is exported by the start-postgres step;
+          # points at the per-run bridge IP, not 127.0.0.1, so concurrent
+          # workflow runs don't fight over a host-net 5432 port.
           go test -tags=integration -timeout 5m -v ./internal/handlers/ -run "^TestIntegration_"
 
-      - if: needs.detect-changes.outputs.handlers == 'true' && failure()
+      - if: failure() && needs.detect-changes.outputs.handlers == 'true'
         name: Diagnostic dump on failure
         env:
           PGPASSWORD: test
         run: |
-          echo "::group::delegations table state"
-          psql -h localhost -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
+          echo "::group::postgres container status"
+          docker ps -a --filter "name=${PG_NAME}" --format '{{.Status}} {{.Names}}' || true
+          docker logs "${PG_NAME}" 2>&1 | tail -50 || true
           echo "::endgroup::"
+          echo "::group::delegations table state"
+          psql -h "${PG_HOST}" -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
+          echo "::endgroup::"
+
+      - if: always() && needs.detect-changes.outputs.handlers == 'true'
+        name: Stop sibling Postgres
+        working-directory: .
+        run: |
+          # always() so containers don't leak when migrations or tests
+          # fail. The cleanup is best-effort: if the container is
+          # already gone (e.g. concurrent rerun race), don't fail the job.
+          docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
+          echo "Cleaned up ${PG_NAME}"
+
diff --git a/canvas/vitest.config.ts b/canvas/vitest.config.ts
index 15fb4195..0d290378 100644
--- a/canvas/vitest.config.ts
+++ b/canvas/vitest.config.ts
@@ -7,6 +7,32 @@ export default defineConfig({
   test: {
     environment: 'node',
     exclude: ['e2e/**', 'node_modules/**', '**/dist/**'],
+    // CI-conditional test timeout (issue #96).
+    //
+    // Vitest's 5000ms default is too tight for the first test in any
+    // file under our CI shape: `npx vitest run --coverage` on the
+    // self-hosted Gitea Actions Docker runner. The cold-start cost
+    // (v8 coverage instrumentation init + JSDOM bootstrap + module-
+    // graph import for @/components/* and @/lib/* + first React
+    // render) consistently consumes 5-7 seconds for the first
+    // synchronous test in heavyweight component files
+    // (ActivityTab.test.tsx, CreateWorkspaceDialog.test.tsx,
+    // ConfigTab.provider.test.tsx) — even though every subsequent
+    // test in the same file completes in 100-1500ms.
+    //
+    // Empirically the worst observed first-test was 6453ms in a
+    // single file (CreateWorkspaceDialog). 30000ms gives ~5x
+    // headroom over that on CI; we still keep 5000ms locally so
+    // genuine waitFor races / hung promises stay sensitive in dev.
+    //
+    // Same vitest pattern documented at:
+    //   https://vitest.dev/config/testtimeout
+    //   https://vitest.dev/guide/coverage#profiling-test-performance
+    //
+    // Per-test duration is still emitted to the CI log; if a test
+    // ever silently approaches 25-30s under this raised ceiling that
+    // will surface as a duration regression and we revisit.
+    testTimeout: process.env.CI ? 30000 : 5000,
     // Coverage is instrumented but NOT yet a CI gate — first land
     // observability so we can see the baseline, then dial in
     // thresholds + a hard gate in a follow-up PR (#1815). Today's
diff --git a/docs/runbooks/handlers-postgres-integration-port-collision.md b/docs/runbooks/handlers-postgres-integration-port-collision.md
new file mode 100644
index 00000000..0b9df483
--- /dev/null
+++ b/docs/runbooks/handlers-postgres-integration-port-collision.md
@@ -0,0 +1,137 @@
+# Runbook — Handlers Postgres Integration port-collision substrate
+
+**Status:** Resolved 2026-05-08 (PR for class B Hongming-owned CICD red sweep).
+
+## Symptom
+
+`Handlers Postgres Integration` workflow fails on staging push and PRs.
+Step `Apply migrations to Postgres service` shows:
+
+```
+psql: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
+```
+
+Job-cleanup step further down logs:
+
+```
+Cleaning up services for job Handlers Postgres Integration
+failed to remove container: Error response from daemon: No such container: <id>
+```
+
+…confirming the postgres service container was already gone before
+cleanup ran.
+
+## Root cause
+
+Our Gitea act_runner (operator host `5.78.80.188`,
+`/opt/molecule/runners/config.yaml`) sets:
+
+```yaml
+container:
+  network: host
+```
+
+…which act_runner applies to BOTH the job container AND every
+`services:` container in a workflow. Multiple workflow instances
+running concurrently across the 16 parallel runners each try to bind
+postgres on `0.0.0.0:5432`. The first wins; subsequent instances exit
+immediately with:
+
+```
+LOG:  could not bind IPv4 address "0.0.0.0": Address in use
+HINT: Is another postmaster already running on port 5432?
+FATAL: could not create any TCP/IP sockets
+```
+
+act_runner sets `AutoRemove:true` on service containers, so Docker
+garbage-collects them as soon as they exit. By the time the migrations
+step runs `pg_isready` / `psql`, the container is gone and connection
+refused.
+
+Reproduction (operator host):
+
+```bash
+docker run --rm -d --name pg-A --network host \
+  -e POSTGRES_PASSWORD=test postgres:15-alpine
+docker run -d --name pg-B --network host \
+  -e POSTGRES_PASSWORD=test postgres:15-alpine
+docker logs pg-B   # FATAL: could not create any TCP/IP sockets
+```
+
+## Why per-job override doesn't work
+
+The natural fix — per-job `container.network` override — is silently
+ignored by act_runner. The runner log emits:
+
+```
+--network and --net in the options will be ignored.
+```
+
+This is a documented act_runner constraint: container network is a
+runner-wide setting, not per-job. Source: gitea/act_runner config docs
++ vegardit/docker-gitea-act-runner issue #7.
+
+Flipping the global `container.network` to `bridge` would break every
+other workflow in the repo (cache server discovery,
+`molecule-monorepo-net` peer access during integration tests, etc.) —
+unacceptable blast radius for a per-test bug.
+
+## Fix shape
+
+`handlers-postgres-integration.yml` no longer uses `services: postgres:`.
+It launches a sibling postgres container manually on the existing
+`molecule-monorepo-net` bridge network with a per-run unique name:
+
+```yaml
+env:
+  PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
+  PG_NETWORK: molecule-monorepo-net
+
+steps:
+  - name: Start sibling Postgres on bridge network
+    run: |
+      docker run -d --name "${PG_NAME}" --network "${PG_NETWORK}" \
+        ...
+        postgres:15-alpine
+      PG_HOST=$(docker inspect "${PG_NAME}" \
+        --format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
+      echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
+
+  # … migrations + tests use ${PG_HOST}, not 127.0.0.1 …
+
+  - if: always() && …
+    name: Stop sibling Postgres
+    run: docker rm -f "${PG_NAME}" || true
+```
+
+The host-net job container can reach a bridge-net container via the
+bridge IP directly (verified manually, 2026-05-08). Two parallel runs
+use different names + different bridge IPs — no collision.
+
+## Future-proofing
+
+Other workflows that hit the same shape (any `services:` with a
+fixed-port image) will exhibit the same failure mode under
+host-network runner config. Translate using this same pattern:
+
+1. Drop the `services:` block.
+2. Use `${{ github.run_id }}-${{ github.run_attempt }}` for unique
+   container name.
+3. Launch on `molecule-monorepo-net` (already trusted bridge in
+   `docker-compose.infra.yml`).
+4. Read back the bridge IP via `docker inspect` and export as a step env.
+5. `if: always()` cleanup step at the end.
+
+If the count of such workflows grows, factor into a composite action
+(`./.github/actions/sibling-postgres`) so the substrate logic lives
+in one place.
+
+## Related
+
+- Issue #88 (closed by #92): localhost → 127.0.0.1 fix that unmasked
+  this collision; the IPv6 fix is correct, port collision is the new
+  layer.
+- Issue #94 created `molecule-monorepo-net` + `alpine:latest` as
+  prereqs.
+- Saved memory `feedback_act_runner_github_server_url` documents
+  another act_runner-vs-GHA divergence (server URL).