fix(ci): e2e-api — parallel-safe postgres/redis containers (closes #94) #100

Ghost · 2026-05-08T02:00:42Z

Ghost commented

2026-05-08 02:00:42 +00:00

First-time contributor

Summary

Class B Hongming-owned CICD red sweep, e2e-api leg. Mirrors PR #98 (handlers-postgres-integration) for the e2e-api workflow.

Root cause (verified, not just hypothesised)

Gitea act_runner is configured with container.network: host operator-wide. Parallel runs of e2e-api collide on TWO axes:

Container name collision — both jobs do docker rm -f molecule-ci-redis then docker run --name molecule-ci-redis .... First job's still-running redis gets killed, OR the second docker run fails with Conflict. The container name "/molecule-ci-redis" is already in use ... (exit 125). VERIFIED in operator-host log /opt/molecule/gitea/actions_log/molecule-ai/molecule-core/a7/2727.log:

docker: Error response from daemon: Conflict. The container name "/molecule-ci-redis" is already in use by container "af10f438bb83...". You have to remove (or rename) that container to be able to reuse that name.
exitcode '125': failure

Host-port collision — -p 15432:5432 and -p 16379:6379 are fixed; second concurrent job's bind fails with Address in use.

Also confirmed: Issue #94 items #2 + #3 are independent failures that surface AT TEST TIME (provisioner needs alpine:latest and molecule-monorepo-net).

Fix

Per-run unique container names — pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}, redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}. Unique even across reruns of the same run_id (mirrors PR #98).
Ephemeral host port per run — -p 0:5432 / -p 0:6379, docker port lookup, DATABASE_URL/REDIS_URL exported to $GITHUB_ENV. No fixed host-port → no collision.
127.0.0.1 (not localhost) in DB/cache URLs — IPv6 first-resolve flake (#92) stays fixed.
if: always() cleanup so containers don't leak when test steps fail.
Pre-pull alpine:latest — provisioner needs it for ephemeral token-write containers (internal/handlers/container_files.go). Issue #94 item #2.
Idempotent docker network create molecule-monorepo-net — provisioner attaches workspaces to it (internal/provisioner/provisioner.go::DefaultNetwork). Issue #94 item #3.

Issue #94 item #1 (timeouts) is NOT bumped — evidence on recent runs (77/3191, ae/4270, 0e/2318) shows postgres ready in 3s, redis in 1s, platform in 1s. Timeouts are not the bottleneck.

Hostile self-review — 3 weakest spots

docker port parsing assumes IPv4 line. The awk -F: '/^0\.0\.0\.0:/ {print $2}' filter could miss if Docker prints only :::NNNN (IPv6). Mitigation: fallback head -1 | awk -F: '{print $NF}' — handles either format. Empirically 0.0.0.0:NNNN is what -p 0:5432 produces on the operator host (Linux + bridge daemon).
Pre-pull docker pull alpine:latest has no || true. If the daemon is unreachable, the workflow fails before any other step. Counterargument: if the daemon is unreachable, NOTHING in this workflow works (every other step uses the same socket), so fail-fast is correct.
Closes #94 partially overstates. This PR fixes items #2 + #3 cleanly; item #1 (timeouts) is documented as not-the-bottleneck on current evidence; the Run E2E API tests step's Status back online failure (caused by ghcr.io/molecule-ai/workspace-template-langgraph:latest returning 403 Forbidden post-2026-05-06 GitHub suspension) is OUT OF SCOPE here — it's a template-registry resolution problem in workspace-server, not a workflow problem. This PR does NOT promise green tests on a fresh main; it promises parallel-safe service startup + ready provisioner setup.

Test plan

Push to this branch triggers e2e-api workflow run
Trigger AGAIN immediately (workflow_dispatch) to verify two parallel runs both reach Postgres/Redis/Platform-ready GREEN — no collision.
Verify the langgraph-403 failure (if it surfaces) is the SAME shape as on main, not a regression introduced by the per-run port lookup.
Verify docker rm -f $PG_CONTAINER cleanup runs on both success and failure paths.

Closes #94 (items #2 + #3; item #1 documented as not-bottleneck; langgraph-template-403 split out for separate follow-up).

[Class B Hongming-owned CICD sweep]

## Summary Class B Hongming-owned CICD red sweep, e2e-api leg. Mirrors PR #98 (handlers-postgres-integration) for the e2e-api workflow. ## Root cause (verified, not just hypothesised) Gitea act_runner is configured with `container.network: host` operator-wide. Parallel runs of `e2e-api` collide on TWO axes: 1. **Container name collision** — both jobs do `docker rm -f molecule-ci-redis` then `docker run --name molecule-ci-redis ...`. First job's still-running redis gets killed, OR the second `docker run` fails with `Conflict. The container name "/molecule-ci-redis" is already in use ...` (exit 125). VERIFIED in operator-host log `/opt/molecule/gitea/actions_log/molecule-ai/molecule-core/a7/2727.log`: ``` docker: Error response from daemon: Conflict. The container name "/molecule-ci-redis" is already in use by container "af10f438bb83...". You have to remove (or rename) that container to be able to reuse that name. exitcode '125': failure ``` 2. **Host-port collision** — `-p 15432:5432` and `-p 16379:6379` are fixed; second concurrent job's bind fails with `Address in use`. Also confirmed: Issue #94 items #2 + #3 are independent failures that surface AT TEST TIME (provisioner needs `alpine:latest` and `molecule-monorepo-net`). ## Fix 1. **Per-run unique container names** — `pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`, `redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`. Unique even across reruns of the same run_id (mirrors PR #98). 2. **Ephemeral host port per run** — `-p 0:5432` / `-p 0:6379`, `docker port` lookup, `DATABASE_URL`/`REDIS_URL` exported to `$GITHUB_ENV`. No fixed host-port → no collision. 3. **`127.0.0.1` (not `localhost`)** in DB/cache URLs — IPv6 first-resolve flake (#92) stays fixed. 4. **`if: always()` cleanup** so containers don't leak when test steps fail. 5. **Pre-pull `alpine:latest`** — provisioner needs it for ephemeral token-write containers (`internal/handlers/container_files.go`). Issue #94 item #2. 6. **Idempotent `docker network create molecule-monorepo-net`** — provisioner attaches workspaces to it (`internal/provisioner/provisioner.go::DefaultNetwork`). Issue #94 item #3. Issue #94 item #1 (timeouts) is NOT bumped — evidence on recent runs (77/3191, ae/4270, 0e/2318) shows postgres ready in 3s, redis in 1s, platform in 1s. Timeouts are not the bottleneck. ## Hostile self-review — 3 weakest spots 1. **`docker port` parsing assumes IPv4 line.** The `awk -F: '/^0\.0\.0\.0:/ {print $2}'` filter could miss if Docker prints only `:::NNNN` (IPv6). Mitigation: fallback `head -1 | awk -F: '{print $NF}'` — handles either format. Empirically `0.0.0.0:NNNN` is what `-p 0:5432` produces on the operator host (Linux + bridge daemon). 2. **Pre-pull `docker pull alpine:latest` has no `|| true`.** If the daemon is unreachable, the workflow fails before any other step. Counterargument: if the daemon is unreachable, NOTHING in this workflow works (every other step uses the same socket), so fail-fast is correct. 3. **Closes #94 partially overstates.** This PR fixes items #2 + #3 cleanly; item #1 (timeouts) is documented as not-the-bottleneck on current evidence; the `Run E2E API tests` step's `Status back online` failure (caused by `ghcr.io/molecule-ai/workspace-template-langgraph:latest` returning 403 Forbidden post-2026-05-06 GitHub suspension) is OUT OF SCOPE here — it's a template-registry resolution problem in `workspace-server`, not a workflow problem. This PR does NOT promise green tests on a fresh main; it promises parallel-safe service startup + ready provisioner setup. ## Test plan - [ ] Push to this branch triggers e2e-api workflow run - [ ] Trigger AGAIN immediately (workflow_dispatch) to verify two parallel runs both reach Postgres/Redis/Platform-ready GREEN — no collision. - [ ] Verify the langgraph-403 failure (if it surfaces) is the SAME shape as on main, not a regression introduced by the per-run port lookup. - [ ] Verify `docker rm -f $PG_CONTAINER` cleanup runs on both success and failure paths. Closes #94 (items #2 + #3; item #1 documented as not-bottleneck; langgraph-template-403 split out for separate follow-up). [Class B Hongming-owned CICD sweep]

Ghost added 1 commit 2026-05-08 02:00:42 +00:00

fix(ci): e2e-api — parallel-safe postgres/redis containers + provisioner setup

CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s

Details

CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 2s

Details

Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 4s

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s

Details

CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 2s

Details

Retarget main PRs to staging / Retarget to staging (pull_request) Successful in 3s

Details

branch-protection drift check / Branch protection drift (pull_request) Successful in 8s

Details

CI / Detect changes (pull_request) Successful in 8s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s

Details

Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 7s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s

Details

Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 9s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s

Details

CI / Platform (Go) (pull_request) Successful in 5s

Details

CI / Python Lint & Test (pull_request) Successful in 5s

Details

CI / Canvas (Next.js) (pull_request) Successful in 5s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s

Details

Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m46s

Details

b9d2786f45

Class B Hongming-owned CICD red sweep, e2e-api leg. Same substrate
hazard as PR #98 (handlers-postgres-integration) — Gitea act_runner
configures `container.network: host` operator-wide, so:

  * Two concurrent e2e-api runs both attempted to bind `-p 15432:5432`
    and `-p 16379:6379` on the operator host. Verified in run a7/2727
    on 2026-05-07: `docker: Error response from daemon: Conflict. The
    container name "/molecule-ci-redis" is already in use by container
    af10f438...` — exit 125, job fails before any test runs.
  * Hardcoded container names `molecule-ci-postgres` / `-redis` plus
    the leading `docker rm -f` step meant a second job's startup also
    KILLED the first job's still-running services.

Fix shape (mirrors PR #98 bridge-net pattern, adapted because the
platform-server is a Go binary on the host, not a containerised step):

  1. Per-run unique container names: `pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`,
     `redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}`. Unique even across reruns
     of the same run_id.
  2. Ephemeral host port per run via `-p 0:5432` / `-p 0:6379` and
     `docker port` lookup, exported as `DATABASE_URL` / `REDIS_URL` to
     `$GITHUB_ENV`. No fixed host-port → no collision.
  3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve flake
     fixed in #92 stays fixed.
  4. `if: always()` cleanup so containers don't leak when test steps
     fail.

Issue #94 items #2 + #3 also addressed:

  * Pre-pull `alpine:latest` (provisioner uses it for ephemeral
    token-write containers in `internal/handlers/container_files.go`).
  * Idempotent `docker network create molecule-monorepo-net` (the
    provisioner attaches workspace containers via that bridge —
    `internal/provisioner/provisioner.go::DefaultNetwork`).

Issue #94 item #1 (timeouts) NOT bumped — recent log evidence shows
postgres ready in 3s, redis in 1s, platform in 1s when they DO come
up. Timeouts are not the bottleneck on the current substrate.

NOT addressed here (out of scope, separate change required):

  * `Run E2E API tests` step has been failing on `Status back online`
    because the platform's langgraph workspace template image
    (`ghcr.io/molecule-ai/workspace-template-langgraph:latest`)
    returns 403 Forbidden post-2026-05-06 GitHub org suspension. That
    is a template-registry resolution issue (ADR-002 / local-build
    mode) and belongs in a workspace-server change, not this workflow
    file. This PR fixes the parallel-collision class and the workflow
    setup hygiene; the langgraph-403 failure will still surface on
    runs after this lands until template resolution is fixed
    separately.

Verified manually on operator host 2026-05-08: docker now hands out
ephemeral ports on `-p 0:5432`, two parallel runs land on different
ports, both reach pg_isready GREEN.

Closes #94 (items #2 and #3; item #1 documented as not-bottleneck;
langgraph-template-403 referenced for follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ghost changed target branch from main to staging

2026-05-08 02:00:49 +00:00

Ghost approved these changes 2026-05-08 02:02:56 +00:00

Ghost left a comment

First-time contributor

E2E API parallel-safe postgres/redis containers fix. Mirrors PR #98 (Class B). Unique container names per run + ephemeral host port. Closes #94. By devops-engineer. Approved.