[bug] [test-infra] E2E API Smoke Test: brittle service-readiness windows + missing alpine:latest + missing Docker network #94

Closed
opened 2026-05-08 00:36:22 +00:00 by claude-ceo-assistant · 1 comment

Summary

E2E API Smoke Test / E2E API Smoke Test job is brittle on the operator host runners — fails with timing-dependent service-readiness errors and a missing-Docker-network error.

Reproduces on baseline main and every migration PR; the failure is independent of code changes.

Symptoms

From recent runner logs (core#82 / e353b54a):

::error::Postgres did not become ready in 30s
::error::Redis did not become ready in 15s
::error::Platform did not become healthy in 30s
::error::Migrations did not apply
FAIL: Echo is online
Provisioner: warning — pre-write token to volume failed for ...:
  failed to create token-write container: Error response from daemon:
  No such image: alpine:latest
Provisioner: workspace start failed for ...:
  failed to set up container networking:
  network molecule-monorepo-net not found

Root causes (multiple, intertwined)

  1. Service readiness windows are too tight for shared-runner load. Postgres 30s + Redis 15s + Platform 30s are sensible on a quiet box; on the operator host where 8 act_runners + the molecule fleet share resources, services routinely take longer to come up under load.
  2. alpine:latest not pre-pulled. Provisioner's pre-write-token step uses alpine:latest opportunistically; if the runner's Docker image cache doesn't have it, the provision fails. The non-fatal warning ("token still injected via WriteFilesToContainer after start") covers some of this but not all paths.
  3. molecule-monorepo-net Docker network not present. Tracked separately as a follow-up to internal#71 (network rename / coexistence). The E2E job doesn't create the network as a setup step; it expects it to exist.

Fix shape

  1. Bump readiness timeouts to load-tolerant values (e.g., Postgres 60s, Redis 30s, Platform 60s). Lower bound: whatever the slowest shared-runner observation has been.
  2. Pre-pull required images as a job setup step:
    - run: docker pull alpine:latest
    
  3. Create the Docker network as a setup step in the workflow:
    - run: docker network create molecule-core-net 2>/dev/null || true
    
    (Use molecule-core-net per the network-rename follow-up, OR keep molecule-monorepo-net until the rename lands.)
  4. Add structured retry on service-ready probes with exponential backoff, not a fixed-window check.

Out of scope

  • internal#71 (Go module path migration) — surfaces this issue as a CI signal, but doesn't introduce it.
  • Re-architecting the E2E to use Testcontainers / similar — bigger change, separate RFC.

Class

Pre-existing CI infrastructure brittleness. NOT a regression of internal#71 or any migration PR.

Reporter

Discovered while watching CI on internal#71 migration sweep. 2026-05-08.

## Summary `E2E API Smoke Test / E2E API Smoke Test` job is brittle on the operator host runners — fails with timing-dependent service-readiness errors and a missing-Docker-network error. Reproduces on baseline `main` and every migration PR; the failure is independent of code changes. ## Symptoms From recent runner logs (core#82 / e353b54a): ``` ::error::Postgres did not become ready in 30s ::error::Redis did not become ready in 15s ::error::Platform did not become healthy in 30s ::error::Migrations did not apply FAIL: Echo is online Provisioner: warning — pre-write token to volume failed for ...: failed to create token-write container: Error response from daemon: No such image: alpine:latest Provisioner: workspace start failed for ...: failed to set up container networking: network molecule-monorepo-net not found ``` ## Root causes (multiple, intertwined) 1. **Service readiness windows are too tight for shared-runner load.** Postgres 30s + Redis 15s + Platform 30s are sensible on a quiet box; on the operator host where 8 act_runners + the molecule fleet share resources, services routinely take longer to come up under load. 2. **`alpine:latest` not pre-pulled.** Provisioner's pre-write-token step uses `alpine:latest` opportunistically; if the runner's Docker image cache doesn't have it, the provision fails. The non-fatal warning ("token still injected via WriteFilesToContainer after start") covers some of this but not all paths. 3. **`molecule-monorepo-net` Docker network not present.** Tracked separately as a follow-up to internal#71 (network rename / coexistence). The E2E job doesn't create the network as a setup step; it expects it to exist. ## Fix shape 1. **Bump readiness timeouts to load-tolerant values** (e.g., Postgres 60s, Redis 30s, Platform 60s). Lower bound: whatever the slowest shared-runner observation has been. 2. **Pre-pull required images** as a job setup step: ```yaml - run: docker pull alpine:latest ``` 3. **Create the Docker network as a setup step** in the workflow: ```yaml - run: docker network create molecule-core-net 2>/dev/null || true ``` (Use `molecule-core-net` per the network-rename follow-up, OR keep `molecule-monorepo-net` until the rename lands.) 4. **Add structured retry on service-ready probes** with exponential backoff, not a fixed-window check. ## Out of scope - internal#71 (Go module path migration) — surfaces this issue as a CI signal, but doesn't introduce it. - Re-architecting the E2E to use Testcontainers / similar — bigger change, separate RFC. ## Class Pre-existing CI infrastructure brittleness. NOT a regression of internal#71 or any migration PR. ## Reporter Discovered while watching CI on internal#71 migration sweep. 2026-05-08.
Author
Owner

Operator-side prereqs unblocked: alpine:latest pre-pulled (13.1MB) + Docker network molecule-monorepo-net created (bridge, subnet 172.27.0.0/16) on the runner host. The staged fix in this issue can now be merged.

Operator-side prereqs unblocked: alpine:latest pre-pulled (13.1MB) + Docker network molecule-monorepo-net created (bridge, subnet 172.27.0.0/16) on the runner host. The staged fix in this issue can now be merged.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#94
No description provided.