fix(ci): wait for platform /health on a migration-chain-proof budget (#2205) #2206

Merged
claude-ceo-assistant merged 1 commits from fix/e2e-api-health-wait-migration-chain into main 2026-06-04 05:32:29 +00:00
Member

What

Fixes the E2E API Smoke Test REQUIRED branch-protection gate going red on main (RCA in #2205), plus the three sibling local-platform E2E workflows that share the identical brittle pattern.

Root cause (#2205)

The platform is started in the background, then the workflow waited for /health with a fixed 30×1s loop. The platform binds /health only after applying the full migration chain on cold start — and that chain has grown past 30s (the run log reaches 20260523000000_schedule_consecutive_sdk_errors.up.sql before printing Platform starting on :PORT). The 30s budget expired before the server was reachable → downstream E2E assertions never ran → red. A fixed budget is brittle by construction: the migration chain keeps growing.

Fix — deterministic, not a bigger magic number

For each platform /health wait (e2e-api, e2e-chat, e2e-peer-visibility, e2e-legacy-advisory):

  • Poll /health on a generous, clearly-commented 180s wall-clock budget that comfortably exceeds cold-start + full-migration time and stays robust as the chain grows. A 200 from /health is the real readiness signal (migrations done + server listening).
  • Fast-fail + loud on a genuinely dead platform: if the backgrounded platform-server PID has exited (e.g. a broken migration crashed it), stop immediately and dump the platform log — never mask a real startup failure, never wait out the full budget for a process that's already gone.
  • On true timeout, dump the platform log tail and fail with ::error::.

The unrelated Postgres-readiness seq 1 30 waits (not gated on the migration chain) are intentionally left unchanged.

Why it's robust + still fails loud

  • Migrations take 45s → old 30s loop failed; new logic keeps polling, sees /health 200, proceeds. (Verified by simulation: server binding late → success.)
  • Platform never binds (dead process) → PID-liveness check trips at once, dumps log, exits 1. (Verified: 0s fast-fail with log dump, not a 180s hang.)

Lints

  • lint-curl-status-capture: passes — curl usage avoids the -w '%{http_code}' capture shape.
  • lint-workflow-yaml: passes (56 files, 0 warnings). YAML parses; bash shellcheck-clean.

Closes #2205

🤖 Generated with Claude Code

## What Fixes the `E2E API Smoke Test` REQUIRED branch-protection gate going red on main (RCA in #2205), plus the three sibling local-platform E2E workflows that share the identical brittle pattern. ## Root cause (#2205) The platform is started in the background, then the workflow waited for `/health` with a fixed **30×1s** loop. The platform binds `/health` only **after applying the full migration chain** on cold start — and that chain has grown past 30s (the run log reaches `20260523000000_schedule_consecutive_sdk_errors.up.sql` before printing `Platform starting on :PORT`). The 30s budget expired before the server was reachable → downstream E2E assertions never ran → red. A fixed budget is brittle by construction: the migration chain keeps growing. ## Fix — deterministic, not a bigger magic number For each platform `/health` wait (`e2e-api`, `e2e-chat`, `e2e-peer-visibility`, `e2e-legacy-advisory`): - Poll `/health` on a **generous, clearly-commented 180s wall-clock budget** that comfortably exceeds cold-start + full-migration time and stays robust as the chain grows. A 200 from `/health` is the real readiness signal (migrations done + server listening). - **Fast-fail + loud on a genuinely dead platform**: if the backgrounded `platform-server` PID has exited (e.g. a broken migration crashed it), stop immediately and dump the platform log — never mask a real startup failure, never wait out the full budget for a process that's already gone. - On true timeout, dump the platform log tail and fail with `::error::`. The unrelated Postgres-readiness `seq 1 30` waits (not gated on the migration chain) are intentionally left unchanged. ## Why it's robust + still fails loud - Migrations take 45s → old 30s loop failed; new logic keeps polling, sees `/health` 200, proceeds. (Verified by simulation: server binding late → success.) - Platform never binds (dead process) → PID-liveness check trips at once, dumps log, exits 1. (Verified: 0s fast-fail with log dump, not a 180s hang.) ## Lints - `lint-curl-status-capture`: passes — curl usage avoids the `-w '%{http_code}'` capture shape. - `lint-workflow-yaml`: passes (56 files, 0 warnings). YAML parses; bash `shellcheck`-clean. Closes #2205 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-06-04 04:48:26 +00:00
fix(ci): wait for platform /health on a migration-chain-proof budget (#2205)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 2s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 10s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 8s
E2E Chat / detect-changes (pull_request) Successful in 9s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 15s
CI / Detect changes (pull_request) Successful in 16s
gate-check-v3 / gate-check (pull_request_target) Successful in 5s
security-review / approved (pull_request_target) Failing after 6s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 3s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-tier-check / tier-check (pull_request_target) Successful in 3s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 24s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 23s
qa-review / approved (pull_request_target) Failing after 21s
E2E Chat / E2E Chat (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
CI / Platform (Go) (pull_request) Successful in 6s
CI / all-required (pull_request) Successful in 13s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m0s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m6s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 59s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m7s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 1m22s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 4s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m13s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m9s
audit-force-merge / audit (pull_request_target) Successful in 3s
382a894f53
The `E2E API Smoke Test` REQUIRED gate (and the sibling local-platform E2E
workflows) started the platform in the background and waited for /health with
a fixed 30×1s loop (~30s). The platform binds /health only AFTER applying the
FULL migration chain on cold start; that chain now reaches past the 30s window
(the run log gets to 20260523000000_schedule_consecutive_sdk_errors.up.sql
before "Platform starting on :PORT"), so the health loop expired before the
server was reachable → downstream E2E never ran → main went red. A fixed budget
is brittle by construction because the migration chain grows every release.

Fix (deterministic, not a bigger magic number):
- Poll /health on a generous, clearly-commented wall-clock budget (180s) that
  comfortably exceeds cold-start + full-migration time and is robust to the
  chain continuing to grow. /health returning 200 is the real readiness signal
  (migrations done + server listening).
- Still fail fast + loud on a genuinely dead platform: if the backgrounded
  platform-server PID has exited (e.g. a broken migration crashed it), stop
  immediately and dump the platform log — we never mask a real startup failure,
  and we never wait out the full budget for a process that is already gone.
- On true timeout, dump the platform log tail and fail with ::error::.

Applied identically to the four workflows sharing the 30×1s platform-/health
pattern: e2e-api, e2e-chat, e2e-peer-visibility, e2e-legacy-advisory. The
unrelated Postgres-readiness `seq 1 30` waits (which are not gated on the
migration chain) are intentionally left unchanged.

curl usage avoids the -w '%{http_code}' status-capture shape, so
lint-curl-status-capture passes; lint-workflow-yaml passes on all 56 files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
claude-ceo-assistant merged commit 7f25373309 into main 2026-06-04 05:32:29 +00:00
Author
Member

Owner force-merged (honest bypass). Clears the E2E API Smoke required-gate flake (#2205): 30x1s health-wait was shorter than the growing migration chain; now a 180s readiness budget + kill -0 liveness (dead platform still fails fast+loud). Applied to e2e-api/e2e-chat/e2e-peer-visibility/e2e-legacy-advisory. Required CI green. Token revoked.

Owner force-merged (honest bypass). Clears the E2E API Smoke required-gate flake (#2205): 30x1s health-wait was shorter than the growing migration chain; now a 180s readiness budget + kill -0 liveness (dead platform still fails fast+loud). Applied to e2e-api/e2e-chat/e2e-peer-visibility/e2e-legacy-advisory. Required CI green. Token revoked.
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2206