fix(ci): wait for platform /health on a migration-chain-proof budget (#2205) #2206
Reference in New Issue
Block a user
Delete Branch "fix/e2e-api-health-wait-migration-chain"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Fixes the
E2E API Smoke TestREQUIRED branch-protection gate going red on main (RCA in #2205), plus the three sibling local-platform E2E workflows that share the identical brittle pattern.Root cause (#2205)
The platform is started in the background, then the workflow waited for
/healthwith a fixed 30×1s loop. The platform binds/healthonly after applying the full migration chain on cold start — and that chain has grown past 30s (the run log reaches20260523000000_schedule_consecutive_sdk_errors.up.sqlbefore printingPlatform starting on :PORT). The 30s budget expired before the server was reachable → downstream E2E assertions never ran → red. A fixed budget is brittle by construction: the migration chain keeps growing.Fix — deterministic, not a bigger magic number
For each platform
/healthwait (e2e-api,e2e-chat,e2e-peer-visibility,e2e-legacy-advisory):/healthon a generous, clearly-commented 180s wall-clock budget that comfortably exceeds cold-start + full-migration time and stays robust as the chain grows. A 200 from/healthis the real readiness signal (migrations done + server listening).platform-serverPID has exited (e.g. a broken migration crashed it), stop immediately and dump the platform log — never mask a real startup failure, never wait out the full budget for a process that's already gone.::error::.The unrelated Postgres-readiness
seq 1 30waits (not gated on the migration chain) are intentionally left unchanged.Why it's robust + still fails loud
/health200, proceeds. (Verified by simulation: server binding late → success.)Lints
lint-curl-status-capture: passes — curl usage avoids the-w '%{http_code}'capture shape.lint-workflow-yaml: passes (56 files, 0 warnings). YAML parses; bashshellcheck-clean.Closes #2205
🤖 Generated with Claude Code
The `E2E API Smoke Test` REQUIRED gate (and the sibling local-platform E2E workflows) started the platform in the background and waited for /health with a fixed 30×1s loop (~30s). The platform binds /health only AFTER applying the FULL migration chain on cold start; that chain now reaches past the 30s window (the run log gets to 20260523000000_schedule_consecutive_sdk_errors.up.sql before "Platform starting on :PORT"), so the health loop expired before the server was reachable → downstream E2E never ran → main went red. A fixed budget is brittle by construction because the migration chain grows every release. Fix (deterministic, not a bigger magic number): - Poll /health on a generous, clearly-commented wall-clock budget (180s) that comfortably exceeds cold-start + full-migration time and is robust to the chain continuing to grow. /health returning 200 is the real readiness signal (migrations done + server listening). - Still fail fast + loud on a genuinely dead platform: if the backgrounded platform-server PID has exited (e.g. a broken migration crashed it), stop immediately and dump the platform log — we never mask a real startup failure, and we never wait out the full budget for a process that is already gone. - On true timeout, dump the platform log tail and fail with ::error::. Applied identically to the four workflows sharing the 30×1s platform-/health pattern: e2e-api, e2e-chat, e2e-peer-visibility, e2e-legacy-advisory. The unrelated Postgres-readiness `seq 1 30` waits (which are not gated on the migration chain) are intentionally left unchanged. curl usage avoids the -w '%{http_code}' status-capture shape, so lint-curl-status-capture passes; lint-workflow-yaml passes on all 56 files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Owner force-merged (honest bypass). Clears the E2E API Smoke required-gate flake (#2205): 30x1s health-wait was shorter than the growing migration chain; now a 180s readiness budget + kill -0 liveness (dead platform still fails fast+loud). Applied to e2e-api/e2e-chat/e2e-peer-visibility/e2e-legacy-advisory. Required CI green. Token revoked.