From 382a894f538015763d449c1cde65c52615a7c0b3 Mon Sep 17 00:00:00 2001 From: "Hongming Wang (CTO)" Date: Wed, 3 Jun 2026 21:48:03 -0700 Subject: [PATCH] fix(ci): wait for platform /health on a migration-chain-proof budget (#2205) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The `E2E API Smoke Test` REQUIRED gate (and the sibling local-platform E2E workflows) started the platform in the background and waited for /health with a fixed 30×1s loop (~30s). The platform binds /health only AFTER applying the FULL migration chain on cold start; that chain now reaches past the 30s window (the run log gets to 20260523000000_schedule_consecutive_sdk_errors.up.sql before "Platform starting on :PORT"), so the health loop expired before the server was reachable → downstream E2E never ran → main went red. A fixed budget is brittle by construction because the migration chain grows every release. Fix (deterministic, not a bigger magic number): - Poll /health on a generous, clearly-commented wall-clock budget (180s) that comfortably exceeds cold-start + full-migration time and is robust to the chain continuing to grow. /health returning 200 is the real readiness signal (migrations done + server listening). - Still fail fast + loud on a genuinely dead platform: if the backgrounded platform-server PID has exited (e.g. a broken migration crashed it), stop immediately and dump the platform log — we never mask a real startup failure, and we never wait out the full budget for a process that is already gone. - On true timeout, dump the platform log tail and fail with ::error::. Applied identically to the four workflows sharing the 30×1s platform-/health pattern: e2e-api, e2e-chat, e2e-peer-visibility, e2e-legacy-advisory. The unrelated Postgres-readiness `seq 1 30` waits (which are not gated on the migration chain) are intentionally left unchanged. curl usage avoids the -w '%{http_code}' status-capture shape, so lint-curl-status-capture passes; lint-workflow-yaml passes on all 56 files. Co-Authored-By: Claude Opus 4.8 (1M context) --- .gitea/workflows/e2e-api.yml | 34 ++++++++++++++++++++---- .gitea/workflows/e2e-chat.yml | 30 +++++++++++++++++---- .gitea/workflows/e2e-legacy-advisory.yml | 34 ++++++++++++++++++++---- .gitea/workflows/e2e-peer-visibility.yml | 32 +++++++++++++++++++--- 4 files changed, 111 insertions(+), 19 deletions(-) diff --git a/.gitea/workflows/e2e-api.yml b/.gitea/workflows/e2e-api.yml index 843fe2af5..c1b9eea6e 100644 --- a/.gitea/workflows/e2e-api.yml +++ b/.gitea/workflows/e2e-api.yml @@ -330,16 +330,40 @@ jobs: - name: Wait for /health if: needs.detect-changes.outputs.api == 'true' run: | - for i in $(seq 1 30); do + # Readiness signal: the platform binds /health only AFTER the full + # migration chain has been applied on cold start (it prints + # "Platform starting on :PORT" at that point). So a 200 from /health + # is the real "migrations done + server listening" signal. + # + # The migration chain grows every release, so a fixed ~30s budget is + # brittle by construction (it WILL be exceeded as migrations accrue). + # Use a generous wall-clock budget that comfortably exceeds + # cold-start + full-migration time, polling fast. This is robust to a + # growing chain WITHOUT masking a genuinely dead platform: if the + # background platform-server process has exited (e.g. a broken + # migration crashed it), we stop and fail loudly at once instead of + # waiting out the whole budget. + DEADLINE_SECS=180 # cold-start + full migration chain headroom + PLATFORM_PID="$(cat workspace-server/platform.pid 2>/dev/null || true)" + start=$(date +%s) + while :; do if curl -sf "$BASE/health" > /dev/null; then - echo "Platform up after ${i}s" + echo "Platform healthy after $(( $(date +%s) - start ))s" exit 0 fi + # Fast-fail: if the platform process died, /health will never come. + if [ -n "$PLATFORM_PID" ] && ! kill -0 "$PLATFORM_PID" 2>/dev/null; then + echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below" + cat workspace-server/platform.log || true + exit 1 + fi + if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then + echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below" + cat workspace-server/platform.log || true + exit 1 + fi sleep 1 done - echo "::error::Platform did not become healthy in 30s" - cat workspace-server/platform.log || true - exit 1 - name: Assert migrations applied if: needs.detect-changes.outputs.api == 'true' run: | diff --git a/.gitea/workflows/e2e-chat.yml b/.gitea/workflows/e2e-chat.yml index de8df292a..1ffadf3f4 100644 --- a/.gitea/workflows/e2e-chat.yml +++ b/.gitea/workflows/e2e-chat.yml @@ -242,16 +242,36 @@ jobs: - name: Wait for /health if: needs.detect-changes.outputs.chat == 'true' run: | - for i in $(seq 1 30); do + # Readiness signal: the platform binds /health only AFTER the full + # migration chain has been applied on cold start (it prints + # "Platform starting on :PORT" at that point). So a 200 from /health + # is the real "migrations done + server listening" signal. + # + # The migration chain grows every release, so a fixed ~30s budget is + # brittle by construction. Use a generous wall-clock budget that + # comfortably exceeds cold-start + full-migration time, polling fast. + # Robust to a growing chain WITHOUT masking a dead platform: if the + # background platform-server process has exited, fail loudly at once. + DEADLINE_SECS=180 # cold-start + full migration chain headroom + PLATFORM_PID="$(cat workspace-server/platform.pid 2>/dev/null || true)" + start=$(date +%s) + while :; do if curl -sf "http://127.0.0.1:${PLATFORM_PORT}/health" > /dev/null; then - echo "Platform up after ${i}s" + echo "Platform healthy after $(( $(date +%s) - start ))s" exit 0 fi + if [ -n "$PLATFORM_PID" ] && ! kill -0 "$PLATFORM_PID" 2>/dev/null; then + echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below" + cat workspace-server/platform.log || true + exit 1 + fi + if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then + echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below" + cat workspace-server/platform.log || true + exit 1 + fi sleep 1 done - echo "::error::Platform did not become healthy in 30s" - cat workspace-server/platform.log || true - exit 1 - name: Install canvas dependencies if: needs.detect-changes.outputs.chat == 'true' diff --git a/.gitea/workflows/e2e-legacy-advisory.yml b/.gitea/workflows/e2e-legacy-advisory.yml index aeeb83f07..2bb943af3 100644 --- a/.gitea/workflows/e2e-legacy-advisory.yml +++ b/.gitea/workflows/e2e-legacy-advisory.yml @@ -130,13 +130,37 @@ jobs: run: | set -euo pipefail ./workspace-server/platform-server > workspace-server/platform.log 2>&1 & - echo $! > workspace-server/platform.pid - for i in $(seq 1 30); do - curl -sf "$BASE/health" >/dev/null && exit 0 + PLATFORM_PID=$! + echo "$PLATFORM_PID" > workspace-server/platform.pid + # Readiness signal: the platform binds /health only AFTER the full + # migration chain has been applied on cold start (it prints + # "Platform starting on :PORT" at that point). So a 200 from /health + # is the real "migrations done + server listening" signal. + # + # The migration chain grows every release, so a fixed ~30s budget is + # brittle by construction. Use a generous wall-clock budget that + # comfortably exceeds cold-start + full-migration time, polling fast. + # Robust to a growing chain WITHOUT masking a dead platform: if the + # background platform-server process has exited, fail loudly at once. + DEADLINE_SECS=180 # cold-start + full migration chain headroom + start=$(date +%s) + while :; do + if curl -sf "$BASE/health" >/dev/null; then + echo "Platform healthy after $(( $(date +%s) - start ))s" + exit 0 + fi + if ! kill -0 "$PLATFORM_PID" 2>/dev/null; then + echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below" + cat workspace-server/platform.log || true + exit 1 + fi + if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then + echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below" + cat workspace-server/platform.log || true + exit 1 + fi sleep 1 done - cat workspace-server/platform.log || true - exit 1 - name: Run comprehensive E2E run: bash tests/e2e/test_comprehensive_e2e.sh diff --git a/.gitea/workflows/e2e-peer-visibility.yml b/.gitea/workflows/e2e-peer-visibility.yml index fd2725717..e5b103972 100644 --- a/.gitea/workflows/e2e-peer-visibility.yml +++ b/.gitea/workflows/e2e-peer-visibility.yml @@ -267,12 +267,36 @@ jobs: echo $! > platform.pid - name: Wait for /health run: | - for i in $(seq 1 30); do - curl -sf "$BASE/health" > /dev/null && { echo "Platform up after ${i}s"; exit 0; } + # Readiness signal: the platform binds /health only AFTER the full + # migration chain has been applied on cold start (it prints + # "Platform starting on :PORT" at that point). So a 200 from /health + # is the real "migrations done + server listening" signal. + # + # The migration chain grows every release, so a fixed ~30s budget is + # brittle by construction. Use a generous wall-clock budget that + # comfortably exceeds cold-start + full-migration time, polling fast. + # Robust to a growing chain WITHOUT masking a dead platform: if the + # background platform-server process has exited, fail loudly at once. + DEADLINE_SECS=180 # cold-start + full migration chain headroom + PLATFORM_PID="$(cat workspace-server/platform.pid 2>/dev/null || true)" + start=$(date +%s) + while :; do + if curl -sf "$BASE/health" > /dev/null; then + echo "Platform healthy after $(( $(date +%s) - start ))s" + exit 0 + fi + if [ -n "$PLATFORM_PID" ] && ! kill -0 "$PLATFORM_PID" 2>/dev/null; then + echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below" + cat workspace-server/platform.log || true + exit 1 + fi + if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then + echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below" + cat workspace-server/platform.log || true + exit 1 + fi sleep 1 done - echo "::error::Platform did not become healthy in 30s" - cat workspace-server/platform.log || true; exit 1 - name: Run LOCAL fresh-provision peer-visibility E2E (literal MCP list_peers) # HONEST gate — NO continue-on-error. The local backend uses # external-mode workspaces so this context tests the literal MCP -- 2.52.0