From 382a894f538015763d449c1cde65c52615a7c0b3 Mon Sep 17 00:00:00 2001
From: "Hongming Wang (CTO)" <hongmingwang@moleculesai.app>
Date: Wed, 3 Jun 2026 21:48:03 -0700
Subject: [PATCH] fix(ci): wait for platform /health on a migration-chain-proof
 budget (#2205)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The `E2E API Smoke Test` REQUIRED gate (and the sibling local-platform E2E
workflows) started the platform in the background and waited for /health with
a fixed 30×1s loop (~30s). The platform binds /health only AFTER applying the
FULL migration chain on cold start; that chain now reaches past the 30s window
(the run log gets to 20260523000000_schedule_consecutive_sdk_errors.up.sql
before "Platform starting on :PORT"), so the health loop expired before the
server was reachable → downstream E2E never ran → main went red. A fixed budget
is brittle by construction because the migration chain grows every release.

Fix (deterministic, not a bigger magic number):
- Poll /health on a generous, clearly-commented wall-clock budget (180s) that
  comfortably exceeds cold-start + full-migration time and is robust to the
  chain continuing to grow. /health returning 200 is the real readiness signal
  (migrations done + server listening).
- Still fail fast + loud on a genuinely dead platform: if the backgrounded
  platform-server PID has exited (e.g. a broken migration crashed it), stop
  immediately and dump the platform log — we never mask a real startup failure,
  and we never wait out the full budget for a process that is already gone.
- On true timeout, dump the platform log tail and fail with ::error::.

Applied identically to the four workflows sharing the 30×1s platform-/health
pattern: e2e-api, e2e-chat, e2e-peer-visibility, e2e-legacy-advisory. The
unrelated Postgres-readiness `seq 1 30` waits (which are not gated on the
migration chain) are intentionally left unchanged.

curl usage avoids the -w '%{http_code}' status-capture shape, so
lint-curl-status-capture passes; lint-workflow-yaml passes on all 56 files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .gitea/workflows/e2e-api.yml             | 34 ++++++++++++++++++++----
 .gitea/workflows/e2e-chat.yml            | 30 +++++++++++++++++----
 .gitea/workflows/e2e-legacy-advisory.yml | 34 ++++++++++++++++++++----
 .gitea/workflows/e2e-peer-visibility.yml | 32 +++++++++++++++++++---
 4 files changed, 111 insertions(+), 19 deletions(-)

diff --git a/.gitea/workflows/e2e-api.yml b/.gitea/workflows/e2e-api.yml
index 843fe2af5..c1b9eea6e 100644
--- a/.gitea/workflows/e2e-api.yml
+++ b/.gitea/workflows/e2e-api.yml
@@ -330,16 +330,40 @@ jobs:
       - name: Wait for /health
         if: needs.detect-changes.outputs.api == 'true'
         run: |
-          for i in $(seq 1 30); do
+          # Readiness signal: the platform binds /health only AFTER the full
+          # migration chain has been applied on cold start (it prints
+          # "Platform starting on :PORT" at that point). So a 200 from /health
+          # is the real "migrations done + server listening" signal.
+          #
+          # The migration chain grows every release, so a fixed ~30s budget is
+          # brittle by construction (it WILL be exceeded as migrations accrue).
+          # Use a generous wall-clock budget that comfortably exceeds
+          # cold-start + full-migration time, polling fast. This is robust to a
+          # growing chain WITHOUT masking a genuinely dead platform: if the
+          # background platform-server process has exited (e.g. a broken
+          # migration crashed it), we stop and fail loudly at once instead of
+          # waiting out the whole budget.
+          DEADLINE_SECS=180          # cold-start + full migration chain headroom
+          PLATFORM_PID="$(cat workspace-server/platform.pid 2>/dev/null || true)"
+          start=$(date +%s)
+          while :; do
             if curl -sf "$BASE/health" > /dev/null; then
-              echo "Platform up after ${i}s"
+              echo "Platform healthy after $(( $(date +%s) - start ))s"
               exit 0
             fi
+            # Fast-fail: if the platform process died, /health will never come.
+            if [ -n "$PLATFORM_PID" ] && ! kill -0 "$PLATFORM_PID" 2>/dev/null; then
+              echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below"
+              cat workspace-server/platform.log || true
+              exit 1
+            fi
+            if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then
+              echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below"
+              cat workspace-server/platform.log || true
+              exit 1
+            fi
             sleep 1
           done
-          echo "::error::Platform did not become healthy in 30s"
-          cat workspace-server/platform.log || true
-          exit 1
       - name: Assert migrations applied
         if: needs.detect-changes.outputs.api == 'true'
         run: |
diff --git a/.gitea/workflows/e2e-chat.yml b/.gitea/workflows/e2e-chat.yml
index de8df292a..1ffadf3f4 100644
--- a/.gitea/workflows/e2e-chat.yml
+++ b/.gitea/workflows/e2e-chat.yml
@@ -242,16 +242,36 @@ jobs:
       - name: Wait for /health
         if: needs.detect-changes.outputs.chat == 'true'
         run: |
-          for i in $(seq 1 30); do
+          # Readiness signal: the platform binds /health only AFTER the full
+          # migration chain has been applied on cold start (it prints
+          # "Platform starting on :PORT" at that point). So a 200 from /health
+          # is the real "migrations done + server listening" signal.
+          #
+          # The migration chain grows every release, so a fixed ~30s budget is
+          # brittle by construction. Use a generous wall-clock budget that
+          # comfortably exceeds cold-start + full-migration time, polling fast.
+          # Robust to a growing chain WITHOUT masking a dead platform: if the
+          # background platform-server process has exited, fail loudly at once.
+          DEADLINE_SECS=180          # cold-start + full migration chain headroom
+          PLATFORM_PID="$(cat workspace-server/platform.pid 2>/dev/null || true)"
+          start=$(date +%s)
+          while :; do
             if curl -sf "http://127.0.0.1:${PLATFORM_PORT}/health" > /dev/null; then
-              echo "Platform up after ${i}s"
+              echo "Platform healthy after $(( $(date +%s) - start ))s"
               exit 0
             fi
+            if [ -n "$PLATFORM_PID" ] && ! kill -0 "$PLATFORM_PID" 2>/dev/null; then
+              echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below"
+              cat workspace-server/platform.log || true
+              exit 1
+            fi
+            if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then
+              echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below"
+              cat workspace-server/platform.log || true
+              exit 1
+            fi
             sleep 1
           done
-          echo "::error::Platform did not become healthy in 30s"
-          cat workspace-server/platform.log || true
-          exit 1
 
       - name: Install canvas dependencies
         if: needs.detect-changes.outputs.chat == 'true'
diff --git a/.gitea/workflows/e2e-legacy-advisory.yml b/.gitea/workflows/e2e-legacy-advisory.yml
index aeeb83f07..2bb943af3 100644
--- a/.gitea/workflows/e2e-legacy-advisory.yml
+++ b/.gitea/workflows/e2e-legacy-advisory.yml
@@ -130,13 +130,37 @@ jobs:
         run: |
           set -euo pipefail
           ./workspace-server/platform-server > workspace-server/platform.log 2>&1 &
-          echo $! > workspace-server/platform.pid
-          for i in $(seq 1 30); do
-            curl -sf "$BASE/health" >/dev/null && exit 0
+          PLATFORM_PID=$!
+          echo "$PLATFORM_PID" > workspace-server/platform.pid
+          # Readiness signal: the platform binds /health only AFTER the full
+          # migration chain has been applied on cold start (it prints
+          # "Platform starting on :PORT" at that point). So a 200 from /health
+          # is the real "migrations done + server listening" signal.
+          #
+          # The migration chain grows every release, so a fixed ~30s budget is
+          # brittle by construction. Use a generous wall-clock budget that
+          # comfortably exceeds cold-start + full-migration time, polling fast.
+          # Robust to a growing chain WITHOUT masking a dead platform: if the
+          # background platform-server process has exited, fail loudly at once.
+          DEADLINE_SECS=180          # cold-start + full migration chain headroom
+          start=$(date +%s)
+          while :; do
+            if curl -sf "$BASE/health" >/dev/null; then
+              echo "Platform healthy after $(( $(date +%s) - start ))s"
+              exit 0
+            fi
+            if ! kill -0 "$PLATFORM_PID" 2>/dev/null; then
+              echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below"
+              cat workspace-server/platform.log || true
+              exit 1
+            fi
+            if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then
+              echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below"
+              cat workspace-server/platform.log || true
+              exit 1
+            fi
             sleep 1
           done
-          cat workspace-server/platform.log || true
-          exit 1
 
       - name: Run comprehensive E2E
         run: bash tests/e2e/test_comprehensive_e2e.sh
diff --git a/.gitea/workflows/e2e-peer-visibility.yml b/.gitea/workflows/e2e-peer-visibility.yml
index fd2725717..e5b103972 100644
--- a/.gitea/workflows/e2e-peer-visibility.yml
+++ b/.gitea/workflows/e2e-peer-visibility.yml
@@ -267,12 +267,36 @@ jobs:
           echo $! > platform.pid
       - name: Wait for /health
         run: |
-          for i in $(seq 1 30); do
-            curl -sf "$BASE/health" > /dev/null && { echo "Platform up after ${i}s"; exit 0; }
+          # Readiness signal: the platform binds /health only AFTER the full
+          # migration chain has been applied on cold start (it prints
+          # "Platform starting on :PORT" at that point). So a 200 from /health
+          # is the real "migrations done + server listening" signal.
+          #
+          # The migration chain grows every release, so a fixed ~30s budget is
+          # brittle by construction. Use a generous wall-clock budget that
+          # comfortably exceeds cold-start + full-migration time, polling fast.
+          # Robust to a growing chain WITHOUT masking a dead platform: if the
+          # background platform-server process has exited, fail loudly at once.
+          DEADLINE_SECS=180          # cold-start + full migration chain headroom
+          PLATFORM_PID="$(cat workspace-server/platform.pid 2>/dev/null || true)"
+          start=$(date +%s)
+          while :; do
+            if curl -sf "$BASE/health" > /dev/null; then
+              echo "Platform healthy after $(( $(date +%s) - start ))s"
+              exit 0
+            fi
+            if [ -n "$PLATFORM_PID" ] && ! kill -0 "$PLATFORM_PID" 2>/dev/null; then
+              echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below"
+              cat workspace-server/platform.log || true
+              exit 1
+            fi
+            if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then
+              echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below"
+              cat workspace-server/platform.log || true
+              exit 1
+            fi
             sleep 1
           done
-          echo "::error::Platform did not become healthy in 30s"
-          cat workspace-server/platform.log || true; exit 1
       - name: Run LOCAL fresh-provision peer-visibility E2E (literal MCP list_peers)
         # HONEST gate — NO continue-on-error. The local backend uses
         # external-mode workspaces so this context tests the literal MCP
-- 
2.52.0