Merge pull request 'fix(ci): wait for platform /health on a migration-chain-proof budget (#2205)' (#2206) from fix/e2e-api-health-wait-migration-chain into main
ci-arm64-advisory / fast-checks (push) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (push) Successful in 1s
publish-workspace-server-image / build-and-push (push) Successful in 3m22s
Block internal-flavored paths / Block forbidden paths (push) Successful in 2s
CI / Python Lint & Test (push) Successful in 3s
CI / Detect changes (push) Successful in 5s
E2E API Smoke Test / detect-changes (push) Successful in 5s
E2E Chat / detect-changes (push) Successful in 6s
Handlers Postgres Integration / detect-changes (push) Successful in 3s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 3s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (push) Successful in 3s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 11s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (push) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (push) Successful in 9s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 4s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (push) Successful in 30s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (push) Successful in 1m10s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (push) Successful in 1m36s
CI / Platform (Go) (push) Successful in 1s
CI / Canvas (Next.js) (push) Successful in 1s
CI / Shellcheck (E2E scripts) (push) Successful in 2s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (push) Failing after 2m13s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 1s
CI / all-required (push) Successful in 38s
E2E API Smoke Test / E2E API Smoke Test (push) Successful in 52s
CI / Canvas Deploy Reminder (push) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 1m9s
E2E Chat / E2E Chat (push) Successful in 2m26s
publish-workspace-server-image / Production auto-deploy (push) Failing after 5m10s

This commit was merged in pull request #2206.
This commit is contained in:
2026-06-04 05:32:28 +00:00
4 changed files with 111 additions and 19 deletions
+29 -5
View File
@@ -330,16 +330,40 @@ jobs:
- name: Wait for /health
if: needs.detect-changes.outputs.api == 'true'
run: |
for i in $(seq 1 30); do
# Readiness signal: the platform binds /health only AFTER the full
# migration chain has been applied on cold start (it prints
# "Platform starting on :PORT" at that point). So a 200 from /health
# is the real "migrations done + server listening" signal.
#
# The migration chain grows every release, so a fixed ~30s budget is
# brittle by construction (it WILL be exceeded as migrations accrue).
# Use a generous wall-clock budget that comfortably exceeds
# cold-start + full-migration time, polling fast. This is robust to a
# growing chain WITHOUT masking a genuinely dead platform: if the
# background platform-server process has exited (e.g. a broken
# migration crashed it), we stop and fail loudly at once instead of
# waiting out the whole budget.
DEADLINE_SECS=180 # cold-start + full migration chain headroom
PLATFORM_PID="$(cat workspace-server/platform.pid 2>/dev/null || true)"
start=$(date +%s)
while :; do
if curl -sf "$BASE/health" > /dev/null; then
echo "Platform up after ${i}s"
echo "Platform healthy after $(( $(date +%s) - start ))s"
exit 0
fi
# Fast-fail: if the platform process died, /health will never come.
if [ -n "$PLATFORM_PID" ] && ! kill -0 "$PLATFORM_PID" 2>/dev/null; then
echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below"
cat workspace-server/platform.log || true
exit 1
fi
if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then
echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below"
cat workspace-server/platform.log || true
exit 1
fi
sleep 1
done
echo "::error::Platform did not become healthy in 30s"
cat workspace-server/platform.log || true
exit 1
- name: Assert migrations applied
if: needs.detect-changes.outputs.api == 'true'
run: |
+25 -5
View File
@@ -242,16 +242,36 @@ jobs:
- name: Wait for /health
if: needs.detect-changes.outputs.chat == 'true'
run: |
for i in $(seq 1 30); do
# Readiness signal: the platform binds /health only AFTER the full
# migration chain has been applied on cold start (it prints
# "Platform starting on :PORT" at that point). So a 200 from /health
# is the real "migrations done + server listening" signal.
#
# The migration chain grows every release, so a fixed ~30s budget is
# brittle by construction. Use a generous wall-clock budget that
# comfortably exceeds cold-start + full-migration time, polling fast.
# Robust to a growing chain WITHOUT masking a dead platform: if the
# background platform-server process has exited, fail loudly at once.
DEADLINE_SECS=180 # cold-start + full migration chain headroom
PLATFORM_PID="$(cat workspace-server/platform.pid 2>/dev/null || true)"
start=$(date +%s)
while :; do
if curl -sf "http://127.0.0.1:${PLATFORM_PORT}/health" > /dev/null; then
echo "Platform up after ${i}s"
echo "Platform healthy after $(( $(date +%s) - start ))s"
exit 0
fi
if [ -n "$PLATFORM_PID" ] && ! kill -0 "$PLATFORM_PID" 2>/dev/null; then
echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below"
cat workspace-server/platform.log || true
exit 1
fi
if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then
echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below"
cat workspace-server/platform.log || true
exit 1
fi
sleep 1
done
echo "::error::Platform did not become healthy in 30s"
cat workspace-server/platform.log || true
exit 1
- name: Install canvas dependencies
if: needs.detect-changes.outputs.chat == 'true'
+29 -5
View File
@@ -130,13 +130,37 @@ jobs:
run: |
set -euo pipefail
./workspace-server/platform-server > workspace-server/platform.log 2>&1 &
echo $! > workspace-server/platform.pid
for i in $(seq 1 30); do
curl -sf "$BASE/health" >/dev/null && exit 0
PLATFORM_PID=$!
echo "$PLATFORM_PID" > workspace-server/platform.pid
# Readiness signal: the platform binds /health only AFTER the full
# migration chain has been applied on cold start (it prints
# "Platform starting on :PORT" at that point). So a 200 from /health
# is the real "migrations done + server listening" signal.
#
# The migration chain grows every release, so a fixed ~30s budget is
# brittle by construction. Use a generous wall-clock budget that
# comfortably exceeds cold-start + full-migration time, polling fast.
# Robust to a growing chain WITHOUT masking a dead platform: if the
# background platform-server process has exited, fail loudly at once.
DEADLINE_SECS=180 # cold-start + full migration chain headroom
start=$(date +%s)
while :; do
if curl -sf "$BASE/health" >/dev/null; then
echo "Platform healthy after $(( $(date +%s) - start ))s"
exit 0
fi
if ! kill -0 "$PLATFORM_PID" 2>/dev/null; then
echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below"
cat workspace-server/platform.log || true
exit 1
fi
if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then
echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below"
cat workspace-server/platform.log || true
exit 1
fi
sleep 1
done
cat workspace-server/platform.log || true
exit 1
- name: Run comprehensive E2E
run: bash tests/e2e/test_comprehensive_e2e.sh
+28 -4
View File
@@ -267,12 +267,36 @@ jobs:
echo $! > platform.pid
- name: Wait for /health
run: |
for i in $(seq 1 30); do
curl -sf "$BASE/health" > /dev/null && { echo "Platform up after ${i}s"; exit 0; }
# Readiness signal: the platform binds /health only AFTER the full
# migration chain has been applied on cold start (it prints
# "Platform starting on :PORT" at that point). So a 200 from /health
# is the real "migrations done + server listening" signal.
#
# The migration chain grows every release, so a fixed ~30s budget is
# brittle by construction. Use a generous wall-clock budget that
# comfortably exceeds cold-start + full-migration time, polling fast.
# Robust to a growing chain WITHOUT masking a dead platform: if the
# background platform-server process has exited, fail loudly at once.
DEADLINE_SECS=180 # cold-start + full migration chain headroom
PLATFORM_PID="$(cat workspace-server/platform.pid 2>/dev/null || true)"
start=$(date +%s)
while :; do
if curl -sf "$BASE/health" > /dev/null; then
echo "Platform healthy after $(( $(date +%s) - start ))s"
exit 0
fi
if [ -n "$PLATFORM_PID" ] && ! kill -0 "$PLATFORM_PID" 2>/dev/null; then
echo "::error::platform-server (pid ${PLATFORM_PID}) exited before /health became reachable — see log below"
cat workspace-server/platform.log || true
exit 1
fi
if [ "$(( $(date +%s) - start ))" -ge "$DEADLINE_SECS" ]; then
echo "::error::Platform did not become healthy within ${DEADLINE_SECS}s — see log below"
cat workspace-server/platform.log || true
exit 1
fi
sleep 1
done
echo "::error::Platform did not become healthy in 30s"
cat workspace-server/platform.log || true; exit 1
- name: Run LOCAL fresh-provision peer-visibility E2E (literal MCP list_peers)
# HONEST gate — NO continue-on-error. The local backend uses
# external-mode workspaces so this context tests the literal MCP