From b3da0b29c5f8acdce60a5c61c312596904e5c363 Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Thu, 23 Apr 2026 16:46:21 -0700 Subject: [PATCH] =?UTF-8?q?fix(e2e):=20hermes=20cold-boot=20tolerance=20?= =?UTF-8?q?=E2=80=94=2020min=20deadline=20+=20treat=20failed=20as=20transi?= =?UTF-8?q?ent?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Today's E2E run 24864011116 timed out at 10 min waiting for workspace to reach online. Hermes cold-boot measured 13 min on the same day's apt mirror (my manual repro on 18.217.175.225). The original 10 min deadline was a ~2x too-tight budget. Also: the `failed` branch was a hard fail, but bootstrap-watcher (cp#245) marks workspace=failed at 5 min if install.sh hasn't finished yet. Heartbeat then transitions failed → online around 10-13 min. Pre this fix, the E2E bailed at the failed read and missed the recovery that was seconds away. ## Changes - Deadline: 10 min → 20 min (hermes worst-case 15 + slack) - `failed` status: now tolerated as transient; loop logs once then keeps polling. Only hard-fails at the final deadline. - Added transition logging (`WS_LAST_STATUS`) so CI output shows the provisioning → failed → online flow instead of silent polling. ## Why not fix cp#245 instead Both should be fixed. cp#245 (bootstrap-watcher deadline) is the root cause; this E2E fix is the defense-in-depth. When cp#245 lands, the `failed` transient log will stop firing but the rest of the logic still protects against other slow-apt-day spikes. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/e2e/test_staging_full_saas.sh | 36 +++++++++++++++++++++++++---- 1 file changed, 32 insertions(+), 4 deletions(-) diff --git a/tests/e2e/test_staging_full_saas.sh b/tests/e2e/test_staging_full_saas.sh index 072d5fe3..06f46d2b 100755 --- a/tests/e2e/test_staging_full_saas.sh +++ b/tests/e2e/test_staging_full_saas.sh @@ -277,20 +277,48 @@ else fi # ─── 7. Wait for workspace(s) online ─────────────────────────────────── -log "7/11 Waiting for workspace(s) to reach status=online..." -WS_DEADLINE=$(( $(date +%s) + 600 )) +# Hermes cold-boot takes 10-13 min on slow apt days (apt + uv + hermes +# install + npm browser-tools). The controlplane bootstrap-watcher +# deadline fires at 5 min and sets status=failed prematurely; heartbeat +# then transitions failed → online after install.sh finishes. So: +# +# - 20 min deadline (hermes worst-case + slack) +# - 'failed' is a TRANSIENT state we must tolerate — log and keep +# polling, only hard-fail at the deadline. Pre-bootstrap-watcher-fix +# (controlplane#245) this was a flake generator: workspace went +# failed→online inside our window but we bailed at the failed read. +log "7/11 Waiting for workspace(s) to reach status=online (up to 20 min — hermes cold boot)..." +WS_DEADLINE=$(( $(date +%s) + 1200 )) WS_TO_CHECK="$PARENT_ID" [ -n "$CHILD_ID" ] && WS_TO_CHECK="$WS_TO_CHECK $CHILD_ID" for wid in $WS_TO_CHECK; do + WS_LAST_STATUS="" + WS_FAILED_LOGGED=0 while true; do if [ "$(date +%s)" -gt "$WS_DEADLINE" ]; then - fail "Workspace $wid never reached online within 10 min" + WS_LAST_ERR=$(tenant_call GET "/workspaces/$wid" 2>/dev/null | \ + python3 -c "import json,sys; print(json.load(sys.stdin).get('last_sample_error',''))" 2>/dev/null || echo "") + fail "Workspace $wid never reached online within 20 min (last status=$WS_LAST_STATUS, err=$WS_LAST_ERR)" fi WS_JSON=$(tenant_call GET "/workspaces/$wid" 2>/dev/null || echo '{}') WS_STATUS=$(echo "$WS_JSON" | python3 -c "import json,sys; print(json.load(sys.stdin).get('status',''))" 2>/dev/null) + if [ "$WS_STATUS" != "$WS_LAST_STATUS" ]; then + log " $wid → $WS_STATUS" + WS_LAST_STATUS="$WS_STATUS" + fi case "$WS_STATUS" in online) break ;; - failed) fail "Workspace $wid status=failed: $(echo "$WS_JSON" | python3 -c 'import json,sys; print(json.load(sys.stdin).get("last_sample_error",""))')" ;; + failed) + # Not a hard fail — bootstrap-watcher frequently marks failed at + # 5 min on hermes, then heartbeat recovers to online around 10-13 + # min when install.sh finishes. Log once per workspace so the CI + # output isn't spammy. + if [ "$WS_FAILED_LOGGED" = "0" ]; then + log " $wid transiently failed — waiting for heartbeat recovery (bootstrap-watcher deadline, see cp#245)" + WS_FAILED_LOGGED=1 + fi + sleep 10 + ;; *) sleep 10 ;; esac done