From 2e92152c34e18151d14e4e7e7b2261761fba229c Mon Sep 17 00:00:00 2001 From: Molecule AI Infra Lead Date: Fri, 24 Apr 2026 14:12:40 +0000 Subject: [PATCH] fix(e2e): increase hermes workspace wait from 20 to 30 min MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Root cause of PR #1981 E2E failures (step 7 timeout): - hermes-agent install from NousResearch (Node 22 tarball + Python deps from source) + gateway health wait takes 15-25 min on staging - install.sh runs BEFORE molecule-runtime launches, blocking heartbeats - bootstrap-watcher fires at 5 min (cp#245) → workspace=failed - workspace never recovers because molecule-runtime never starts in time Fix: increase WS_DEADLINE from 1200s (20 min) to 1800s (30 min) to give hermes cold-boot enough runway. Also bump job timeout-minutes from 30 → 45 to accommodate the longer wait. Medium-term: fix cp#245 (bootstrap-watcher hermes deadline too short) in molecule-controlplane to reduce false-failed noise. Co-Authored-By: Claude Sonnet 4.6 --- .github/workflows/e2e-staging-saas.yml | 4 ++-- tests/e2e/test_staging_full_saas.sh | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/.github/workflows/e2e-staging-saas.yml b/.github/workflows/e2e-staging-saas.yml index c1e2b878..8ef1c950 100644 --- a/.github/workflows/e2e-staging-saas.yml +++ b/.github/workflows/e2e-staging-saas.yml @@ -5,7 +5,7 @@ name: E2E Staging SaaS (full lifecycle) # HMA memory → activity → peers), then tears down and asserts leak-free. # # Why a separate workflow (not folded into ci.yml): -# - The run takes ~20 min (EC2 boot + cloudflared DNS + provision sweeps + +# - The run takes ~25-35 min (EC2 boot + cloudflared DNS + provision sweeps + # agent bootstrap), way too slow for every PR. # - Needs its own concurrency group so two pushes don't fight over the # same staging org slug prefix. @@ -68,7 +68,7 @@ jobs: e2e-staging-saas: name: E2E Staging SaaS runs-on: ubuntu-latest - timeout-minutes: 30 + timeout-minutes: 45 permissions: contents: read diff --git a/tests/e2e/test_staging_full_saas.sh b/tests/e2e/test_staging_full_saas.sh index aea0f8a0..ba0fc7a9 100755 --- a/tests/e2e/test_staging_full_saas.sh +++ b/tests/e2e/test_staging_full_saas.sh @@ -308,8 +308,8 @@ fi # polling, only hard-fail at the deadline. Pre-bootstrap-watcher-fix # (controlplane#245) this was a flake generator: workspace went # failed→online inside our window but we bailed at the failed read. -log "7/11 Waiting for workspace(s) to reach status=online (up to 20 min — hermes cold boot)..." -WS_DEADLINE=$(( $(date +%s) + 1200 )) +log "7/11 Waiting for workspace(s) to reach status=online (up to 30 min — hermes cold boot)..." +WS_DEADLINE=$(( $(date +%s) + 1800 )) WS_TO_CHECK="$PARENT_ID" [ -n "$CHILD_ID" ] && WS_TO_CHECK="$WS_TO_CHECK $CHILD_ID" for wid in $WS_TO_CHECK; do