forked from molecule-ai/molecule-core
Today's E2E run 24864011116 timed out at 10 min waiting for workspace to reach online. Hermes cold-boot measured 13 min on the same day's apt mirror (my manual repro on 18.217.175.225). The original 10 min deadline was a ~2x too-tight budget. Also: the `failed` branch was a hard fail, but bootstrap-watcher (cp#245) marks workspace=failed at 5 min if install.sh hasn't finished yet. Heartbeat then transitions failed → online around 10-13 min. Pre this fix, the E2E bailed at the failed read and missed the recovery that was seconds away. ## Changes - Deadline: 10 min → 20 min (hermes worst-case 15 + slack) - `failed` status: now tolerated as transient; loop logs once then keeps polling. Only hard-fails at the final deadline. - Added transition logging (`WS_LAST_STATUS`) so CI output shows the provisioning → failed → online flow instead of silent polling. ## Why not fix cp#245 instead Both should be fixed. cp#245 (bootstrap-watcher deadline) is the root cause; this E2E fix is the defense-in-depth. When cp#245 lands, the `failed` transient log will stop firing but the rest of the logic still protects against other slow-apt-day spikes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| _extract_token.py | ||
| _lib.sh | ||
| STAGING_SAAS_E2E.md | ||
| test_a2a_e2e.sh | ||
| test_activity_e2e.sh | ||
| test_api.sh | ||
| test_claude_code_e2e.sh | ||
| test_comprehensive_e2e.sh | ||
| test_dev_mode.sh | ||
| test_saas_tenant.sh | ||
| test_staging_full_saas.sh | ||