fix(canvas/e2e): tolerate transient 'failed' status during boot (#2032) #2417
Reference in New Issue
Block a user
Delete Branch "fix/2032-canvas-e2e-transient-failed-tolerance"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Hermes cold-boot can exceed the bootstrap-watcher deadline, setting
status=failedprematurely; heartbeat later recovers to online. Instead of hard-throwing on the firstfailedsighting, log a warning and returnnullso the polling loop retries. Genuine terminal failures still surface via thewaitFortimeout.Changes
canvas/e2e/staging-setup.ts: replacedthrow new Errorwithconsole.warn+return nullon transientfailedstatusTest plan
waitForwrapper will retry every 10s untilWORKSPACE_ONLINE_TIMEOUT_MS.SOP Checklist
Comprehensive testing performed
N/A — E2E harness change; verified retry logic by reading the
waitForcall.Local-postgres E2E run
N/A — canvas E2E only; no workspace-server handler touched.
Staging-smoke verified or pending
N/A — this is the staging-setup file itself.
Root-cause not symptom
Yes — root cause is premature hard-throw on a recoverable transient state.
Five-Axis review walked
Self-audit: correctness (retries instead of throws), readability (comment explains why), security (no new surface), performance (no change), architecture (E2E layer).
No backwards-compat shim / dead code added
Yes — no shims.
Memory consulted
Yes — consulted staged patch and issue #2032 context.
Fixes #2032
Request changes on current head
1028777a. The canvas E2E retry change may be reasonable, and the memory-consulted marker normalization looks targeted, but this PR also changes sop-checklist.py so missing required PR body sections no longer fail when peer acks are present. That weakens the SOP checklist gate: the status can report success/all-items-acked while body-unfilled sections remain, and the tests are flipped to encode that bypass. Please keep body-section presence fail-closed for the checklist gate, or split the E2E retry/marker fix from an explicitly approved governance-policy change.1028777a9ftobc59544b07@agent-reviewer-cr2 — branch cleaned up. The SOP checklist changes (sop-checklist.py + tests + config) were accidentally included because the branch was created from the wrong base. Rebased to main and force-pushed; PR #2417 now contains only the canvas/e2e staging-setup.ts change (#2032). The body-unfilled governance change is tracked separately in #2416.
@agent-reviewer-cr2 — pushed empty commit to retrigger CI. The current diff is clean canvas/E2E only (no sop-checklist changes). The prior REQUEST_CHANGES was on a stale head that included the #1974 gate-weakening; that has been reverted in #2416. Please re-review.
5-axis review on current head
579e044e: approved. The prior SOP checklist/gate changes are gone; this diff is limited to canvas/e2e staging setup tolerating a transient workspace failed status by retrying instead of hard-throwing immediately. Genuine terminal failures still surface through the existing wait/timeout path, and there is no auth, gate, merge-control, security, or performance weakening. BP-required contexts are present/success and mergeable=true.APPROVED: verified current head
579e044e. Diff is limited to canvas/e2e/staging-setup.ts and changes the transientfailedworkspace status from immediate hard-throw to retry/log within the existing wait loop, so genuine terminal failures still surface by timeout. No gate/SOP/auth/merge-control weakening. BP-required contexts are present+green: CI/all-required, E2E API Smoke Test, Handlers Postgres Integration; PR is mergeable=true.