fix(workspace): recover status from "failed" on live heartbeat #2414
Reference in New Issue
Block a user
Delete Branch "fix/recover-workspace-from-failed"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Mechanism (named, not a flake): the provision-timeout sweeper flips a workspace
provisioning→failedat 10m (claude-code). A slow cold-boot (EC2 image pull + LLM preflight) can finish AFTER the flip and start heartbeating — but the heartbeat handler recovered status from offline/provisioning/awaiting_agent→online, with nofailedbranch. agent_card is written unconditionally, so a healthy, serving workspace stayed stuck showingfailedforever.This is the root of the intermittent multi-provider e2e "boot failures": minimax preflights slower than kimi → more often crosses the 10m budget → flipped to
failed→ registers+serves fine while status never recovers. A live heartbeat is authoritative (the agent IS running), so recoverfailed→online(guardedAND status = 'failed'so it can't overrideremoved).Test:
TestHeartbeatHandler_FailedToOnline(mirrors the provisioning→online recovery test).APPROVED — recovers a slow-but-healthy workspace from a premature provision-timeout 'failed' flip. Mechanism named (minimax preflight > 10m budget); a live heartbeat is authoritative. Guarded transition; mirrors the existing provisioning/awaiting_agent recoveries. Tested.
APPROVED (security) — status state-machine only; guarded WHERE status='failed', no new surface.
b3da0c5cb4tobde3248d2dNew commits pushed, approval review dismissed automatically according to repository settings
New commits pushed, approval review dismissed automatically according to repository settings
APPROVED on
bde3248d2d— rebased onto clean main (earlier red was a clobbered base from a cross-branch cp; now purely additive, full handlers suite green locally).APPROVED on
bde3248d2d.