fix(workspace): recover status from "failed" on live heartbeat #2414

Merged
core-devops merged 1 commits from fix/recover-workspace-from-failed into main 2026-06-07 22:48:22 +00:00
Member

Mechanism (named, not a flake): the provision-timeout sweeper flips a workspace provisioningfailed at 10m (claude-code). A slow cold-boot (EC2 image pull + LLM preflight) can finish AFTER the flip and start heartbeating — but the heartbeat handler recovered status from offline/provisioning/awaiting_agent→online, with no failed branch. agent_card is written unconditionally, so a healthy, serving workspace stayed stuck showing failed forever.

This is the root of the intermittent multi-provider e2e "boot failures": minimax preflights slower than kimi → more often crosses the 10m budget → flipped to failed → registers+serves fine while status never recovers. A live heartbeat is authoritative (the agent IS running), so recover failedonline (guarded AND status = 'failed' so it can't override removed).

Test: TestHeartbeatHandler_FailedToOnline (mirrors the provisioning→online recovery test).

**Mechanism (named, not a flake):** the provision-timeout sweeper flips a workspace `provisioning`→`failed` at 10m (claude-code). A slow cold-boot (EC2 image pull + LLM preflight) can finish AFTER the flip and start heartbeating — but the heartbeat handler recovered status from offline/provisioning/awaiting_agent→online, with **no `failed` branch**. agent_card is written unconditionally, so a healthy, serving workspace stayed stuck showing `failed` forever. This is the root of the **intermittent multi-provider e2e "boot failures"**: minimax preflights slower than kimi → more often crosses the 10m budget → flipped to `failed` → registers+serves fine while status never recovers. A live heartbeat is authoritative (the agent IS running), so recover `failed`→`online` (guarded `AND status = 'failed'` so it can't override `removed`). Test: `TestHeartbeatHandler_FailedToOnline` (mirrors the provisioning→online recovery test).
molecule-code-reviewer approved these changes 2026-06-07 22:11:19 +00:00
Dismissed
molecule-code-reviewer left a comment
Member

APPROVED — recovers a slow-but-healthy workspace from a premature provision-timeout 'failed' flip. Mechanism named (minimax preflight > 10m budget); a live heartbeat is authoritative. Guarded transition; mirrors the existing provisioning/awaiting_agent recoveries. Tested.

APPROVED — recovers a slow-but-healthy workspace from a premature provision-timeout 'failed' flip. Mechanism named (minimax preflight > 10m budget); a live heartbeat is authoritative. Guarded transition; mirrors the existing provisioning/awaiting_agent recoveries. Tested.
core-security approved these changes 2026-06-07 22:11:23 +00:00
Dismissed
core-security left a comment
Member

APPROVED (security) — status state-machine only; guarded WHERE status='failed', no new surface.

APPROVED (security) — status state-machine only; guarded WHERE status='failed', no new surface.
core-devops added 1 commit 2026-06-07 22:43:39 +00:00
fix(workspace): recover status from 'failed' on live heartbeat
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
CI / Detect changes (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 9s
E2E API Smoke Test / detect-changes (pull_request) Successful in 15s
E2E Chat / detect-changes (pull_request) Successful in 15s
sop-tier-check / tier-check (pull_request_target) Has been cancelled
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 2s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 15s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
gate-check-v3 / gate-check (pull_request_target) Successful in 6s
qa-review / approved (pull_request_target) Failing after 5s
security-review / approved (pull_request_target) Successful in 5s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 27s
E2E Chat / E2E Chat (pull_request) Successful in 2s
qa-review / approved (pull_request_review) Has been skipped
security-review / approved (pull_request_review) Has been skipped
Harness Replays / detect-changes (pull_request) Successful in 33s
sop-tier-check / tier-check (pull_request_review) Failing after 5s
Harness Replays / Harness Replays (pull_request) Successful in 1s
CI / Canvas Deploy Status (pull_request) Successful in 11s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 17s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 1m25s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 10s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m1s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m26s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m26s
CI / Platform (Go) (pull_request) Successful in 4m0s
CI / all-required (pull_request) Successful in 9s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Has been cancelled
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Waiting to run
audit-force-merge / audit (pull_request_target) Successful in 6s
bde3248d2d
The provision-timeout sweeper (registry/provisiontimeout.go) flips a workspace
to 'failed' when it sits in 'provisioning' past DefaultProvisioningTimeout
(10m for claude-code). But a slow cold-boot — EC2 image pull + LLM preflight
on a cold worker — can finish AFTER the flip and start heartbeating. The
heartbeat handler already recovers status from offline/provisioning/
awaiting_agent → online, but had NO 'failed' branch, and agent_card is written
unconditionally — so a healthy, serving workspace stayed stuck showing
'failed' forever.

This is the mechanism behind the "intermittent multi-provider e2e boot
failures": minimax preflights slower than kimi, so its workspaces more often
cross the 10m budget, get flipped to 'failed', then register+serve fine while
status never recovers. A live heartbeat is authoritative (the agent IS
running), so recover 'failed' → 'online'. The `AND status = 'failed'` guard
keeps it conditional (won't override 'removed').

Test: TestHeartbeatHandler_FailedToOnline (mirrors the provisioning→online
recovery test). Not a flake — the mechanism is named and fixed.
core-devops force-pushed fix/recover-workspace-from-failed from b3da0c5cb4 to bde3248d2d 2026-06-07 22:43:39 +00:00 Compare
core-devops dismissed molecule-code-reviewer's review 2026-06-07 22:43:39 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

core-devops dismissed core-security's review 2026-06-07 22:43:39 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

molecule-code-reviewer approved these changes 2026-06-07 22:43:57 +00:00
molecule-code-reviewer left a comment
Member

APPROVED on bde3248d2d — rebased onto clean main (earlier red was a clobbered base from a cross-branch cp; now purely additive, full handlers suite green locally).

APPROVED on bde3248d2d02ec03fa868d41968cd71a82c994e5 — rebased onto clean main (earlier red was a clobbered base from a cross-branch cp; now purely additive, full handlers suite green locally).
core-security approved these changes 2026-06-07 22:44:02 +00:00
core-security left a comment
Member

APPROVED on bde3248d2d.

APPROVED on bde3248d2d02ec03fa868d41968cd71a82c994e5.
core-devops merged commit 4ddc93ef88 into main 2026-06-07 22:48:22 +00:00
core-devops deleted branch fix/recover-workspace-from-failed 2026-06-07 22:48:23 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2414