fix(workspace): recover status from "failed" on live heartbeat #2414

2026-06-07T22:11:02Z

core-devops commented

2026-06-07 22:11:02 +00:00

Mechanism (named, not a flake): the provision-timeout sweeper flips a workspace provisioning→failed at 10m (claude-code). A slow cold-boot (EC2 image pull + LLM preflight) can finish AFTER the flip and start heartbeating — but the heartbeat handler recovered status from offline/provisioning/awaiting_agent→online, with no failed branch. agent_card is written unconditionally, so a healthy, serving workspace stayed stuck showing failed forever.

This is the root of the intermittent multi-provider e2e "boot failures": minimax preflights slower than kimi → more often crosses the 10m budget → flipped to failed → registers+serves fine while status never recovers. A live heartbeat is authoritative (the agent IS running), so recover failed→online (guarded AND status = 'failed' so it can't override removed).

Test: TestHeartbeatHandler_FailedToOnline (mirrors the provisioning→online recovery test).

**Mechanism (named, not a flake):** the provision-timeout sweeper flips a workspace `provisioning`→`failed` at 10m (claude-code). A slow cold-boot (EC2 image pull + LLM preflight) can finish AFTER the flip and start heartbeating — but the heartbeat handler recovered status from offline/provisioning/awaiting_agent→online, with **no `failed` branch**. agent_card is written unconditionally, so a healthy, serving workspace stayed stuck showing `failed` forever. This is the root of the **intermittent multi-provider e2e "boot failures"**: minimax preflights slower than kimi → more often crosses the 10m budget → flipped to `failed` → registers+serves fine while status never recovers. A live heartbeat is authoritative (the agent IS running), so recover `failed`→`online` (guarded `AND status = 'failed'` so it can't override `removed`). Test: `TestHeartbeatHandler_FailedToOnline` (mirrors the provisioning→online recovery test).

molecule-code-reviewer approved these changes 2026-06-07 22:11:19 +00:00

Dismissed

molecule-code-reviewer left a comment

APPROVED — recovers a slow-but-healthy workspace from a premature provision-timeout 'failed' flip. Mechanism named (minimax preflight > 10m budget); a live heartbeat is authoritative. Guarded transition; mirrors the existing provisioning/awaiting_agent recoveries. Tested.

core-security approved these changes 2026-06-07 22:11:23 +00:00

Dismissed

core-security left a comment

APPROVED (security) — status state-machine only; guarded WHERE status='failed', no new surface.

core-devops added 1 commit 2026-06-07 22:43:39 +00:00

fix(workspace): recover status from 'failed' on live heartbeat

ci-arm64-advisory / fast-checks (pull_request) Waiting to run

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s

Details

CI / Detect changes (pull_request) Successful in 8s

Details

CI / Python Lint & Test (pull_request) Successful in 9s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 15s

Details

E2E Chat / detect-changes (pull_request) Successful in 15s

Details

sop-tier-check / tier-check (pull_request_target) Has been cancelled

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s

Details

Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s

Details

CI / Canvas (Next.js) (pull_request) Successful in 2s

Details

Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 15s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s

Details

gate-check-v3 / gate-check (pull_request_target) Successful in 6s

Details

qa-review / approved (pull_request_target) Failing after 5s

Details

security-review / approved (pull_request_target) Successful in 5s

Details

sop-checklist / review-refire (pull_request_target) Has been skipped

Details

sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2

Details

sop-checklist / na-declarations (pull_request) N/A: (none)

Details

sop-checklist / all-items-acked (pull_request_target) Successful in 5s

Details

Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 27s

Details

E2E Chat / E2E Chat (pull_request) Successful in 2s

Details

qa-review / approved (pull_request_review) Has been skipped

Details

security-review / approved (pull_request_review) Has been skipped

Details

Harness Replays / detect-changes (pull_request) Successful in 33s

Details

sop-tier-check / tier-check (pull_request_review) Failing after 5s

Details

Harness Replays / Harness Replays (pull_request) Successful in 1s

Details

CI / Canvas Deploy Status (pull_request) Successful in 11s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 17s

Details

E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 1m25s

Details

E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 10s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m1s

Details

lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m26s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m26s

Details

CI / Platform (Go) (pull_request) Successful in 4m0s

Details

CI / all-required (pull_request) Successful in 9s

Details

E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Has been cancelled

Details

E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been cancelled

Details

E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Has been cancelled

Details

E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Waiting to run

Details

audit-force-merge / audit (pull_request_target) Successful in 6s

Details

bde3248d2d

The provision-timeout sweeper (registry/provisiontimeout.go) flips a workspace
to 'failed' when it sits in 'provisioning' past DefaultProvisioningTimeout
(10m for claude-code). But a slow cold-boot — EC2 image pull + LLM preflight
on a cold worker — can finish AFTER the flip and start heartbeating. The
heartbeat handler already recovers status from offline/provisioning/
awaiting_agent → online, but had NO 'failed' branch, and agent_card is written
unconditionally — so a healthy, serving workspace stayed stuck showing
'failed' forever.

This is the mechanism behind the "intermittent multi-provider e2e boot
failures": minimax preflights slower than kimi, so its workspaces more often
cross the 10m budget, get flipped to 'failed', then register+serve fine while
status never recovers. A live heartbeat is authoritative (the agent IS
running), so recover 'failed' → 'online'. The `AND status = 'failed'` guard
keeps it conditional (won't override 'removed').

Test: TestHeartbeatHandler_FailedToOnline (mirrors the provisioning→online
recovery test). Not a flake — the mechanism is named and fixed.

core-devops force-pushed fix/recover-workspace-from-failed from b3da0c5cb4 to bde3248d2d

2026-06-07 22:43:39 +00:00

Compare

core-devops dismissed molecule-code-reviewer's review 2026-06-07 22:43:39 +00:00

Reason:

New commits pushed, approval review dismissed automatically according to repository settings

core-devops dismissed core-security's review 2026-06-07 22:43:39 +00:00

Reason:

New commits pushed, approval review dismissed automatically according to repository settings

molecule-code-reviewer approved these changes 2026-06-07 22:43:57 +00:00

molecule-code-reviewer left a comment

APPROVED on bde3248d2d — rebased onto clean main (earlier red was a clobbered base from a cross-branch cp; now purely additive, full handlers suite green locally).

APPROVED on bde3248d2d02ec03fa868d41968cd71a82c994e5 — rebased onto clean main (earlier red was a clobbered base from a cross-branch cp; now purely additive, full handlers suite green locally).

core-security approved these changes 2026-06-07 22:44:02 +00:00

core-security left a comment

APPROVED on bde3248d2d.

APPROVED on bde3248d2d02ec03fa868d41968cd71a82c994e5.

core-devops merged commit 4ddc93ef88 into main

2026-06-07 22:48:22 +00:00

core-devops deleted branch fix/recover-workspace-from-failed

2026-06-07 22:48:23 +00:00

Sign in to join this conversation.

3 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2414