feat(registry): reconcile online workspaces against real EC2 state — auto-heal terminated instances (core#2261) #2266
Reference in New Issue
Block a user
Delete Branch "feat/core2261-instance-state-reconciler"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root cause (core#2247)
Every existing liveness sweep in workspace-server keys off a proxy for "is this workspace alive?":
StartLivenessMonitor— Redis TTL expiry (agent stopped heartbeating)StartHealthSweep(Docker pass) — local Docker daemon, only whenprov != nilStartHealthSweep(remote pass) —last_heartbeat_atfreshness forruntime='external'StartCPOrphanSweeper—status='removed'rows with a strayinstance_idA SaaS
claude-codeworkspace whose EC2 was terminated/stopped out from under us (manual AWS action, spot reclaim, CP-side reap) falls through all of them: it isn'tremoved, isn'texternal, and on a pure-SaaS front-doorprov == nilso the Docker pass never runs. The registry keptstatus=onlinepointing at a deadinstance_idforever. CTO framing: "it shouldn't be pointing at a dead one at all."The fix
New
StartCPInstanceReconciler(workspace-server/internal/registry/cp_instance_reconciler.go): a 60s sweep that asks the one authoritative question the others lack —CPProvisioner.IsRunning, which ultimately asks the control-plane "is this EC2 actually running?" (DescribeInstances-equivalent). On a clean "not running" it feeds the workspace into the existing offline + auto-heal machinery via the sameonWorkspaceOfflineclosure the other sweeps use — no new healing path, just real ground truth driving the one we already have.onWorkspaceOfflineflips the row offline andgo wh.RestartByID(...), which reprovisions with the existing volume.Query (online + SaaS EC2 only)
runtime='external'rows are owned by the remote-heartbeat pass; paused/hibernated/removed/provisioning/awaiting_agent are excluded by the status filter.Guardrails
IsRunningreturns(true, err)on any transient DB/transport error and(false, nil)only when CP genuinely reports not-running. The reconciler acts strictly on(false, nil); any error short-circuits to "leave it online" so a CP blip never cascades healthy workspaces into reprovision. Covered byTestReconcileOnce_TransientError_DoesNotFlip.LIMIT 200+ per-workspace 10s timeout so one slow CP call can't stall the sweep.Wiring
Gated identically to
cp-orphan-sweeperincmd/server/main.go, reusing the sameonWorkspaceOfflineclosure:Tests
cp_instance_reconciler_test.go(sqlmock + fake checker), mirroringcp_orphan_sweeper_test.go: not-running→flip, running→no-flip, transient-error→no-flip (fail-safe), query-scope excludes external/non-online, mixed batch, query-error, nil-DB, nil-checker disabled, runs-once-and-exits-on-cancel.go build ./...,go vet ./internal/registry/...,go test ./internal/registry/...all green; gofmt clean on touched files only.Refs core#2261, core#2247.
DO NOT MERGE — heavy core SOP gate.
🤖 Generated with Claude Code
QA (core#2261 instance reconciler). Independently ran the 9 registry tests — all pass incl. the fail-safe (transient IsRunning err → no flip) and query-scope (online+SaaS only, excludes external/non-online). Logic verified: acts ONLY on (false,nil), per-workspace timeout, LIMIT 200, reuses onWorkspaceOffline auto-heal. Approve.
Security (core#2261). No new external surface or creds. Auto-heal reuses the existing onWorkspaceOffline→RestartByID path (existing-volume reprovision); fail-safe IsRunning prevents flipping a healthy workspace on transient errors; scope strictly online+SaaS (excludes paused/hibernated/removed). DoS-safe: per-cycle LIMIT + per-workspace timeout. Approve.