feat(registry): reconcile online workspaces against real EC2 state — auto-heal terminated instances (core#2261) #2266

Merged
hongming merged 1 commits from feat/core2261-instance-state-reconciler into main 2026-06-05 00:28:32 +00:00
Owner

Root cause (core#2247)

Every existing liveness sweep in workspace-server keys off a proxy for "is this workspace alive?":

  • StartLivenessMonitor — Redis TTL expiry (agent stopped heartbeating)
  • StartHealthSweep (Docker pass) — local Docker daemon, only when prov != nil
  • StartHealthSweep (remote pass) — last_heartbeat_at freshness for runtime='external'
  • StartCPOrphanSweeperstatus='removed' rows with a stray instance_id

A SaaS claude-code workspace whose EC2 was terminated/stopped out from under us (manual AWS action, spot reclaim, CP-side reap) falls through all of them: it isn't removed, isn't external, and on a pure-SaaS front-door prov == nil so the Docker pass never runs. The registry kept status=online pointing at a dead instance_id forever. CTO framing: "it shouldn't be pointing at a dead one at all."

The fix

New StartCPInstanceReconciler (workspace-server/internal/registry/cp_instance_reconciler.go): a 60s sweep that asks the one authoritative question the others lack — CPProvisioner.IsRunning, which ultimately asks the control-plane "is this EC2 actually running?" (DescribeInstances-equivalent). On a clean "not running" it feeds the workspace into the existing offline + auto-heal machinery via the same onWorkspaceOffline closure the other sweeps use — no new healing path, just real ground truth driving the one we already have.

onWorkspaceOffline flips the row offline and go wh.RestartByID(...), which reprovisions with the existing volume.

Query (online + SaaS EC2 only)

SELECT id::text FROM workspaces
 WHERE status = 'online'
   AND instance_id IS NOT NULL AND instance_id != ''
   AND COALESCE(runtime, '') <> 'external'
 ORDER BY updated_at DESC
 LIMIT 200

runtime='external' rows are owned by the remote-heartbeat pass; paused/hibernated/removed/provisioning/awaiting_agent are excluded by the status filter.

Guardrails

  • Fail-safe: IsRunning returns (true, err) on any transient DB/transport error and (false, nil) only when CP genuinely reports not-running. The reconciler acts strictly on (false, nil); any error short-circuits to "leave it online" so a CP blip never cascades healthy workspaces into reprovision. Covered by TestReconcileOnce_TransientError_DoesNotFlip.
  • Online + SaaS only (above).
  • Per-cycle LIMIT 200 + per-workspace 10s timeout so one slow CP call can't stall the sweep.
  • nil-checker / nil-DB tolerant (matches the sibling CP sweeper).

Wiring

Gated identically to cp-orphan-sweeper in cmd/server/main.go, reusing the same onWorkspaceOffline closure:

if cpProv != nil {
    go supervised.RunWithRecover(ctx, "cp-instance-reconciler", func(c context.Context) {
        registry.StartCPInstanceReconciler(c, cpProv, onWorkspaceOffline, 60*time.Second)
    })
}

Tests

cp_instance_reconciler_test.go (sqlmock + fake checker), mirroring cp_orphan_sweeper_test.go: not-running→flip, running→no-flip, transient-error→no-flip (fail-safe), query-scope excludes external/non-online, mixed batch, query-error, nil-DB, nil-checker disabled, runs-once-and-exits-on-cancel.

go build ./..., go vet ./internal/registry/..., go test ./internal/registry/... all green; gofmt clean on touched files only.

Refs core#2261, core#2247.

DO NOT MERGE — heavy core SOP gate.

🤖 Generated with Claude Code

## Root cause (core#2247) Every existing liveness sweep in workspace-server keys off a **proxy** for "is this workspace alive?": - `StartLivenessMonitor` — Redis TTL expiry (agent stopped heartbeating) - `StartHealthSweep` (Docker pass) — local Docker daemon, only when `prov != nil` - `StartHealthSweep` (remote pass) — `last_heartbeat_at` freshness for `runtime='external'` - `StartCPOrphanSweeper` — `status='removed'` rows with a stray `instance_id` A SaaS `claude-code` workspace whose EC2 was terminated/stopped out from under us (manual AWS action, spot reclaim, CP-side reap) falls through **all** of them: it isn't `removed`, isn't `external`, and on a pure-SaaS front-door `prov == nil` so the Docker pass never runs. The registry kept `status=online` pointing at a dead `instance_id` forever. CTO framing: *"it shouldn't be pointing at a dead one at all."* ## The fix New `StartCPInstanceReconciler` (`workspace-server/internal/registry/cp_instance_reconciler.go`): a 60s sweep that asks the **one authoritative question** the others lack — `CPProvisioner.IsRunning`, which ultimately asks the control-plane "is this EC2 actually running?" (DescribeInstances-equivalent). On a clean "not running" it feeds the workspace into the **existing** offline + auto-heal machinery via the same `onWorkspaceOffline` closure the other sweeps use — no new healing path, just real ground truth driving the one we already have. `onWorkspaceOffline` flips the row offline and `go wh.RestartByID(...)`, which reprovisions **with the existing volume**. ### Query (online + SaaS EC2 only) ```sql SELECT id::text FROM workspaces WHERE status = 'online' AND instance_id IS NOT NULL AND instance_id != '' AND COALESCE(runtime, '') <> 'external' ORDER BY updated_at DESC LIMIT 200 ``` `runtime='external'` rows are owned by the remote-heartbeat pass; paused/hibernated/removed/provisioning/awaiting_agent are excluded by the status filter. ## Guardrails - **Fail-safe**: `IsRunning` returns `(true, err)` on any transient DB/transport error and `(false, nil)` **only** when CP genuinely reports not-running. The reconciler acts strictly on `(false, nil)`; any error short-circuits to "leave it online" so a CP blip never cascades healthy workspaces into reprovision. Covered by `TestReconcileOnce_TransientError_DoesNotFlip`. - **Online + SaaS only** (above). - Per-cycle `LIMIT 200` + per-workspace 10s timeout so one slow CP call can't stall the sweep. - nil-checker / nil-DB tolerant (matches the sibling CP sweeper). ## Wiring Gated identically to `cp-orphan-sweeper` in `cmd/server/main.go`, reusing the **same** `onWorkspaceOffline` closure: ```go if cpProv != nil { go supervised.RunWithRecover(ctx, "cp-instance-reconciler", func(c context.Context) { registry.StartCPInstanceReconciler(c, cpProv, onWorkspaceOffline, 60*time.Second) }) } ``` ## Tests `cp_instance_reconciler_test.go` (sqlmock + fake checker), mirroring `cp_orphan_sweeper_test.go`: not-running→flip, running→no-flip, **transient-error→no-flip (fail-safe)**, query-scope excludes external/non-online, mixed batch, query-error, nil-DB, nil-checker disabled, runs-once-and-exits-on-cancel. `go build ./...`, `go vet ./internal/registry/...`, `go test ./internal/registry/...` all green; gofmt clean on touched files only. Refs core#2261, core#2247. DO NOT MERGE — heavy core SOP gate. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hongming added 1 commit 2026-06-05 00:21:45 +00:00
feat(registry): reconcile online workspaces against real EC2 state — auto-heal terminated instances (core#2261)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 5s
E2E API Smoke Test / detect-changes (pull_request) Successful in 5s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 2s
Harness Replays / detect-changes (pull_request) Successful in 3s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 3s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
qa-review / approved (pull_request_target) Failing after 4s
security-review / approved (pull_request_target) Failing after 4s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 10s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 11s
E2E Chat / detect-changes (pull_request) Successful in 12s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 4s
gate-check-v3 / gate-check (pull_request_target) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
E2E Chat / E2E Chat (pull_request) Successful in 2s
Harness Replays / Harness Replays (pull_request) Successful in 4s
CI / Canvas Deploy Status (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 57s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-tier-check / tier-check (pull_request_target) Has been cancelled
qa-review / approved (pull_request_review) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
security-review / approved (pull_request_review) Has been skipped
sop-checklist / all-items-acked (pull_request_target) Successful in 5s
sop-tier-check / tier-check (pull_request_review) Successful in 5s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m41s
CI / Platform (Go) (pull_request) Successful in 3m54s
CI / all-required (pull_request) Successful in 5s
audit-force-merge / audit (pull_request_target) Successful in 15s
48aebdfcc4
Root cause (core#2247): every existing liveness sweep keys off a PROXY
(Redis TTL, agent heartbeat, local Docker, or runtime='external'). A SaaS
claude-code workspace whose EC2 was terminated/stopped falls through ALL
of them and stays status=online pointing at a dead instance_id forever.

Adds StartCPInstanceReconciler: a 60s sweep that asks the ONE
authoritative question the others lack — CPProvisioner.IsRunning (CP
DescribeInstances-equivalent) — for each online SaaS row, and on a clean
"not running" feeds it into the existing onWorkspaceOffline closure
(status flip + RestartByID reprovision, existing volume).

Guardrails: fail-safe (IsRunning is (true, err) on any transient error →
never flip); online + SaaS-EC2 only (runtime <> 'external'); per-cycle
LIMIT 200 + per-workspace timeout.

Refs core#2261, core#2247.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hongming added the tier:medium label 2026-06-05 00:23:28 +00:00
core-qa approved these changes 2026-06-05 00:23:28 +00:00
core-qa left a comment
Member

QA (core#2261 instance reconciler). Independently ran the 9 registry tests — all pass incl. the fail-safe (transient IsRunning err → no flip) and query-scope (online+SaaS only, excludes external/non-online). Logic verified: acts ONLY on (false,nil), per-workspace timeout, LIMIT 200, reuses onWorkspaceOffline auto-heal. Approve.

QA (core#2261 instance reconciler). Independently ran the 9 registry tests — all pass incl. the fail-safe (transient IsRunning err → no flip) and query-scope (online+SaaS only, excludes external/non-online). Logic verified: acts ONLY on (false,nil), per-workspace timeout, LIMIT 200, reuses onWorkspaceOffline auto-heal. Approve.
core-security approved these changes 2026-06-05 00:23:30 +00:00
core-security left a comment
Member

Security (core#2261). No new external surface or creds. Auto-heal reuses the existing onWorkspaceOffline→RestartByID path (existing-volume reprovision); fail-safe IsRunning prevents flipping a healthy workspace on transient errors; scope strictly online+SaaS (excludes paused/hibernated/removed). DoS-safe: per-cycle LIMIT + per-workspace timeout. Approve.

Security (core#2261). No new external surface or creds. Auto-heal reuses the existing onWorkspaceOffline→RestartByID path (existing-volume reprovision); fail-safe IsRunning prevents flipping a healthy workspace on transient errors; scope strictly online+SaaS (excludes paused/hibernated/removed). DoS-safe: per-cycle LIMIT + per-workspace timeout. Approve.
hongming merged commit 8812285932 into main 2026-06-05 00:28:32 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2266