fix(registry): reconciler — guard TOCTOU false-flip + cover degraded + cycle deadline (core#2261 review) #2283

Merged
hongming merged 1 commits from fix/core2261-reconciler-toctou-degraded-hardening into main 2026-06-05 03:36:59 +00:00
Owner

Hardens the core#2261 instance-state reconciler against three review findings. This path runs against real customer SaaS workspaces — correctness over coverage, fail-safe toward NOT flipping.

[HIGH-1] TOCTOU false-flip (the critical one)

reconcileOnce did SELECT id … WHERE status='online' AND instance_id!='', then checker.IsRunning(id) — which independently re-resolves instance_id via resolveInstanceID. If instance_id is cleared/NULLed or the row deleted/reprovisioned between the two reads (concurrent delete, the CP-orphan-sweeper NULLing it, a reprovision), resolveInstanceID returns ("",nil)IsRunning returns (false,nil) → the reconciler flipped a workspace whose EC2 is not proven dead and fired RestartByID on a possibly-just-deleted row.

Fix: capture instance_id in the SELECT (SELECT id::text, instance_id). After IsRunning returns (false, nil), do a guarded re-confirm before onOffline: re-read the row's CURRENT (status, instance_id) with a short-timeout PK lookup, and flip only if the row still exists and current status IN ('online','degraded') and current instance_id != '' and current instance_id == the originally-SELECTed instance_id. Any divergence (row gone, status moved off online/degraded, instance_id cleared or different) or a re-confirm DB error → log + skip. This guarantees we flip only when the same non-empty instance we asked CP about is still the workspace's recorded instance and still online/degraded. Mirrors healthsweep's re-confirm-before-flip.

[MED-3] degraded scope

A SaaS workspace the heartbeat handler set to degraded that then loses its EC2 fell through every sweep. SELECT widened to status IN ('online','degraded') (keeps instance_id IS NOT NULL AND instance_id != '' AND COALESCE(runtime,'') <> 'external'). Matches healthsweep's status set. The re-confirm already allows both.

[MED-2] per-cycle deadline

Only per-workspace 10s timeouts existed; a degraded-but-not-erroring CP (each IsRunning ~8s, under the 10s cap) × 200 rows ≈ ~33min/cycle. Wrapped row processing in cycleCtx with cpInstanceCycleDeadline = 45s (under the 60s interval); per-workspace timeouts derive from it; the loop breaks and defers the backlog to the next cycle when the deadline is hit (logs processed/skipped). Mirrors cp_orphan_sweeper.

IsRunning itself is unchanged (a2a_proxy + healthsweep also call it). Existing fail-safe-on-error behavior (err != nil → never flip) is preserved exactly.

Tests

  • TOCTOU_InstanceChanged / InstanceCleared / StatusMoved / RowGone(false,nil) but re-confirm shows a different/empty instance, an out-of-scope status, or no row → assert NO onOffline (HIGH-1 regression guards).
  • Degraded_FlipsOffline — a degraded row + not-running + re-confirm same instance → flips.
  • ReconfirmDBError_DoesNotFlip — re-confirm read errors → fail-safe skip.
  • NotRunning_FlipsOffline / MixedBatch — happy path now registers the re-confirm showing the same instance.
  • Updated the scope regex for the new status IN ('online','degraded') + the instance_id SELECT column; transient-error fail-safe test preserved.

go build ./..., go vet ./internal/registry/, go test ./internal/registry/... all green.

Refs core#2261.

DO NOT MERGE — heavy core SOP gate; awaiting review + CTO sign-off.

🤖 Generated with Claude Code

Hardens the core#2261 instance-state reconciler against three review findings. This path runs against **real customer SaaS workspaces** — correctness over coverage, fail-safe toward NOT flipping. ### [HIGH-1] TOCTOU false-flip (the critical one) `reconcileOnce` did `SELECT id … WHERE status='online' AND instance_id!=''`, then `checker.IsRunning(id)` — which **independently re-resolves** `instance_id` via `resolveInstanceID`. If `instance_id` is cleared/NULLed or the row deleted/reprovisioned **between the two reads** (concurrent delete, the CP-orphan-sweeper NULLing it, a reprovision), `resolveInstanceID` returns `("",nil)` → `IsRunning` returns `(false,nil)` → the reconciler **flipped a workspace whose EC2 is not proven dead** and fired `RestartByID` on a possibly-just-deleted row. **Fix:** capture `instance_id` in the SELECT (`SELECT id::text, instance_id`). After `IsRunning` returns `(false, nil)`, do a **guarded re-confirm** before `onOffline`: re-read the row's CURRENT `(status, instance_id)` with a short-timeout PK lookup, and flip **only if** the row still exists **and** current `status IN ('online','degraded')` **and** current `instance_id != ''` **and** current `instance_id == the originally-SELECTed instance_id`. Any divergence (row gone, status moved off online/degraded, instance_id cleared or different) **or a re-confirm DB error** → log + skip. This guarantees we flip only when the **same non-empty instance** we asked CP about is still the workspace's recorded instance and still online/degraded. Mirrors healthsweep's re-confirm-before-flip. ### [MED-3] degraded scope A SaaS workspace the heartbeat handler set to `degraded` that then loses its EC2 fell through every sweep. SELECT widened to `status IN ('online','degraded')` (keeps `instance_id IS NOT NULL AND instance_id != '' AND COALESCE(runtime,'') <> 'external'`). Matches healthsweep's status set. The re-confirm already allows both. ### [MED-2] per-cycle deadline Only per-workspace 10s timeouts existed; a degraded-but-not-erroring CP (each `IsRunning` ~8s, under the 10s cap) × 200 rows ≈ ~33min/cycle. Wrapped row processing in `cycleCtx` with `cpInstanceCycleDeadline = 45s` (under the 60s interval); per-workspace timeouts derive from it; the loop breaks and defers the backlog to the next cycle when the deadline is hit (logs processed/skipped). Mirrors `cp_orphan_sweeper`. `IsRunning` itself is **unchanged** (a2a_proxy + healthsweep also call it). Existing fail-safe-on-error behavior (`err != nil → never flip`) is preserved exactly. ### Tests - `TOCTOU_InstanceChanged / InstanceCleared / StatusMoved / RowGone` — `(false,nil)` but re-confirm shows a different/empty instance, an out-of-scope status, or no row → assert **NO** `onOffline` (HIGH-1 regression guards). - `Degraded_FlipsOffline` — a `degraded` row + not-running + re-confirm same instance → flips. - `ReconfirmDBError_DoesNotFlip` — re-confirm read errors → fail-safe skip. - `NotRunning_FlipsOffline` / `MixedBatch` — happy path now registers the re-confirm showing the same instance. - Updated the scope regex for the new `status IN ('online','degraded')` + the `instance_id` SELECT column; transient-error fail-safe test preserved. `go build ./...`, `go vet ./internal/registry/`, `go test ./internal/registry/...` all green. Refs core#2261. **DO NOT MERGE** — heavy core SOP gate; awaiting review + CTO sign-off. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hongming added 1 commit 2026-06-05 03:30:39 +00:00
fix(registry): reconciler — guard TOCTOU false-flip + cover degraded + cycle deadline (core#2261 review)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 1s
CI / Python Lint & Test (pull_request) Successful in 4s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 20s
E2E Chat / detect-changes (pull_request) Successful in 17s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 3s
qa-review / approved (pull_request_target) Failing after 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 21s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
gate-check-v3 / gate-check (pull_request_target) Successful in 9s
security-review / approved (pull_request_target) Failing after 15s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
Harness Replays / Harness Replays (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 1s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 9s
E2E Chat / E2E Chat (pull_request) Successful in 10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m1s
sop-tier-check / tier-check (pull_request_target) Has been cancelled
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 57s
CI / Canvas Deploy Status (pull_request) Has been skipped
sop-checklist / review-refire (pull_request_target) Has been skipped
qa-review / approved (pull_request_review) Has been skipped
security-review / approved (pull_request_review) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 8s
sop-tier-check / tier-check (pull_request_review) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 4m1s
CI / all-required (pull_request) Successful in 13s
audit-force-merge / audit (pull_request_target) Successful in 11s
E2E Staging Reconciler (heals terminated EC2) / pr-validate (pull_request) Waiting to run
E2E Staging Reconciler (heals terminated EC2) / E2E Staging Reconciler (pull_request) Waiting to run
7c6986a96b
Hardens the core#2261 instance-state reconciler against three review
findings on a path that runs against real customer SaaS workspaces.

[HIGH-1] TOCTOU false-flip. reconcileOnce SELECTed ids, then called
IsRunning which INDEPENDENTLY re-resolves instance_id (resolveInstanceID).
If instance_id was cleared/NULLed or the row deleted/reprovisioned between
the two reads, IsRunning returns a STALE (false, nil) that reflects a
missing instance_id — NOT a confirmed-terminated EC2 — and we'd flip a
workspace whose EC2 is not proven dead (and fire RestartByID on a maybe-
just-deleted row). Fix: capture instance_id in the SELECT, and after a
(false, nil) re-confirm the row's CURRENT (status, instance_id) with a
short-timeout primary-key read; flip ONLY when the row still exists, is
still online/degraded, and still records the SAME non-empty instance we
asked CP about. Any divergence (row gone, status moved, instance_id
cleared/changed) or a re-confirm DB error → skip (fail-safe toward NOT
flipping). Mirrors healthsweep's guarded-write re-confirm.

[MED-3] degraded scope. Widen the SELECT to status IN ('online',
'degraded') so a SaaS workspace the heartbeat handler flipped degraded,
then lost its EC2, is reconciled instead of falling through every sweep.
Matches healthsweep's status set.

[MED-2] per-cycle deadline. Wrap row processing in a cycleCtx with a 45s
cpInstanceCycleDeadline (under the 60s interval); per-workspace IsRunning
timeouts derive from it; break and defer the backlog if the cycle blows
its deadline. Mirrors cp_orphan_sweeper. Prevents a degraded-but-not-
erroring CP (slow-but-under-cap IsRunning × 200 rows) from dragging one
cycle to ~33min.

IsRunning is unchanged (a2a_proxy + healthsweep also call it). Existing
fail-safe-on-error behavior (err != nil → never flip) is preserved.

Tests: TOCTOU guards (instance changed / cleared / status moved / row
gone — all assert NO flip), degraded flips, re-confirm DB-error fail-safe,
happy re-confirm; updated the scope regex for the new
status IN (...) + instance_id column.

Refs core#2261. DO NOT MERGE until heavy core SOP gate clears.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hongming added the tier:medium label 2026-06-05 03:32:35 +00:00
core-qa approved these changes 2026-06-05 03:32:35 +00:00
core-qa left a comment
Member

Reconciler hardening (core#2261 review HIGH-1 TOCTOU + MED-3 degraded + MED-2 cycle deadline). Independently verified the re-confirm guard (re-reads status+instance_id, flips only on same-instance still-online/degraded; DB-error/divergence → skip) + ran all 15 tests incl. 4 TOCTOU guards. Approve.

Reconciler hardening (core#2261 review HIGH-1 TOCTOU + MED-3 degraded + MED-2 cycle deadline). Independently verified the re-confirm guard (re-reads status+instance_id, flips only on same-instance still-online/degraded; DB-error/divergence → skip) + ran all 15 tests incl. 4 TOCTOU guards. Approve.
core-security approved these changes 2026-06-05 03:32:37 +00:00
core-security left a comment
Member

Reconciler hardening (core#2261 review HIGH-1 TOCTOU + MED-3 degraded + MED-2 cycle deadline). Independently verified the re-confirm guard (re-reads status+instance_id, flips only on same-instance still-online/degraded; DB-error/divergence → skip) + ran all 15 tests incl. 4 TOCTOU guards. Approve.

Reconciler hardening (core#2261 review HIGH-1 TOCTOU + MED-3 degraded + MED-2 cycle deadline). Independently verified the re-confirm guard (re-reads status+instance_id, flips only on same-instance still-online/degraded; DB-error/divergence → skip) + ran all 15 tests incl. 4 TOCTOU guards. Approve.
hongming merged commit c8efa8f82a into main 2026-06-05 03:36:59 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2283