fix(registry): reconciler — guard TOCTOU false-flip + cover degraded + cycle deadline (core#2261 review) #2283
Reference in New Issue
Block a user
Delete Branch "fix/core2261-reconciler-toctou-degraded-hardening"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hardens the core#2261 instance-state reconciler against three review findings. This path runs against real customer SaaS workspaces — correctness over coverage, fail-safe toward NOT flipping.
[HIGH-1] TOCTOU false-flip (the critical one)
reconcileOncedidSELECT id … WHERE status='online' AND instance_id!='', thenchecker.IsRunning(id)— which independently re-resolvesinstance_idviaresolveInstanceID. Ifinstance_idis cleared/NULLed or the row deleted/reprovisioned between the two reads (concurrent delete, the CP-orphan-sweeper NULLing it, a reprovision),resolveInstanceIDreturns("",nil)→IsRunningreturns(false,nil)→ the reconciler flipped a workspace whose EC2 is not proven dead and firedRestartByIDon a possibly-just-deleted row.Fix: capture
instance_idin the SELECT (SELECT id::text, instance_id). AfterIsRunningreturns(false, nil), do a guarded re-confirm beforeonOffline: re-read the row's CURRENT(status, instance_id)with a short-timeout PK lookup, and flip only if the row still exists and currentstatus IN ('online','degraded')and currentinstance_id != ''and currentinstance_id == the originally-SELECTed instance_id. Any divergence (row gone, status moved off online/degraded, instance_id cleared or different) or a re-confirm DB error → log + skip. This guarantees we flip only when the same non-empty instance we asked CP about is still the workspace's recorded instance and still online/degraded. Mirrors healthsweep's re-confirm-before-flip.[MED-3] degraded scope
A SaaS workspace the heartbeat handler set to
degradedthat then loses its EC2 fell through every sweep. SELECT widened tostatus IN ('online','degraded')(keepsinstance_id IS NOT NULL AND instance_id != '' AND COALESCE(runtime,'') <> 'external'). Matches healthsweep's status set. The re-confirm already allows both.[MED-2] per-cycle deadline
Only per-workspace 10s timeouts existed; a degraded-but-not-erroring CP (each
IsRunning~8s, under the 10s cap) × 200 rows ≈ ~33min/cycle. Wrapped row processing incycleCtxwithcpInstanceCycleDeadline = 45s(under the 60s interval); per-workspace timeouts derive from it; the loop breaks and defers the backlog to the next cycle when the deadline is hit (logs processed/skipped). Mirrorscp_orphan_sweeper.IsRunningitself is unchanged (a2a_proxy + healthsweep also call it). Existing fail-safe-on-error behavior (err != nil → never flip) is preserved exactly.Tests
TOCTOU_InstanceChanged / InstanceCleared / StatusMoved / RowGone—(false,nil)but re-confirm shows a different/empty instance, an out-of-scope status, or no row → assert NOonOffline(HIGH-1 regression guards).Degraded_FlipsOffline— adegradedrow + not-running + re-confirm same instance → flips.ReconfirmDBError_DoesNotFlip— re-confirm read errors → fail-safe skip.NotRunning_FlipsOffline/MixedBatch— happy path now registers the re-confirm showing the same instance.status IN ('online','degraded')+ theinstance_idSELECT column; transient-error fail-safe test preserved.go build ./...,go vet ./internal/registry/,go test ./internal/registry/...all green.Refs core#2261.
DO NOT MERGE — heavy core SOP gate; awaiting review + CTO sign-off.
🤖 Generated with Claude Code
Hardens the core#2261 instance-state reconciler against three review findings on a path that runs against real customer SaaS workspaces. [HIGH-1] TOCTOU false-flip. reconcileOnce SELECTed ids, then called IsRunning which INDEPENDENTLY re-resolves instance_id (resolveInstanceID). If instance_id was cleared/NULLed or the row deleted/reprovisioned between the two reads, IsRunning returns a STALE (false, nil) that reflects a missing instance_id — NOT a confirmed-terminated EC2 — and we'd flip a workspace whose EC2 is not proven dead (and fire RestartByID on a maybe- just-deleted row). Fix: capture instance_id in the SELECT, and after a (false, nil) re-confirm the row's CURRENT (status, instance_id) with a short-timeout primary-key read; flip ONLY when the row still exists, is still online/degraded, and still records the SAME non-empty instance we asked CP about. Any divergence (row gone, status moved, instance_id cleared/changed) or a re-confirm DB error → skip (fail-safe toward NOT flipping). Mirrors healthsweep's guarded-write re-confirm. [MED-3] degraded scope. Widen the SELECT to status IN ('online', 'degraded') so a SaaS workspace the heartbeat handler flipped degraded, then lost its EC2, is reconciled instead of falling through every sweep. Matches healthsweep's status set. [MED-2] per-cycle deadline. Wrap row processing in a cycleCtx with a 45s cpInstanceCycleDeadline (under the 60s interval); per-workspace IsRunning timeouts derive from it; break and defer the backlog if the cycle blows its deadline. Mirrors cp_orphan_sweeper. Prevents a degraded-but-not- erroring CP (slow-but-under-cap IsRunning × 200 rows) from dragging one cycle to ~33min. IsRunning is unchanged (a2a_proxy + healthsweep also call it). Existing fail-safe-on-error behavior (err != nil → never flip) is preserved. Tests: TOCTOU guards (instance changed / cleared / status moved / row gone — all assert NO flip), degraded flips, re-confirm DB-error fail-safe, happy re-confirm; updated the scope regex for the new status IN (...) + instance_id column. Refs core#2261. DO NOT MERGE until heavy core SOP gate clears. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Reconciler hardening (core#2261 review HIGH-1 TOCTOU + MED-3 degraded + MED-2 cycle deadline). Independently verified the re-confirm guard (re-reads status+instance_id, flips only on same-instance still-online/degraded; DB-error/divergence → skip) + ran all 15 tests incl. 4 TOCTOU guards. Approve.
Reconciler hardening (core#2261 review HIGH-1 TOCTOU + MED-3 degraded + MED-2 cycle deadline). Independently verified the re-confirm guard (re-reads status+instance_id, flips only on same-instance still-online/degraded; DB-error/divergence → skip) + ran all 15 tests incl. 4 TOCTOU guards. Approve.