Files
hongming-personal 7c6986a96b
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 1s
CI / Python Lint & Test (pull_request) Successful in 4s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 20s
E2E Chat / detect-changes (pull_request) Successful in 17s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 3s
qa-review / approved (pull_request_target) Failing after 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 21s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
gate-check-v3 / gate-check (pull_request_target) Successful in 9s
security-review / approved (pull_request_target) Failing after 15s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
Harness Replays / Harness Replays (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 1s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 9s
E2E Chat / E2E Chat (pull_request) Successful in 10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m1s
sop-tier-check / tier-check (pull_request_target) Has been cancelled
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 57s
CI / Canvas Deploy Status (pull_request) Has been skipped
sop-checklist / review-refire (pull_request_target) Has been skipped
qa-review / approved (pull_request_review) Has been skipped
security-review / approved (pull_request_review) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 8s
sop-tier-check / tier-check (pull_request_review) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 4m1s
CI / all-required (pull_request) Successful in 13s
audit-force-merge / audit (pull_request_target) Successful in 11s
E2E Staging Reconciler (heals terminated EC2) / pr-validate (pull_request) Waiting to run
E2E Staging Reconciler (heals terminated EC2) / E2E Staging Reconciler (pull_request) Waiting to run
fix(registry): reconciler — guard TOCTOU false-flip + cover degraded + cycle deadline (core#2261 review)
Hardens the core#2261 instance-state reconciler against three review
findings on a path that runs against real customer SaaS workspaces.

[HIGH-1] TOCTOU false-flip. reconcileOnce SELECTed ids, then called
IsRunning which INDEPENDENTLY re-resolves instance_id (resolveInstanceID).
If instance_id was cleared/NULLed or the row deleted/reprovisioned between
the two reads, IsRunning returns a STALE (false, nil) that reflects a
missing instance_id — NOT a confirmed-terminated EC2 — and we'd flip a
workspace whose EC2 is not proven dead (and fire RestartByID on a maybe-
just-deleted row). Fix: capture instance_id in the SELECT, and after a
(false, nil) re-confirm the row's CURRENT (status, instance_id) with a
short-timeout primary-key read; flip ONLY when the row still exists, is
still online/degraded, and still records the SAME non-empty instance we
asked CP about. Any divergence (row gone, status moved, instance_id
cleared/changed) or a re-confirm DB error → skip (fail-safe toward NOT
flipping). Mirrors healthsweep's guarded-write re-confirm.

[MED-3] degraded scope. Widen the SELECT to status IN ('online',
'degraded') so a SaaS workspace the heartbeat handler flipped degraded,
then lost its EC2, is reconciled instead of falling through every sweep.
Matches healthsweep's status set.

[MED-2] per-cycle deadline. Wrap row processing in a cycleCtx with a 45s
cpInstanceCycleDeadline (under the 60s interval); per-workspace IsRunning
timeouts derive from it; break and defer the backlog if the cycle blows
its deadline. Mirrors cp_orphan_sweeper. Prevents a degraded-but-not-
erroring CP (slow-but-under-cap IsRunning × 200 rows) from dragging one
cycle to ~33min.

IsRunning is unchanged (a2a_proxy + healthsweep also call it). Existing
fail-safe-on-error behavior (err != nil → never flip) is preserved.

Tests: TOCTOU guards (instance changed / cleared / status moved / row
gone — all assert NO flip), degraded flips, re-confirm DB-error fail-safe,
happy re-confirm; updated the scope regex for the new
status IN (...) + instance_id column.

Refs core#2261. DO NOT MERGE until heavy core SOP gate clears.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 20:30:08 -07:00
..