fix(ws-server): close self-fire restart feedback loop (internal#544) #1556

Merged
hongming merged 1 commits from fix/ws-server-self-fire-restart-loop into main 2026-05-19 01:40:55 +00:00
Owner

Summary

Three-layer cohesive fix for the 2026-05-19 ~00:05-00:09Z 4x reprov thrash class observed on prod-Reviewer + prod-Researcher: a single secrets PUT fanned out into 4x stop+provision cycles per workspace within 4 min, each stopping the just-launched (still-pending) EC2 of the previous cycle. Root-caused via Loki (provision.ec2_started / ec2_stopped pairs).

Companion class-doc issue: internal#544 (empirical chain + reproduction recipe).

Empirical chain (verified against source)

All files in workspace-server/internal/handlers/:

  1. secrets.go:264POST /secrets unconditionally fires go h.restartFunc(workspaceID).
  2. workspace_restart.go:runRestartCycle — sets url='' synchronously, then async provisions EC2 (provisionWorkspaceCP returns immediately after cpProv.Start).
  3. 20-30s pending windowworkspaces.url='' AND cpProv.IsRunning()==false (looks identical to a dead container).
  4. a2a_proxy_helpers.gomaybeMarkContainerDead OR preflightContainerHealth fires (from canvas /delegations poll or the trailing restart-context probe at the end of runRestartCycle), sees container-dead → calls RestartByID → loop.
  5. workspace_restart.go:coalesceRestartpending=true, drains by running ANOTHER full cycle → ec2_stopped of the just-booted instance → re-provision.

Fix — three interdependent layers

L1 — Restart-aware health probes (a2a_proxy_helpers.go)

  • workspace_restart.go:isRestarting(workspaceID) bool exposed.
  • maybeMarkContainerDead early-returns false while isRestarting==true.
  • preflightContainerHealth early-returns nil (lets the optimistic forward proceed) while isRestarting==true.
  • Breaks the self-fire at the probe layer.

L2 — Restart-context probe gate (restart_context.go)

  • sendRestartContext now requires url != '' AND last_heartbeat_at > restart_start_ts before firing the trailing ProxyA2ARequest.
  • New helper waitForFreshHeartbeat(ctx, wsID, restartStartTs, timeout) next to existing waitForWorkspaceOnline.
  • Belt-and-suspenders so the probe never tries until the new container is actually addressable.

L3 — RestartByID debounce (workspace_restart.go)

  • Silent-drop successive RestartByID calls within restartDebounceWindow=60s of restartStartedAt.
  • NOT coalesce — actual drop. Drop is observable via restartByIDDropCounter (atomic.Uint64) + the dropped log line.
  • Only programmatic path (secrets, maybeMarkContainerDead, preflightContainerHealth). HTTP Restart handler is unaffected (goes through RestartWorkspaceAutoOpts directly).
  • Package-level restartDebounceWindow so tests shrink it to 5ms.

Test plan

  • Regression: TestRestartByID_SingleProvisionPerRestart simulates the 4x prod thrash — single trigger + 4 simulated probe-driven RestartByID calls during the in-flight cycle. Asserts exactly 1 cycle and 4 drops (counter). Pre-fix: 4-5 cycles. Post-fix: 1.
  • L1 gates: TestMaybeMarkContainerDead_SkippedWhileRestarting, TestPreflightContainerHealth_SkippedWhileRestarting — assert IsRunning not called + no DB update + no restart goroutine while restarting.
  • L3 debounce: TestRestartByID_DebounceSilentDrop (5 rapid calls → 5 drops, counter==5), TestRestartByID_DebounceExpiresAfterWindow (post-window legitimate restarts proceed).
  • Baseline invariants: TestIsRestarting_FalseWhenNoStateEntry, TestIsRestarting_TrueWhileCycleRunning.
  • No regression: full internal/handlers/ test suite passes (go test ./internal/handlers/ -count=1 -timeout 300sok 15.834s).
  • Existing TestCoalesceRestart_*, TestRestartByID_*, TestMaybeMarkContainerDead_*, TestPreflight_*, TestRestart_* all green.

Open Q

  • Could be a 1-line fix in CPProvisioner.IsRunning — returning (true, nil) for state=pending would mask the EC2-pending state as alive-pre-online and close all the self-fire paths at the source. Rejected for this PR because:
    • IsRunning has other callers (admin status pages, orphan reconciler) that need the actual EC2 state, not a smoothed one.
    • The 3-layer fix is per-call-site, surgical, and preserves the existing IsRunning contract.
    • But: open for discussion in review. If reviewer prefers the simpler 1-line semantic, this PR can be reduced.

Boundaries

  • NOT auto-merged
  • 167 lines source diff (+297 test); single cohesive PR (the layers are interdependent).
  • Companion issue internal#544 filed with class doc + reproduction recipe.
  • HTTP Restart handler (user click) path is NOT affected by the debounce.

Verification SOP (per feedback_verify_actual_endstate_not_ack_follow_sop)

Post-merge canary: trigger a single secrets PUT against a non-prod workspace, watch Loki {tenant="tenant-staging"} |~ "provision.ec2_" for exactly 1 ec2_started event over the next 60s. Pre-fix this would emit 4. Don't claim closed without that read-back.


Closes internal#544.

## Summary Three-layer cohesive fix for the **2026-05-19 ~00:05-00:09Z 4x reprov thrash class** observed on prod-Reviewer + prod-Researcher: a single secrets PUT fanned out into 4x stop+provision cycles per workspace within 4 min, each stopping the just-launched (still-pending) EC2 of the previous cycle. Root-caused via Loki (`provision.ec2_started` / `ec2_stopped` pairs). Companion class-doc issue: **internal#544** (empirical chain + reproduction recipe). ## Empirical chain (verified against source) All files in `workspace-server/internal/handlers/`: 1. **`secrets.go:264`** — `POST /secrets` unconditionally fires `go h.restartFunc(workspaceID)`. 2. **`workspace_restart.go:runRestartCycle`** — sets `url=''` synchronously, then async provisions EC2 (provisionWorkspaceCP returns immediately after `cpProv.Start`). 3. **20-30s pending window** — `workspaces.url=''` AND `cpProv.IsRunning()==false` (looks identical to a dead container). 4. **`a2a_proxy_helpers.go`** — `maybeMarkContainerDead` OR `preflightContainerHealth` fires (from canvas `/delegations` poll or the trailing `restart-context` probe at the end of `runRestartCycle`), sees container-dead → calls `RestartByID` → loop. 5. **`workspace_restart.go:coalesceRestart`** — `pending=true`, drains by running ANOTHER full cycle → `ec2_stopped` of the just-booted instance → re-provision. ## Fix — three interdependent layers ### L1 — Restart-aware health probes (`a2a_proxy_helpers.go`) - `workspace_restart.go:isRestarting(workspaceID) bool` exposed. - `maybeMarkContainerDead` early-returns `false` while `isRestarting==true`. - `preflightContainerHealth` early-returns `nil` (lets the optimistic forward proceed) while `isRestarting==true`. - Breaks the self-fire at the probe layer. ### L2 — Restart-context probe gate (`restart_context.go`) - `sendRestartContext` now requires `url != ''` AND `last_heartbeat_at > restart_start_ts` before firing the trailing `ProxyA2ARequest`. - New helper `waitForFreshHeartbeat(ctx, wsID, restartStartTs, timeout)` next to existing `waitForWorkspaceOnline`. - Belt-and-suspenders so the probe never tries until the new container is actually addressable. ### L3 — RestartByID debounce (`workspace_restart.go`) - Silent-drop successive `RestartByID` calls within `restartDebounceWindow=60s` of `restartStartedAt`. - **NOT coalesce** — actual drop. Drop is observable via `restartByIDDropCounter` (atomic.Uint64) + the dropped log line. - Only programmatic path (secrets, maybeMarkContainerDead, preflightContainerHealth). HTTP `Restart` handler is unaffected (goes through `RestartWorkspaceAutoOpts` directly). - Package-level `restartDebounceWindow` so tests shrink it to 5ms. ## Test plan - [x] **Regression**: `TestRestartByID_SingleProvisionPerRestart` simulates the 4x prod thrash — single trigger + 4 simulated probe-driven RestartByID calls during the in-flight cycle. Asserts exactly **1** cycle and **4** drops (counter). Pre-fix: 4-5 cycles. Post-fix: 1. - [x] **L1 gates**: `TestMaybeMarkContainerDead_SkippedWhileRestarting`, `TestPreflightContainerHealth_SkippedWhileRestarting` — assert IsRunning not called + no DB update + no restart goroutine while restarting. - [x] **L3 debounce**: `TestRestartByID_DebounceSilentDrop` (5 rapid calls → 5 drops, counter==5), `TestRestartByID_DebounceExpiresAfterWindow` (post-window legitimate restarts proceed). - [x] **Baseline invariants**: `TestIsRestarting_FalseWhenNoStateEntry`, `TestIsRestarting_TrueWhileCycleRunning`. - [x] **No regression**: full `internal/handlers/` test suite passes (`go test ./internal/handlers/ -count=1 -timeout 300s` → `ok 15.834s`). - [x] Existing `TestCoalesceRestart_*`, `TestRestartByID_*`, `TestMaybeMarkContainerDead_*`, `TestPreflight_*`, `TestRestart_*` all green. ## Open Q - **Could be a 1-line fix in `CPProvisioner.IsRunning`** — returning `(true, nil)` for `state=pending` would mask the EC2-pending state as alive-pre-online and close all the self-fire paths at the source. Rejected for this PR because: - `IsRunning` has other callers (admin status pages, orphan reconciler) that need the *actual* EC2 state, not a smoothed one. - The 3-layer fix is per-call-site, surgical, and preserves the existing `IsRunning` contract. - But: open for discussion in review. If reviewer prefers the simpler 1-line semantic, this PR can be reduced. ## Boundaries - **NOT auto-merged** - 167 lines source diff (+297 test); single cohesive PR (the layers are interdependent). - Companion issue **internal#544** filed with class doc + reproduction recipe. - HTTP `Restart` handler (user click) path is **NOT** affected by the debounce. ## Verification SOP (per `feedback_verify_actual_endstate_not_ack_follow_sop`) Post-merge canary: trigger a single secrets PUT against a non-prod workspace, watch Loki `{tenant="tenant-staging"} |~ "provision.ec2_"` for exactly **1** `ec2_started` event over the next 60s. Pre-fix this would emit 4. Don't claim closed without that read-back. --- Closes internal#544.
hongming added 1 commit 2026-05-19 01:24:47 +00:00
fix(ws-server): close self-fire restart feedback loop (internal#544)
CI / Canvas Deploy Reminder (pull_request) Blocked by required conditions
E2E API Smoke Test / E2E API Smoke Test (pull_request) Blocked by required conditions
E2E Chat / E2E Chat (pull_request) Blocked by required conditions
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Blocked by required conditions
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Blocked by required conditions
Harness Replays / Harness Replays (pull_request) Blocked by required conditions
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Blocked by required conditions
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 20s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
E2E Chat / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
Harness Replays / detect-changes (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 4m38s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s
gate-check-v3 / gate-check (pull_request) Successful in 4s
qa-review / approved (pull_request) Failing after 5s
security-review / approved (pull_request) Failing after 4s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-tier-check / tier-check (pull_request) Successful in 5s
sop-checklist / all-items-acked (pull_request) Successful in 6s
CI / Canvas (Next.js) (pull_request) Successful in 6m11s
CI / Python Lint & Test (pull_request) Successful in 6m56s
CI / all-required (pull_request) Successful in 6m29s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m12s
audit-force-merge / audit (pull_request) Successful in 5s
4bf87d122d
Three-layer cohesive fix for the 2026-05-19 ~00:05-00:09Z 4x reprov thrash
class observed on prod-Reviewer + prod-Researcher: a single secrets PUT
fanned out into 4x stop+provision cycles per workspace within 4 min,
each stopping the just-launched (still-pending) EC2 of the previous
cycle. Root-caused via Loki (provision.ec2_started / ec2_stopped pairs).

Empirical chain (all in workspace-server/internal/handlers/):
1. secrets.go SetSecret → go h.restartFunc → coalesceRestart cycle.
2. runRestartCycle sets url='' synchronously, then async provisions EC2.
3. During 20-30s pending window: url='' AND cpProv.IsRunning()==false
   — indistinguishable from a dead container.
4. Canvas /delegations poll OR the trailing restart-context probe fires
   ProxyA2A → maybeMarkContainerDead OR preflightContainerHealth →
   RestartByID → loop.
5. coalesceRestart's pending flag drains by running ANOTHER full cycle
   → ec2_stopped of the just-booted instance → re-provision.

Fix (single PR, three interdependent layers):

L1) Restart-aware health probes — workspace_restart.go exposes
    isRestarting(workspaceID) bool. Both maybeMarkContainerDead and
    preflightContainerHealth early-return false/nil while a restart
    cycle is in flight. Breaks the self-fire at the probe layer.

L2) Restart-context probe gate — sendRestartContext now requires
    url != '' AND last_heartbeat_at > restart_start_ts before firing
    the trailing ProxyA2A probe. Adds waitForFreshHeartbeat() next to
    waitForWorkspaceOnline. Belt-and-suspenders so the probe never
    tries until the new container is actually addressable.

L3) RestartByID debounce — silent-drop successive RestartByID calls
    within restartDebounceWindow=60s of restartStartedAt. Not coalesce
    (which would still drain to another full cycle). Drop is observable
    via restartByIDDropCounter (atomic.Uint64) + the dropped log line.
    Only programmatic path; HTTP Restart handler is unaffected.

Tests:
- TestIsRestarting_{FalseWhenNoStateEntry,TrueWhileCycleRunning}
- TestMaybeMarkContainerDead_SkippedWhileRestarting (L1)
- TestPreflightContainerHealth_SkippedWhileRestarting (L1)
- TestRestartByID_DebounceSilentDrop (L3, counter assertion)
- TestRestartByID_DebounceExpiresAfterWindow (L3, window release)
- TestRestartByID_SingleProvisionPerRestart (regression — asserts
  exactly 1 cycle per trigger, with 4 dropped self-fire probes)

Existing coalesce/restart/preflight/maybeMarkContainerDead tests
remain green. Full handlers suite: ok in 15.8s.

Closes internal#544.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
agent-dev-a approved these changes 2026-05-19 01:40:14 +00:00
agent-dev-a left a comment
Member

Five-axis review (correctness/readability/arch/security/perf) — APPROVED.

Diff: +464/-0 across 4 files (3 modified + 1 new test file with 7 tests).

Correctness: Three independent layers gate the self-fire loop. L1 (isRestarting in maybeMarkContainerDead + preflightContainerHealth) early-returns before any IsRunning / DB mutation. L2 (waitForFreshHeartbeat) asserts url!='' AND last_heartbeat_at > RestartAt before firing the trailing context-probe. L3 (debounce window 60s + atomic drop counter, false→true edge only). Mutex on per-workspace restartState used correctly. HTTP Restart() handler correctly bypasses this (user-initiated unaffected).

Readability: Each guard comments WHY (internal#544 cited, EC2-pending 20-30s window described). Test file's package-level comment reproduces the 5-step empirical chain.

Architecture: Defense-in-depth. Doesn't change coalesceRestart semantics; just gates entry. Layer separation is clean (one layer per failure mode).

Security: No new external surface. Internal-state guards only. No secret/PII path touched.

Performance: Per-workspace sync.Map keeps mutex contention bounded; atomic counter is wait-free; waitForFreshHeartbeat polls DB but bounded by restartContextOnlineTimeout.

Tests: TestRestartByID_SingleProvisionPerRestart pins the regression shape (1 cycle for 4 probes), TestRestartByID_DebounceExpiresAfterWindow guarantees legit later restarts still work, Layer-1 tests pin isRestarting+maybeMarkContainerDead+preflightContainerHealth gates.

Ready to merge.

Five-axis review (correctness/readability/arch/security/perf) — APPROVED. **Diff**: +464/-0 across 4 files (3 modified + 1 new test file with 7 tests). **Correctness**: Three independent layers gate the self-fire loop. L1 (`isRestarting` in `maybeMarkContainerDead` + `preflightContainerHealth`) early-returns before any IsRunning / DB mutation. L2 (`waitForFreshHeartbeat`) asserts url!='' AND last_heartbeat_at > RestartAt before firing the trailing context-probe. L3 (debounce window 60s + atomic drop counter, false→true edge only). Mutex on per-workspace restartState used correctly. HTTP Restart() handler correctly bypasses this (user-initiated unaffected). **Readability**: Each guard comments WHY (internal#544 cited, EC2-pending 20-30s window described). Test file's package-level comment reproduces the 5-step empirical chain. **Architecture**: Defense-in-depth. Doesn't change coalesceRestart semantics; just gates entry. Layer separation is clean (one layer per failure mode). **Security**: No new external surface. Internal-state guards only. No secret/PII path touched. **Performance**: Per-workspace sync.Map keeps mutex contention bounded; atomic counter is wait-free; waitForFreshHeartbeat polls DB but bounded by restartContextOnlineTimeout. **Tests**: TestRestartByID_SingleProvisionPerRestart pins the regression shape (1 cycle for 4 probes), TestRestartByID_DebounceExpiresAfterWindow guarantees legit later restarts still work, Layer-1 tests pin isRestarting+maybeMarkContainerDead+preflightContainerHealth gates. Ready to merge.
agent-dev-b approved these changes 2026-05-19 01:40:23 +00:00
agent-dev-b left a comment
Member

Five-axis review — APPROVED.

Independently reviewed the diff. The 3-layer fix (isRestarting gate / waitForFreshHeartbeat / RestartByID debounce) addresses the prod-Reviewer/Researcher 4x reprov thrash class observed 2026-05-19 ~00:05-00:09Z. Test coverage covers the regression shape (1 cycle for 4 probes), the gate-correctness shape (no IsRunning call when restarting), and the release shape (debounce expires past window). All additive (+464/-0), no behaviour change to user-initiated Restart path. Ready to merge.

Five-axis review — APPROVED. Independently reviewed the diff. The 3-layer fix (isRestarting gate / waitForFreshHeartbeat / RestartByID debounce) addresses the prod-Reviewer/Researcher 4x reprov thrash class observed 2026-05-19 ~00:05-00:09Z. Test coverage covers the regression shape (1 cycle for 4 probes), the gate-correctness shape (no IsRunning call when restarting), and the release shape (debounce expires past window). All additive (+464/-0), no behaviour change to user-initiated Restart path. Ready to merge.
hongming merged commit d27df740f5 into main 2026-05-19 01:40:55 +00:00
hongming deleted branch fix/ws-server-self-fire-restart-loop 2026-05-19 01:40:56 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1556