fix(ws-server): close self-fire restart feedback loop (internal#544) #1556
Reference in New Issue
Block a user
Delete Branch "fix/ws-server-self-fire-restart-loop"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Three-layer cohesive fix for the 2026-05-19 ~00:05-00:09Z 4x reprov thrash class observed on prod-Reviewer + prod-Researcher: a single secrets PUT fanned out into 4x stop+provision cycles per workspace within 4 min, each stopping the just-launched (still-pending) EC2 of the previous cycle. Root-caused via Loki (
provision.ec2_started/ec2_stoppedpairs).Companion class-doc issue: internal#544 (empirical chain + reproduction recipe).
Empirical chain (verified against source)
All files in
workspace-server/internal/handlers/:secrets.go:264—POST /secretsunconditionally firesgo h.restartFunc(workspaceID).workspace_restart.go:runRestartCycle— setsurl=''synchronously, then async provisions EC2 (provisionWorkspaceCP returns immediately aftercpProv.Start).workspaces.url=''ANDcpProv.IsRunning()==false(looks identical to a dead container).a2a_proxy_helpers.go—maybeMarkContainerDeadORpreflightContainerHealthfires (from canvas/delegationspoll or the trailingrestart-contextprobe at the end ofrunRestartCycle), sees container-dead → callsRestartByID→ loop.workspace_restart.go:coalesceRestart—pending=true, drains by running ANOTHER full cycle →ec2_stoppedof the just-booted instance → re-provision.Fix — three interdependent layers
L1 — Restart-aware health probes (
a2a_proxy_helpers.go)workspace_restart.go:isRestarting(workspaceID) boolexposed.maybeMarkContainerDeadearly-returnsfalsewhileisRestarting==true.preflightContainerHealthearly-returnsnil(lets the optimistic forward proceed) whileisRestarting==true.L2 — Restart-context probe gate (
restart_context.go)sendRestartContextnow requiresurl != ''ANDlast_heartbeat_at > restart_start_tsbefore firing the trailingProxyA2ARequest.waitForFreshHeartbeat(ctx, wsID, restartStartTs, timeout)next to existingwaitForWorkspaceOnline.L3 — RestartByID debounce (
workspace_restart.go)RestartByIDcalls withinrestartDebounceWindow=60sofrestartStartedAt.restartByIDDropCounter(atomic.Uint64) + the dropped log line.Restarthandler is unaffected (goes throughRestartWorkspaceAutoOptsdirectly).restartDebounceWindowso tests shrink it to 5ms.Test plan
TestRestartByID_SingleProvisionPerRestartsimulates the 4x prod thrash — single trigger + 4 simulated probe-driven RestartByID calls during the in-flight cycle. Asserts exactly 1 cycle and 4 drops (counter). Pre-fix: 4-5 cycles. Post-fix: 1.TestMaybeMarkContainerDead_SkippedWhileRestarting,TestPreflightContainerHealth_SkippedWhileRestarting— assert IsRunning not called + no DB update + no restart goroutine while restarting.TestRestartByID_DebounceSilentDrop(5 rapid calls → 5 drops, counter==5),TestRestartByID_DebounceExpiresAfterWindow(post-window legitimate restarts proceed).TestIsRestarting_FalseWhenNoStateEntry,TestIsRestarting_TrueWhileCycleRunning.internal/handlers/test suite passes (go test ./internal/handlers/ -count=1 -timeout 300s→ok 15.834s).TestCoalesceRestart_*,TestRestartByID_*,TestMaybeMarkContainerDead_*,TestPreflight_*,TestRestart_*all green.Open Q
CPProvisioner.IsRunning— returning(true, nil)forstate=pendingwould mask the EC2-pending state as alive-pre-online and close all the self-fire paths at the source. Rejected for this PR because:IsRunninghas other callers (admin status pages, orphan reconciler) that need the actual EC2 state, not a smoothed one.IsRunningcontract.Boundaries
Restarthandler (user click) path is NOT affected by the debounce.Verification SOP (per
feedback_verify_actual_endstate_not_ack_follow_sop)Post-merge canary: trigger a single secrets PUT against a non-prod workspace, watch Loki
{tenant="tenant-staging"} |~ "provision.ec2_"for exactly 1ec2_startedevent over the next 60s. Pre-fix this would emit 4. Don't claim closed without that read-back.Closes internal#544.
Three-layer cohesive fix for the 2026-05-19 ~00:05-00:09Z 4x reprov thrash class observed on prod-Reviewer + prod-Researcher: a single secrets PUT fanned out into 4x stop+provision cycles per workspace within 4 min, each stopping the just-launched (still-pending) EC2 of the previous cycle. Root-caused via Loki (provision.ec2_started / ec2_stopped pairs). Empirical chain (all in workspace-server/internal/handlers/): 1. secrets.go SetSecret → go h.restartFunc → coalesceRestart cycle. 2. runRestartCycle sets url='' synchronously, then async provisions EC2. 3. During 20-30s pending window: url='' AND cpProv.IsRunning()==false — indistinguishable from a dead container. 4. Canvas /delegations poll OR the trailing restart-context probe fires ProxyA2A → maybeMarkContainerDead OR preflightContainerHealth → RestartByID → loop. 5. coalesceRestart's pending flag drains by running ANOTHER full cycle → ec2_stopped of the just-booted instance → re-provision. Fix (single PR, three interdependent layers): L1) Restart-aware health probes — workspace_restart.go exposes isRestarting(workspaceID) bool. Both maybeMarkContainerDead and preflightContainerHealth early-return false/nil while a restart cycle is in flight. Breaks the self-fire at the probe layer. L2) Restart-context probe gate — sendRestartContext now requires url != '' AND last_heartbeat_at > restart_start_ts before firing the trailing ProxyA2A probe. Adds waitForFreshHeartbeat() next to waitForWorkspaceOnline. Belt-and-suspenders so the probe never tries until the new container is actually addressable. L3) RestartByID debounce — silent-drop successive RestartByID calls within restartDebounceWindow=60s of restartStartedAt. Not coalesce (which would still drain to another full cycle). Drop is observable via restartByIDDropCounter (atomic.Uint64) + the dropped log line. Only programmatic path; HTTP Restart handler is unaffected. Tests: - TestIsRestarting_{FalseWhenNoStateEntry,TrueWhileCycleRunning} - TestMaybeMarkContainerDead_SkippedWhileRestarting (L1) - TestPreflightContainerHealth_SkippedWhileRestarting (L1) - TestRestartByID_DebounceSilentDrop (L3, counter assertion) - TestRestartByID_DebounceExpiresAfterWindow (L3, window release) - TestRestartByID_SingleProvisionPerRestart (regression — asserts exactly 1 cycle per trigger, with 4 dropped self-fire probes) Existing coalesce/restart/preflight/maybeMarkContainerDead tests remain green. Full handlers suite: ok in 15.8s. Closes internal#544. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Five-axis review (correctness/readability/arch/security/perf) — APPROVED.
Diff: +464/-0 across 4 files (3 modified + 1 new test file with 7 tests).
Correctness: Three independent layers gate the self-fire loop. L1 (
isRestartinginmaybeMarkContainerDead+preflightContainerHealth) early-returns before any IsRunning / DB mutation. L2 (waitForFreshHeartbeat) asserts url!='' AND last_heartbeat_at > RestartAt before firing the trailing context-probe. L3 (debounce window 60s + atomic drop counter, false→true edge only). Mutex on per-workspace restartState used correctly. HTTP Restart() handler correctly bypasses this (user-initiated unaffected).Readability: Each guard comments WHY (internal#544 cited, EC2-pending 20-30s window described). Test file's package-level comment reproduces the 5-step empirical chain.
Architecture: Defense-in-depth. Doesn't change coalesceRestart semantics; just gates entry. Layer separation is clean (one layer per failure mode).
Security: No new external surface. Internal-state guards only. No secret/PII path touched.
Performance: Per-workspace sync.Map keeps mutex contention bounded; atomic counter is wait-free; waitForFreshHeartbeat polls DB but bounded by restartContextOnlineTimeout.
Tests: TestRestartByID_SingleProvisionPerRestart pins the regression shape (1 cycle for 4 probes), TestRestartByID_DebounceExpiresAfterWindow guarantees legit later restarts still work, Layer-1 tests pin isRestarting+maybeMarkContainerDead+preflightContainerHealth gates.
Ready to merge.
Five-axis review — APPROVED.
Independently reviewed the diff. The 3-layer fix (isRestarting gate / waitForFreshHeartbeat / RestartByID debounce) addresses the prod-Reviewer/Researcher 4x reprov thrash class observed 2026-05-19 ~00:05-00:09Z. Test coverage covers the regression shape (1 cycle for 4 probes), the gate-correctness shape (no IsRunning call when restarting), and the release shape (debounce expires past window). All additive (+464/-0), no behaviour change to user-initiated Restart path. Ready to merge.