fix(workspace-server): debounce file-write → RestartByID tight loop (#624) #1623
Reference in New Issue
Block a user
Delete Branch "fix/624-file-write-restart-debounce"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
P0 fix for internal#624 — claude-code workspace tight-loop re-provision triggered by canvas Save bursts.
Empirical chain (Loki on workspace
3fe84b89-eb65-42fc-ad1f-5c93582ca3e72026-05-20 22:00-22:11Z):PUT /filesattemplates.go:575/591/607/662/682/697+template_import.go:239/275/297firesgoAsync(RestartByID).pending=trueon the runningcoalesceRestartcycle, and drain immediately into cycle 2 with no re-debounce.EC2InstanceStateInvalidException.Fix (both paths in same PR)
Path A — call-site debounce: new
maybeRestartAfterFileWritehelper with a 15s per-workspace coalescing window backed bysync.Map[wsID]*atomic.Int64. Replaces the 9 file-writegoAsync(RestartByID)sites intemplates.go(6) +template_import.go(3). CAS-loop on the timestamp so concurrent writers can't both fire.Path B — drain-loop re-stamp:
coalesceRestartnow re-stampsrestartStartedAtat the top of each drained iteration past the first. The original design (stamp only onfalse→trueedge) treated all drained pending as one debounce-event, which lets a write at T+65s pipe cycle 1→cycle 2 with no re-debounce. Re-stamping means anyRestartByIDarriving during cycle N hits a fresh 60s window and is dropped instead of chaining. Secrets-batch coalescing semantic (one cycle picks up everyone who arrived during it) is preserved.Files changed
workspace-server/internal/handlers/workspace_restart.go(+101 LoC: helper + Path B re-stamp + comments)workspace-server/internal/handlers/templates.go(6 call-sites swapped)workspace-server/internal/handlers/template_import.go(3 call-sites swapped)workspace-server/internal/handlers/workspace_restart_file_write_debounce_test.go(new, 316 LoC, 7 tests)Tests
TestMaybeRestartAfterFileWrite_FirstWriteRestarts— baseline first-call firesTestMaybeRestartAfterFileWrite_SecondWriteWithin15sSkipped— core fixTestMaybeRestartAfterFileWrite_ManyWritesInBurstCoalesceToOne— 10 PUTs → 1 fire + 9 drops (bonus case from issue)TestMaybeRestartAfterFileWrite_AfterWindowExpiresFiresAgain— window releaseTestMaybeRestartAfterFileWrite_DifferentWorkspacesIndependent— per-workspace isolationTestMaybeRestartAfterFileWrite_ConcurrentCallsSafelyDebounced— 50-goroutine herd → exactly 1 firesTestCoalesceRestart_DrainRespectsRestartedAtBetweenIterations— Path B regressionAll existing
TestCoalesceRestart_*+TestRestartByID_*+ self-fire +isRestartingtests still pass (no regression).Test plan
go build ./workspace-server/...cleango test ./workspace-server/internal/handlers/ -run 'TestMaybeRestartAfterFileWrite|TestCoalesceRestart|TestRestartByID|TestIsRestarting|TestMaybeMarkContainerDead|TestPreflight'PASSgo vet ./workspace-server/internal/handlers/clean3fe84b89-eb65-42fc-ad1f-5c93582ca3e7— canvas Save burst should produce ≤2 restart cycles (not 4+)Out of scope (follow-up issue)
The diagnostic-improvement work (broadcast
WORKSPACE_AUTO_RESTART_FIRED+WORKSPACE_RESTARTING_MID_WRITEevents so canvas Activity shows WHY a workspace restarted) is filed as a separate issue per CTO request — see internal#625.🤖 Generated with Claude Code
Canvas Save fires N PUT /files in a 30-60s burst (10-17 files observed on claude-code SEO Agent 3fe84b89-eb65-42fc-ad1f-5c93582ca3e7). Each successful write at templates.go:575/591/607/662/682/697 and template_import.go:239/275/297 previously called `goAsync(func() { wh.RestartByID(wsID) })`. The existing 60s self-fire debounce in RestartByID catches calls within the window, but writes at T+65s+ pass the debounce, set pending=true on the running coalesceRestart cycle, and drain immediately into cycle 2 — which DELETEs+recreates EC2 mid-burst → user PUTs 500 with EC2InstanceStateInvalidException. Fix in two layers (both shipped here): Path A (call-site debounce): new `maybeRestartAfterFileWrite` helper with a 15s per-workspace coalescing window, backed by a sync.Map of *atomic.Int64 last-fire timestamps. Replaces the 9 file-write goAsync(RestartByID) sites in templates.go (6) and template_import.go (3). CAS-loop on the timestamp so concurrent writers can't both fire. Path B (drain-loop re-stamp): coalesceRestart now re-stamps restartStartedAt at the top of each drained iteration past the first. The original design (stamp only on false→true edge) treated all drained pending as "one event" for debounce purposes, which lets a write at T+65s pipe through cycle 1 into cycle 2 without re-debounce. Re-stamping means any RestartByID arriving during cycle N hits a fresh 60s window and is dropped by shouldDebounceRestart instead of chaining into N+1. The secrets-batch coalescing semantic (one cycle picks up everyone who arrived during it) is preserved. Tests (workspace_restart_file_write_debounce_test.go, 7 cases): - FirstWriteRestarts (baseline — drop counter stays 0) - SecondWriteWithin15sSkipped (core fix — drop=1, stamp unchanged) - ManyWritesInBurstCoalesceToOne (10-PUT burst → 1 fire + 9 drops) - AfterWindowExpiresFiresAgain (5ms shrink, second fires post-window) - DifferentWorkspacesIndependent (per-workspace isolation) - ConcurrentCallsSafelyDebounced (50-goroutine herd → exactly 1 fires) - TestCoalesceRestart_DrainRespectsRestartedAtBetweenIterations (Path B regression: drained cycle's restartStartedAt is observably newer) All existing TestCoalesceRestart_* + TestRestartByID_* + self-fire + isRestarting tests still pass. Closes #624 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>core-qa five-axis review — APPROVE.
Correctness: 15s window is well-sized. Canvas Save bursts cluster in a 5-10s tight write window then idle; 15s comfortably absorbs the burst, exits the window before legit 'edit / wait / edit again' patterns get swallowed. The Path A+B layering is correct: Path A drops re-fires at the call site; if anything slips through (a2a_proxy/main.go siblings), Path B re-stamps
restartStartedAtper drained iteration so a freshrestartDebounceWindow=60sgate covers the next cycle. Two independent failure modes, two independent defenses — sound.Tests: All 7 tests pass and pin meaningful invariants — first-write fires, second-within-15s drops (with stamp-preservation assert at L116), 50-goroutine CAS contention produces exactly N-1 drops, window-expiry releases, per-workspace isolation, drain re-stamp monotonicity. Path B test (L266) uses a real
time.Sleep(20ms)between cycles so thestamps[1].After(stamps[0])assertion is observable, not a no-op.Bonus 10-PUT burst test (L127): asserts
drops==9— STRONGER than 'at most 2 cycles' (it's exactly 1 cycle). However the test only validates Path A in isolation; it does NOT cross-validate that Path B's 60s gate kicks in if a write LEAKS past Path A's 15s window. The two layers are unit-tested independently. Acceptable since each layer has its own pin and the prod chain requires both to fail — but an integration test that feeds an out-of-window write into a still-draining cycle would be the highest-confidence regression pin. Non-blocking.sync.Map memory:
fileWriteRestartLastFireAtgrows monotonically — no Delete on workspace teardown. Per-workspace overhead ~32 bytes; for 10k lifetime workspaces ~300KB. Same shape as the pre-existingrestartStates sync.Mapthat already ships. Not a regression, but worth a TODO/follow-up issue if the workspace count grows substantially. Non-blocking.Concurrency:
atomic.Int64insidesync.Map.LoadOrStoreis the canonical lock-free pattern; CAS loop bounded at 4 retries (any contention beats out in ≤1 retry given the elapsed-time check post-load). No race risk.Edge cases: First-write zero-init stamp (0) sits ~50+ years in the past relative to
time.Now().UnixNano(), so theelapsed < windowpredicate is correctly false on first call. Verified byTestMaybeRestartAfterFileWrite_FirstWriteRestarts.Ready to merge.
core-devops five-axis review — APPROVE.
Operational fit: The fix maps cleanly to the prod chain (canvas-Save burst → 9 file-write trigger sites → coalesceRestart drain) verified at 22:00-22:11Z on workspace 3fe84b89. Two-layer defense (Path A call-site debounce + Path B drain re-stamp) closes both the dominant source (file-write burst) and any straggler that arrives via the 4 sibling siblings. Sound.
Surgical-swap check: All 9 call-sites in templates.go (6) + template_import.go (3) literally replace
wsID := workspaceID; h.wh.goAsync(func() { h.wh.RestartByID(wsID) })withh.wh.maybeRestartAfterFileWrite(workspaceID). No collateral handler-logic changes in the diff. Confirmed by reading the unified diff line-by-line.4 sibling RestartByID callers verified sparse-trigger (NOT same-bug-class):
a2a_proxy.go:669— single fire per A2A preflight request, not burst.a2a_proxy_helpers.go:217, 279— startup / post-launch single-shot.cmd/server/main.go:224— liveness-monitor offline callback (one trigger per offline event).None live in a burst path. PR body claim holds.
Observability:
maybeRestartAfterFileWriteemits a structured log line on every drop ('%s — coalesced (last fire %s ago < %s window; total dropped=%d)') plus a process-widefileWriteRestartDropCounteratomic. Searchable in Loki via the literal substringmaybeRestartAfterFileWriteto confirm the loop stopped post-deploy. Path B has no explicit drop log line for the re-stamp event butshouldDebounceRestartalready logs at the existing 60s gate — sufficient. /admin/metrics exposure is deferred (called out in code comment as separate RFC) — non-blocking.LoC / tests ratio: +497/-27 looks large but is +316 test (66%) / +181 production. Tests are dense and meaningful (CAS contention, per-workspace isolation, window-expiry, drain re-stamp). Acceptable test-investment ratio for a fix that lives in the workspace-lifecycle hot path.
Rollout risk: NO feature flag. Path A is purely additive (drop branch new, fire branch identical to before). Path B changes drain semantics — but the change is monotone-defensive: re-stamping
restartStartedAtcan only DROP additional cycles, never SCHEDULE additional ones. Worst-case post-merge regression is 'restart fires fewer times than expected'; the existing 60s window already governed the first cycle so behavior for non-burst paths is unchanged. Safe to ship immediately. Recommend a Loki check ~10min post-deploy for thecoalescedlog line to confirm Path A is firing in prod.Ready to merge.