fix(workspace-server): debounce file-write → RestartByID tight loop (#624) #1623

Merged
core-devops merged 1 commits from fix/624-file-write-restart-debounce into main 2026-05-20 22:59:09 +00:00
Member

Summary

P0 fix for internal#624 — claude-code workspace tight-loop re-provision triggered by canvas Save bursts.

Empirical chain (Loki on workspace 3fe84b89-eb65-42fc-ad1f-5c93582ca3e7 2026-05-20 22:00-22:11Z):

  1. Canvas Save writes 10-17 files in a 30-60s window.
  2. Each successful PUT /files at templates.go:575/591/607/662/682/697 + template_import.go:239/275/297 fires goAsync(RestartByID).
  3. Existing 60s self-fire debounce catches calls 1-60s — but writes at T+65s pass the debounce, set pending=true on the running coalesceRestart cycle, and drain immediately into cycle 2 with no re-debounce.
  4. Cycle 2 DELETEs+recreates EC2 mid-burst → user PUTs 500 with EC2InstanceStateInvalidException.

Fix (both paths in same PR)

Path A — call-site debounce: new maybeRestartAfterFileWrite helper with a 15s per-workspace coalescing window backed by sync.Map[wsID]*atomic.Int64. Replaces the 9 file-write goAsync(RestartByID) sites in templates.go (6) + template_import.go (3). CAS-loop on the timestamp so concurrent writers can't both fire.

Path B — drain-loop re-stamp: coalesceRestart now re-stamps restartStartedAt at the top of each drained iteration past the first. The original design (stamp only on false→true edge) treated all drained pending as one debounce-event, which lets a write at T+65s pipe cycle 1→cycle 2 with no re-debounce. Re-stamping means any RestartByID arriving during cycle N hits a fresh 60s window and is dropped instead of chaining. Secrets-batch coalescing semantic (one cycle picks up everyone who arrived during it) is preserved.

Files changed

  • workspace-server/internal/handlers/workspace_restart.go (+101 LoC: helper + Path B re-stamp + comments)
  • workspace-server/internal/handlers/templates.go (6 call-sites swapped)
  • workspace-server/internal/handlers/template_import.go (3 call-sites swapped)
  • workspace-server/internal/handlers/workspace_restart_file_write_debounce_test.go (new, 316 LoC, 7 tests)

Tests

  • TestMaybeRestartAfterFileWrite_FirstWriteRestarts — baseline first-call fires
  • TestMaybeRestartAfterFileWrite_SecondWriteWithin15sSkipped — core fix
  • TestMaybeRestartAfterFileWrite_ManyWritesInBurstCoalesceToOne — 10 PUTs → 1 fire + 9 drops (bonus case from issue)
  • TestMaybeRestartAfterFileWrite_AfterWindowExpiresFiresAgain — window release
  • TestMaybeRestartAfterFileWrite_DifferentWorkspacesIndependent — per-workspace isolation
  • TestMaybeRestartAfterFileWrite_ConcurrentCallsSafelyDebounced — 50-goroutine herd → exactly 1 fires
  • TestCoalesceRestart_DrainRespectsRestartedAtBetweenIterations — Path B regression

All existing TestCoalesceRestart_* + TestRestartByID_* + self-fire + isRestarting tests still pass (no regression).

Test plan

  • go build ./workspace-server/... clean
  • go test ./workspace-server/internal/handlers/ -run 'TestMaybeRestartAfterFileWrite|TestCoalesceRestart|TestRestartByID|TestIsRestarting|TestMaybeMarkContainerDead|TestPreflight' PASS
  • go vet ./workspace-server/internal/handlers/ clean
  • Post-merge: verify on workspace 3fe84b89-eb65-42fc-ad1f-5c93582ca3e7 — canvas Save burst should produce ≤2 restart cycles (not 4+)

Out of scope (follow-up issue)

The diagnostic-improvement work (broadcast WORKSPACE_AUTO_RESTART_FIRED + WORKSPACE_RESTARTING_MID_WRITE events so canvas Activity shows WHY a workspace restarted) is filed as a separate issue per CTO request — see internal#625.

🤖 Generated with Claude Code

## Summary P0 fix for [internal#624](https://git.moleculesai.app/molecule-ai/internal/issues/624) — claude-code workspace tight-loop re-provision triggered by canvas Save bursts. Empirical chain (Loki on workspace `3fe84b89-eb65-42fc-ad1f-5c93582ca3e7` 2026-05-20 22:00-22:11Z): 1. Canvas Save writes 10-17 files in a 30-60s window. 2. Each successful `PUT /files` at `templates.go:575/591/607/662/682/697` + `template_import.go:239/275/297` fires `goAsync(RestartByID)`. 3. Existing 60s self-fire debounce catches calls 1-60s — but writes at T+65s pass the debounce, set `pending=true` on the running `coalesceRestart` cycle, and drain immediately into cycle 2 with no re-debounce. 4. Cycle 2 DELETEs+recreates EC2 mid-burst → user PUTs 500 with `EC2InstanceStateInvalidException`. ## Fix (both paths in same PR) **Path A — call-site debounce**: new `maybeRestartAfterFileWrite` helper with a 15s per-workspace coalescing window backed by `sync.Map[wsID]*atomic.Int64`. Replaces the 9 file-write `goAsync(RestartByID)` sites in `templates.go` (6) + `template_import.go` (3). CAS-loop on the timestamp so concurrent writers can't both fire. **Path B — drain-loop re-stamp**: `coalesceRestart` now re-stamps `restartStartedAt` at the top of each drained iteration past the first. The original design (stamp only on `false→true` edge) treated all drained pending as one debounce-event, which lets a write at T+65s pipe cycle 1→cycle 2 with no re-debounce. Re-stamping means any `RestartByID` arriving during cycle N hits a fresh 60s window and is dropped instead of chaining. Secrets-batch coalescing semantic (one cycle picks up everyone who arrived during it) is preserved. ## Files changed - `workspace-server/internal/handlers/workspace_restart.go` (+101 LoC: helper + Path B re-stamp + comments) - `workspace-server/internal/handlers/templates.go` (6 call-sites swapped) - `workspace-server/internal/handlers/template_import.go` (3 call-sites swapped) - `workspace-server/internal/handlers/workspace_restart_file_write_debounce_test.go` (new, 316 LoC, 7 tests) ## Tests - `TestMaybeRestartAfterFileWrite_FirstWriteRestarts` — baseline first-call fires - `TestMaybeRestartAfterFileWrite_SecondWriteWithin15sSkipped` — core fix - `TestMaybeRestartAfterFileWrite_ManyWritesInBurstCoalesceToOne` — 10 PUTs → 1 fire + 9 drops (bonus case from issue) - `TestMaybeRestartAfterFileWrite_AfterWindowExpiresFiresAgain` — window release - `TestMaybeRestartAfterFileWrite_DifferentWorkspacesIndependent` — per-workspace isolation - `TestMaybeRestartAfterFileWrite_ConcurrentCallsSafelyDebounced` — 50-goroutine herd → exactly 1 fires - `TestCoalesceRestart_DrainRespectsRestartedAtBetweenIterations` — Path B regression All existing `TestCoalesceRestart_*` + `TestRestartByID_*` + self-fire + `isRestarting` tests still pass (no regression). ## Test plan - [x] `go build ./workspace-server/...` clean - [x] `go test ./workspace-server/internal/handlers/ -run 'TestMaybeRestartAfterFileWrite|TestCoalesceRestart|TestRestartByID|TestIsRestarting|TestMaybeMarkContainerDead|TestPreflight'` PASS - [x] `go vet ./workspace-server/internal/handlers/` clean - [ ] Post-merge: verify on workspace `3fe84b89-eb65-42fc-ad1f-5c93582ca3e7` — canvas Save burst should produce ≤2 restart cycles (not 4+) ## Out of scope (follow-up issue) The diagnostic-improvement work (broadcast `WORKSPACE_AUTO_RESTART_FIRED` + `WORKSPACE_RESTARTING_MID_WRITE` events so canvas Activity shows WHY a workspace restarted) is filed as a separate issue per CTO request — see internal#625. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-be added 1 commit 2026-05-20 22:33:13 +00:00
fix(workspace-server): debounce file-write → RestartByID tight loop (#624)
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 18s
CI / Detect changes (pull_request) Successful in 15s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 14s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
CI / Platform (Go) (pull_request) Successful in 4m37s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 3s
Harness Replays / detect-changes (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 4s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 5m50s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 3s
gate-check-v3 / gate-check (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m15s
qa-review / approved (pull_request) Successful in 4s
sop-checklist / na-declarations (pull_request) N/A: (none)
security-review / approved (pull_request) Failing after 6s
sop-checklist / review-refire (pull_request) Has been skipped
sop-checklist / all-items-acked (pull_request) Successful in 4s
sop-tier-check / tier-check (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 6m50s
CI / all-required (pull_request) Successful in 4m46s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m19s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10s
Harness Replays / Harness Replays (pull_request) Successful in 14s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 11s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 2m2s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3m6s
E2E Chat / E2E Chat (pull_request) Failing after 7m30s
audit-force-merge / audit (pull_request) Successful in 4s
c5c759227c
Canvas Save fires N PUT /files in a 30-60s burst (10-17 files observed on
claude-code SEO Agent 3fe84b89-eb65-42fc-ad1f-5c93582ca3e7). Each
successful write at templates.go:575/591/607/662/682/697 and
template_import.go:239/275/297 previously called
`goAsync(func() { wh.RestartByID(wsID) })`. The existing 60s self-fire
debounce in RestartByID catches calls within the window, but writes at
T+65s+ pass the debounce, set pending=true on the running coalesceRestart
cycle, and drain immediately into cycle 2 — which DELETEs+recreates EC2
mid-burst → user PUTs 500 with EC2InstanceStateInvalidException.

Fix in two layers (both shipped here):

Path A (call-site debounce): new `maybeRestartAfterFileWrite` helper with
a 15s per-workspace coalescing window, backed by a sync.Map of
*atomic.Int64 last-fire timestamps. Replaces the 9 file-write
goAsync(RestartByID) sites in templates.go (6) and template_import.go
(3). CAS-loop on the timestamp so concurrent writers can't both fire.

Path B (drain-loop re-stamp): coalesceRestart now re-stamps
restartStartedAt at the top of each drained iteration past the first.
The original design (stamp only on false→true edge) treated all drained
pending as "one event" for debounce purposes, which lets a write at
T+65s pipe through cycle 1 into cycle 2 without re-debounce. Re-stamping
means any RestartByID arriving during cycle N hits a fresh 60s window
and is dropped by shouldDebounceRestart instead of chaining into N+1.
The secrets-batch coalescing semantic (one cycle picks up everyone who
arrived during it) is preserved.

Tests (workspace_restart_file_write_debounce_test.go, 7 cases):
- FirstWriteRestarts (baseline — drop counter stays 0)
- SecondWriteWithin15sSkipped (core fix — drop=1, stamp unchanged)
- ManyWritesInBurstCoalesceToOne (10-PUT burst → 1 fire + 9 drops)
- AfterWindowExpiresFiresAgain (5ms shrink, second fires post-window)
- DifferentWorkspacesIndependent (per-workspace isolation)
- ConcurrentCallsSafelyDebounced (50-goroutine herd → exactly 1 fires)
- TestCoalesceRestart_DrainRespectsRestartedAtBetweenIterations (Path B
  regression: drained cycle's restartStartedAt is observably newer)

All existing TestCoalesceRestart_* + TestRestartByID_* + self-fire +
isRestarting tests still pass.

Closes #624

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
core-qa approved these changes 2026-05-20 22:37:06 +00:00
core-qa left a comment
Member

core-qa five-axis review — APPROVE.

Correctness: 15s window is well-sized. Canvas Save bursts cluster in a 5-10s tight write window then idle; 15s comfortably absorbs the burst, exits the window before legit 'edit / wait / edit again' patterns get swallowed. The Path A+B layering is correct: Path A drops re-fires at the call site; if anything slips through (a2a_proxy/main.go siblings), Path B re-stamps restartStartedAt per drained iteration so a fresh restartDebounceWindow=60s gate covers the next cycle. Two independent failure modes, two independent defenses — sound.

Tests: All 7 tests pass and pin meaningful invariants — first-write fires, second-within-15s drops (with stamp-preservation assert at L116), 50-goroutine CAS contention produces exactly N-1 drops, window-expiry releases, per-workspace isolation, drain re-stamp monotonicity. Path B test (L266) uses a real time.Sleep(20ms) between cycles so the stamps[1].After(stamps[0]) assertion is observable, not a no-op.

Bonus 10-PUT burst test (L127): asserts drops==9 — STRONGER than 'at most 2 cycles' (it's exactly 1 cycle). However the test only validates Path A in isolation; it does NOT cross-validate that Path B's 60s gate kicks in if a write LEAKS past Path A's 15s window. The two layers are unit-tested independently. Acceptable since each layer has its own pin and the prod chain requires both to fail — but an integration test that feeds an out-of-window write into a still-draining cycle would be the highest-confidence regression pin. Non-blocking.

sync.Map memory: fileWriteRestartLastFireAt grows monotonically — no Delete on workspace teardown. Per-workspace overhead ~32 bytes; for 10k lifetime workspaces ~300KB. Same shape as the pre-existing restartStates sync.Map that already ships. Not a regression, but worth a TODO/follow-up issue if the workspace count grows substantially. Non-blocking.

Concurrency: atomic.Int64 inside sync.Map.LoadOrStore is the canonical lock-free pattern; CAS loop bounded at 4 retries (any contention beats out in ≤1 retry given the elapsed-time check post-load). No race risk.

Edge cases: First-write zero-init stamp (0) sits ~50+ years in the past relative to time.Now().UnixNano(), so the elapsed < window predicate is correctly false on first call. Verified by TestMaybeRestartAfterFileWrite_FirstWriteRestarts.

Ready to merge.

core-qa five-axis review — APPROVE. **Correctness**: 15s window is well-sized. Canvas Save bursts cluster in a 5-10s tight write window then idle; 15s comfortably absorbs the burst, exits the window before legit 'edit / wait / edit again' patterns get swallowed. The Path A+B layering is correct: Path A drops re-fires at the call site; if anything slips through (a2a_proxy/main.go siblings), Path B re-stamps `restartStartedAt` per drained iteration so a fresh `restartDebounceWindow=60s` gate covers the next cycle. Two independent failure modes, two independent defenses — sound. **Tests**: All 7 tests pass and pin meaningful invariants — first-write fires, second-within-15s drops (with stamp-preservation assert at L116), 50-goroutine CAS contention produces exactly N-1 drops, window-expiry releases, per-workspace isolation, drain re-stamp monotonicity. Path B test (L266) uses a real `time.Sleep(20ms)` between cycles so the `stamps[1].After(stamps[0])` assertion is observable, not a no-op. **Bonus 10-PUT burst test (L127)**: asserts `drops==9` — STRONGER than 'at most 2 cycles' (it's exactly 1 cycle). However the test only validates Path A in isolation; it does NOT cross-validate that Path B's 60s gate kicks in if a write LEAKS past Path A's 15s window. The two layers are unit-tested independently. Acceptable since each layer has its own pin and the prod chain requires both to fail — but an integration test that feeds an out-of-window write into a still-draining cycle would be the highest-confidence regression pin. Non-blocking. **sync.Map memory**: `fileWriteRestartLastFireAt` grows monotonically — no Delete on workspace teardown. Per-workspace overhead ~32 bytes; for 10k lifetime workspaces ~300KB. Same shape as the pre-existing `restartStates sync.Map` that already ships. Not a regression, but worth a TODO/follow-up issue if the workspace count grows substantially. Non-blocking. **Concurrency**: `atomic.Int64` inside `sync.Map.LoadOrStore` is the canonical lock-free pattern; CAS loop bounded at 4 retries (any contention beats out in ≤1 retry given the elapsed-time check post-load). No race risk. **Edge cases**: First-write zero-init stamp (0) sits ~50+ years in the past relative to `time.Now().UnixNano()`, so the `elapsed < window` predicate is correctly false on first call. Verified by `TestMaybeRestartAfterFileWrite_FirstWriteRestarts`. Ready to merge.
core-devops approved these changes 2026-05-20 22:37:45 +00:00
core-devops left a comment
Member

core-devops five-axis review — APPROVE.

Operational fit: The fix maps cleanly to the prod chain (canvas-Save burst → 9 file-write trigger sites → coalesceRestart drain) verified at 22:00-22:11Z on workspace 3fe84b89. Two-layer defense (Path A call-site debounce + Path B drain re-stamp) closes both the dominant source (file-write burst) and any straggler that arrives via the 4 sibling siblings. Sound.

Surgical-swap check: All 9 call-sites in templates.go (6) + template_import.go (3) literally replace wsID := workspaceID; h.wh.goAsync(func() { h.wh.RestartByID(wsID) }) with h.wh.maybeRestartAfterFileWrite(workspaceID). No collateral handler-logic changes in the diff. Confirmed by reading the unified diff line-by-line.

4 sibling RestartByID callers verified sparse-trigger (NOT same-bug-class):

  • a2a_proxy.go:669 — single fire per A2A preflight request, not burst.
  • a2a_proxy_helpers.go:217, 279 — startup / post-launch single-shot.
  • cmd/server/main.go:224 — liveness-monitor offline callback (one trigger per offline event).
    None live in a burst path. PR body claim holds.

Observability: maybeRestartAfterFileWrite emits a structured log line on every drop ('%s — coalesced (last fire %s ago < %s window; total dropped=%d)') plus a process-wide fileWriteRestartDropCounter atomic. Searchable in Loki via the literal substring maybeRestartAfterFileWrite to confirm the loop stopped post-deploy. Path B has no explicit drop log line for the re-stamp event but shouldDebounceRestart already logs at the existing 60s gate — sufficient. /admin/metrics exposure is deferred (called out in code comment as separate RFC) — non-blocking.

LoC / tests ratio: +497/-27 looks large but is +316 test (66%) / +181 production. Tests are dense and meaningful (CAS contention, per-workspace isolation, window-expiry, drain re-stamp). Acceptable test-investment ratio for a fix that lives in the workspace-lifecycle hot path.

Rollout risk: NO feature flag. Path A is purely additive (drop branch new, fire branch identical to before). Path B changes drain semantics — but the change is monotone-defensive: re-stamping restartStartedAt can only DROP additional cycles, never SCHEDULE additional ones. Worst-case post-merge regression is 'restart fires fewer times than expected'; the existing 60s window already governed the first cycle so behavior for non-burst paths is unchanged. Safe to ship immediately. Recommend a Loki check ~10min post-deploy for the coalesced log line to confirm Path A is firing in prod.

Ready to merge.

core-devops five-axis review — APPROVE. **Operational fit**: The fix maps cleanly to the prod chain (canvas-Save burst → 9 file-write trigger sites → coalesceRestart drain) verified at 22:00-22:11Z on workspace 3fe84b89. Two-layer defense (Path A call-site debounce + Path B drain re-stamp) closes both the dominant source (file-write burst) and any straggler that arrives via the 4 sibling siblings. Sound. **Surgical-swap check**: All 9 call-sites in templates.go (6) + template_import.go (3) literally replace `wsID := workspaceID; h.wh.goAsync(func() { h.wh.RestartByID(wsID) })` with `h.wh.maybeRestartAfterFileWrite(workspaceID)`. No collateral handler-logic changes in the diff. Confirmed by reading the unified diff line-by-line. **4 sibling RestartByID callers verified sparse-trigger** (NOT same-bug-class): - `a2a_proxy.go:669` — single fire per A2A preflight request, not burst. - `a2a_proxy_helpers.go:217, 279` — startup / post-launch single-shot. - `cmd/server/main.go:224` — liveness-monitor offline callback (one trigger per offline event). None live in a burst path. PR body claim holds. **Observability**: `maybeRestartAfterFileWrite` emits a structured log line on every drop (`'%s — coalesced (last fire %s ago < %s window; total dropped=%d)'`) plus a process-wide `fileWriteRestartDropCounter` atomic. Searchable in Loki via the literal substring `maybeRestartAfterFileWrite` to confirm the loop stopped post-deploy. Path B has no explicit drop log line for the re-stamp event but `shouldDebounceRestart` already logs at the existing 60s gate — sufficient. /admin/metrics exposure is deferred (called out in code comment as separate RFC) — non-blocking. **LoC / tests ratio**: +497/-27 looks large but is +316 test (66%) / +181 production. Tests are dense and meaningful (CAS contention, per-workspace isolation, window-expiry, drain re-stamp). Acceptable test-investment ratio for a fix that lives in the workspace-lifecycle hot path. **Rollout risk**: NO feature flag. Path A is purely additive (drop branch new, fire branch identical to before). Path B changes drain semantics — but the change is monotone-defensive: re-stamping `restartStartedAt` can only DROP additional cycles, never SCHEDULE additional ones. Worst-case post-merge regression is 'restart fires fewer times than expected'; the existing 60s window already governed the first cycle so behavior for non-burst paths is unchanged. Safe to ship immediately. Recommend a Loki check ~10min post-deploy for the `coalesced` log line to confirm Path A is firing in prod. Ready to merge.
core-devops merged commit dbbd351c70 into main 2026-05-20 22:59:09 +00:00
Sign in to join this conversation.
No Reviewers
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1623