fix(restart): clear running flag on panic in cycle()

Self-review caught a regression I introduced in #2266: if cycle() panics
(e.g. a future provisionWorkspace nil-deref or any runtime error from
the DB / Docker / encryption stacks it touches), the loop never reaches
`state.running = false`. The flag stays true forever, the early-return
guard at the top of coalesceRestart fires for every subsequent call,
and that workspace is permanently locked out of restarts until the
platform process restarts.

The pre-fix code had similar exposure (panic killed the goroutine
before defer wsMu.Unlock() ran in some Go versions), but my pending-
flag version made it worse: the guard is sticky, not ephemeral.

Fix: defer the state-clear so it always runs on exit, including panic.
Recover (and DON'T re-raise) so the panic doesn't propagate to the
goroutine boundary and crash the whole platform process — RestartByID
is always called via `go h.RestartByID(...)` from HTTP handlers, and
an unrecovered goroutine panic in Go terminates the program. Crashing
the platform for every tenant because one workspace's cycle panicked
is the wrong availability tradeoff. The panic message + full stack
trace via runtime/debug.Stack() are still logged for debuggability.

Regression test in TestCoalesceRestart_PanicInCycleClearsState:

  1. First call's cycle panics. coalesceRestart's defer must swallow
     the panic — assert no panic propagates out (would crash the
     platform process from a goroutine in production).
  2. Second call must run a fresh cycle (proves running was cleared).

All 7 tests pass with -race -count=10.

Surfaced via /code-review-and-quality self-review of #2266; the
re-raise-after-recover anti-pattern (originally argued as "don't
mask bugs") came up in the comprehensive review and was corrected
to log-with-stack-and-suppress for availability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-04-28 23:56:10 -07:00
parent 480b0576f2
commit bdfa45572e
2 changed files with 60 additions and 2 deletions

View File

@ -5,6 +5,7 @@ import (
"database/sql"
"log"
"net/http"
"runtime/debug"
"strings"
"sync"
"time"
@ -352,6 +353,27 @@ func coalesceRestart(workspaceID string, cycle func()) {
state.running = true
state.mu.Unlock()
// Always clear running on exit — including panic — so a panicking
// cycle (e.g. a future provisionWorkspace nil-deref) doesn't leave
// the workspace permanently locked out of restarts.
//
// recover()-and-DON'T-re-raise on purpose: this runs in a goroutine
// (callers are `go h.RestartByID(...)`); an unrecovered panic in a
// goroutine crashes the whole platform process, taking down every
// workspace served by this binary because of one bug in cycle for
// one workspace. Log the panic with stack trace for debuggability,
// then recover and let the goroutine exit cleanly. The next restart
// request for this workspace will see running=false and proceed.
defer func() {
state.mu.Lock()
state.running = false
state.mu.Unlock()
if r := recover(); r != nil {
log.Printf("Auto-restart: %s — cycle panicked, restart-state cleared: %v\n%s",
workspaceID, r, debug.Stack())
}
}()
// Drain pending requests. Each iteration re-loads workspace_secrets
// inside provisionWorkspace, so any writes that committed since the
// last cycle are picked up. Continues until no pending request was
@ -359,9 +381,8 @@ func coalesceRestart(workspaceID string, cycle func()) {
for {
state.mu.Lock()
if !state.pending {
state.running = false
state.mu.Unlock()
return
return // defer clears running
}
state.pending = false
state.mu.Unlock()

View File

@ -183,6 +183,43 @@ func TestCoalesceRestart_StateClearedAfterDrain(t *testing.T) {
}
}
// TestCoalesceRestart_PanicInCycleClearsState defends against a
// regression of the sticky-running deadlock: if cycle() panics, the
// running flag MUST be cleared so a follow-up RestartByID for the
// same workspace can still acquire the gate. Without the deferred
// state-clear + recover, a single panic would permanently lock a
// workspace out of all future restarts until process restart.
//
// Also asserts the panic is RECOVERED (not re-raised): callers are
// `go h.RestartByID(...)` from HTTP handlers, and an unrecovered
// goroutine panic in Go takes down the whole process. Crashing the
// platform for every tenant because one workspace's cycle panicked
// is the wrong availability tradeoff. The panic message + stack
// trace are still logged for debuggability.
func TestCoalesceRestart_PanicInCycleClearsState(t *testing.T) {
const wsID = "test-coalesce-panic-recovery"
resetRestartStatesFor(wsID)
// First call's cycle panics. coalesceRestart's defer must swallow
// the panic so this test caller doesn't see it propagate up — that
// matches what the real production caller (`go h.RestartByID(...)`)
// gets: the goroutine survives, no process crash.
defer func() {
if r := recover(); r != nil {
t.Errorf("panic should NOT propagate out of coalesceRestart (would crash the platform process from a goroutine), got: %v", r)
}
}()
coalesceRestart(wsID, func() { panic("simulated cycle failure") })
// Second call must run a fresh cycle. If running stayed true after
// the panic, this call would early-return without invoking cycle.
var ran atomic.Bool
coalesceRestart(wsID, func() { ran.Store(true) })
if !ran.Load() {
t.Error("post-panic restart was blocked — running flag leaked, workspace permanently locked out")
}
}
// TestCoalesceRestart_DifferentWorkspacesDoNotSerialize verifies the
// per-workspace state map: an in-flight restart for ws A must not
// block restarts for ws B. Important for performance — without this,