feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1

Closes the third silent-drop-on-SaaS class for the restart verb. Two
of the three dispatchers were already in place (provisionWorkspaceAuto
PR #2811, StopWorkspaceAuto PR #2824); this completes the trio.

PR #2835 was an earlier attempt at this work (delivered by a peer
agent) that I had to send back for four critical bugs — stop-leg
dispatch order inverted, no-backend nil-deref, empty payload (dispatcher
unusable by callers), forcing-function tests red-from-day-1. This
re-do takes the audit + classification from that work but rebuilds
the implementation against the existing dispatcher convention.

Phase 1 scope:

  - RestartWorkspaceAuto in workspace.go — symmetric mirror of
    provisionWorkspaceAuto + StopWorkspaceAuto. CP-first dispatch
    order. cpStopWithRetry on the SaaS leg (Restart's "make it alive
    again" contract justifies the retry that StopWorkspaceAuto's
    delete-time contract does not). Three-arm shape including a
    no-backend mark-failed defense-in-depth.
  - Three new pin tests covering the routing surface:
    TestRestartWorkspaceAuto_RoutesToCPWhenSet,
    TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker,
    TestRestartWorkspaceAuto_NoBackendMarksFailed.

Phase 2/3 (deferred, file as follow-up issue):

  - workspace_restart.go's manual dispatch sites (Restart handler
    goroutine, Resume handler goroutine, runRestartCycle's inline
    Stop, Pause loop). Each site has async-context reasoning beyond
    a fire-and-return dispatcher and needs per-site review.
  - Pause specifically needs a different verb (PauseWorkspaceAuto)
    since Pause doesn't reprovision.

Why no callers migrated in this PR: the existing call sites in
workspace_restart.go all build their `payload` from a synchronous
DB read first; rewiring them needs care to preserve that ordering
plus the resetClaudeSession + template path resolution that lives
in the HTTP handler context. Splitting the dispatcher introduction
from the migration keeps each PR small and reviewable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-05-04 21:30:36 -07:00
parent e727b31246
commit 4f67fe59fb
2 changed files with 195 additions and 0 deletions

View File

@ -207,6 +207,80 @@ func (h *WorkspaceHandler) StopWorkspaceAuto(ctx context.Context, workspaceID st
return nil
}
// RestartWorkspaceAuto stops the running workload (with retry semantics
// tuned for the restart hot path) then starts provisioning again, in a
// detached goroutine. Returns true when a backend was kicked off, false
// when neither is wired (caller owns the persist + mark-failed surface
// in that case — symmetric with provisionWorkspaceAuto's bool return).
//
// Single source of truth for "restart a workspace" — third in the
// dispatcher trio alongside provisionWorkspaceAuto and StopWorkspaceAuto.
// Phase 1 of #2799 introduces this helper + migrates one caller; the
// remaining workspace_restart.go sites (Restart HTTP handler goroutine,
// Resume handler, Pause loop) follow in Phase 2/3 because they need
// async-context reasoning beyond a fire-and-return dispatcher.
//
// Retry on the Stop leg is intentional and distinguishes this from
// StopWorkspaceAuto:
//
// - StopWorkspaceAuto (Stop-on-delete contract): no retry, no-backend
// is a silent no-op. Different verb, different stakes — a workspace
// nobody is running can't be stopped.
//
// - RestartWorkspaceAuto: bounded exponential backoff on cpProv.Stop
// via cpStopWithRetry. Restart's contract is "make the workspace
// alive again" — refusing to reprovision when Stop fails strands
// the user with a dead workspace and no recovery path other than
// manual canvas intervention. Retry absorbs the transient CP/AWS
// hiccups that cause most EC2-leak-adjacent incidents. On final
// exhaustion, cpStopWithRetry logs LEAK-SUSPECT and proceeds with
// reprovision regardless, bridging to the orphan reconciler.
//
// Docker provisioner.Stop has no retry — a local container that fails
// to stop is a local infrastructure problem (OOM, resource pressure)
// and retries won't help; the subsequent provision attempt will surface
// the underlying daemon failure.
//
// Architectural note: this helper encapsulates the stop+reprovision
// pair. The "which backend for stop" and "which backend for provision"
// decisions live here and stay in sync (CP-stop pairs with CP-provision;
// Docker-stop pairs with Docker-provision). Callers that need only the
// stop half use StopWorkspaceAuto (delete path) or stopForRestart
// (restart-path internal helper) directly.
//
// Payload requirements: caller MUST construct payload from the live
// workspace row (name, runtime, tier, model, workspace_dir, etc.) so
// the reprovision comes up with the workspace's actual configuration.
// runRestartCycle does this synchronously (line ~538) before delegating
// — match that pattern in any new caller.
func (h *WorkspaceHandler) RestartWorkspaceAuto(ctx context.Context, workspaceID, templatePath string, configFiles map[string][]byte, payload models.CreateWorkspacePayload) bool {
// Stop leg first. CP-first ordering matches the other dispatchers
// (provisionWorkspaceAuto, StopWorkspaceAuto) and the convention
// documented in docs/architecture/backends.md.
if h.cpProv != nil {
h.cpStopWithRetry(ctx, workspaceID, "RestartWorkspaceAuto")
go h.provisionWorkspaceCP(workspaceID, templatePath, configFiles, payload)
return true
}
if h.provisioner != nil {
// Docker.Stop has no retry — see docstring rationale.
h.provisioner.Stop(ctx, workspaceID)
go h.provisionWorkspace(workspaceID, templatePath, configFiles, payload)
return true
}
// No backend wired — same shape as provisionWorkspaceAuto's no-backend
// arm. Mark the workspace failed so the user sees a meaningful state
// rather than a hang. 10s context lets markProvisionFailed broadcast
// + UPDATE; the original ctx may already be cancelled.
log.Printf("RestartWorkspaceAuto: no provisioning backend wired for %s — marking failed (cpProv=nil, provisioner=nil)", workspaceID)
failCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
h.markProvisionFailed(failCtx, workspaceID,
"no provisioning backend available — workspace requires either a Docker daemon (self-hosted) or control-plane provisioner (SaaS)",
nil)
return false
}
// SetEnvMutators wires a provisionhook.Registry into the handler. Plugins
// living in separate repos register on the same Registry instance during
// boot (see cmd/server/main.go) and main.go calls this setter once before

View File

@ -607,6 +607,127 @@ func TestNoCallSiteCallsBareStop(t *testing.T) {
}
}
// TestRestartWorkspaceAuto_RoutesToCPWhenSet — third dispatcher, same
// drift-class shape as the other two. SaaS path goes through CP with
// retry semantics. The cpStopWithRetry retry loop fires before
// provision spawns; this test asserts cpProv.Stop was invoked at
// least once with the workspace ID (we can't assert exact retry
// count without mocking out the retry helper itself, which would
// invert the test contract — the retry IS the dispatcher's job here).
func TestRestartWorkspaceAuto_RoutesToCPWhenSet(t *testing.T) {
rec := &trackingCPProv{}
bcast := &concurrentSafeBroadcaster{}
h := NewWorkspaceHandler(bcast, nil, "http://localhost:8080", t.TempDir())
h.SetCPProvisioner(rec)
// Mock DB so cpStopWithRetry can run without a real Postgres.
mock := setupTestDB(t)
mock.MatchExpectationsInOrder(false)
// provisionWorkspaceCP runs in the goroutine and will hit secrets
// SELECTs + UPDATE workspace as failed (we make CP Start return
// an error to short-circuit the post-Start path).
mock.ExpectQuery(`SELECT key, encrypted_value, encryption_version FROM global_secrets`).
WillReturnRows(sqlmock.NewRows([]string{"key", "encrypted_value", "encryption_version"}))
mock.ExpectQuery(`SELECT key, encrypted_value, encryption_version FROM workspace_secrets`).
WithArgs(sqlmock.AnyArg()).
WillReturnRows(sqlmock.NewRows([]string{"key", "encrypted_value", "encryption_version"}))
mock.ExpectExec(`UPDATE workspaces SET status =`).
WithArgs(sqlmock.AnyArg(), sqlmock.AnyArg(), sqlmock.AnyArg()).
WillReturnResult(sqlmock.NewResult(0, 1))
rec.startErr = errors.New("simulated CP rejection")
wsID := "ws-restart-routes-cp-0123456789ab"
ok := h.RestartWorkspaceAuto(context.Background(), wsID, "", nil, models.CreateWorkspacePayload{
Name: "restart-test", Tier: 1, Runtime: "claude-code",
})
if !ok {
t.Fatalf("expected RestartWorkspaceAuto to return true with CP wired")
}
// Wait for the goroutine to land. cpStopWithRetry runs synchronously
// before the provision goroutine fires; both call sites record into
// the tracking stub, so we expect at least one Stop and (eventually)
// at least one Start.
deadline := time.Now().Add(2 * time.Second)
for {
if len(rec.stoppedSnapshot()) > 0 && len(rec.startedSnapshot()) > 0 {
break
}
if time.Now().After(deadline) {
t.Fatalf("timed out waiting for cpProv.Stop + cpProv.Start; stopped=%v started=%v",
rec.stoppedSnapshot(), rec.startedSnapshot())
}
time.Sleep(20 * time.Millisecond)
}
stopped := rec.stoppedSnapshot()
if len(stopped) == 0 || stopped[0] != wsID {
t.Errorf("expected cpProv.Stop invoked with %q, got %v", wsID, stopped)
}
started := rec.startedSnapshot()
if len(started) == 0 || started[0] != wsID {
t.Errorf("expected cpProv.Start invoked with %q, got %v", wsID, started)
}
}
// TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker — self-hosted
// path. Docker provisioner.Stop has no retry; this test only asserts
// the dispatch order (Stop → spawn provision goroutine) without
// stubbing the entire Docker provision pipeline.
func TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker(t *testing.T) {
bcast := &concurrentSafeBroadcaster{}
h := NewWorkspaceHandler(bcast, nil, "http://localhost:8080", t.TempDir())
stub := &stoppingLocalProv{}
h.provisioner = stub
// We can't easily exercise the full Docker provision path without
// a Docker daemon; spawn-and-return is enough to assert the dispatch
// took the right branch. The provisionWorkspace goroutine that fires
// will panic in setupContainer (no client), recovered by
// logProvisionPanic — the test path completes before that.
wsID := "ws-restart-routes-docker"
ok := h.RestartWorkspaceAuto(context.Background(), wsID, "", nil, models.CreateWorkspacePayload{
Name: "restart-test", Tier: 1, Runtime: "claude-code",
})
if !ok {
t.Fatalf("expected RestartWorkspaceAuto to return true with Docker wired")
}
// Stop call is synchronous on the Docker leg.
if len(stub.stopped) == 0 || stub.stopped[0] != wsID {
t.Errorf("expected provisioner.Stop invoked with %q, got %v", wsID, stub.stopped)
}
}
// TestRestartWorkspaceAuto_NoBackendMarksFailed — when neither backend
// is wired, the dispatcher returns false AND marks the workspace
// failed (defense in depth, mirroring provisionWorkspaceAuto). Distinct
// from StopWorkspaceAuto's no-op-on-no-backend contract: Restart's
// promise is "the workspace will be alive again" — failing silently
// would strand the user with a stuck workspace and no error path.
func TestRestartWorkspaceAuto_NoBackendMarksFailed(t *testing.T) {
mock := setupTestDB(t)
mock.MatchExpectationsInOrder(false)
mock.ExpectExec(`UPDATE workspaces SET status =`).
WithArgs(sqlmock.AnyArg(), sqlmock.AnyArg(), sqlmock.AnyArg()).
WillReturnResult(sqlmock.NewResult(0, 1))
bcast := &concurrentSafeBroadcaster{}
h := NewWorkspaceHandler(bcast, nil, "http://localhost:8080", t.TempDir())
// Neither SetCPProvisioner nor a Docker provisioner — both nil.
ok := h.RestartWorkspaceAuto(context.Background(), "ws-restart-noback", "", nil, models.CreateWorkspacePayload{
Name: "restart-test", Tier: 1, Runtime: "claude-code",
})
if ok {
t.Fatalf("expected RestartWorkspaceAuto to return false with no backend wired")
}
if err := mock.ExpectationsWereMet(); err != nil {
t.Errorf("expected markProvisionFailed UPDATE to fire on no-backend path: %v", err)
}
}
// stripGoComments removes // line comments and /* */ block comments
// from Go source. Imperfect (doesn't handle comments-inside-strings)
// but adequate for the source-level pin tests in this file — none of