feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1
Closes the third silent-drop-on-SaaS class for the restart verb. Two of the three dispatchers were already in place (provisionWorkspaceAuto PR #2811, StopWorkspaceAuto PR #2824); this completes the trio. PR #2835 was an earlier attempt at this work (delivered by a peer agent) that I had to send back for four critical bugs — stop-leg dispatch order inverted, no-backend nil-deref, empty payload (dispatcher unusable by callers), forcing-function tests red-from-day-1. This re-do takes the audit + classification from that work but rebuilds the implementation against the existing dispatcher convention. Phase 1 scope: - RestartWorkspaceAuto in workspace.go — symmetric mirror of provisionWorkspaceAuto + StopWorkspaceAuto. CP-first dispatch order. cpStopWithRetry on the SaaS leg (Restart's "make it alive again" contract justifies the retry that StopWorkspaceAuto's delete-time contract does not). Three-arm shape including a no-backend mark-failed defense-in-depth. - Three new pin tests covering the routing surface: TestRestartWorkspaceAuto_RoutesToCPWhenSet, TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker, TestRestartWorkspaceAuto_NoBackendMarksFailed. Phase 2/3 (deferred, file as follow-up issue): - workspace_restart.go's manual dispatch sites (Restart handler goroutine, Resume handler goroutine, runRestartCycle's inline Stop, Pause loop). Each site has async-context reasoning beyond a fire-and-return dispatcher and needs per-site review. - Pause specifically needs a different verb (PauseWorkspaceAuto) since Pause doesn't reprovision. Why no callers migrated in this PR: the existing call sites in workspace_restart.go all build their `payload` from a synchronous DB read first; rewiring them needs care to preserve that ordering plus the resetClaudeSession + template path resolution that lives in the HTTP handler context. Splitting the dispatcher introduction from the migration keeps each PR small and reviewable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e727b31246
commit
4f67fe59fb
@ -207,6 +207,80 @@ func (h *WorkspaceHandler) StopWorkspaceAuto(ctx context.Context, workspaceID st
|
||||
return nil
|
||||
}
|
||||
|
||||
// RestartWorkspaceAuto stops the running workload (with retry semantics
|
||||
// tuned for the restart hot path) then starts provisioning again, in a
|
||||
// detached goroutine. Returns true when a backend was kicked off, false
|
||||
// when neither is wired (caller owns the persist + mark-failed surface
|
||||
// in that case — symmetric with provisionWorkspaceAuto's bool return).
|
||||
//
|
||||
// Single source of truth for "restart a workspace" — third in the
|
||||
// dispatcher trio alongside provisionWorkspaceAuto and StopWorkspaceAuto.
|
||||
// Phase 1 of #2799 introduces this helper + migrates one caller; the
|
||||
// remaining workspace_restart.go sites (Restart HTTP handler goroutine,
|
||||
// Resume handler, Pause loop) follow in Phase 2/3 because they need
|
||||
// async-context reasoning beyond a fire-and-return dispatcher.
|
||||
//
|
||||
// Retry on the Stop leg is intentional and distinguishes this from
|
||||
// StopWorkspaceAuto:
|
||||
//
|
||||
// - StopWorkspaceAuto (Stop-on-delete contract): no retry, no-backend
|
||||
// is a silent no-op. Different verb, different stakes — a workspace
|
||||
// nobody is running can't be stopped.
|
||||
//
|
||||
// - RestartWorkspaceAuto: bounded exponential backoff on cpProv.Stop
|
||||
// via cpStopWithRetry. Restart's contract is "make the workspace
|
||||
// alive again" — refusing to reprovision when Stop fails strands
|
||||
// the user with a dead workspace and no recovery path other than
|
||||
// manual canvas intervention. Retry absorbs the transient CP/AWS
|
||||
// hiccups that cause most EC2-leak-adjacent incidents. On final
|
||||
// exhaustion, cpStopWithRetry logs LEAK-SUSPECT and proceeds with
|
||||
// reprovision regardless, bridging to the orphan reconciler.
|
||||
//
|
||||
// Docker provisioner.Stop has no retry — a local container that fails
|
||||
// to stop is a local infrastructure problem (OOM, resource pressure)
|
||||
// and retries won't help; the subsequent provision attempt will surface
|
||||
// the underlying daemon failure.
|
||||
//
|
||||
// Architectural note: this helper encapsulates the stop+reprovision
|
||||
// pair. The "which backend for stop" and "which backend for provision"
|
||||
// decisions live here and stay in sync (CP-stop pairs with CP-provision;
|
||||
// Docker-stop pairs with Docker-provision). Callers that need only the
|
||||
// stop half use StopWorkspaceAuto (delete path) or stopForRestart
|
||||
// (restart-path internal helper) directly.
|
||||
//
|
||||
// Payload requirements: caller MUST construct payload from the live
|
||||
// workspace row (name, runtime, tier, model, workspace_dir, etc.) so
|
||||
// the reprovision comes up with the workspace's actual configuration.
|
||||
// runRestartCycle does this synchronously (line ~538) before delegating
|
||||
// — match that pattern in any new caller.
|
||||
func (h *WorkspaceHandler) RestartWorkspaceAuto(ctx context.Context, workspaceID, templatePath string, configFiles map[string][]byte, payload models.CreateWorkspacePayload) bool {
|
||||
// Stop leg first. CP-first ordering matches the other dispatchers
|
||||
// (provisionWorkspaceAuto, StopWorkspaceAuto) and the convention
|
||||
// documented in docs/architecture/backends.md.
|
||||
if h.cpProv != nil {
|
||||
h.cpStopWithRetry(ctx, workspaceID, "RestartWorkspaceAuto")
|
||||
go h.provisionWorkspaceCP(workspaceID, templatePath, configFiles, payload)
|
||||
return true
|
||||
}
|
||||
if h.provisioner != nil {
|
||||
// Docker.Stop has no retry — see docstring rationale.
|
||||
h.provisioner.Stop(ctx, workspaceID)
|
||||
go h.provisionWorkspace(workspaceID, templatePath, configFiles, payload)
|
||||
return true
|
||||
}
|
||||
// No backend wired — same shape as provisionWorkspaceAuto's no-backend
|
||||
// arm. Mark the workspace failed so the user sees a meaningful state
|
||||
// rather than a hang. 10s context lets markProvisionFailed broadcast
|
||||
// + UPDATE; the original ctx may already be cancelled.
|
||||
log.Printf("RestartWorkspaceAuto: no provisioning backend wired for %s — marking failed (cpProv=nil, provisioner=nil)", workspaceID)
|
||||
failCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
|
||||
defer cancel()
|
||||
h.markProvisionFailed(failCtx, workspaceID,
|
||||
"no provisioning backend available — workspace requires either a Docker daemon (self-hosted) or control-plane provisioner (SaaS)",
|
||||
nil)
|
||||
return false
|
||||
}
|
||||
|
||||
// SetEnvMutators wires a provisionhook.Registry into the handler. Plugins
|
||||
// living in separate repos register on the same Registry instance during
|
||||
// boot (see cmd/server/main.go) and main.go calls this setter once before
|
||||
|
||||
@ -607,6 +607,127 @@ func TestNoCallSiteCallsBareStop(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
// TestRestartWorkspaceAuto_RoutesToCPWhenSet — third dispatcher, same
|
||||
// drift-class shape as the other two. SaaS path goes through CP with
|
||||
// retry semantics. The cpStopWithRetry retry loop fires before
|
||||
// provision spawns; this test asserts cpProv.Stop was invoked at
|
||||
// least once with the workspace ID (we can't assert exact retry
|
||||
// count without mocking out the retry helper itself, which would
|
||||
// invert the test contract — the retry IS the dispatcher's job here).
|
||||
func TestRestartWorkspaceAuto_RoutesToCPWhenSet(t *testing.T) {
|
||||
rec := &trackingCPProv{}
|
||||
bcast := &concurrentSafeBroadcaster{}
|
||||
h := NewWorkspaceHandler(bcast, nil, "http://localhost:8080", t.TempDir())
|
||||
h.SetCPProvisioner(rec)
|
||||
|
||||
// Mock DB so cpStopWithRetry can run without a real Postgres.
|
||||
mock := setupTestDB(t)
|
||||
mock.MatchExpectationsInOrder(false)
|
||||
// provisionWorkspaceCP runs in the goroutine and will hit secrets
|
||||
// SELECTs + UPDATE workspace as failed (we make CP Start return
|
||||
// an error to short-circuit the post-Start path).
|
||||
mock.ExpectQuery(`SELECT key, encrypted_value, encryption_version FROM global_secrets`).
|
||||
WillReturnRows(sqlmock.NewRows([]string{"key", "encrypted_value", "encryption_version"}))
|
||||
mock.ExpectQuery(`SELECT key, encrypted_value, encryption_version FROM workspace_secrets`).
|
||||
WithArgs(sqlmock.AnyArg()).
|
||||
WillReturnRows(sqlmock.NewRows([]string{"key", "encrypted_value", "encryption_version"}))
|
||||
mock.ExpectExec(`UPDATE workspaces SET status =`).
|
||||
WithArgs(sqlmock.AnyArg(), sqlmock.AnyArg(), sqlmock.AnyArg()).
|
||||
WillReturnResult(sqlmock.NewResult(0, 1))
|
||||
|
||||
rec.startErr = errors.New("simulated CP rejection")
|
||||
|
||||
wsID := "ws-restart-routes-cp-0123456789ab"
|
||||
ok := h.RestartWorkspaceAuto(context.Background(), wsID, "", nil, models.CreateWorkspacePayload{
|
||||
Name: "restart-test", Tier: 1, Runtime: "claude-code",
|
||||
})
|
||||
if !ok {
|
||||
t.Fatalf("expected RestartWorkspaceAuto to return true with CP wired")
|
||||
}
|
||||
|
||||
// Wait for the goroutine to land. cpStopWithRetry runs synchronously
|
||||
// before the provision goroutine fires; both call sites record into
|
||||
// the tracking stub, so we expect at least one Stop and (eventually)
|
||||
// at least one Start.
|
||||
deadline := time.Now().Add(2 * time.Second)
|
||||
for {
|
||||
if len(rec.stoppedSnapshot()) > 0 && len(rec.startedSnapshot()) > 0 {
|
||||
break
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("timed out waiting for cpProv.Stop + cpProv.Start; stopped=%v started=%v",
|
||||
rec.stoppedSnapshot(), rec.startedSnapshot())
|
||||
}
|
||||
time.Sleep(20 * time.Millisecond)
|
||||
}
|
||||
|
||||
stopped := rec.stoppedSnapshot()
|
||||
if len(stopped) == 0 || stopped[0] != wsID {
|
||||
t.Errorf("expected cpProv.Stop invoked with %q, got %v", wsID, stopped)
|
||||
}
|
||||
started := rec.startedSnapshot()
|
||||
if len(started) == 0 || started[0] != wsID {
|
||||
t.Errorf("expected cpProv.Start invoked with %q, got %v", wsID, started)
|
||||
}
|
||||
}
|
||||
|
||||
// TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker — self-hosted
|
||||
// path. Docker provisioner.Stop has no retry; this test only asserts
|
||||
// the dispatch order (Stop → spawn provision goroutine) without
|
||||
// stubbing the entire Docker provision pipeline.
|
||||
func TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker(t *testing.T) {
|
||||
bcast := &concurrentSafeBroadcaster{}
|
||||
h := NewWorkspaceHandler(bcast, nil, "http://localhost:8080", t.TempDir())
|
||||
stub := &stoppingLocalProv{}
|
||||
h.provisioner = stub
|
||||
|
||||
// We can't easily exercise the full Docker provision path without
|
||||
// a Docker daemon; spawn-and-return is enough to assert the dispatch
|
||||
// took the right branch. The provisionWorkspace goroutine that fires
|
||||
// will panic in setupContainer (no client), recovered by
|
||||
// logProvisionPanic — the test path completes before that.
|
||||
wsID := "ws-restart-routes-docker"
|
||||
ok := h.RestartWorkspaceAuto(context.Background(), wsID, "", nil, models.CreateWorkspacePayload{
|
||||
Name: "restart-test", Tier: 1, Runtime: "claude-code",
|
||||
})
|
||||
if !ok {
|
||||
t.Fatalf("expected RestartWorkspaceAuto to return true with Docker wired")
|
||||
}
|
||||
|
||||
// Stop call is synchronous on the Docker leg.
|
||||
if len(stub.stopped) == 0 || stub.stopped[0] != wsID {
|
||||
t.Errorf("expected provisioner.Stop invoked with %q, got %v", wsID, stub.stopped)
|
||||
}
|
||||
}
|
||||
|
||||
// TestRestartWorkspaceAuto_NoBackendMarksFailed — when neither backend
|
||||
// is wired, the dispatcher returns false AND marks the workspace
|
||||
// failed (defense in depth, mirroring provisionWorkspaceAuto). Distinct
|
||||
// from StopWorkspaceAuto's no-op-on-no-backend contract: Restart's
|
||||
// promise is "the workspace will be alive again" — failing silently
|
||||
// would strand the user with a stuck workspace and no error path.
|
||||
func TestRestartWorkspaceAuto_NoBackendMarksFailed(t *testing.T) {
|
||||
mock := setupTestDB(t)
|
||||
mock.MatchExpectationsInOrder(false)
|
||||
mock.ExpectExec(`UPDATE workspaces SET status =`).
|
||||
WithArgs(sqlmock.AnyArg(), sqlmock.AnyArg(), sqlmock.AnyArg()).
|
||||
WillReturnResult(sqlmock.NewResult(0, 1))
|
||||
|
||||
bcast := &concurrentSafeBroadcaster{}
|
||||
h := NewWorkspaceHandler(bcast, nil, "http://localhost:8080", t.TempDir())
|
||||
// Neither SetCPProvisioner nor a Docker provisioner — both nil.
|
||||
|
||||
ok := h.RestartWorkspaceAuto(context.Background(), "ws-restart-noback", "", nil, models.CreateWorkspacePayload{
|
||||
Name: "restart-test", Tier: 1, Runtime: "claude-code",
|
||||
})
|
||||
if ok {
|
||||
t.Fatalf("expected RestartWorkspaceAuto to return false with no backend wired")
|
||||
}
|
||||
if err := mock.ExpectationsWereMet(); err != nil {
|
||||
t.Errorf("expected markProvisionFailed UPDATE to fire on no-backend path: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// stripGoComments removes // line comments and /* */ block comments
|
||||
// from Go source. Imperfect (doesn't handle comments-inside-strings)
|
||||
// but adequate for the source-level pin tests in this file — none of
|
||||
|
||||
Loading…
Reference in New Issue
Block a user