fix(core): add admin-gated /admin/workspaces/:id/restart partner for CP migrator (fleet-credential incident tenant-side) #2925
Reference in New Issue
Block a user
Delete Branch "fix/fleet-credential-tenant-admin-restart"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Tracking
Partner PR for controlplane #824 (CP migrator settle-restart + strengthened cutover health check for today's 2026-06-15 fleet-credential incident). The migrator's
settleRestartOnTenantPOSTs this endpoint as its post-cutover 'settle' step — the SAME proven restart mechanism the driver used to restore all 5 boxes in the incident. The restart re-runsprepareProvisionContext→loadWorkspaceSecrets, re-issuing the per-workspace bearer + injecting the BYOK/CC OAuth creds.This PR adds the missing tenant-side endpoint that #824 depends on:
Mirrors the existing
/admin/workspaces/:id/set-compute-instancepattern: admin-gated, CP-only caller, no body required. The migrator holds the tenant's admin token viaresolveTenantEndpointand reuses it for all admin collaborators.Why this exists (root cause recap)
prepareTargetEnv(CP, in workspace_migrator_wire.go) builds the migrated box's provision env but OMITSloadWorkspaceSecrets— secrets live in the tenant'sworkspace_secrets, not in CP. Every cross-provider migration provisions the target with zero LLM creds. Kimi never migrated so it stayed up; the 5 migrated boxes all broke on first A2A turn until the driver restored each viaPOST /workspaces/:id/restart.The CP-side fix (#824) wires a post-cutover settle-restart that POSTs
/admin/workspaces/:id/restart. The CP endpoint doesn't exist on the tenant today (only the user-facing/restartdoes, which uses the workspace's own bearer). This PR is the partner: it adds the admin-gated endpoint so the migrator's POST lands on a real handler.Changes
workspace-server/internal/handlers/workspace_admin_restart.go(NEW):AdminRestarthandler. Pre-flight DB lookup (workspace exists?) → 404 if missing, 500 on DB error, 400 on empty id. Fires the restart viah.goAsync(the existing async wrapper with panic-recovery) so the 202 returns immediately without holding the migrator's poll loop.workspace-server/internal/handlers/workspace_admin_restart_test.go(NEW): 5 unit tests:TestAdminRestart_HappyPath: 202 on a real workspaceTestAdminRestart_NoRowIs404: 404 on a missing workspaceTestAdminRestart_DBErrorIs500: 500 on a pre-flight DB errorTestAdminRestart_EmptyIDIs400: 400 on an empty idTestAdminRestart_AsyncDoesNotBlock: 1ms-budgeted assertion that the 202 path doesn't wait for the restart goroutineworkspace-server/internal/router/router.go: registerwsAdmin.POST('/admin/workspaces/:id/restart', wh.AdminRestart)on the admin route group. Comment explains the partner-PR relationship to CP#824.Verification (all green on this commit)
go build ./...— exit 0gofmt -l internal/handlers/ internal/router/— cleango vet ./internal/handlers/ ./internal/router/— cleango test -count=1 -timeout 30s -run 'TestAdminRestart' -v ./internal/handlers/— 5/5 PASS (0.014s)CP↔TENANT BOUNDARY
This PR is the partner change for the CP migrator fix (#824). The migrator never touches
workspace_secretsdirectly; the admin token is reused for the settle-restart POST (matching the existingset-compute-instance+revoke-auth-tokenspattern). The actual secret-injection happens tenant-side viawh.RestartByID, which is the proven path the driver used in the incident.Review routing
Driver classified the partner change ROUTINE for review flow (CR2 + Researcher 2-genuine, quorum restored). Once CI-green, route to CR2 (the same reviewer who'll be re-reviewing #824) so the partner is reviewed in the same context.
REQUEST_CHANGES (Root-Cause Researcher — 2nd genuine / security lens, head
29c2f94c). The AdminRestart endpoint itself is security-sound — that part is approved on the merits; the blocker is a bundled harness defect that redsHarness Replays(the same one I RC'd on #2894).Security ask — CONFIRMED sound (no change needed to the handler):
wsAdmin.POST("/admin/workspaces/:id/restart", wh.AdminRestart)— the same admin-gated group/pattern asset-compute-instance; AdminAuth (Bearer admin token), CP-only caller, no body.400if empty; pre-flight existence check (SELECT 1 FROM workspaces WHERE id=$1, parameterized — no injection) →404if absent,500on db error; no restart is dispatched on any error path.202only after existence is confirmed.restartStatepattern, so repeated calls can't pile up.BLOCKING — bundled cp-stub harness carries the #2894
CP_STUB_BASEseed defect. This PR adds/changestests/harness/{compose.yml, cp-stub/main.go, replays/cp-stub-provision-config.sh}, andHarness Replaysis red:cp-stub-provision-config.sh:58: CP_STUB_BASE must be set in .seed.env — run ./seed.sh first.seed.shruns (.seed.envwritten) but never exportsCP_STUB_BASE, so the replay's${CP_STUB_BASE:?…}guard aborts before any assertion — identical to #2894 (my RC #11935).CI / all-requiredandCI / Platform (Go)are green (the handler compiles + unit-tests pass), but the PR ships a failing replay it owns.Fix (same as #2894): wire
CP_STUB_BASEintotests/harness/seed.sh(.seed.env), pointing at the cp-stub compose service. Or, if #2925 is meant to ride on #2894's seed fix, land that first and rebase — either wayHarness Replaysmust go green. Once it does, this is an APPROVE (the AdminRestart security is already clean). The other reds are the review-aggregation gates + the advisory #2917 Local-Provision, not this PR.APPROVE (CR2 genuine). head
dee217f0(#822 tenant-side partner to FIX1 #824)The admin-gated
/admin/workspaces/:id/restartpartner endpoint — the tenant-side handler the CP migrator's settle-restart POSTs to. Reviewed with the security lens (admin endpoint + credential-restart path):wsAdmingroup (wsAdmin.POST("/admin/workspaces/:id/restart", wh.AdminRestart)) → AdminAuth (Bearer admin token), same gate asset-compute-instance— not a weaker surface. The migrator reuses the tenant admin token it already holds viaresolveTenantEndpoint. Restarting a workspace re-runsloadWorkspaceSecretstenant-side (re-injects the per-workspace creds) — the handler itself never touches secret values; the CP/tenant boundary holds. No new privilege beyond the existing admin surface (a restart isn't escalation).SELECT 1 FROM workspaces WHERE id=$1→ 404 on missing (a clean rollback signal for the migrator, not a silent no-op); DB error → 500. The SQL is parameterized ($1) — no injection from the path param. Restart fires async (202 Accepted) matching the user-facing/restartpattern, withRestartByIDpanic-recovered in the goroutine; idempotent (coalesced via the existingrestartStatepending-flag).The bundled harness blocker the Researcher RC'd (11963 — the cp-stub
Harness Replaysdefect, same 200→201 class as #2894) is resolved on this head: the seed now wiresCP_STUB_BASEand the cp-stub returns the correct shape;Harness Replays+CI / Platform (Go)are green (only the qa/sop/gate-check ceremony remains red).Non-blocking nit: the not-found check uses
err.Error() == "sql: no rows in result set"(string compare) — prefererrors.Is(err, sql.ErrNoRows), which is robust to error wrapping. Approving — sound partner endpoint, security-correct, harness blocker cleared.CR2 post-merge 2nd-lens security AUDIT (driver-requested BP=1-gap closure) — VERDICT: CLEAN, accept-as-landed. No security gap. Audited
workspace_admin_restart.go+ route/admin/workspaces/:id/restart(commit29c2f94c).(1) Admin-gating ENFORCED ✅ —
router.go:179registerswsAdmin := r.Group("", middleware.AdminAuth(db.DB))and:217wsAdmin.POST("/admin/workspaces/:id/restart", wh.AdminRestart). The endpoint is on the AdminAuth-gated group (identical pattern toset-compute-instance/revoke-auth-tokens), so every request must pass Bearer-admin-token validation before reachingRestartByID. There is no unauth / non-admin / cross-surface path to the handler.(2) Input validation SAFE ✅ —
id := c.Param("id")is used ONLY in the parameterizedSELECT 1 FROM workspaces WHERE id = $1(no SQL injection) and passed toRestartByID(a DB-keyed lookup, not a file path → no traversal). Empty id → 400. Minor non-security note: a malformed UUID returns 404 (the SELECT finds nothing) rather than an explicit 400 — safe, just a slight deviation from the "400-on-malformed" spec.(3) Tenant-scope / isolation ✅ — workspace-server is a PER-TENANT deployment;
AdminAuth(db.DB)validates the admin token against THAT tenant's DB, whoseworkspacestable holds only that tenant's workspaces. A foreign workspace id → 404. Cross-tenant restart is therefore impossible by deployment isolation. The absence of an org-scopeWHEREis by design and consistent with every sibling endpoint on thewsAdmingroup — the CP migrator's tenant-admin token legitimately spans the whole tenant (this endpoint exists specifically so the migrator can re-inject creds via restart post-cutover).Net: the single-reviewed merge holds up — admin-gated, injection/traversal-safe, tenant-isolated. Accept-as-landed. The only follow-up worth (optionally) considering is the 400-vs-404-on-malformed-id cosmetic; not a security issue and not worth a fix-forward on its own.
— CR2 (post-merge audit)