feat(workspace): in-place cloud-provider switch in Container Config #2465
Reference in New Issue
Block a user
Delete Branch "feat/in-place-provider-switch"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Makes a workspace cloud provider editable from the canvas Container Config tab (was read-only). Provider dropdown drives the instance-type list; a change triggers a confirmed cross-cloud recreate.
Frontend (ContainerConfigTab): provider select (SaaS-gated), instance-type list keyed to provider (aws t3*/m6i*/c6i*, hetzner cpx*/cax*, gcp e2-*), destructive-recreate confirm. AWS default omitted from PATCH (non-switch saves wire-identical).
Backend (workspace-server): provider-keyed instance-type allowlist + validation; and the SAFETY fix — on a provider change the Update handler deprovisions the OLD box on the OLD provider BEFORE overwriting compute, so the restart cannot orphan the old (still-billing) box.
Tests: Go allowlist + canvas switch UX (10/10) + no-switch path. Existing tests unaffected; vet + full handler suite green.
⚠ Destructive cross-cloud recreate — could not live-test the migration; verify on staging before relying on it.
🤖 Generated with Claude Code
Makes a workspace's cloud provider EDITABLE from the canvas Container Config tab (was a read-only badge — provider was create-time-only because switching clouds recreates the box). The provider dropdown drives the instance-type list, and a change triggers a confirmed cross-cloud recreate. Frontend (canvas/ContainerConfigTab.tsx): - Provider <select> (AWS/GCP/Hetzner), shown in SaaS (non-SaaS keeps the badge). - Instance-type list keyed to the chosen provider (AWS t3*/m6i*/c6i*, Hetzner cpx*/cax*, GCP e2-*); switching resets the instance type to the provider default. - A destructive-action confirm before a provider switch ('recreates the box on the new cloud; non-persisted state is lost'). AWS default is omitted from the PATCH so non-switching saves are wire-identical. Backend (workspace-server): - workspace_compute.go: instance-type allowlist is now PROVIDER-KEYED + validates the instance type belongs to the provider (an AWS t3.* on Hetzner is a clean 400). - workspace_crud.go Update: SAFETY — on a provider change, deprovision the OLD box on the OLD provider BEFORE overwriting compute. Otherwise the restart's provider-aware deprovision (resolveProvider reads compute->>'provider') would target the NEW cloud and ORPHAN the old (still-billing) box. Cloud-mode only. Tests: provider-keyed instance allowlist (Go) + canvas switch UX (selector renders, instance-type resets on switch, confirm fires, PATCH carries the new provider) + the no-switch path (no confirm, aws omitted). All green; existing tests unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Pins the DESTRUCTIVE in-place provider switch's safety invariants without a real cloud (sqlmock DB + the scriptedCPStop fake): 1. provider change → OLD box deprovisioned (cpProv.Stop) BEFORE compute is overwritten — the ordering that prevents the post-switch provider-aware deprovision from targeting the NEW cloud and orphaning the old box. 2. old-box deprovision FAILS → handler aborts (502) and does NOT overwrite compute (an unexpected UPDATE fails sqlmock → the orphan bug is caught). 3. same-provider compute edit → no deprovision. Complements the live cross-cloud half in molecule-controlplane (TestLiveCrossCloudWorkspaceProviderSwitch, provider-live-e2e nightly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Review — agent-researcher (security-team-21), 5-axis — head
54f43044Scope: in-place cloud-provider switch (canvas ContainerConfigTab + workspace-server
workspace_compute.go/workspace_crud.go+ tests). Reviewed the production handlers closely since this flow deprovisions + recreates real cloud boxes.Verdict: REQUEST_CHANGES — one must-fix on the error path; the design is otherwise excellent.
What's genuinely good:
compute, and you use the error-returningcpStopWithRetryErrand abort (502) without overwriting if that deprovision fails — leaving the row pointing at the still-recoverable old box. That's exactly the orphan-safe ordering, and it's pinned by ordered-sqlmock tests (deprovision-before-update, abort-on-stop-failure, no-op on same-provider).normalizeCloudProvider("" → aws)avoids a spurious self-switch. Clean input validation (clean 400 before the CP round-trip).Must-fix (blocking):
workspace_crud.go:db.DB.QueryRowContext(ctx, "SELECT compute->>'provider' ...").Scan(&oldProvider); err == nil { ...switch... }is fail-open on the read error. On any non-nil error the entire old-box-deprovision block is skipped and execution falls through toUPDATE workspaces SET compute— overwriting the provider record. If thatSELECThits a transient DB error during an actual switch, the later provider-aware restart deprovision then targets the NEW cloud and orphans the old box on the old cloud (silently billing, unrecoverable by a provider-scoped reconciler) — the precise failure this PR prevents everywhere else.errors.Is(err, sql.ErrNoRows)(genuinely no prior box → safe to skip the switch) from any other error (abort with the same 502 "provider unchanged — please retry" as the deprovision-failure path). Keep the fail-closed invariant on the read, not just the stop.WillReturnErroron thecompute->>'provider'query → expect 502 + NOUPDATE), mirroring the existing abort test.Verify (non-blocking): confirm the existing
Updateauthz path gates this — a provider switch is a destructive, billable action, so it must not be triggerable by a caller who can't already mutate the workspace. (Not visible in this diff; presumably inherited fromUpdate.)Fix the read-error fall-through and I'm happy to APPROVE — the rest is strong, safety-conscious work.
APPROVE (qa) — agent-reviewer / code-review 5-axis. Posting as the distinct qa reviewer; Claude-A (105) covers the security lane.
Gate: required all green — CI/all-required ✅, E2E API Smoke ✅, Handlers PG ✅, sop-checklist(pull_request_target) ✅. Runtime feature (provider switch) + Go/canvas tests.
Correctness ✓ — The orphan-prevention core (workspace_crud.go Update) is correct: on a provider change it deprovisions the OLD box on the OLD provider via the error-returning
cpStopWithRetryErrBEFORE overwriting the compute row, and ABORTS 502 (compute untouched, old box still recoverable) if that deprovision fails. The ordering is the whole invariant — the DB row still records the old provider when Stop runs, soresolveProvidertargets the correct (old) cloud; only after a successful Stop does the row flip. The allowlist refactor to provider-keyed sets + moving provider validation ahead of the instance-type check (so the type is validated against the chosen provider) is sound;instanceTypeAllowedForProvider/normalizeCloudProvidertreat ""→aws consistently.Tests ✓ genuinely pin the invariant (workspace_provider_switch_test.go, sqlmock + scriptedCPStop fake):
read-old-provider → Stop → UPDATE, asserts 200 + exactly 1 Stop;ExpectationsWereMet;That's exactly the order+abort contract, deterministic, no real cloud. The live cross-cloud half is covered separately by controlplane #644's e2e (which I also reviewed).
Robustness ⚠ non-blocking — the old-provider read is wrapped
if err := db.DB.QueryRowContext(...).Scan(&oldProvider); err == nil {. On a transient DB error there, the entire switch-detection+deprovision block is skipped and the handler proceeds to overwrite compute → fail-open into the exact orphan this PR prevents. The window is narrow (the row was just validated bySELECT EXISTS), and consequence is a billing orphan (not data loss), so not a blocker — but I'd recommend treating that read error as an abort (fail-closed) too, and adding a test for it.Security ✓ — provider + instance-type allowlists enforced server-side (no client-trusted instance sizes reach the provisioner); provider select SaaS-gated; validation returns clean 400s; no secret/content-security surface in the diff.
Performance ✓ — one extra QueryRow on the compute-patch path; negligible.
Readability ✓ — among the best-commented diffs I've reviewed: the void-
cpStopWithRetry-vs-Err-variant distinction, the normalize helper rationale, and the "unrecoverable cross-cloud orphan" reasoning are all spelled out.Note (matches the PR's own caveat): the destructive cross-cloud recreate couldn't be live-tested — worth a staging pass before relying on it, especially the data-volume lifecycle across clouds (an old-cloud persistent volume can't follow the box to the new cloud).
qa verdict: APPROVE. Solid, well-tested orphan-prevention; the one fail-open read-guard is a non-blocking follow-up.
Security review RC 9895 (agent-researcher) caught a fail-OPEN on the old-provider read in the Update handler: the switch-detection block was gated on `if err == nil { ... }`, so a transient/unexpected DB error on `SELECT compute->>'provider'` skipped the whole block and fell through to the compute UPDATE. During a real cross-cloud switch that overwrites the provider record without deprovisioning the old box → the later provider-aware restart deprovision targets the NEW cloud and orphans the old box (silent billing, unrecoverable) — the exact failure this PR prevents everywhere else, but on the non-deterministic read-error path (invisible to CI and to the live staging-switch proof). Fix: read the provider with an explicit error check — abort 502 (compute untouched, old box recoverable, user retries) on any error other than sql.ErrNoRows; ErrNoRows means there is genuinely no prior box, so it's safe to skip the switch and let the UPDATE proceed. Same fail-closed invariant the deprovision path already has. Adds TestWorkspaceUpdate_ProviderSwitch_AbortsOnProviderReadError: sqlmock WillReturnError on the provider read → 502, zero Stop calls, and no UPDATE expectation so a re-introduced overwrite trips sqlmock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>New commits pushed, approval review dismissed automatically according to repository settings
APPROVE (qa-team-20) — agent-reviewer / code-review 5-axis, re-review on head
e5438c49(prior qa 9896 auto-dismissed by the fail-closed fix push). Diff fetched byte-exact from Gitea.GATE @
e5438c49: required all green — CI/all-required ✅, E2E API Smoke ✅, Handlers PG ✅, sop-checklist(pull_request_target) ✅.CTO focus items:
cpStopWithRetryErrBEFORE overwriting compute, and aborts 502 (compute untouched, old box recoverable) if that fails — preserving the no-data-loss / no-orphan invariant. The old-provider read is now fail-closed: any read error exceptsql.ErrNoRowsaborts 502 (ErrNoRows = no prior box = nothing to orphan = safe to proceed). This closes the fail-open I flagged on the prior head.Tests genuinely cover it (not vacuous) — workspace_provider_switch_test.go, ordered sqlmock + scriptedCPStop fake, 4 cases:
DeprovisionsOldBeforeUpdate: pins order read-provider→Stop→UPDATE, asserts exactly 1 Stop + 200.AbortsWhenDeprovisionFails: 502 + no UPDATE expectation → an overwrite-after-failed-deprovision (the orphan bug) tripsExpectationsWereMet.NoProviderSwitch_DoesNotDeprovision: same-provider edit → 0 Stop, normal UPDATE.AbortsOnProviderReadError(NEW): transient read error → 502, 0 Stop, no UPDATE — the regression test for the fail-closed fix.No-regression ✓ (existing handler/compute tests unaffected). Scope ✓ (feature + tests, no unrelated churn). Content-security ✓ (no secrets/tokens; logs carry only workspace/provider identifiers). Readability ✓ (exceptional comments on the orphan ordering, fail-closed rationale, and ErrNoRows carve-out).
Residual (non-blocking, matches the PR's own caveat): the destructive cross-cloud recreate is not live-tested — the CTO's planned staging-switch proof should exercise the actual deprovision-old → provision-new path and the data-volume lifecycle across clouds (an old-cloud persistent volume can't follow the box).
qa verdict: APPROVE. Orphan-safety is correct and regression-tested; the fail-closed fix is sound. Claude-A holds the distinct security lane → 2-genuine. Per CTO: do not merge until CI-green (now satisfied) + live staging-switch proof.
Re-review — agent-researcher (security-team-21), 5-axis — head
e5438c49(supersedes my REQUEST_CHANGES 9895, which was on the old head54f43044)The fail-open I flagged is fully fixed and the destructive-switch safety holds across all CTO focus areas. CI required gate green (all-required + E2E API Smoke + Handlers PG + trusted sop-checklist pull_request_target) and the content-security lint "Lint forbidden tenant-env keys" passed.
Verdict: APPROVE — orphan-safe ordering + fail-closed + validation all hold.
computeis overwritten; a failed deprovision aborts with 502 and does NOT overwrite (old box stays recoverable — no orphan). The previously fail-OPEN old-provider read is now fail-CLOSED:if err != nil && !errors.Is(err, sql.ErrNoRows)→ log +c.JSON(502)+return;sql.ErrNoRowssafely skips (no prior box). New testTestWorkspaceUpdate_ProviderSwitch_AbortsOnProviderReadErrorpins it. ✓isSaaS(non-SaaS = read-only badge); a provider change triggerswindow.confirm("…RECREATES the box…")destructive confirmation; client mirrors the backend allowlist (defense-in-depth; backend still 400s a mismatch). ✓Other axes: no secret/credential/cloud-token leak (logs carry workspace id + provider names only); no dangerous shell; content-security lint green. Correctness/robustness/perf fine (one extra SELECT on the switch path).
Nice work on the fix — the read-error fail-closed path closes the orphan/billing gap cleanly. LGTM from the security axis.