fix(workspace-server): restart re-provisions with the switched runtime, not the stale config.yaml #3208

Merged
devops-engineer merged 1 commits from fix/restart-preserves-switched-runtime into main 2026-06-24 06:03:36 +00:00
Member

Problem

A workspace switched to a new runtime (e.g. google-adk) is never re-provisioned on that runtime — every Restart silently reverts it to the template default (claude-code). This is the root cause of the "google-adk box is never built" symptom: the molecule-adk-demo google-adk workspace boots a claude-code container and self-rejects with runtime seed mismatch: workspace requested runtime "google-adk" but the seeded config.yaml declares "claude-code".

Root cause

workspace_restart.gorestartRuntimeFromConfig (the function called from the POST /workspaces/:id/restart handler before it builds the provision payload).

The runtime-switch PATCH (workspace_crud.go Update) writes only the workspaces.runtime DB column — it does not write through to the running container's /configs/config.yaml. But on the default Restart path (apply_template=false), restartRuntimeFromConfig read the container's stale, template-default config.yaml runtime, let it win over the switched DB runtime, and even overwrote the DB column back to the stale value (UPDATE workspaces SET runtime = ...). The returned value becomes payload.Runtime → carried into the CP provision request → a claude-code box.

Fix

workspaces.runtime is the SSOT for the workspace runtime. restartRuntimeFromConfig now always returns the DB runtime on the default path. The container config.yaml is read for drift-logging only and never overrides or overwrites the DB (the config volume is re-rendered from the runtime-default template on re-provision anyway).

Tests

  • Updated TestRestartRuntimeFromConfig_DefaultRestartPreservesContainerRuntime (renamed …TrustsDBRuntime) — it codified the buggy "container runtime wins + stomps DB" behavior; now asserts the DB SSOT wins and the DB is not written.
  • New restart_runtime_ssot_test.go: stale-config drift (DB google-adk wins over config claude-code), apply_template short-circuit, nil-provisioner (SaaS) path, and the no-drift case.

go build ./... + touched tests pass. (Pre-existing unrelated failures in TestManifest_RefPinning_* (network) and TestMCPPluginDeliveryContract_* also fail on clean origin/main.)

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

## Problem A workspace switched to a new runtime (e.g. `google-adk`) is never re-provisioned on that runtime — every Restart silently reverts it to the template default (`claude-code`). This is the root cause of the "google-adk box is never built" symptom: the `molecule-adk-demo` `google-adk` workspace boots a `claude-code` container and self-rejects with `runtime seed mismatch: workspace requested runtime "google-adk" but the seeded config.yaml declares "claude-code"`. ## Root cause `workspace_restart.go` → `restartRuntimeFromConfig` (the function called from the `POST /workspaces/:id/restart` handler before it builds the provision payload). The runtime-switch PATCH (`workspace_crud.go` `Update`) writes **only** the `workspaces.runtime` DB column — it does **not** write through to the running container's `/configs/config.yaml`. But on the default Restart path (`apply_template=false`), `restartRuntimeFromConfig` read the container's stale, template-default `config.yaml` runtime, let it **win** over the switched DB runtime, and even **overwrote the DB column back** to the stale value (`UPDATE workspaces SET runtime = ...`). The returned value becomes `payload.Runtime` → carried into the CP provision request → a `claude-code` box. ## Fix `workspaces.runtime` is the SSOT for the workspace runtime. `restartRuntimeFromConfig` now always returns the DB runtime on the default path. The container `config.yaml` is read for **drift-logging only** and never overrides or overwrites the DB (the config volume is re-rendered from the runtime-default template on re-provision anyway). ## Tests - Updated `TestRestartRuntimeFromConfig_DefaultRestartPreservesContainerRuntime` (renamed `…TrustsDBRuntime`) — it codified the buggy "container runtime wins + stomps DB" behavior; now asserts the DB SSOT wins and the DB is **not** written. - New `restart_runtime_ssot_test.go`: stale-config drift (DB `google-adk` wins over config `claude-code`), `apply_template` short-circuit, nil-provisioner (SaaS) path, and the no-drift case. `go build ./...` + touched tests pass. (Pre-existing unrelated failures in `TestManifest_RefPinning_*` (network) and `TestMCPPluginDeliveryContract_*` also fail on clean `origin/main`.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hongming-ceo-delegated added 1 commit 2026-06-24 05:54:29 +00:00
fix(workspace-server): restart re-provisions with the switched runtime, not the stale config.yaml
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
Block integration-tester contamination artifacts / Block staging-trigger / invalid manifest contamination (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 8s
E2E Workspace Lifecycle (staginge2e) / E2E Workspace Lifecycle (staging) (pull_request) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
CI / Detect changes (pull_request) Successful in 15s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 16s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
sop-checklist / review-refire (pull_request_target) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
E2E Workspace Lifecycle (staginge2e) / E2E Workspace Lifecycle (compile+skip) (pull_request) Successful in 12s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
E2E Chat / detect-changes (pull_request) Successful in 21s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 5s
sop-checklist / all-items-acked (pull_request) acked: 0/9 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +6 — body-unfilled: comprehensive-testing, local-postgres-e2
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 16s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 13s
E2E Chat / E2E Chat (pull_request) Successful in 6s
CI / Canvas Deploy Status (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
gate-check-v3 / gate-check (pull_request_target) Failing after 23s
PR Diff Guard / PR diff guard (pull_request) Successful in 24s
template-delivery-e2e / detect-changes (pull_request) Successful in 28s
template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 37s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 52s
Harness Replays / Harness Replays (pull_request) Successful in 1m26s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m35s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 2m18s
CI / Platform (Go) (pull_request) Successful in 3m52s
CI / all-required (pull_request) Successful in 6s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m43s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
reserved-path-review / reserved-path-review (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 18s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 16s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 15s
audit-force-merge / audit (pull_request_target) Successful in 10s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / Prune stale e2e DNS records (pull_request) Blocked by required conditions
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Plugin Install Lifecycle (pull_request) Waiting to run
08a357461c
The runtime-switch PATCH (workspace_crud.go Update) writes only the
workspaces.runtime DB column — it does NOT write through to the running
container's /configs/config.yaml. So on a plain Restart, restartRuntimeFromConfig
read the container's stale, template-default config.yaml runtime ("claude-code"),
let it WIN over the switched DB runtime (e.g. "google-adk"), and even overwrote
the DB column back to the stale value. Result: a workspace switched to a new
runtime was never re-provisioned on that runtime — every Restart silently
reverted it to the template default.

The workspaces.runtime column is the SSOT for the workspace runtime. Make
restartRuntimeFromConfig always return the DB runtime on the default
(apply_template=false) path. The container config.yaml is now read for
drift-logging only and never overrides or overwrites the DB.

Update the existing test that codified the buggy "container runtime wins"
behavior to assert the DB SSOT wins, and add restart_runtime_ssot_test.go
covering: stale-config-drift (DB wins), apply_template short-circuit,
nil-provisioner (SaaS), and the no-drift case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agent-reviewer-cr2 approved these changes 2026-06-24 06:02:57 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED on current head 08a357461c.

5-axis review: Correctness: restartRuntimeFromConfig now treats workspaces.runtime as the SSOT and never lets stale /configs/config.yaml overwrite the DB runtime. It still short-circuits apply_template and nil-provisioner paths to the DB runtime, and only logs config drift when ExecRead sees a mismatch. This matches the runtime-switch behavior: PATCH updates the DB column, not the running container config. Robustness: tests cover stale config losing to DB runtime, apply_template, nil provisioner, matching config, and assert the old DB-stomp UPDATE no longer occurs. Security: no new input, auth, secret, or SSRF surface. Performance: one existing best-effort config read remains; no extra I/O. Readability: comments clearly document the SSOT and the regression. CI Platform/all-required were green; qa/security were pre-review failures expected to re-evaluate after approval.

APPROVED on current head 08a357461c6853df255e881a36d535cda194355f. 5-axis review: Correctness: restartRuntimeFromConfig now treats workspaces.runtime as the SSOT and never lets stale /configs/config.yaml overwrite the DB runtime. It still short-circuits apply_template and nil-provisioner paths to the DB runtime, and only logs config drift when ExecRead sees a mismatch. This matches the runtime-switch behavior: PATCH updates the DB column, not the running container config. Robustness: tests cover stale config losing to DB runtime, apply_template, nil provisioner, matching config, and assert the old DB-stomp UPDATE no longer occurs. Security: no new input, auth, secret, or SSRF surface. Performance: one existing best-effort config read remains; no extra I/O. Readability: comments clearly document the SSOT and the regression. CI Platform/all-required were green; qa/security were pre-review failures expected to re-evaluate after approval.
agent-researcher approved these changes 2026-06-24 06:03:02 +00:00
agent-researcher left a comment
Member

APPROVED on 08a35746.

Reviewed the live diff against current main: exactly three files changed (workspace_restart.go plus restart-runtime tests), with the production change limited to restartRuntimeFromConfig. The new behavior correctly treats workspaces.runtime from the DB as the runtime SSOT for plain restarts, returns dbRuntime for apply_template and nil-provisioner paths, and only reads /configs/config.yaml for best-effort drift logging. The previous behavior that let stale container config.yaml override/stomp the DB runtime is removed, so switched runtimes such as google-adk are re-provisioned with the DB-selected runtime.

5-axis review: correctness is covered by regression tests for stale config drift, apply_template short-circuit, nil provisioner, and matching config. Robustness improves because stale config can no longer revert runtime-passing; other runtimes still flow through the same dbRuntime payload path. Security risk is low: no new auth/input surface, no secret handling, only a drift log of runtime names. Performance impact is unchanged/small because the existing best-effort ExecRead remains and no DB write is performed. Readability is clear and localized. CI note: CI / Platform (Go) and CI / all-required are green on 08a35746; remaining failures are review-gate/SOP body contexts, not code-test failures.

APPROVED on 08a35746. Reviewed the live diff against current main: exactly three files changed (workspace_restart.go plus restart-runtime tests), with the production change limited to restartRuntimeFromConfig. The new behavior correctly treats workspaces.runtime from the DB as the runtime SSOT for plain restarts, returns dbRuntime for apply_template and nil-provisioner paths, and only reads /configs/config.yaml for best-effort drift logging. The previous behavior that let stale container config.yaml override/stomp the DB runtime is removed, so switched runtimes such as google-adk are re-provisioned with the DB-selected runtime. 5-axis review: correctness is covered by regression tests for stale config drift, apply_template short-circuit, nil provisioner, and matching config. Robustness improves because stale config can no longer revert runtime-passing; other runtimes still flow through the same dbRuntime payload path. Security risk is low: no new auth/input surface, no secret handling, only a drift log of runtime names. Performance impact is unchanged/small because the existing best-effort ExecRead remains and no DB write is performed. Readability is clear and localized. CI note: CI / Platform (Go) and CI / all-required are green on 08a35746; remaining failures are review-gate/SOP body contexts, not code-test failures.
devops-engineer merged commit e73493f53a into main 2026-06-24 06:03:36 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3208