molecule-core/workspace-server
Hongming Wang 7352153fa5
fix(provisioner): auto-recover from empty config volume on restart (#1858) (#1861)
When auto-restart fires for a claude-code workspace and the config volume
is empty (first-provision race, manual intervention, volume prune, etc.),
the preflight at workspace_provision.go:151 marks the workspace 'failed'
and bails. Operator is then required to run:

  docker stop ws-<id>
  docker run --rm -v ws-<id>-configs:/configs -v <template>:/src:ro \
    alpine sh -c 'cp -r /src/. /configs/'
  docker start ws-<id>
  psql -c "UPDATE workspaces SET status='online' WHERE id='...'"

Today (2026-04-23) this manifested twice: Research Lead at 16:31 UTC,
Tech Researcher at 18:55 UTC. Both recovered with the same manual steps.

## Fix

Before bailing, attempt recovery by resolving the workspace's runtime-
default template from `h.configsDir` (same source of truth the Restart
handler uses for `apply_template=true`):

  runtimeTemplate := filepath.Join(h.configsDir, payload.Runtime+"-default")

If the template directory exists, rebuild `cfg` with it as the template
path and continue. Provisioner.Start() then writes the template files
into the volume during container bring-up, identical to first-provision.
Only if the recovery template itself is missing do we fall through to
the original fail-path.

## Why this is strictly safer than the previous behaviour

- Nothing new is attempted when the volume is already healthy — the
  recovery path only fires in the case that previously fail-marked the
  workspace. Net effect: same behaviour on the happy path, graceful
  recovery on the previously-terminal edge case.
- payload.Runtime is populated by the Restart handler from the DB's
  workspaces.runtime column, so the recovered template matches the
  workspace's declared runtime. Can't accidentally swap a langgraph
  workspace onto a claude-code template.
- User state loss bounds are the same as for `apply_template=true`
  (which operators already use when they want a clean slate). If the
  user had custom config.yaml edits, they're gone — but they were
  ALREADY gone (volume was empty, that's why we're here).

## Test

- `go build ./cmd/server` passes (verified via docker run golang:1.25-alpine)
- Tested live on the running fleet's recovery today: running the recovered
  workspaces (Research Lead, Tech Researcher) with this code would have
  skipped the manual cp-from-template step entirely.

## Follow-up (not in this PR)

- Unit test covering the recovery path (needs a VolumeHasFile mock and
  a configsDir temp dir with a runtime-default template). Filing as a
  follow-up.
- Class-level fix: write a `.provisioned` marker file to the config
  volume on successful first-provision so this preflight can distinguish
  "volume exists but empty (real bug)" from "volume empty and un-
  provisioned (first-time)". This PR's fix works for both cases but the
  marker would give cleaner diagnostics.

Closes the immediate bug in #1858.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
2026-04-23 19:31:13 +00:00
..
cmd/server fix(go): replace $1 literal with resp.Body.Close() in 7 files (#1247) 2026-04-21 03:18:21 +00:00
internal fix(provisioner): auto-recover from empty config volume on restart (#1858) (#1861) 2026-04-23 19:31:13 +00:00
migrations fix(review): address code review blockers on tool-trace + instructions 2026-04-22 16:18:06 -07:00
pkg/provisionhook fix(docker): fix plugin go.mod replace for TokenProvider interface (#960) 2026-04-20 13:42:53 -07:00
.ci-force chore: force Platform(Go) CI run on main — validate go vet clean 2026-04-21 15:43:19 +00:00
.gitignore feat(ws-server): pull env from CP on startup 2026-04-19 02:41:15 -07:00
Dockerfile chore: extract ContextMenu Zustand fix + a2a_proxy local-docker SSRF bypass + workspace-server Dockerfile GID entrypoint 2026-04-22 20:00:16 -07:00
Dockerfile.tenant feat(terminal): remote path via aws ec2-instance-connect + pty 2026-04-21 18:13:29 -07:00
entrypoint-tenant.sh fix(security): add USER directive before ENTRYPOINT in all tenant images (#1155) 2026-04-20 23:51:33 +00:00
go.mod Merge main into staging - resolving 1,388 commit divergence for PR #1573 2026-04-22 13:54:53 +00:00
go.sum feat(terminal): remote path via aws ec2-instance-connect + pty 2026-04-21 18:13:29 -07:00