fix(platform): install docker-cli in workspace-server image — unblocks RegistryModeLocal #765

Merged
hongming merged 1 commits from infra/dockerfile-add-docker-cli-for-local-build into main 2026-05-13 04:39:20 +00:00

1 Commits

Author SHA1 Message Date
b8ccd21c8c fix(platform): install docker-cli in workspace-server image — unblocks RegistryModeLocal
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 18s
CI / Detect changes (pull_request) Successful in 17s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
Harness Replays / detect-changes (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 22s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 21s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 13s
qa-review / approved (pull_request) Failing after 13s
security-review / approved (pull_request) Failing after 14s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 25s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m24s
CI / Platform (Go) (pull_request) Has been skipped
CI / Canvas (Next.js) (pull_request) Has been skipped
CI / Shellcheck (E2E scripts) (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Has been skipped
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 1s
sop-checklist-gate / gate (pull_request) Successful in 37s
gate-check-v3 / gate-check (pull_request) Successful in 38s
sop-tier-check / tier-check (pull_request) Successful in 37s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
audit-force-merge / audit (pull_request) Successful in 8s
The platform server's internal/provisioner/localbuild.go (Task #194 /
Issue #63 — the post-2026-05-06 GHCR-suspension fallback) shells out
via exec.Command("docker", "image", "inspect"/"build"/"tag", ...) in
the production dockerHasTagProd / dockerBuildProd / dockerTagProd
functions. The colocated workspace-server/Dockerfile installed
`ca-certificates git tzdata wget` in the alpine runtime layer but NOT
`docker-cli`, so every workspace re-provision in the now-permanent
RegistryModeLocal path fails at step 2 (cache check):

  local-build: image inspect for
    molecule-local/workspace-template-claude-code:<sha> failed
    (exec: "docker": executable file not found in $PATH); will rebuild
  Provisioner: workspace start failed for <id>: local-build mode:
    ensure image for runtime "claude-code": local-build:
    docker build molecule-local/workspace-template-claude-code:<sha>:
    exec: "docker": executable file not found in $PATH

Net: ANY ws-* container that dies (auto-restart on container-dead, the
liveness-monitor RestartByID, plugin auto-restart, secrets-set
auto-restart, manual POST /workspaces/:id/restart) cannot come back
up. Already took down CP-QA (ec6cf05b) and sdk-lead (360d42e4); also
blocks the MiniMax LLM-provider switch for the 6 *-lead workspaces
(which requires postgres UPDATE workspace_secrets + POST /restart to
re-bake the env from the updated secrets).

The Docker SOCKET is already mounted into the platform container —
the entrypoint.sh adds the platform user to the docker group derived
from the socket's gid. Only the CLI binary was missing.

Per `registry_mode.go:Resolve()`, MOLECULE_IMAGE_REGISTRY is the
toggle: set ⇒ RegistryModeSaaS pull from a real registry; unset ⇒
RegistryModeLocal clone+build from Gitea. Since 2026-05-06 the env
var has been unset (GHCR was the only SaaS-mode target and it's
unreachable post-suspension), so RegistryModeLocal is the permanent
mode until internal#231 (GHCR→ECR migration) lands. This Dockerfile
needs to support the mode the code is permanently in.

Diff is +16/-1 (mostly comment explaining why). The single
behavioural change: `docker-cli` added to the apk-add line.

Verification: post-deploy, `POST /workspaces/360d42e4-…/restart` (the
known-failed sdk-lead) should succeed and bring the workspace back
up with its current Claude-Opus secrets — that's the first confirmation
the local-build path is unblocked. Then the MiniMax switch can proceed
(postgres UPDATE on each *-lead's workspace_secrets + POST /restart).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:13:55 -07:00