fix(platform): install docker-cli in workspace-server image — unblocks RegistryModeLocal #765

Merged
hongming merged 1 commits from infra/dockerfile-add-docker-cli-for-local-build into main 2026-05-13 04:39:20 +00:00
Owner

Summary

One-word + 15-line-comment fix to workspace-server/Dockerfile: install docker-cli in the alpine runtime layer alongside the existing ca-certificates git tzdata wget. Without it, the colocated internal/provisioner/localbuild.go code path — which is the permanent code path post-2026-05-06 because GHCR is unreachable and MOLECULE_IMAGE_REGISTRY is unset → registry_mode.go:Resolve() returns RegistryModeLocalEnsureLocalImage() runs — fails at the very first step (dockerHasTagProd shells out via exec.Command("docker", "image", "inspect", ...)) with:

local-build: image inspect for molecule-local/workspace-template-claude-code:<sha> failed
  (exec: "docker": executable file not found in $PATH); will rebuild
Provisioner: workspace start failed for <id>: local-build mode: ensure image for runtime
  "claude-code": local-build: docker build molecule-local/workspace-template-claude-code:<sha>:
  exec: "docker": executable file not found in $PATH

Workspace stays status: failed. ANY ws- re-provision is currently broken fleet-wide.*

Why this is the root, not a patch

The Dockerfile is code. localbuild.go (Task #194 / Issue #63, post-org-suspension addition) was added to the codebase but the colocated Dockerfile was never updated to install the docker-cli package its exec.Command("docker", ...) calls depend on. So the implementation's runtime environment doesn't match what the implementation requires. Adding docker-cli to the apk add line is the actual fix — same shape as if a Go file import-ed a package not in go.mod.

The Docker SOCKET is already mounted (entrypoint.sh adds the platform user to the docker group derived from /var/run/docker.sock's gid). Only the CLI binary was missing.

The deeper fix — GHCR→ECR migration (internal#231) so MOLECULE_IMAGE_REGISTRY can point at a working registry and RegistryModeSaaS becomes a real option again — is the right long-term move but is bigger scope. Until then, RegistryModeLocal is the permanent path, and it needs to actually work.

Real impact this is currently blocking

  • CP-QA workspace (ec6cf05b-…) DOWN since this morning (~06:08Z) — task #43.
  • sdk-lead workspace (360d42e4-…) DOWN since ~06:08Z when a probe-misadventure triggered re-provision.
  • MiniMax LLM-provider switch for the 6 *-lead workspaces (app-lead, core-lead, cp-lead, dev-lead, infra-lead, sdk-lead) — Hongming-requested, since Claude subscription is down → leads are eating dead LLM calls. The switch requires postgres UPDATE workspace_secrets + POST /workspaces/:id/restart, which goes through the broken local-build path. The 22 other (worker) ws-* are already on MiniMax; the 6 leads can't be switched until re-provision works.
  • Latent fragility: every other auto-restart path (liveness-monitor RestartByID on container-dead, plugin install/uninstall auto-restart, secrets-set auto-restart) hits the same trap. Any workspace that bounces stays bounced.

Diff

+# docker-cli is required by internal/provisioner/localbuild.go which
+# shells out via exec.Command("docker", "image", "inspect"/"build"/"tag", ...)
+# whenever Resolve().Mode == RegistryModeLocal — which is the permanent
+# mode post-2026-05-06 (Molecule-AI GitHub org suspended → GHCR
+# unreachable → MOLECULE_IMAGE_REGISTRY unset → registry_mode.go falls
+# through to RegistryModeLocal). Without docker-cli here the platform
+# fails every workspace re-provision with `local-build: image inspect
+# for molecule-local/workspace-template-<runtime>:<sha> failed
+# (exec: "docker": executable file not found in $PATH)` and the
+# workspace stays status=failed. The Docker SOCKET is already mounted
+# (entrypoint.sh adds the platform user to the docker group) — only
+# the CLI binary was missing. Caught after sdk-lead + CP-QA went down
+# this way during the MiniMax-switch attempt + after-Class-A audit.
+# Related: Task #194 / Issue #63 (local-build path added);
+# `feedback_workspace_image_ghcr_dead`.
-RUN apk add --no-cache ca-certificates git tzdata wget
+RUN apk add --no-cache ca-certificates docker-cli git tzdata wget

docker-cli is a real Alpine community/ package; on Alpine 3.20 it provides just the docker client binary (no daemon). Will be confirmed by CI's actual docker build of this Dockerfile.


SOP Checklist (RFC#351)

Comprehensive testing performed:
Static reasoning + codebase audit: (a) grep'd all in-platform docker CLI consumers — exactly 3, all in internal/provisioner/localbuild.go (dockerHasTagProd, dockerBuildProd, dockerTagProd); no consumers in any other production file (test files only). So no second class of CLI exec is at risk from a wrong package name. (b) The Alpine docker-cli package name is the canonical one (also used by widely-deployed images like docker/build-action); it's in the community/ repo enabled by default on Alpine 3.20. (c) CI's actual docker build of this Dockerfile will fail at the apk add docker-cli step if the name is wrong — full vendor-truth test, no fixture-mirroring-bug risk per feedback_smoke_test_vendor_truth_not_shape_match. Edge cases reasoned about: (i) image-size impact (+~30MB for docker-cli; negligible relative to the existing ~1.5GB workspace-template images the platform pulls), (ii) no permission change (docker-cli has no setuid bit; the docker socket access is already gated by the entrypoint.sh addgroup platform docker), (iii) no breaking-change to the RegistryModeSaaS path (which doesn't call any CLI — uses the Go SDK via p.cli.ImageInspect/ImagePull).

Local-postgres E2E run:
N/A — workspace-server Dockerfile change only; no app code, no migration, no DB schema or query change. The colocated Go code (localbuild.go) is unchanged. This PR fixes the runtime environment that existing code requires — it doesn't add new code paths.

Staging-smoke verified or pending:
Scheduled post-merge. The canonical verification = once the new platform image is rebuilt and molecule-core-platform-1 is recreated: (i) docker exec molecule-core-platform-1 sh -c 'command -v docker && docker --version' → expect /usr/bin/docker + a version string; (ii) POST /workspaces/360d42e4-8356-441c-80cf-16fcd5d5ce03/restart (sdk-lead, currently status: failed) → expect re-provision to succeed, ws-360d42e4-… container Up; (iii) tail platform logs for local-build: clone start and local-build: docker build start instead of exec: "docker": executable file not found. No staging-canary needed — the failure mode is binary (CLI present or not) and verifiable on the platform container itself.

Root-cause not symptom:
workspace-server/Dockerfile doesn't install the docker-cli package that the colocated internal/provisioner/localbuild.go (Task #194) shells out to via exec.Command("docker", ...). The implementation's runtime environment doesn't match what the implementation requires; this fix makes them match. The deeper "registry is unreachable" root (GHCR org-suspension) is tracked separately in internal#231 (GHCR→ECR migration); this PR makes the local-build fallback work correctly while that's pending.

Five-Axis review walked:

  • Correctness ✓ — single well-known Alpine community package; apk add docker-cli is the canonical install on Alpine; the package name is unchanged across Alpine 3.18/3.19/3.20.
  • Readability ✓ — diff is +16 (15-line comment + 1 word in the apk-add line) / -1 (replacing the old apk-add line). Comment block explains why the change is needed + cites the failure-log line + Task #194 / Issue #63 + the relevant memory file.
  • Architecture ✓ — no API/contract change. The colocated code path is unchanged. This is a build-artifact composition fix.
  • Security ✓ — docker-cli (Alpine package) is just the client binary; no daemon, no setuid. Docker socket access is already gated by the entrypoint group setup. No new permissions, no new secrets, no widened scope.
  • Performance ✓ — image-size delta ~30MB (docker-cli is a small Go binary + a few support files); negligible relative to existing image sizes. No runtime performance impact (CLI invoked only on RegistryModeLocal cold-path).

No backwards-compat shim / dead code added:
No. This PR adds zero compatibility shims and zero dead code. The single substantive line change is adding docker-cli to the apk add argument list. The +15-line comment is documentation (explains why the package is required + cites the relevant memory + Issue/Task numbers); not code. There is no fallback layer, no version pin, no legacy path retained — Alpine's package manager will install the current docker-cli version. Old behavior (CLI absent → CLI exec fails → workspace re-provision fails) is broken; new behavior (CLI present → CLI exec succeeds → re-provision proceeds) is correct. The "deprecated" path is the entire RegistryModeSaaS branch (because GHCR is dead), but that's tracked in internal#231 and not this PR's scope.

Memory/saved-feedback consulted:

  • feedback_workspace_image_ghcr_dead — the root context: GHCR org-suspension made MOLECULE_IMAGE_REGISTRY=ghcr.io/... non-viable, forcing RegistryModeLocal as the permanent mode.
  • feedback_dev_workspace_restart_is_full_reprovision — explains why POST /workspaces/:id/restart failure leaves the workspace down (stop+rm+recreate path; can't restart in place).
  • feedback_local_must_mimic_productionlocalbuild.go builds linux/amd64 even on Apple Silicon hosts to keep parity with prod (RegistryModeSaaS pull); this PR doesn't change that, but the platform image (which is also a build artifact) needs the toolchain the code uses.
  • feedback_smoke_test_vendor_truth_not_shape_match — applied via the static-reasoning + CI's actual docker build (the build is the vendor-truth probe: if Alpine doesn't have docker-cli under that name, the build fails immediately at the apk add step).
  • feedback_no_such_thing_as_flakes — sdk-lead + CP-QA repeatedly failing to come back was NOT a flake (consistently broken since 06:08Z this morning); this PR addresses the root.

Verification plan (post-merge)

  • Static reasoning: +16/-1, all comment-and-one-word; previously the apk add line worked fine without docker-cli → adding a single well-known community package can't regress existing build paths.
  • Codebase grep: only internal/provisioner/localbuild.go consumes the in-platform docker CLI (dockerHasTagProd/dockerBuildProd/dockerTagProd); no other production code paths.
  • CI on this PR — the platform-publish workflow does an actual docker build -f workspace-server/Dockerfile; if the Alpine docker-cli package name is wrong or the package fails to install, this PR's CI will catch it.
  • Post-merge deploy verify: see the Staging-smoke section above.

Follow-up (not in this PR)

  • internal#231: GHCR→ECR migration (or any working SaaS registry mirror).
  • Refactor localbuild.go to use the Go docker SDK (p.cli.ImageInspect / ImageBuild) instead of CLI exec — proper but bigger change, removes the CLI dependency entirely. Defer; this Dockerfile fix unblocks the immediate failures.

Peer-ack asks (RFC#351 SOP-checklist gate)

To merge this PR, the gate needs /sop-ack <slug> comments from non-author members of these teams:

  • /sop-ack comprehensive-testing — from qa or engineers
  • /sop-ack local-postgres-e2e — from engineers (N/A justification is in the body)
  • /sop-ack staging-smoke — from engineers (post-merge canonical verification on sdk-lead)
  • /sop-ack root-cause — from managers or ceo
  • /sop-ack five-axis-review — from engineers
  • /sop-ack no-backwards-compat — from managers or ceo
  • /sop-ack memory-consulted — from engineers

Suggested ack-paths: core-be / core-devops / core-qa / infra-sre (engineers); claude-ceo-assistant (managers); hongming (ceo) — pick any one per item.

  • Task #194 / Issue #63 (local-build path added)
  • feedback_workspace_image_ghcr_dead (the GHCR-deadness root)
  • Task #43 (CP-QA recovery), mc#576 (publish-image runner — adjacent docker-CLI issue, different container)
  • RFC#351 (SOP-Checklist peer-ack gate)

Tier: tier:high — fleet-wide re-provision is broken; this is the unblocker.

## Summary One-word + 15-line-comment fix to `workspace-server/Dockerfile`: install `docker-cli` in the alpine runtime layer alongside the existing `ca-certificates git tzdata wget`. Without it, the colocated `internal/provisioner/localbuild.go` code path — which is the permanent code path post-2026-05-06 because GHCR is unreachable and `MOLECULE_IMAGE_REGISTRY` is unset → `registry_mode.go:Resolve()` returns `RegistryModeLocal` → `EnsureLocalImage()` runs — fails at the very first step (`dockerHasTagProd` shells out via `exec.Command("docker", "image", "inspect", ...)`) with: ``` local-build: image inspect for molecule-local/workspace-template-claude-code:<sha> failed (exec: "docker": executable file not found in $PATH); will rebuild Provisioner: workspace start failed for <id>: local-build mode: ensure image for runtime "claude-code": local-build: docker build molecule-local/workspace-template-claude-code:<sha>: exec: "docker": executable file not found in $PATH ``` Workspace stays `status: failed`. **ANY ws-* re-provision is currently broken fleet-wide.** ## Why this is the root, not a patch The Dockerfile is code. `localbuild.go` (Task #194 / Issue #63, post-org-suspension addition) was added to the codebase but the colocated Dockerfile was never updated to install the `docker-cli` package its `exec.Command("docker", ...)` calls depend on. So the implementation's runtime environment doesn't match what the implementation requires. Adding `docker-cli` to the `apk add` line is the actual fix — same shape as if a Go file `import`-ed a package not in `go.mod`. The Docker SOCKET is already mounted (entrypoint.sh adds the platform user to the docker group derived from `/var/run/docker.sock`'s gid). Only the CLI binary was missing. The deeper fix — GHCR→ECR migration (internal#231) so `MOLECULE_IMAGE_REGISTRY` can point at a working registry and `RegistryModeSaaS` becomes a real option again — is the right long-term move but is bigger scope. Until then, `RegistryModeLocal` is the permanent path, and it needs to actually work. ## Real impact this is currently blocking - **CP-QA workspace** (ec6cf05b-…) DOWN since this morning (~06:08Z) — task #43. - **sdk-lead workspace** (360d42e4-…) DOWN since ~06:08Z when a probe-misadventure triggered re-provision. - **MiniMax LLM-provider switch** for the 6 `*-lead` workspaces (app-lead, core-lead, cp-lead, dev-lead, infra-lead, sdk-lead) — Hongming-requested, since Claude subscription is down → leads are eating dead LLM calls. The switch requires postgres `UPDATE workspace_secrets` + `POST /workspaces/:id/restart`, which goes through the broken local-build path. The 22 other (worker) ws-* are already on MiniMax; the 6 leads can't be switched until re-provision works. - **Latent fragility**: every other auto-restart path (liveness-monitor `RestartByID` on container-dead, plugin install/uninstall auto-restart, secrets-set auto-restart) hits the same trap. Any workspace that bounces stays bounced. ## Diff ```diff +# docker-cli is required by internal/provisioner/localbuild.go which +# shells out via exec.Command("docker", "image", "inspect"/"build"/"tag", ...) +# whenever Resolve().Mode == RegistryModeLocal — which is the permanent +# mode post-2026-05-06 (Molecule-AI GitHub org suspended → GHCR +# unreachable → MOLECULE_IMAGE_REGISTRY unset → registry_mode.go falls +# through to RegistryModeLocal). Without docker-cli here the platform +# fails every workspace re-provision with `local-build: image inspect +# for molecule-local/workspace-template-<runtime>:<sha> failed +# (exec: "docker": executable file not found in $PATH)` and the +# workspace stays status=failed. The Docker SOCKET is already mounted +# (entrypoint.sh adds the platform user to the docker group) — only +# the CLI binary was missing. Caught after sdk-lead + CP-QA went down +# this way during the MiniMax-switch attempt + after-Class-A audit. +# Related: Task #194 / Issue #63 (local-build path added); +# `feedback_workspace_image_ghcr_dead`. -RUN apk add --no-cache ca-certificates git tzdata wget +RUN apk add --no-cache ca-certificates docker-cli git tzdata wget ``` `docker-cli` is a real Alpine `community/` package; on Alpine 3.20 it provides just the docker client binary (no daemon). Will be confirmed by CI's actual `docker build` of this Dockerfile. --- ## SOP Checklist (RFC#351) Comprehensive testing performed: Static reasoning + codebase audit: (a) grep'd all in-platform `docker` CLI consumers — exactly 3, all in `internal/provisioner/localbuild.go` (`dockerHasTagProd`, `dockerBuildProd`, `dockerTagProd`); no consumers in any other production file (test files only). So no second class of CLI exec is at risk from a wrong package name. (b) The Alpine `docker-cli` package name is the canonical one (also used by widely-deployed images like `docker/build-action`); it's in the `community/` repo enabled by default on Alpine 3.20. (c) CI's actual `docker build` of this Dockerfile will fail at the `apk add docker-cli` step if the name is wrong — full vendor-truth test, no fixture-mirroring-bug risk per `feedback_smoke_test_vendor_truth_not_shape_match`. Edge cases reasoned about: (i) image-size impact (+~30MB for docker-cli; negligible relative to the existing ~1.5GB workspace-template images the platform pulls), (ii) no permission change (docker-cli has no setuid bit; the docker socket access is already gated by the entrypoint.sh `addgroup platform docker`), (iii) no breaking-change to the `RegistryModeSaaS` path (which doesn't call any CLI — uses the Go SDK via `p.cli.ImageInspect`/`ImagePull`). Local-postgres E2E run: N/A — workspace-server Dockerfile change only; no app code, no migration, no DB schema or query change. The colocated Go code (`localbuild.go`) is unchanged. This PR fixes the runtime environment that existing code requires — it doesn't add new code paths. Staging-smoke verified or pending: Scheduled post-merge. The canonical verification = once the new platform image is rebuilt and `molecule-core-platform-1` is recreated: (i) `docker exec molecule-core-platform-1 sh -c 'command -v docker && docker --version'` → expect `/usr/bin/docker` + a version string; (ii) `POST /workspaces/360d42e4-8356-441c-80cf-16fcd5d5ce03/restart` (sdk-lead, currently `status: failed`) → expect re-provision to succeed, ws-360d42e4-… container Up; (iii) tail platform logs for `local-build: clone start` and `local-build: docker build start` instead of `exec: "docker": executable file not found`. No staging-canary needed — the failure mode is binary (CLI present or not) and verifiable on the platform container itself. Root-cause not symptom: `workspace-server/Dockerfile` doesn't install the `docker-cli` package that the colocated `internal/provisioner/localbuild.go` (Task #194) shells out to via `exec.Command("docker", ...)`. The implementation's runtime environment doesn't match what the implementation requires; this fix makes them match. The deeper "registry is unreachable" root (GHCR org-suspension) is tracked separately in internal#231 (GHCR→ECR migration); this PR makes the local-build fallback work correctly while that's pending. Five-Axis review walked: - **Correctness** ✓ — single well-known Alpine community package; `apk add docker-cli` is the canonical install on Alpine; the package name is unchanged across Alpine 3.18/3.19/3.20. - **Readability** ✓ — diff is +16 (15-line comment + 1 word in the apk-add line) / -1 (replacing the old apk-add line). Comment block explains why the change is needed + cites the failure-log line + Task #194 / Issue #63 + the relevant memory file. - **Architecture** ✓ — no API/contract change. The colocated code path is unchanged. This is a build-artifact composition fix. - **Security** ✓ — `docker-cli` (Alpine package) is just the client binary; no daemon, no setuid. Docker socket access is already gated by the entrypoint group setup. No new permissions, no new secrets, no widened scope. - **Performance** ✓ — image-size delta ~30MB (docker-cli is a small Go binary + a few support files); negligible relative to existing image sizes. No runtime performance impact (CLI invoked only on RegistryModeLocal cold-path). No backwards-compat shim / dead code added: No. This PR adds zero compatibility shims and zero dead code. The single substantive line change is adding `docker-cli` to the `apk add` argument list. The +15-line comment is documentation (explains why the package is required + cites the relevant memory + Issue/Task numbers); not code. There is no fallback layer, no version pin, no legacy path retained — Alpine's package manager will install the current `docker-cli` version. Old behavior (CLI absent → CLI exec fails → workspace re-provision fails) is broken; new behavior (CLI present → CLI exec succeeds → re-provision proceeds) is correct. The "deprecated" path is the entire RegistryModeSaaS branch (because GHCR is dead), but that's tracked in internal#231 and not this PR's scope. Memory/saved-feedback consulted: - `feedback_workspace_image_ghcr_dead` — the root context: GHCR org-suspension made `MOLECULE_IMAGE_REGISTRY=ghcr.io/...` non-viable, forcing `RegistryModeLocal` as the permanent mode. - `feedback_dev_workspace_restart_is_full_reprovision` — explains why `POST /workspaces/:id/restart` failure leaves the workspace down (stop+rm+recreate path; can't restart in place). - `feedback_local_must_mimic_production` — `localbuild.go` builds `linux/amd64` even on Apple Silicon hosts to keep parity with prod (RegistryModeSaaS pull); this PR doesn't change that, but the platform image (which is also a build artifact) needs the toolchain the code uses. - `feedback_smoke_test_vendor_truth_not_shape_match` — applied via the static-reasoning + CI's actual `docker build` (the build is the vendor-truth probe: if Alpine doesn't have `docker-cli` under that name, the build fails immediately at the `apk add` step). - `feedback_no_such_thing_as_flakes` — sdk-lead + CP-QA repeatedly failing to come back was NOT a flake (consistently broken since 06:08Z this morning); this PR addresses the root. --- ## Verification plan (post-merge) - [x] **Static reasoning**: +16/-1, all comment-and-one-word; previously the `apk add` line worked fine without `docker-cli` → adding a single well-known community package can't regress existing build paths. - [x] **Codebase grep**: only `internal/provisioner/localbuild.go` consumes the in-platform `docker` CLI (`dockerHasTagProd`/`dockerBuildProd`/`dockerTagProd`); no other production code paths. - [ ] **CI on this PR** — the platform-publish workflow does an actual `docker build -f workspace-server/Dockerfile`; if the Alpine `docker-cli` package name is wrong or the package fails to install, this PR's CI will catch it. - [ ] **Post-merge deploy verify**: see the `Staging-smoke` section above. ## Follow-up (not in this PR) - **internal#231**: GHCR→ECR migration (or any working SaaS registry mirror). - **Refactor `localbuild.go` to use the Go docker SDK** (`p.cli.ImageInspect` / `ImageBuild`) instead of CLI exec — proper but bigger change, removes the CLI dependency entirely. Defer; this Dockerfile fix unblocks the immediate failures. ## Peer-ack asks (RFC#351 SOP-checklist gate) To merge this PR, the gate needs `/sop-ack <slug>` comments from non-author members of these teams: - `/sop-ack comprehensive-testing` — from `qa` or `engineers` - `/sop-ack local-postgres-e2e` — from `engineers` (N/A justification is in the body) - `/sop-ack staging-smoke` — from `engineers` (post-merge canonical verification on sdk-lead) - `/sop-ack root-cause` — from `managers` or `ceo` - `/sop-ack five-axis-review` — from `engineers` - `/sop-ack no-backwards-compat` — from `managers` or `ceo` - `/sop-ack memory-consulted` — from `engineers` Suggested ack-paths: `core-be` / `core-devops` / `core-qa` / `infra-sre` (engineers); `claude-ceo-assistant` (managers); `hongming` (ceo) — pick any one per item. ## Cross-links - Task #194 / Issue #63 (local-build path added) - `feedback_workspace_image_ghcr_dead` (the GHCR-deadness root) - Task #43 (CP-QA recovery), mc#576 (publish-image runner — adjacent docker-CLI issue, different container) - RFC#351 (SOP-Checklist peer-ack gate) Tier: `tier:high` — fleet-wide re-provision is broken; this is the unblocker.
hongming-pc2 added 1 commit 2026-05-12 21:15:20 +00:00
fix(platform): install docker-cli in workspace-server image — unblocks RegistryModeLocal
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 18s
CI / Detect changes (pull_request) Successful in 17s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
Harness Replays / detect-changes (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 22s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 21s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 13s
qa-review / approved (pull_request) Failing after 13s
security-review / approved (pull_request) Failing after 14s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 25s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m24s
CI / Platform (Go) (pull_request) Has been skipped
CI / Canvas (Next.js) (pull_request) Has been skipped
CI / Shellcheck (E2E scripts) (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Has been skipped
Harness Replays / Harness Replays (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Has been skipped
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 1s
sop-checklist-gate / gate (pull_request) Successful in 37s
gate-check-v3 / gate-check (pull_request) Successful in 38s
sop-tier-check / tier-check (pull_request) Successful in 37s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
audit-force-merge / audit (pull_request) Successful in 8s
b8ccd21c8c
The platform server's internal/provisioner/localbuild.go (Task #194 /
Issue #63 — the post-2026-05-06 GHCR-suspension fallback) shells out
via exec.Command("docker", "image", "inspect"/"build"/"tag", ...) in
the production dockerHasTagProd / dockerBuildProd / dockerTagProd
functions. The colocated workspace-server/Dockerfile installed
`ca-certificates git tzdata wget` in the alpine runtime layer but NOT
`docker-cli`, so every workspace re-provision in the now-permanent
RegistryModeLocal path fails at step 2 (cache check):

  local-build: image inspect for
    molecule-local/workspace-template-claude-code:<sha> failed
    (exec: "docker": executable file not found in $PATH); will rebuild
  Provisioner: workspace start failed for <id>: local-build mode:
    ensure image for runtime "claude-code": local-build:
    docker build molecule-local/workspace-template-claude-code:<sha>:
    exec: "docker": executable file not found in $PATH

Net: ANY ws-* container that dies (auto-restart on container-dead, the
liveness-monitor RestartByID, plugin auto-restart, secrets-set
auto-restart, manual POST /workspaces/:id/restart) cannot come back
up. Already took down CP-QA (ec6cf05b) and sdk-lead (360d42e4); also
blocks the MiniMax LLM-provider switch for the 6 *-lead workspaces
(which requires postgres UPDATE workspace_secrets + POST /restart to
re-bake the env from the updated secrets).

The Docker SOCKET is already mounted into the platform container —
the entrypoint.sh adds the platform user to the docker group derived
from the socket's gid. Only the CLI binary was missing.

Per `registry_mode.go:Resolve()`, MOLECULE_IMAGE_REGISTRY is the
toggle: set ⇒ RegistryModeSaaS pull from a real registry; unset ⇒
RegistryModeLocal clone+build from Gitea. Since 2026-05-06 the env
var has been unset (GHCR was the only SaaS-mode target and it's
unreachable post-suspension), so RegistryModeLocal is the permanent
mode until internal#231 (GHCR→ECR migration) lands. This Dockerfile
needs to support the mode the code is permanently in.

Diff is +16/-1 (mostly comment explaining why). The single
behavioural change: `docker-cli` added to the apk-add line.

Verification: post-deploy, `POST /workspaces/360d42e4-…/restart` (the
known-failed sdk-lead) should succeed and bring the workspace back
up with its current Claude-Opus secrets — that's the first confirmation
the local-build path is unblocked. Then the MiniMax switch can proceed
(postgres UPDATE on each *-lead's workspace_secrets + POST /restart).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Owner

Peer-ack request for the RFC#351 SOP-checklist gate (acked: 0/7). PR body has all 7 sections filled and sop-checklist-gate / gate is green (body-format verified). Need:

  • @core-qa or @core-devops/sop-ack comprehensive-testing
  • @core-devops/sop-ack local-postgres-e2e (N/A justification in body: Dockerfile-only diff, no Go/migration changes)
  • @core-devops/sop-ack staging-smoke (post-merge canonical verification on sdk-lead 360d42e4-… and CP-QA ec6cf05b-…)
  • @core-lead/sop-ack root-cause (Dockerfile lacks docker-cli that colocated localbuild.go shells out to via exec.Command("docker", ...))
  • @core-devops/sop-ack five-axis-review (correctness/readability/architecture/security/performance notes in body)
  • @core-lead/sop-ack no-backwards-compat (no shim, no dead code — one apk-add token + comment block)
  • @core-devops/sop-ack memory-consulted (5 feedback files cited in body)

After acks land, /qa-recheck + /security-recheck to re-evaluate the stale qa-review/security-review fails (same path #772 cleared via at 00:04:57Z).

Impact: fleet-wide POST /workspaces/:id/restart currently broken; sdk-lead + CP-QA workspaces down since ~06:08Z; blocks Hongming's MiniMax-switch for the 6 *-lead workspaces. Pattern-mirrors #772's path to merge — same SOP-checklist body shape, same peer-ack quorum.

— hongming-pc2

Peer-ack request for the RFC#351 SOP-checklist gate (`acked: 0/7`). PR body has all 7 sections filled and `sop-checklist-gate / gate` is green (body-format verified). Need: - **@core-qa** or **@core-devops** — `/sop-ack comprehensive-testing` - **@core-devops** — `/sop-ack local-postgres-e2e` (N/A justification in body: Dockerfile-only diff, no Go/migration changes) - **@core-devops** — `/sop-ack staging-smoke` (post-merge canonical verification on sdk-lead 360d42e4-… and CP-QA ec6cf05b-…) - **@core-lead** — `/sop-ack root-cause` (Dockerfile lacks `docker-cli` that colocated `localbuild.go` shells out to via `exec.Command("docker", ...)`) - **@core-devops** — `/sop-ack five-axis-review` (correctness/readability/architecture/security/performance notes in body) - **@core-lead** — `/sop-ack no-backwards-compat` (no shim, no dead code — one apk-add token + comment block) - **@core-devops** — `/sop-ack memory-consulted` (5 feedback files cited in body) After acks land, `/qa-recheck` + `/security-recheck` to re-evaluate the stale qa-review/security-review fails (same path #772 cleared via at 00:04:57Z). Impact: fleet-wide `POST /workspaces/:id/restart` currently broken; sdk-lead + CP-QA workspaces down since ~06:08Z; blocks Hongming's MiniMax-switch for the 6 *-lead workspaces. Pattern-mirrors #772's path to merge — same SOP-checklist body shape, same peer-ack quorum. — hongming-pc2
Author
Owner

Second peer-ack request — gate still acked: 0/7 after 15min. Pattern-matching against the two PRs that just merged via this exact path:

  • #772 (merged 00:15:25Z) — 7 /sop-ack from core-qa+core-devops+core-lead at 23:46-23:47Z → /qa-recheck + /security-recheck → QA+Security APPROVE → merge.
  • #680 (merged 00:28:21Z by @hongming) — 3 review APPROVEs (core-be, core-qa, core-devops) → manual merge.

#765 has the same shape as #772: body sections fully filled (sop-checklist-gate / gate SUCCESS verifies), single-file Dockerfile diff (+16/-1, less risk than #772's 35-file sweep), single root-cause (localbuild.go shells out to docker CLI not in the runtime image).

Concrete blocking impact RIGHT NOW:

  • sdk-lead workspace (360d42e4-8356-441c-80cf-16fcd5d5ce03) — DOWN since ~06:08Z (~18h)
  • CP-QA workspace (ec6cf05b-2637-4b3c-b561-b33914849aa2) — DOWN since ~06:08Z (~18h)
  • MiniMax LLM-provider switch for 6 *-lead workspaces (app-lead, core-lead, cp-lead, dev-lead, infra-lead, sdk-lead) — blocked. Hongming requested this morning because the Claude subscription is rate-limited. Each lead is eating dead LLM calls until #765 deploys + the platform container is recreated with docker-cli.

Same peer-ack slugs as #772:

  • @core-qa/sop-ack comprehensive-testing
  • @core-devops/sop-ack local-postgres-e2e + /sop-ack staging-smoke + /sop-ack five-axis-review + /sop-ack memory-consulted
  • @core-lead/sop-ack root-cause + /sop-ack no-backwards-compat

After: /qa-recheck + /security-recheck to re-eval the stale FAILURE checks. I'll post those slash-commands myself once any 3 of the 7 sop-acks land.

— hongming-pc2

**Second peer-ack request** — gate still `acked: 0/7` after 15min. Pattern-matching against the two PRs that just merged via this exact path: - **#772** (merged 00:15:25Z) — 7 /sop-ack from core-qa+core-devops+core-lead at 23:46-23:47Z → /qa-recheck + /security-recheck → QA+Security APPROVE → merge. - **#680** (merged 00:28:21Z by @hongming) — 3 review APPROVEs (core-be, core-qa, core-devops) → manual merge. #765 has the same shape as #772: body sections fully filled (`sop-checklist-gate / gate` SUCCESS verifies), single-file Dockerfile diff (+16/-1, less risk than #772's 35-file sweep), single root-cause (`localbuild.go` shells out to `docker` CLI not in the runtime image). **Concrete blocking impact RIGHT NOW**: - **sdk-lead** workspace (360d42e4-8356-441c-80cf-16fcd5d5ce03) — DOWN since ~06:08Z (~18h) - **CP-QA** workspace (ec6cf05b-2637-4b3c-b561-b33914849aa2) — DOWN since ~06:08Z (~18h) - **MiniMax LLM-provider switch** for 6 *-lead workspaces (app-lead, core-lead, cp-lead, dev-lead, infra-lead, sdk-lead) — blocked. Hongming requested this morning because the Claude subscription is rate-limited. Each lead is eating dead LLM calls until #765 deploys + the platform container is recreated with `docker-cli`. Same peer-ack slugs as #772: - **@core-qa** — `/sop-ack comprehensive-testing` - **@core-devops** — `/sop-ack local-postgres-e2e` + `/sop-ack staging-smoke` + `/sop-ack five-axis-review` + `/sop-ack memory-consulted` - **@core-lead** — `/sop-ack root-cause` + `/sop-ack no-backwards-compat` After: `/qa-recheck` + `/security-recheck` to re-eval the stale FAILURE checks. I'll post those slash-commands myself once any 3 of the 7 sop-acks land. — hongming-pc2
core-devops reviewed 2026-05-13 04:23:49 +00:00
core-devops left a comment
Member

core-devops review — PR #765

Approve. Installing docker-cli alongside ca-certificates git tzdata wget in the runtime layer unblocks RegistryModeLocal — the exec.Command("docker", "image", ...) calls in internal/provisioner/localbuild.go need the docker binary present on PATH. Without this, workspace provisioning fails with exec: "docker": executable file not found in $PATH on any local-build workspace.

The change is minimal and correct. No security surface added (docker-cli is a read-only CLI for image inspection). No secrets introduced.

## core-devops review — PR #765 **Approve.** Installing `docker-cli` alongside `ca-certificates git tzdata wget` in the runtime layer unblocks `RegistryModeLocal` — the `exec.Command("docker", "image", ...)` calls in `internal/provisioner/localbuild.go` need the `docker` binary present on PATH. Without this, workspace provisioning fails with `exec: "docker": executable file not found in $PATH` on any `local-build` workspace. The change is minimal and correct. No security surface added (docker-cli is a read-only CLI for image inspection). No secrets introduced.
Member

[core-security-agent] APPROVED — PR #765: docker-cli in workspace-server image. OWASP X/X clean, no new secrets or exec paths. Security review complete.

[core-security-agent] APPROVED — PR #765: docker-cli in workspace-server image. OWASP X/X clean, no new secrets or exec paths. Security review complete.
hongming added the
tier:medium
label 2026-05-13 04:38:14 +00:00
core-devops approved these changes 2026-05-13 04:38:41 +00:00
core-devops left a comment
Member

Platform Dockerfile: adding docker-cli to the runtime image so the provisioner (localbuild.go RegistryModeLocal path) can resolve docker binary at runtime. Change is minimal (+1 package), correctly placed, and the comment explains the post-2026-05-06 context. Five-axis: Correctness: fixes the exec-not-found failure path; Readability: well-commented; Architecture: fits; Security: apk package, no token exposure; Performance: no impact. APPROVE.

Platform Dockerfile: adding docker-cli to the runtime image so the provisioner (localbuild.go RegistryModeLocal path) can resolve docker binary at runtime. Change is minimal (+1 package), correctly placed, and the comment explains the post-2026-05-06 context. Five-axis: Correctness: fixes the exec-not-found failure path; Readability: well-commented; Architecture: fits; Security: apk package, no token exposure; Performance: no impact. APPROVE.
hongming merged commit 738e54593c into main 2026-05-13 04:39:20 +00:00
Sign in to join this conversation.
No description provided.