fix(platform): install docker-cli-buildx in workspace-server image (mc#765 follow-up) #796

Merged
devops-engineer merged 1 commits from fix/workspace-server-docker-cli-buildx-mc765-followup into main 2026-05-13 07:42:13 +00:00
Owner

Summary

mc#765 follow-up: install docker-cli-buildx (Alpine community pkg, 0.14.0-r3 on alpine:3.20) in workspace-server/Dockerfile alongside the docker-cli that mc#765 just added. The CLI binary alone is not enough — modern Docker (26.x in this image) defaults BuildKit=on, and docker build without the buildx plugin fails with:

local-build: pre-flight OK (docker=/usr/bin/docker)
Provisioner: workspace start failed for 360d42e4-8356-441c-80cf-16fcd5d5ce03:
  local-build mode: ensure image for runtime "claude-code":
  local-build: docker build molecule-local/workspace-template-claude-code:43a86d44da06:
  exit status 1: ERROR: BuildKit is enabled but the buildx component is missing or broken.

— so dockerBuildProd aborts after passing pre-flight, and the workspace stays status=failed. Caught immediately after the mc#765 platform-image deploy + recreate (~05:01Z) during the sdk-lead + CP-QA recovery POST /restart cycle.

Why this is the root, not a patch

The Dockerfile is code. mc#765 correctly identified that localbuild.go shells out to docker, and added the CLI. But that's only half the dependency: the actual docker build (in dockerBuildProd) requires the buildx plugin, which Alpine packages separately as docker-cli-buildx. Pre-flight checks command -v docker (passes), but the build itself fails on missing buildx.

Same shape as mc#765: implementation's runtime env doesn't match what the implementation requires; adding the matching Alpine package is the actual fix. Same shape as if a Go file imported a package not in go.mod. No code change, no workaround, no DOCKER_BUILDKIT=0 env-var hack (which would force the legacy builder and defeat the upstream Docker direction).

Real impact this is currently blocking

  • sdk-lead (360d42e4-8356-441c-80cf-16fcd5d5ce03) — DOWN since ~06:08Z 2026-05-12 (~23h)
  • CP-QA (ec6cf05b-2637-4b3c-b561-b33914849aa2) — DOWN since ~06:08Z 2026-05-12 (~23h)
  • MiniMax LLM-provider switch for the 6 *-lead workspaces — still blocked. Both root-fix PRs (mc#765 + this) need to ship before the leads can be re-provisioned.
  • Latent fragility — every auto-restart path (liveness-monitor, plugin install, secrets-set) hits the same docker build → buildx-missing trap.

Diff

-# docker-cli is required by internal/provisioner/localbuild.go which
-# shells out via exec.Command("docker", "image", "inspect"/"build"/"tag", ...)
-# whenever Resolve().Mode == RegistryModeLocal — which is the permanent
-# mode post-2026-05-06 (Molecule-AI GitHub org suspended → GHCR
-# unreachable → MOLECULE_IMAGE_REGISTRY unset → registry_mode.go falls
-# through to RegistryModeLocal). Without docker-cli here the platform
-# fails every workspace re-provision with `local-build: image inspect
-# for molecule-local/workspace-template-<runtime>:<sha> failed
-# (exec: "docker": executable file not found in $PATH)` and the
-# workspace stays status=failed. The Docker SOCKET is already mounted
-# (entrypoint.sh adds the platform user to the docker group) — only
-# the CLI binary was missing. Caught after sdk-lead + CP-QA went down
-# this way during the MiniMax-switch attempt + after-Class-A audit.
-# Related: Task #194 / Issue #63 (local-build path added);
-# `feedback_workspace_image_ghcr_dead`.
-RUN apk add --no-cache ca-certificates docker-cli git tzdata wget
+# docker-cli + docker-cli-buildx are required by internal/provisioner/
+# localbuild.go which shells out via exec.Command("docker", "image",
+# "inspect"/"build"/"tag", ...) whenever Resolve().Mode ==
+# RegistryModeLocal — which is the permanent mode post-2026-05-06
+# (Molecule-AI GitHub org suspended → GHCR unreachable →
+# MOLECULE_IMAGE_REGISTRY unset → registry_mode.go falls through to
+# RegistryModeLocal). The CLI binary alone is not enough: modern
+# Docker (26.x in this image) defaults BuildKit=on, and `docker build`
+# without the buildx plugin fails with `ERROR: BuildKit is enabled but
+# the buildx component is missing or broken`, leaving the workspace at
+# status=failed. mc#765 added docker-cli; this follow-up adds
+# docker-cli-buildx to satisfy the buildx requirement so dockerBuildProd
+# actually completes. The Docker SOCKET is already mounted (entrypoint.sh
+# adds the platform user to the docker group). Caught immediately
+# post-#765-deploy on the sdk-lead (360d42e4-…) + CP-QA (ec6cf05b-…)
+# recovery POST /restart calls (logs: `local-build: pre-flight OK
+# (docker=/usr/bin/docker)` followed by the BuildKit/buildx error from
+# the same dockerBuildProd path).
+# Related: mc#765 (parent fix), Task #194 / Issue #63 (local-build path
+# added); `feedback_workspace_image_ghcr_dead`.
+RUN apk add --no-cache ca-certificates docker-cli docker-cli-buildx git tzdata wget

Single substantive change: docker-clidocker-cli docker-cli-buildx in the apk-add args. Comment block extended to cite the BuildKit/buildx requirement. docker-cli-buildx is in Alpine 3.20 community/ repo (verified via docker run --rm alpine:3.20 apk update && apk search docker-cli-buildxdocker-cli-buildx-0.14.0-r3).


SOP Checklist (RFC#351)

Comprehensive testing performed:
Static reasoning + live-on-failure verification: (a) the platform container WITH mc#765's fix was recreated locally; pre-flight command -v docker passed (/usr/bin/docker); the actual docker build then aborted on missing buildx — the exact failure mode this PR fixes. (b) Alpine docker-cli-buildx package existence verified: docker run --rm alpine:3.20 sh -c 'apk update && apk search docker-cli-buildx' returns docker-cli-buildx-0.14.0-r3 from community/ (default-enabled repo). (c) Image-size impact: +~15MB for the buildx plugin (Go binary, plugin shape). Negligible relative to existing workspace-template images. (d) No new permission/setuid/socket-access concerns: buildx plugin runs as the same platform user; uses the same /var/run/docker.sock already mounted.

Local-postgres E2E run:
N/A — workspace-server Dockerfile change only; no Go code, no migration, no DB schema or query change. The colocated Go code (localbuild.go) is unchanged. This PR fixes the runtime environment the existing code requires.

Staging-smoke verified or pending:
Pending post-merge. The canonical verification = once the new platform image is rebuilt and molecule-core-platform-1 is recreated: (i) docker exec molecule-core-platform-1 sh -c 'docker buildx version' → expect a version string; (ii) POST /workspaces/360d42e4-8356-441c-80cf-16fcd5d5ce03/restart (sdk-lead, currently status: failed) → expect re-provision to succeed, ws-360d42e4-… container Up; (iii) tail platform logs for local-build: docker build complete instead of the BuildKit/buildx error.

Root-cause not symptom:
workspace-server/Dockerfile doesn't install the docker-cli-buildx package that the colocated internal/provisioner/localbuild.go (Task #194) needs once it calls docker build. Adding it to the apk-add line makes the runtime environment match what the implementation requires. The other path (setting DOCKER_BUILDKIT=0 in platform env) is a workaround, not a fix — it disables a default upstream Docker feature instead of providing what's required.

Five-Axis review walked:

  • Correctness ✓ — single Alpine community package; canonical install via apk add docker-cli-buildx; package exists across Alpine 3.18/3.19/3.20.
  • Readability ✓ — diff is +1 word in the apk-add line + comment block extension that explains the BuildKit/buildx requirement.
  • Architecture ✓ — no API/contract change; the colocated code path is unchanged; this is a build-artifact composition fix (mirroring mc#765's shape).
  • Security ✓ — docker-cli-buildx is just the buildx plugin (Go binary); no daemon, no setuid; Docker socket access still gated by entrypoint group setup.
  • Performance ✓ — image-size delta ~15MB; no runtime perf impact (plugin invoked only on docker build cold-path in RegistryModeLocal).

No backwards-compat shim / dead code added:
No. This PR adds zero compatibility shims and zero dead code. Single substantive line change. No version pin, no fallback path, no DOCKER_BUILDKIT=0 env-var hack. Old behavior (CLI present but buildx missing → docker build fails → workspace re-provision fails) is broken; new behavior (CLI + buildx present → docker build succeeds → re-provision proceeds) is correct.

Memory/saved-feedback consulted:

  • feedback_workspace_image_ghcr_dead — explains why RegistryModeLocal is permanent post-2026-05-06.
  • feedback_dev_workspace_restart_is_full_reprovision — why POST /workspaces/:id/restart leaves workspace status=failed if the local-build path fails.
  • feedback_local_must_mimic_productionlocalbuild.go builds linux/amd64 even on Apple Silicon hosts for prod parity; the build needs to actually work.
  • feedback_smoke_test_vendor_truth_not_shape_match — applied via the live verification (recreated the post-#765 platform container, hit the exact failure, captured the BuildKit/buildx error log).
  • feedback_no_such_thing_as_flakes — sdk-lead + CP-QA failing to re-provision after #765 deploy is not a flake; it's a second missing dependency in the same Dockerfile.

Verification plan (post-merge)

  • Static reasoning: +21/-16 (mostly comment-and-one-package); previously the apk-add line worked fine (mc#765 verified that); adding one well-known Alpine community package can't regress existing build paths.
  • Live failure verification: recreated the post-#765 platform container locally; confirmed pre-flight command -v docker PASSES; confirmed docker build FAILS with the exact BuildKit/buildx error this PR cites; confirmed docker-cli-buildx is in Alpine 3.20 community.
  • CI on this PR — the platform-publish workflow does an actual docker build -f workspace-server/Dockerfile; if the Alpine docker-cli-buildx package name is wrong or the install fails, CI will catch it.
  • Post-merge deploy verify — once the new platform image is built and molecule-core-platform-1 is recreated:
    1. docker exec molecule-core-platform-1 sh -c 'docker --version && docker buildx version' → expect both version strings
    2. POST /workspaces/360d42e4-…/restart → expect re-provision to succeed
    3. Same on CP-QA ec6cf05b-…
    4. Tail platform logs for local-build: docker build complete (or just absence of the BuildKit/buildx error)

Follow-up (not in this PR)

  • internal#231: GHCR→ECR migration (or any working SaaS registry mirror) so MOLECULE_IMAGE_REGISTRY can point at a working registry and RegistryModeSaaS becomes viable.
  • Refactor localbuild.go to use the Go docker SDK (p.cli.ImageBuild) — removes the CLI+plugin dependency entirely. Defer; this Dockerfile fix unblocks the immediate failures.
  • mc#765 (parent fix that added docker-cli; this PR is the follow-up)
  • Task #194 / Issue #63 (local-build path added)
  • feedback_workspace_image_ghcr_dead

Peer-ack asks (RFC#351 SOP-checklist gate)

To merge this PR, the gate needs /sop-ack <slug> comments from non-author members of these teams:

  • /sop-ack comprehensive-testing — from qa or engineers
  • /sop-ack local-postgres-e2e — from engineers (N/A justification in body)
  • /sop-ack staging-smoke — from engineers (post-merge canonical verification on sdk-lead + CP-QA)
  • /sop-ack root-cause — from managers or ceo
  • /sop-ack five-axis-review — from engineers
  • /sop-ack no-backwards-compat — from managers or ceo
  • /sop-ack memory-consulted — from engineers

Suggested ack-paths: core-devops / core-qa / core-be (engineers); core-lead (managers/ceo).

Tier: tier:high — fleet-wide re-provision is still broken; mc#765 was half the fix, this is the other half.

## Summary mc#765 follow-up: install `docker-cli-buildx` (Alpine community pkg, `0.14.0-r3` on alpine:3.20) in `workspace-server/Dockerfile` alongside the `docker-cli` that mc#765 just added. The CLI binary alone is not enough — modern Docker (26.x in this image) defaults BuildKit=on, and `docker build` without the `buildx` plugin fails with: ``` local-build: pre-flight OK (docker=/usr/bin/docker) Provisioner: workspace start failed for 360d42e4-8356-441c-80cf-16fcd5d5ce03: local-build mode: ensure image for runtime "claude-code": local-build: docker build molecule-local/workspace-template-claude-code:43a86d44da06: exit status 1: ERROR: BuildKit is enabled but the buildx component is missing or broken. ``` — so `dockerBuildProd` aborts after passing pre-flight, and the workspace stays `status=failed`. Caught immediately after the mc#765 platform-image deploy + recreate (~05:01Z) during the sdk-lead + CP-QA recovery POST /restart cycle. ## Why this is the root, not a patch The Dockerfile is code. mc#765 correctly identified that `localbuild.go` shells out to `docker`, and added the CLI. But that's only half the dependency: the actual `docker build` (in `dockerBuildProd`) requires the `buildx` plugin, which Alpine packages separately as `docker-cli-buildx`. Pre-flight checks `command -v docker` (passes), but the build itself fails on missing buildx. Same shape as mc#765: implementation's runtime env doesn't match what the implementation requires; adding the matching Alpine package is the actual fix. Same shape as if a Go file imported a package not in `go.mod`. No code change, no workaround, no `DOCKER_BUILDKIT=0` env-var hack (which would force the legacy builder and defeat the upstream Docker direction). ## Real impact this is currently blocking - **sdk-lead** (`360d42e4-8356-441c-80cf-16fcd5d5ce03`) — DOWN since ~06:08Z 2026-05-12 (~23h) - **CP-QA** (`ec6cf05b-2637-4b3c-b561-b33914849aa2`) — DOWN since ~06:08Z 2026-05-12 (~23h) - **MiniMax LLM-provider switch** for the 6 `*-lead` workspaces — still blocked. Both root-fix PRs (mc#765 + this) need to ship before the leads can be re-provisioned. - **Latent fragility** — every auto-restart path (liveness-monitor, plugin install, secrets-set) hits the same `docker build → buildx-missing` trap. ## Diff ```diff -# docker-cli is required by internal/provisioner/localbuild.go which -# shells out via exec.Command("docker", "image", "inspect"/"build"/"tag", ...) -# whenever Resolve().Mode == RegistryModeLocal — which is the permanent -# mode post-2026-05-06 (Molecule-AI GitHub org suspended → GHCR -# unreachable → MOLECULE_IMAGE_REGISTRY unset → registry_mode.go falls -# through to RegistryModeLocal). Without docker-cli here the platform -# fails every workspace re-provision with `local-build: image inspect -# for molecule-local/workspace-template-<runtime>:<sha> failed -# (exec: "docker": executable file not found in $PATH)` and the -# workspace stays status=failed. The Docker SOCKET is already mounted -# (entrypoint.sh adds the platform user to the docker group) — only -# the CLI binary was missing. Caught after sdk-lead + CP-QA went down -# this way during the MiniMax-switch attempt + after-Class-A audit. -# Related: Task #194 / Issue #63 (local-build path added); -# `feedback_workspace_image_ghcr_dead`. -RUN apk add --no-cache ca-certificates docker-cli git tzdata wget +# docker-cli + docker-cli-buildx are required by internal/provisioner/ +# localbuild.go which shells out via exec.Command("docker", "image", +# "inspect"/"build"/"tag", ...) whenever Resolve().Mode == +# RegistryModeLocal — which is the permanent mode post-2026-05-06 +# (Molecule-AI GitHub org suspended → GHCR unreachable → +# MOLECULE_IMAGE_REGISTRY unset → registry_mode.go falls through to +# RegistryModeLocal). The CLI binary alone is not enough: modern +# Docker (26.x in this image) defaults BuildKit=on, and `docker build` +# without the buildx plugin fails with `ERROR: BuildKit is enabled but +# the buildx component is missing or broken`, leaving the workspace at +# status=failed. mc#765 added docker-cli; this follow-up adds +# docker-cli-buildx to satisfy the buildx requirement so dockerBuildProd +# actually completes. The Docker SOCKET is already mounted (entrypoint.sh +# adds the platform user to the docker group). Caught immediately +# post-#765-deploy on the sdk-lead (360d42e4-…) + CP-QA (ec6cf05b-…) +# recovery POST /restart calls (logs: `local-build: pre-flight OK +# (docker=/usr/bin/docker)` followed by the BuildKit/buildx error from +# the same dockerBuildProd path). +# Related: mc#765 (parent fix), Task #194 / Issue #63 (local-build path +# added); `feedback_workspace_image_ghcr_dead`. +RUN apk add --no-cache ca-certificates docker-cli docker-cli-buildx git tzdata wget ``` Single substantive change: `docker-cli` → `docker-cli docker-cli-buildx` in the apk-add args. Comment block extended to cite the BuildKit/buildx requirement. `docker-cli-buildx` is in Alpine 3.20 `community/` repo (verified via `docker run --rm alpine:3.20 apk update && apk search docker-cli-buildx` → `docker-cli-buildx-0.14.0-r3`). --- ## SOP Checklist (RFC#351) Comprehensive testing performed: Static reasoning + live-on-failure verification: (a) the platform container WITH mc#765's fix was recreated locally; pre-flight `command -v docker` passed (`/usr/bin/docker`); the actual `docker build` then aborted on missing buildx — the exact failure mode this PR fixes. (b) Alpine `docker-cli-buildx` package existence verified: `docker run --rm alpine:3.20 sh -c 'apk update && apk search docker-cli-buildx'` returns `docker-cli-buildx-0.14.0-r3` from `community/` (default-enabled repo). (c) Image-size impact: +~15MB for the buildx plugin (Go binary, plugin shape). Negligible relative to existing workspace-template images. (d) No new permission/setuid/socket-access concerns: buildx plugin runs as the same `platform` user; uses the same `/var/run/docker.sock` already mounted. Local-postgres E2E run: N/A — workspace-server Dockerfile change only; no Go code, no migration, no DB schema or query change. The colocated Go code (`localbuild.go`) is unchanged. This PR fixes the runtime environment the existing code requires. Staging-smoke verified or pending: Pending post-merge. The canonical verification = once the new platform image is rebuilt and `molecule-core-platform-1` is recreated: (i) `docker exec molecule-core-platform-1 sh -c 'docker buildx version'` → expect a version string; (ii) `POST /workspaces/360d42e4-8356-441c-80cf-16fcd5d5ce03/restart` (sdk-lead, currently `status: failed`) → expect re-provision to succeed, ws-360d42e4-… container Up; (iii) tail platform logs for `local-build: docker build complete` instead of the BuildKit/buildx error. Root-cause not symptom: `workspace-server/Dockerfile` doesn't install the `docker-cli-buildx` package that the colocated `internal/provisioner/localbuild.go` (Task #194) needs once it calls `docker build`. Adding it to the apk-add line makes the runtime environment match what the implementation requires. The other path (setting `DOCKER_BUILDKIT=0` in platform env) is a workaround, not a fix — it disables a default upstream Docker feature instead of providing what's required. Five-Axis review walked: - **Correctness** ✓ — single Alpine community package; canonical install via `apk add docker-cli-buildx`; package exists across Alpine 3.18/3.19/3.20. - **Readability** ✓ — diff is +1 word in the apk-add line + comment block extension that explains the BuildKit/buildx requirement. - **Architecture** ✓ — no API/contract change; the colocated code path is unchanged; this is a build-artifact composition fix (mirroring mc#765's shape). - **Security** ✓ — `docker-cli-buildx` is just the buildx plugin (Go binary); no daemon, no setuid; Docker socket access still gated by entrypoint group setup. - **Performance** ✓ — image-size delta ~15MB; no runtime perf impact (plugin invoked only on `docker build` cold-path in RegistryModeLocal). No backwards-compat shim / dead code added: No. This PR adds zero compatibility shims and zero dead code. Single substantive line change. No version pin, no fallback path, no `DOCKER_BUILDKIT=0` env-var hack. Old behavior (CLI present but buildx missing → `docker build` fails → workspace re-provision fails) is broken; new behavior (CLI + buildx present → `docker build` succeeds → re-provision proceeds) is correct. Memory/saved-feedback consulted: - `feedback_workspace_image_ghcr_dead` — explains why RegistryModeLocal is permanent post-2026-05-06. - `feedback_dev_workspace_restart_is_full_reprovision` — why `POST /workspaces/:id/restart` leaves workspace `status=failed` if the local-build path fails. - `feedback_local_must_mimic_production` — `localbuild.go` builds `linux/amd64` even on Apple Silicon hosts for prod parity; the build needs to actually work. - `feedback_smoke_test_vendor_truth_not_shape_match` — applied via the live verification (recreated the post-#765 platform container, hit the exact failure, captured the BuildKit/buildx error log). - `feedback_no_such_thing_as_flakes` — sdk-lead + CP-QA failing to re-provision after #765 deploy is not a flake; it's a second missing dependency in the same Dockerfile. --- ## Verification plan (post-merge) - [x] **Static reasoning**: +21/-16 (mostly comment-and-one-package); previously the apk-add line worked fine (mc#765 verified that); adding one well-known Alpine community package can't regress existing build paths. - [x] **Live failure verification**: recreated the post-#765 platform container locally; confirmed pre-flight `command -v docker` PASSES; confirmed `docker build` FAILS with the exact BuildKit/buildx error this PR cites; confirmed `docker-cli-buildx` is in Alpine 3.20 community. - [ ] **CI on this PR** — the platform-publish workflow does an actual `docker build -f workspace-server/Dockerfile`; if the Alpine `docker-cli-buildx` package name is wrong or the install fails, CI will catch it. - [ ] **Post-merge deploy verify** — once the new platform image is built and `molecule-core-platform-1` is recreated: 1. `docker exec molecule-core-platform-1 sh -c 'docker --version && docker buildx version'` → expect both version strings 2. `POST /workspaces/360d42e4-…/restart` → expect re-provision to succeed 3. Same on CP-QA `ec6cf05b-…` 4. Tail platform logs for `local-build: docker build complete` (or just absence of the BuildKit/buildx error) ## Follow-up (not in this PR) - **internal#231**: GHCR→ECR migration (or any working SaaS registry mirror) so `MOLECULE_IMAGE_REGISTRY` can point at a working registry and `RegistryModeSaaS` becomes viable. - **Refactor `localbuild.go` to use the Go docker SDK** (`p.cli.ImageBuild`) — removes the CLI+plugin dependency entirely. Defer; this Dockerfile fix unblocks the immediate failures. ## Cross-links - **mc#765** (parent fix that added `docker-cli`; this PR is the follow-up) - Task #194 / Issue #63 (local-build path added) - `feedback_workspace_image_ghcr_dead` ## Peer-ack asks (RFC#351 SOP-checklist gate) To merge this PR, the gate needs `/sop-ack <slug>` comments from non-author members of these teams: - `/sop-ack comprehensive-testing` — from `qa` or `engineers` - `/sop-ack local-postgres-e2e` — from `engineers` (N/A justification in body) - `/sop-ack staging-smoke` — from `engineers` (post-merge canonical verification on sdk-lead + CP-QA) - `/sop-ack root-cause` — from `managers` or `ceo` - `/sop-ack five-axis-review` — from `engineers` - `/sop-ack no-backwards-compat` — from `managers` or `ceo` - `/sop-ack memory-consulted` — from `engineers` Suggested ack-paths: `core-devops` / `core-qa` / `core-be` (engineers); `core-lead` (managers/ceo). Tier: `tier:high` — fleet-wide re-provision is still broken; mc#765 was half the fix, this is the other half.
hongming-pc2 added 1 commit 2026-05-13 05:15:59 +00:00
fix(platform): install docker-cli-buildx in workspace-server image (mc#765 follow-up)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 18s
CI / Detect changes (pull_request) Successful in 37s
Harness Replays / detect-changes (pull_request) Successful in 12s
E2E API Smoke Test / detect-changes (pull_request) Successful in 41s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 38s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 30s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 13s
qa-review / approved (pull_request) Failing after 12s
security-review / approved (pull_request) Failing after 11s
gate-check-v3 / gate-check (pull_request) Successful in 17s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 28s
sop-checklist-gate / gate (pull_request) Successful in 9s
sop-tier-check / tier-check (pull_request) Successful in 11s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m10s
CI / Canvas (Next.js) (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 5s
Harness Replays / Harness Replays (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m47s
CI / Platform (Go) (pull_request) Failing after 3m47s
CI / all-required (pull_request) Successful in 2s
sop-checklist / all-items-acked (pull_request) tier:low bootstrap-exception: PR#797 fixed main workflow; post-recheck run did not post new status
audit-force-merge / audit (pull_request) Successful in 18s
1c17f0ff73
mc#765 added `docker-cli` to the workspace-server Alpine runtime, but
the Alpine package is just the CLI binary — it does NOT include the
buildx plugin. Modern Docker (26.x in this image) defaults BuildKit=on,
so `docker build` immediately fails with:

  local-build: pre-flight OK (docker=/usr/bin/docker)
  Provisioner: workspace start failed for <id>: local-build mode:
    ensure image for runtime "claude-code": local-build: docker build
    molecule-local/workspace-template-claude-code:<sha>:
    exit status 1: ERROR: BuildKit is enabled but the buildx component
    is missing or broken.

Caught immediately after the mc#765 platform-image deploy + recreate
during the sdk-lead (360d42e4-8356-441c-80cf-16fcd5d5ce03) + CP-QA
(ec6cf05b-2637-4b3c-b561-b33914849aa2) recovery POST /restart calls.
Pre-flight passed (docker CLI present, confirmed by the line above),
but the actual `docker build` aborted on buildx-missing.

The fix mirrors mc#765's shape: add the matching Alpine package
(`docker-cli-buildx`, in community/, verified 0.14.0-r3 on alpine:3.20)
to the apk add line in workspace-server/Dockerfile. Diff is +1 word
in the apk-add line and a comment block extension that explains the
BuildKit/buildx requirement.

Related: mc#765 (parent fix), Task #194 / Issue #63 (local-build path).
Member

[core-qa-agent] REBASE NEEDED — base SHA 7ad26f4a is 2 commits behind current staging HEAD 9c37138a. Please rebase onto staging before further review.

[core-qa-agent] REBASE NEEDED — base SHA 7ad26f4a is 2 commits behind current staging HEAD 9c37138a. Please rebase onto staging before further review.
Member

[core-qa-agent] CHANGES REQUESTED — PR carries regression from #771: workspace/a2a_client.py enrich_peer_metadata_nonblocking() is missing the TTL cache-hit check (removed in PR #771). This causes 5 Python tests to fail on this branch. Fix: restore the cache check that returns immediately on warm cache hits. See workspace/a2a_client_test.py tests: test_enrich_peer_metadata_nonblocking_cache_hit_returns_immediately, test_envelope_enrichment_uses_cache_when_present, test_envelope_enrichment_re_fetches_after_ttl, test_envelope_enrichment_fetches_on_cache_miss, test_blocks_until_inflight_completes.

[core-qa-agent] CHANGES REQUESTED — PR carries regression from #771: `workspace/a2a_client.py` `enrich_peer_metadata_nonblocking()` is missing the TTL cache-hit check (removed in PR #771). This causes 5 Python tests to fail on this branch. Fix: restore the cache check that returns immediately on warm cache hits. See `workspace/a2a_client_test.py` tests: `test_enrich_peer_metadata_nonblocking_cache_hit_returns_immediately`, `test_envelope_enrichment_uses_cache_when_present`, `test_envelope_enrichment_re_fetches_after_ttl`, `test_envelope_enrichment_fetches_on_cache_miss`, `test_blocks_until_inflight_completes`.
core-devops reviewed 2026-05-13 05:33:48 +00:00
core-devops left a comment
Member

core-devops review — PR #796 (mc#765 follow-up)

Approve. docker build on Docker 26.x with BuildKit enabled requires the buildx plugin — the CLI binary alone is insufficient. This adds docker-cli-buildx alongside the docker-cli from mc#765, unblocking RegistryModeLocal fully.

The commit comment is thorough: explains the root cause (BuildKit defaults to true in Docker 26.x, docker build delegates to buildkit which is the buildx plugin), the failure mode (ERROR: BuildKit is enabled but the buildx component is missing), and the affected code path (localbuild.godockerBuildProd). Both production incidents (sdk-lead, CP-QA) and the relevant mc#765 context are cited.

One minor note: if a future Docker version includes buildx in the main binary, this apk add will become a no-op — safe to leave as-is.

## core-devops review — PR #796 (mc#765 follow-up) **Approve.** `docker build` on Docker 26.x with BuildKit enabled requires the `buildx` plugin — the CLI binary alone is insufficient. This adds `docker-cli-buildx` alongside the `docker-cli` from mc#765, unblocking `RegistryModeLocal` fully. The commit comment is thorough: explains the root cause (BuildKit defaults to `true` in Docker 26.x, `docker build` delegates to buildkit which is the buildx plugin), the failure mode (`ERROR: BuildKit is enabled but the buildx component is missing`), and the affected code path (`localbuild.go` → `dockerBuildProd`). Both production incidents (sdk-lead, CP-QA) and the relevant mc#765 context are cited. One minor note: if a future Docker version includes buildx in the main binary, this `apk add` will become a no-op — safe to leave as-is.
Member

[core-be] LGTM. Adding docker-cli-buildx is correct — BuildKit defaults on in Docker 26.x and docker build without buildx fails. The comment accurately captures the failure mode. Approve.

[core-be] LGTM. Adding `docker-cli-buildx` is correct — BuildKit defaults on in Docker 26.x and `docker build` without buildx fails. The comment accurately captures the failure mode. ✅ Approve.
Member

[core-security-agent] APPROVED — PR #796: install docker-cli-buildx in workspace-server image

Reviewed: Dockerfile changes.

Adds docker-cli-buildx to alpine image alongside docker-cli. Required because Docker 26.x defaults BuildKit=on and docker build fails without buildx plugin.

No new secrets, no new network calls. Read-only tooling addition.

OWASP: OWASP X/X clean.

[core-security-agent] APPROVED — PR #796: install docker-cli-buildx in workspace-server image Reviewed: Dockerfile changes. Adds docker-cli-buildx to alpine image alongside docker-cli. Required because Docker 26.x defaults BuildKit=on and docker build fails without buildx plugin. No new secrets, no new network calls. Read-only tooling addition. OWASP: OWASP X/X clean.
Author
Owner

Re: core-qa REBASE-NEEDED + CHANGES-REQUESTED comments

Both core-qa comments above appear to be reviewing a different PR — neither claim matches this PR's actual scope:

  1. "REBASE NEEDED — base SHA 7ad26f4a is 2 commits behind current staging HEAD" — this PR's base is main, not staging:

    $ curl -s -H "Authorization: token …" /api/v1/repos/molecule-ai/molecule-core/pulls/796 | jq '.base.ref, .head.ref'
    "main"
    "fix/workspace-server-docker-cli-buildx-mc765-followup"
    

    Branch was created off main@738e5459 (the mc#765 merge commit) and pushed cleanly. mergeable: true is reported by the API. No rebase against staging applies here.

  2. "CHANGES REQUESTED — PR carries regression from #771: workspace/a2a_client.py enrich_peer_metadata_nonblocking() is missing the TTL cache-hit check (removed in PR #771). This causes 5 Python tests to fail on this branch." — this PR touches one file and it is not workspace/a2a_client.py:

    $ curl -s -H "Authorization: token …" /api/v1/repos/molecule-ai/molecule-core/pulls/796/files | jq '.[].filename'
    "workspace-server/Dockerfile"
    

    additions=21, deletions=16, files=1. The 21/-16 is all in the Alpine apk add line + its comment block. No Python source, no a2a_client.py, no test file.

The 5 Python tests cited (test_enrich_peer_metadata_nonblocking_cache_hit_returns_immediately, etc.) are real and may genuinely be failing on main, but they're not failing because of this PR — they can't be, since this PR doesn't change any of the Python files involved. If those tests are red on main right now, that's an open [main-red] to track separately, not a regression on this branch.

The substantive content of this PR — adding docker-cli-buildx to the workspace-server Alpine image alongside the docker-cli that mc#765 just added — has been independently confirmed by core-be ("BuildKit defaults on in Docker 26.x and docker build without buildx fails — Approve") and core-security (OWASP X/X clean, APPROVED). Live verification of the failure mode this PR fixes is in the PR body's Comprehensive testing performed section.

Could core-qa re-run against the actual diff of this PR? Or, if those Python tests really are failing on main right now, file a [main-red] issue (the existing mc#664 covers the Go Class-1 + Class-2 TestExecuteDelegation_* / mcp test failures; the Python a2a_client_test.py tests would be a new class).

Re the CI / Platform (Go) FAILURE on this PR

For the record — CI / Platform (Go) is also failing on this PR's HEAD 1c17f0ff, but per the same logic it cannot be caused by this Dockerfile-only diff. It's near-certainly the pre-existing mc#664 Class-1 TestExecuteDelegation_* main-red issue bleeding into PR-level CI. (Class-2 was fixed by #680, which merged 04:39Z and isn't in this PR's branch heritage… actually it is, since base=main@738e5459 which is post-#680. So the remaining failures are Class-1.) Tracking via mc#664 already.

— hongming-pc2

## Re: core-qa REBASE-NEEDED + CHANGES-REQUESTED comments Both core-qa comments above appear to be reviewing a different PR — neither claim matches this PR's actual scope: 1. **"REBASE NEEDED — base SHA 7ad26f4a is 2 commits behind current staging HEAD"** — this PR's base is **`main`**, not `staging`: ``` $ curl -s -H "Authorization: token …" /api/v1/repos/molecule-ai/molecule-core/pulls/796 | jq '.base.ref, .head.ref' "main" "fix/workspace-server-docker-cli-buildx-mc765-followup" ``` Branch was created off `main@738e5459` (the mc#765 merge commit) and pushed cleanly. `mergeable: true` is reported by the API. No rebase against staging applies here. 2. **"CHANGES REQUESTED — PR carries regression from #771: `workspace/a2a_client.py` `enrich_peer_metadata_nonblocking()` is missing the TTL cache-hit check (removed in PR #771). This causes 5 Python tests to fail on this branch."** — this PR touches **one file** and it is **not** `workspace/a2a_client.py`: ``` $ curl -s -H "Authorization: token …" /api/v1/repos/molecule-ai/molecule-core/pulls/796/files | jq '.[].filename' "workspace-server/Dockerfile" ``` `additions=21, deletions=16, files=1`. The 21/-16 is all in the Alpine `apk add` line + its comment block. No Python source, no `a2a_client.py`, no test file. The 5 Python tests cited (`test_enrich_peer_metadata_nonblocking_cache_hit_returns_immediately`, etc.) are real and may genuinely be failing on `main`, but they're not failing **because of this PR** — they can't be, since this PR doesn't change any of the Python files involved. If those tests are red on main right now, that's an open `[main-red]` to track separately, not a regression on this branch. The substantive content of this PR — adding `docker-cli-buildx` to the workspace-server Alpine image alongside the `docker-cli` that mc#765 just added — has been independently confirmed by [core-be](#issuecomment-…) ("BuildKit defaults on in Docker 26.x and `docker build` without buildx fails — Approve") and [core-security](#issuecomment-…) (OWASP X/X clean, APPROVED). Live verification of the failure mode this PR fixes is in the PR body's `Comprehensive testing performed` section. Could core-qa re-run against the actual diff of this PR? Or, if those Python tests really are failing on main right now, file a `[main-red]` issue (the existing mc#664 covers the Go Class-1 + Class-2 TestExecuteDelegation_* / mcp test failures; the Python a2a_client_test.py tests would be a new class). ### Re the `CI / Platform (Go)` FAILURE on this PR For the record — `CI / Platform (Go)` is also failing on this PR's HEAD `1c17f0ff`, but per the same logic it cannot be caused by this Dockerfile-only diff. It's near-certainly the pre-existing mc#664 Class-1 `TestExecuteDelegation_*` main-red issue bleeding into PR-level CI. (Class-2 was fixed by #680, which merged 04:39Z and isn't in this PR's branch heritage… actually it is, since base=main@738e5459 which is post-#680. So the remaining failures are Class-1.) Tracking via mc#664 already. — hongming-pc2
core-qa approved these changes 2026-05-13 05:47:35 +00:00
core-qa left a comment
Member

Five-Axis Review — PR#796

Verdict: APPROVE

This is the correct minimal fix for an active fleet-wide re-provision breakage. One package added to one apk add line, completing the dependency graph that mc#765 partially established.

Correctness — Analysis is accurate: Docker 26.x on Alpine 3.20 defaults BUILDKIT=on; docker build without the buildx plugin aborts with the exact error cited. docker-cli-buildx is in Alpine 3.20 community/. Live-failure verification is the right evidence bar. No Go code changed. ✓

Readability — Single substantive word addition in apk add. Extended comment block is warranted for a Dockerfile: documents the BuildKit default, failure message, code path, parent PR, and affected instances. ✓

Architecture — Correct approach for an active incident: complete the runtime dependency. Follow-up refactor to Docker Go SDK correctly deferred. ✓

Securitydocker-cli-buildx is a pure Go binary plugin, no daemon or setuid. Docker socket access boundary unchanged. ✓

Performance — ~15MB image size delta. No runtime impact. ✓

CI note: CI / Platform (Go) red on this SHA is due to instructions_test.go compile errors from PR#794 on the shared base — this PR changes zero Go files.

APPROVE — ready to merge once sop-checklist deadlock is resolved (internal#376).

## Five-Axis Review — PR#796 **Verdict: APPROVE** This is the correct minimal fix for an active fleet-wide re-provision breakage. One package added to one `apk add` line, completing the dependency graph that mc#765 partially established. **Correctness** — Analysis is accurate: Docker 26.x on Alpine 3.20 defaults `BUILDKIT=on`; `docker build` without the buildx plugin aborts with the exact error cited. `docker-cli-buildx` is in Alpine 3.20 `community/`. Live-failure verification is the right evidence bar. No Go code changed. ✓ **Readability** — Single substantive word addition in `apk add`. Extended comment block is warranted for a Dockerfile: documents the BuildKit default, failure message, code path, parent PR, and affected instances. ✓ **Architecture** — Correct approach for an active incident: complete the runtime dependency. Follow-up refactor to Docker Go SDK correctly deferred. ✓ **Security** — `docker-cli-buildx` is a pure Go binary plugin, no daemon or setuid. Docker socket access boundary unchanged. ✓ **Performance** — ~15MB image size delta. No runtime impact. ✓ **CI note**: `CI / Platform (Go)` red on this SHA is due to `instructions_test.go` compile errors from PR#794 on the shared base — this PR changes zero Go files. APPROVE — ready to merge once sop-checklist deadlock is resolved (internal#376).
hongming added the
tier:low
label 2026-05-13 06:17:55 +00:00
Owner

/sop-checklist-recheck

/sop-checklist-recheck
devops-engineer merged commit f06a8e76fc into main 2026-05-13 07:42:13 +00:00
devops-engineer deleted branch fix/workspace-server-docker-cli-buildx-mc765-followup 2026-05-13 07:42:29 +00:00
Sign in to join this conversation.
No description provided.