fix(platform): install docker-cli-buildx in workspace-server image (mc#765 follow-up) #796
No reviewers
Labels
No Label
merge-queue
merge-queue-hold
release-blocker
release-test
security
test-label-sre
tier:high
tier:low
tier:medium
triage-test
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: molecule-ai/molecule-core#796
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix/workspace-server-docker-cli-buildx-mc765-followup"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
mc#765 follow-up: install
docker-cli-buildx(Alpine community pkg,0.14.0-r3on alpine:3.20) inworkspace-server/Dockerfilealongside thedocker-clithat mc#765 just added. The CLI binary alone is not enough — modern Docker (26.x in this image) defaults BuildKit=on, anddocker buildwithout thebuildxplugin fails with:— so
dockerBuildProdaborts after passing pre-flight, and the workspace staysstatus=failed. Caught immediately after the mc#765 platform-image deploy + recreate (~05:01Z) during the sdk-lead + CP-QA recovery POST /restart cycle.Why this is the root, not a patch
The Dockerfile is code. mc#765 correctly identified that
localbuild.goshells out todocker, and added the CLI. But that's only half the dependency: the actualdocker build(indockerBuildProd) requires thebuildxplugin, which Alpine packages separately asdocker-cli-buildx. Pre-flight checkscommand -v docker(passes), but the build itself fails on missing buildx.Same shape as mc#765: implementation's runtime env doesn't match what the implementation requires; adding the matching Alpine package is the actual fix. Same shape as if a Go file imported a package not in
go.mod. No code change, no workaround, noDOCKER_BUILDKIT=0env-var hack (which would force the legacy builder and defeat the upstream Docker direction).Real impact this is currently blocking
360d42e4-8356-441c-80cf-16fcd5d5ce03) — DOWN since ~06:08Z 2026-05-12 (~23h)ec6cf05b-2637-4b3c-b561-b33914849aa2) — DOWN since ~06:08Z 2026-05-12 (~23h)*-leadworkspaces — still blocked. Both root-fix PRs (mc#765 + this) need to ship before the leads can be re-provisioned.docker build → buildx-missingtrap.Diff
Single substantive change:
docker-cli→docker-cli docker-cli-buildxin the apk-add args. Comment block extended to cite the BuildKit/buildx requirement.docker-cli-buildxis in Alpine 3.20community/repo (verified viadocker run --rm alpine:3.20 apk update && apk search docker-cli-buildx→docker-cli-buildx-0.14.0-r3).SOP Checklist (RFC#351)
Comprehensive testing performed:
Static reasoning + live-on-failure verification: (a) the platform container WITH mc#765's fix was recreated locally; pre-flight
command -v dockerpassed (/usr/bin/docker); the actualdocker buildthen aborted on missing buildx — the exact failure mode this PR fixes. (b) Alpinedocker-cli-buildxpackage existence verified:docker run --rm alpine:3.20 sh -c 'apk update && apk search docker-cli-buildx'returnsdocker-cli-buildx-0.14.0-r3fromcommunity/(default-enabled repo). (c) Image-size impact: +~15MB for the buildx plugin (Go binary, plugin shape). Negligible relative to existing workspace-template images. (d) No new permission/setuid/socket-access concerns: buildx plugin runs as the sameplatformuser; uses the same/var/run/docker.sockalready mounted.Local-postgres E2E run:
N/A — workspace-server Dockerfile change only; no Go code, no migration, no DB schema or query change. The colocated Go code (
localbuild.go) is unchanged. This PR fixes the runtime environment the existing code requires.Staging-smoke verified or pending:
Pending post-merge. The canonical verification = once the new platform image is rebuilt and
molecule-core-platform-1is recreated: (i)docker exec molecule-core-platform-1 sh -c 'docker buildx version'→ expect a version string; (ii)POST /workspaces/360d42e4-8356-441c-80cf-16fcd5d5ce03/restart(sdk-lead, currentlystatus: failed) → expect re-provision to succeed, ws-360d42e4-… container Up; (iii) tail platform logs forlocal-build: docker build completeinstead of the BuildKit/buildx error.Root-cause not symptom:
workspace-server/Dockerfiledoesn't install thedocker-cli-buildxpackage that the colocatedinternal/provisioner/localbuild.go(Task #194) needs once it callsdocker build. Adding it to the apk-add line makes the runtime environment match what the implementation requires. The other path (settingDOCKER_BUILDKIT=0in platform env) is a workaround, not a fix — it disables a default upstream Docker feature instead of providing what's required.Five-Axis review walked:
apk add docker-cli-buildx; package exists across Alpine 3.18/3.19/3.20.docker-cli-buildxis just the buildx plugin (Go binary); no daemon, no setuid; Docker socket access still gated by entrypoint group setup.docker buildcold-path in RegistryModeLocal).No backwards-compat shim / dead code added:
No. This PR adds zero compatibility shims and zero dead code. Single substantive line change. No version pin, no fallback path, no
DOCKER_BUILDKIT=0env-var hack. Old behavior (CLI present but buildx missing →docker buildfails → workspace re-provision fails) is broken; new behavior (CLI + buildx present →docker buildsucceeds → re-provision proceeds) is correct.Memory/saved-feedback consulted:
feedback_workspace_image_ghcr_dead— explains why RegistryModeLocal is permanent post-2026-05-06.feedback_dev_workspace_restart_is_full_reprovision— whyPOST /workspaces/:id/restartleaves workspacestatus=failedif the local-build path fails.feedback_local_must_mimic_production—localbuild.gobuildslinux/amd64even on Apple Silicon hosts for prod parity; the build needs to actually work.feedback_smoke_test_vendor_truth_not_shape_match— applied via the live verification (recreated the post-#765 platform container, hit the exact failure, captured the BuildKit/buildx error log).feedback_no_such_thing_as_flakes— sdk-lead + CP-QA failing to re-provision after #765 deploy is not a flake; it's a second missing dependency in the same Dockerfile.Verification plan (post-merge)
command -v dockerPASSES; confirmeddocker buildFAILS with the exact BuildKit/buildx error this PR cites; confirmeddocker-cli-buildxis in Alpine 3.20 community.docker build -f workspace-server/Dockerfile; if the Alpinedocker-cli-buildxpackage name is wrong or the install fails, CI will catch it.molecule-core-platform-1is recreated:docker exec molecule-core-platform-1 sh -c 'docker --version && docker buildx version'→ expect both version stringsPOST /workspaces/360d42e4-…/restart→ expect re-provision to succeedec6cf05b-…local-build: docker build complete(or just absence of the BuildKit/buildx error)Follow-up (not in this PR)
MOLECULE_IMAGE_REGISTRYcan point at a working registry andRegistryModeSaaSbecomes viable.localbuild.goto use the Go docker SDK (p.cli.ImageBuild) — removes the CLI+plugin dependency entirely. Defer; this Dockerfile fix unblocks the immediate failures.Cross-links
docker-cli; this PR is the follow-up)feedback_workspace_image_ghcr_deadPeer-ack asks (RFC#351 SOP-checklist gate)
To merge this PR, the gate needs
/sop-ack <slug>comments from non-author members of these teams:/sop-ack comprehensive-testing— fromqaorengineers/sop-ack local-postgres-e2e— fromengineers(N/A justification in body)/sop-ack staging-smoke— fromengineers(post-merge canonical verification on sdk-lead + CP-QA)/sop-ack root-cause— frommanagersorceo/sop-ack five-axis-review— fromengineers/sop-ack no-backwards-compat— frommanagersorceo/sop-ack memory-consulted— fromengineersSuggested ack-paths:
core-devops/core-qa/core-be(engineers);core-lead(managers/ceo).Tier:
tier:high— fleet-wide re-provision is still broken; mc#765 was half the fix, this is the other half.mc#765 added `docker-cli` to the workspace-server Alpine runtime, but the Alpine package is just the CLI binary — it does NOT include the buildx plugin. Modern Docker (26.x in this image) defaults BuildKit=on, so `docker build` immediately fails with: local-build: pre-flight OK (docker=/usr/bin/docker) Provisioner: workspace start failed for <id>: local-build mode: ensure image for runtime "claude-code": local-build: docker build molecule-local/workspace-template-claude-code:<sha>: exit status 1: ERROR: BuildKit is enabled but the buildx component is missing or broken. Caught immediately after the mc#765 platform-image deploy + recreate during the sdk-lead (360d42e4-8356-441c-80cf-16fcd5d5ce03) + CP-QA (ec6cf05b-2637-4b3c-b561-b33914849aa2) recovery POST /restart calls. Pre-flight passed (docker CLI present, confirmed by the line above), but the actual `docker build` aborted on buildx-missing. The fix mirrors mc#765's shape: add the matching Alpine package (`docker-cli-buildx`, in community/, verified 0.14.0-r3 on alpine:3.20) to the apk add line in workspace-server/Dockerfile. Diff is +1 word in the apk-add line and a comment block extension that explains the BuildKit/buildx requirement. Related: mc#765 (parent fix), Task #194 / Issue #63 (local-build path).[core-qa-agent] REBASE NEEDED — base SHA
7ad26f4ais 2 commits behind current staging HEAD9c37138a. Please rebase onto staging before further review.[core-qa-agent] CHANGES REQUESTED — PR carries regression from #771:
workspace/a2a_client.pyenrich_peer_metadata_nonblocking()is missing the TTL cache-hit check (removed in PR #771). This causes 5 Python tests to fail on this branch. Fix: restore the cache check that returns immediately on warm cache hits. Seeworkspace/a2a_client_test.pytests:test_enrich_peer_metadata_nonblocking_cache_hit_returns_immediately,test_envelope_enrichment_uses_cache_when_present,test_envelope_enrichment_re_fetches_after_ttl,test_envelope_enrichment_fetches_on_cache_miss,test_blocks_until_inflight_completes.core-devops review — PR #796 (mc#765 follow-up)
Approve.
docker buildon Docker 26.x with BuildKit enabled requires thebuildxplugin — the CLI binary alone is insufficient. This addsdocker-cli-buildxalongside thedocker-clifrom mc#765, unblockingRegistryModeLocalfully.The commit comment is thorough: explains the root cause (BuildKit defaults to
truein Docker 26.x,docker builddelegates to buildkit which is the buildx plugin), the failure mode (ERROR: BuildKit is enabled but the buildx component is missing), and the affected code path (localbuild.go→dockerBuildProd). Both production incidents (sdk-lead, CP-QA) and the relevant mc#765 context are cited.One minor note: if a future Docker version includes buildx in the main binary, this
apk addwill become a no-op — safe to leave as-is.[core-be] LGTM. Adding
docker-cli-buildxis correct — BuildKit defaults on in Docker 26.x anddocker buildwithout buildx fails. The comment accurately captures the failure mode. ✅ Approve.[core-security-agent] APPROVED — PR #796: install docker-cli-buildx in workspace-server image
Reviewed: Dockerfile changes.
Adds docker-cli-buildx to alpine image alongside docker-cli. Required because Docker 26.x defaults BuildKit=on and docker build fails without buildx plugin.
No new secrets, no new network calls. Read-only tooling addition.
OWASP: OWASP X/X clean.
Re: core-qa REBASE-NEEDED + CHANGES-REQUESTED comments
Both core-qa comments above appear to be reviewing a different PR — neither claim matches this PR's actual scope:
"REBASE NEEDED — base SHA
7ad26f4ais 2 commits behind current staging HEAD" — this PR's base ismain, notstaging:Branch was created off
main@738e5459(the mc#765 merge commit) and pushed cleanly.mergeable: trueis reported by the API. No rebase against staging applies here."CHANGES REQUESTED — PR carries regression from #771:
workspace/a2a_client.pyenrich_peer_metadata_nonblocking()is missing the TTL cache-hit check (removed in PR #771). This causes 5 Python tests to fail on this branch." — this PR touches one file and it is notworkspace/a2a_client.py:additions=21, deletions=16, files=1. The 21/-16 is all in the Alpineapk addline + its comment block. No Python source, noa2a_client.py, no test file.The 5 Python tests cited (
test_enrich_peer_metadata_nonblocking_cache_hit_returns_immediately, etc.) are real and may genuinely be failing onmain, but they're not failing because of this PR — they can't be, since this PR doesn't change any of the Python files involved. If those tests are red on main right now, that's an open[main-red]to track separately, not a regression on this branch.The substantive content of this PR — adding
docker-cli-buildxto the workspace-server Alpine image alongside thedocker-clithat mc#765 just added — has been independently confirmed by core-be ("BuildKit defaults on in Docker 26.x anddocker buildwithout buildx fails — Approve") and core-security (OWASP X/X clean, APPROVED). Live verification of the failure mode this PR fixes is in the PR body'sComprehensive testing performedsection.Could core-qa re-run against the actual diff of this PR? Or, if those Python tests really are failing on main right now, file a
[main-red]issue (the existing mc#664 covers the Go Class-1 + Class-2 TestExecuteDelegation_* / mcp test failures; the Python a2a_client_test.py tests would be a new class).Re the
CI / Platform (Go)FAILURE on this PRFor the record —
CI / Platform (Go)is also failing on this PR's HEAD1c17f0ff, but per the same logic it cannot be caused by this Dockerfile-only diff. It's near-certainly the pre-existing mc#664 Class-1TestExecuteDelegation_*main-red issue bleeding into PR-level CI. (Class-2 was fixed by #680, which merged 04:39Z and isn't in this PR's branch heritage… actually it is, since base=main@738e5459 which is post-#680. So the remaining failures are Class-1.) Tracking via mc#664 already.— hongming-pc2
Five-Axis Review — PR#796
Verdict: APPROVE
This is the correct minimal fix for an active fleet-wide re-provision breakage. One package added to one
apk addline, completing the dependency graph that mc#765 partially established.Correctness — Analysis is accurate: Docker 26.x on Alpine 3.20 defaults
BUILDKIT=on;docker buildwithout the buildx plugin aborts with the exact error cited.docker-cli-buildxis in Alpine 3.20community/. Live-failure verification is the right evidence bar. No Go code changed. ✓Readability — Single substantive word addition in
apk add. Extended comment block is warranted for a Dockerfile: documents the BuildKit default, failure message, code path, parent PR, and affected instances. ✓Architecture — Correct approach for an active incident: complete the runtime dependency. Follow-up refactor to Docker Go SDK correctly deferred. ✓
Security —
docker-cli-buildxis a pure Go binary plugin, no daemon or setuid. Docker socket access boundary unchanged. ✓Performance — ~15MB image size delta. No runtime impact. ✓
CI note:
CI / Platform (Go)red on this SHA is due toinstructions_test.gocompile errors from PR#794 on the shared base — this PR changes zero Go files.APPROVE — ready to merge once sop-checklist deadlock is resolved (internal#376).
/sop-checklist-recheck