ci(publish): registry-backed Docker layer cache for build-and-push (slowest CI job class) #2511
Reference in New Issue
Block a user
Delete Branch "ci/publish-image-registry-layer-cache"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Registry-backed Docker layer caching for the
build-and-pushjob inpublish-workspace-server-image.yml— the slowest job class in our CI (~175 runs/wk).Adds to both image builds (platform + tenant):
--cache-from type=registry,ref=<repo>:buildcache--cache-to type=registry,ref=<repo>:buildcache,mode=max,image-manifest=true,oci-mediatypes=true,ignore-error=trueCache tags:
molecule-ai/platform:buildcacheandmolecule-ai/platform-tenant:buildcacheon the primary ECR (153263036946) only. Never a deploy tag — deploys pinstaging-<sha>/ promote:latest; nothing consumes:buildcacheas an image.Why registry cache (not local builder state)
Every run builds on a fresh ephemeral docker-container builder — setup-buildx-action for the platform image, and the tenant build creates + destroys a builder per retry attempt by design (buildkit-EOF retry loop, internal#2468). So local cache never survives, and the publish lane is host-pinned only as of last night (cp#646) — registry cache works no matter where the job lands and survives runner/host churn.
Baseline (action_run_job, last 7d, successful runs only)
build-and-pushAfter merge: first main push is the cold run (exports cache), subsequent pushes are warm. Will comment warm-run numbers here.
Safety
--cache-toexport + fresh-builder--cache-fromimport round-trip against ECR from the publish host (writing cache image manifest ... DONE, warm rebuild showed layers CACHED).image-manifest=true,oci-mediatypes=trueis what makes ECR accept it.ignore-error=trueon cache-to: cache export failure can never fail the publish.:buildcache(first run) is a buildx warning, not an error.:buildcache, same semantics asstaging-latest.Cost / housekeeping
mode=maxcache for the tenant image will be roughly the size of the builder stages (Go module cache + node_modules layers, est. 1–3 GB). Single moving tag self-overwrites; superseded cache manifests become untagged. Follow-up (separate, ops): ECR lifecycle policy to expire untagged manifests.Security+correctness 5-axis — APPROVE (head
d417a7e52d). Registry-backed Docker layer cache (publish-workspace-server-image.yml, +29/-0) — adds--cache-from/--cache-to type=registry,ref=${IMAGE_NAME}:buildcache,mode=max,image-manifest=true,oci-mediatypes=true,ignore-error=trueto the platform + tenant buildx steps (warms the slowest CI job, p50 228s).:buildcacheECR tag, imports next run; mode=max needed (final stage is tiny copy); image-manifest/oci-mediatypes required for ECR (author verified a real export+import round-trip). Sound.ignore-error=trueon cache-to → an export failure never fails the publish lane (worst case cold next run); cache-from on a missing tag = warning not error; concurrent publishes = last-writer-wins (same semantics as :staging-latest); benefits the fresh-builder-per-attempt retry path.:buildcacheis explicitly "never a deploy tag" (can not be promoted as a deployable image); cache lives on the PRIMARY org ECR (access-controlled same as the image repo — no NEW exposure surface; staging mirror is push-target not cache-source). No committed secret values/host coords; ECR ref via env.⚠️ NON-BLOCKING (pre-existing, not introduced here): mode=max caches INTERMEDIATE layers — IF any build secret is passed via
--build-arg(rather than a BuildKit--secretmount), it could persist in the:buildcachelayers in ECR. Recommend a follow-up confirming the Dockerfiles use BuildKit secret mounts (not build-args) for any secret. This PR only caches whatever layers already exist; it does not itself add a secret.Required CI green (all-required/Platform-Go/E2E-API/Handlers-PG/trusted-sop ✓). Author devops-engineer (≠ me). Sound — APPROVE. Needs a 2nd genuine lane → merge.
qa lane (full-SHA
d417a7e52d). 5-axis on the registry-backed Docker layer cache for build-and-push (CI perf):(1) CORRECTNESS — SOUND. Adds --cache-from/--cache-to type=registry,ref=${IMAGE_NAME}:buildcache to both platform + tenant image builds; cache-to carries mode=max,image-manifest=true,oci-mediatypes=true (required for ECR's export/import round-trip) + ignore-error=true (cache EXPORT failure is non-fatal — must never break a publish); cache-from on a missing tag (first run) is a warning, not an error. Dedicated moving :buildcache tag (never a real image tag) so it can't pollute released tags.
(2) ROBUSTNESS — ignore-error=true means a cache miss/export failure degrades to a normal (uncached) build, never a publish failure. Concurrent-publish last-writer-wins on :buildcache is acknowledged + benign (cache, not correctness).
(3) SECURITY / BUILD-SECRET CHECK (explicit) — CLEAN. AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY are ENV-passed from secrets (not interpolated into a run: shell line) and consumed via
aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin— the secure password-stdin pattern (no secret in argv/process-list/logs). The new cache refs use the ${IMAGE_NAME} ENV var, NOT any secret. NO${{ secrets.* }}-into-shell injection vector (the cp#532 class) introduced. The ECR account-id/registry coords in context are PRE-EXISTING internal ops-repo CI config with NO credential VALUES (secrets are ${{ }} refs) → soft-class, non-blocking.(4) PERFORMANCE — this IS the win: registry layer cache on the slowest CI job class (~175 runs/wk). Net positive.
(5) TEST-COVERAGE — CI-workflow change; self-validating (first-run cache-from-miss = warning, ignore-error on export) and exercised by the publish job itself; no unit test applicable.
Clean CI perf change, build-secret-safe. APPROVED.
Measured.
251df96, exports cache)Warm-run log confirms the mechanism:
importing cache manifest from .../molecule-ai/platform:buildcache+.../platform-tenant:buildcache,inferred cache manifest type: application/vnd.oci.image.manifest.v1+json, 43 CACHED build steps.platform-tenant:buildcachein ECR is ~933MB.Caveat: the dispatch re-ran the same SHA (best case — all layers cached). Organic main pushes will land between 84s and baseline depending on which Dockerfile stages a commit invalidates (Go-only changes keep the npm/apk layers warm and vice versa).