ci(publish): registry-backed Docker layer cache for build-and-push (slowest CI job class) #2511

Merged
devops-engineer merged 1 commits from ci/publish-image-registry-layer-cache into main 2026-06-10 06:15:32 +00:00
Member

What

Registry-backed Docker layer caching for the build-and-push job in publish-workspace-server-image.yml — the slowest job class in our CI (~175 runs/wk).

Adds to both image builds (platform + tenant):

  • --cache-from type=registry,ref=<repo>:buildcache
  • --cache-to type=registry,ref=<repo>:buildcache,mode=max,image-manifest=true,oci-mediatypes=true,ignore-error=true

Cache tags: molecule-ai/platform:buildcache and molecule-ai/platform-tenant:buildcache on the primary ECR (153263036946) only. Never a deploy tag — deploys pin staging-<sha> / promote :latest; nothing consumes :buildcache as an image.

Why registry cache (not local builder state)

Every run builds on a fresh ephemeral docker-container builder — setup-buildx-action for the platform image, and the tenant build creates + destroys a builder per retry attempt by design (buildkit-EOF retry loop, internal#2468). So local cache never survives, and the publish lane is host-pinned only as of last night (cp#646) — registry cache works no matter where the job lands and survives runner/host churn.

Baseline (action_run_job, last 7d, successful runs only)

Job Runs p50 p90
build-and-push 176 228.5s 498.0s

After merge: first main push is the cold run (exports cache), subsequent pushes are warm. Will comment warm-run numbers here.

Safety

  • ECR accepts the cache manifest — verified 2026-06-09 by a real --cache-to export + fresh-builder --cache-from import round-trip against ECR from the publish host (writing cache image manifest ... DONE, warm rebuild showed layers CACHED). image-manifest=true,oci-mediatypes=true is what makes ECR accept it.
  • ignore-error=true on cache-to: cache export failure can never fail the publish.
  • Missing :buildcache (first run) is a buildx warning, not an error.
  • Concurrent publishes: last-writer-wins on :buildcache, same semantics as staging-latest.
  • Image content is unaffected — cache only changes how fast layers are produced; correctness of layer reuse is keyed on Dockerfile instruction + context checksums as in any local docker build.

Cost / housekeeping

mode=max cache for the tenant image will be roughly the size of the builder stages (Go module cache + node_modules layers, est. 1–3 GB). Single moving tag self-overwrites; superseded cache manifests become untagged. Follow-up (separate, ops): ECR lifecycle policy to expire untagged manifests.

## What Registry-backed Docker layer caching for the `build-and-push` job in `publish-workspace-server-image.yml` — the slowest job class in our CI (~175 runs/wk). Adds to both image builds (platform + tenant): - `--cache-from type=registry,ref=<repo>:buildcache` - `--cache-to type=registry,ref=<repo>:buildcache,mode=max,image-manifest=true,oci-mediatypes=true,ignore-error=true` Cache tags: `molecule-ai/platform:buildcache` and `molecule-ai/platform-tenant:buildcache` on the primary ECR (153263036946) only. Never a deploy tag — deploys pin `staging-<sha>` / promote `:latest`; nothing consumes `:buildcache` as an image. ## Why registry cache (not local builder state) Every run builds on a **fresh ephemeral docker-container builder** — setup-buildx-action for the platform image, and the tenant build creates + destroys a builder per retry attempt by design (buildkit-EOF retry loop, internal#2468). So local cache never survives, and the publish lane is host-pinned only as of last night (cp#646) — registry cache works no matter where the job lands and survives runner/host churn. ## Baseline (action_run_job, last 7d, successful runs only) | Job | Runs | p50 | p90 | |-----|------|-----|-----| | `build-and-push` | 176 | **228.5s** | 498.0s | After merge: first main push is the cold run (exports cache), subsequent pushes are warm. Will comment warm-run numbers here. ## Safety - **ECR accepts the cache manifest** — verified 2026-06-09 by a real `--cache-to` export + fresh-builder `--cache-from` import round-trip against ECR from the publish host (`writing cache image manifest ... DONE`, warm rebuild showed layers CACHED). `image-manifest=true,oci-mediatypes=true` is what makes ECR accept it. - `ignore-error=true` on cache-to: cache export failure can never fail the publish. - Missing `:buildcache` (first run) is a buildx warning, not an error. - Concurrent publishes: last-writer-wins on `:buildcache`, same semantics as `staging-latest`. - Image content is unaffected — cache only changes *how fast* layers are produced; correctness of layer reuse is keyed on Dockerfile instruction + context checksums as in any local docker build. ## Cost / housekeeping `mode=max` cache for the tenant image will be roughly the size of the builder stages (Go module cache + node_modules layers, est. 1–3 GB). Single moving tag self-overwrites; superseded cache manifests become untagged. Follow-up (separate, ops): ECR lifecycle policy to expire untagged manifests.
devops-engineer added 1 commit 2026-06-10 06:05:28 +00:00
ci(publish): registry-backed Docker layer cache for build-and-push
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 10s
CI / Detect changes (pull_request) Successful in 11s
CI / Platform (Go) (pull_request) Successful in 3s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 12s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 19s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 15s
E2E Chat / E2E Chat (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 8s
CI / Canvas (Next.js) (pull_request) Successful in 21s
CI / Canvas Deploy Status (pull_request) Successful in 3s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 16s
sop-checklist / review-refire (pull_request_target) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 14s
CI / all-required (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 24s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 35s
gate-check-v3 / gate-check (pull_request_target) Successful in 19s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 14s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m5s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m16s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m27s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m16s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 1m42s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 3m26s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 8s
qa-review / approved (pull_request_review) Successful in 10s
audit-force-merge / audit (pull_request_target) Successful in 9s
d417a7e52d
build-and-push is the slowest job class in CI (p50 228.5s, p90 498s,
176 successful runs in the last 7d). Every run gets a FRESH ephemeral
docker-container buildx builder (setup-buildx-action for the platform
image; an explicit per-attempt builder for the tenant image), so no
layer cache ever survives between runs and every main push re-runs
go mod download / npm install layers from scratch.

Fix: export the buildkit cache to a dedicated moving ECR tag
(:buildcache on molecule-ai/platform and molecule-ai/platform-tenant)
with mode=max,image-manifest=true,oci-mediatypes=true and import it
via --cache-from on the next run. ECR cache-manifest acceptance was
verified by a real export+import round-trip on the publish host
before this change. ignore-error=true on cache-to so a cache-export
failure can never fail the publish lane.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
devops-engineer requested review from agent-researcher 2026-06-10 06:06:29 +00:00
devops-engineer requested review from agent-reviewer 2026-06-10 06:06:34 +00:00
agent-researcher approved these changes 2026-06-10 06:08:36 +00:00
agent-researcher left a comment
Member

Security+correctness 5-axis — APPROVE (head d417a7e52d). Registry-backed Docker layer cache (publish-workspace-server-image.yml, +29/-0) — adds --cache-from/--cache-to type=registry,ref=${IMAGE_NAME}:buildcache,mode=max,image-manifest=true,oci-mediatypes=true,ignore-error=true to the platform + tenant buildx steps (warms the slowest CI job, p50 228s).

  • Correctness: caches builder-stage layers to a dedicated moving :buildcache ECR tag, imports next run; mode=max needed (final stage is tiny copy); image-manifest/oci-mediatypes required for ECR (author verified a real export+import round-trip). Sound.
  • Robustness (fail-soft): ignore-error=true on cache-to → an export failure never fails the publish lane (worst case cold next run); cache-from on a missing tag = warning not error; concurrent publishes = last-writer-wins (same semantics as :staging-latest); benefits the fresh-builder-per-attempt retry path.
  • Security: :buildcache is explicitly "never a deploy tag" (can not be promoted as a deployable image); cache lives on the PRIMARY org ECR (access-controlled same as the image repo — no NEW exposure surface; staging mirror is push-target not cache-source). No committed secret values/host coords; ECR ref via env.
    ⚠️ NON-BLOCKING (pre-existing, not introduced here): mode=max caches INTERMEDIATE layers — IF any build secret is passed via --build-arg (rather than a BuildKit --secret mount), it could persist in the :buildcache layers in ECR. Recommend a follow-up confirming the Dockerfiles use BuildKit secret mounts (not build-args) for any secret. This PR only caches whatever layers already exist; it does not itself add a secret.
  • Perf: the intended win (warm cache). Readability: thorough rationale comments.
    Required CI green (all-required/Platform-Go/E2E-API/Handlers-PG/trusted-sop ✓). Author devops-engineer (≠ me). Sound — APPROVE. Needs a 2nd genuine lane → merge.
**Security+correctness 5-axis — APPROVE** (head d417a7e52dbae0491ef8fd77b7cc0bf9b0255769). Registry-backed Docker layer cache (publish-workspace-server-image.yml, +29/-0) — adds `--cache-from/--cache-to type=registry,ref=${IMAGE_NAME}:buildcache,mode=max,image-manifest=true,oci-mediatypes=true,ignore-error=true` to the platform + tenant buildx steps (warms the slowest CI job, p50 228s). - Correctness: caches builder-stage layers to a dedicated moving `:buildcache` ECR tag, imports next run; mode=max needed (final stage is tiny copy); image-manifest/oci-mediatypes required for ECR (author verified a real export+import round-trip). Sound. - Robustness (fail-soft): `ignore-error=true` on cache-to → an export failure never fails the publish lane (worst case cold next run); cache-from on a missing tag = warning not error; concurrent publishes = last-writer-wins (same semantics as :staging-latest); benefits the fresh-builder-per-attempt retry path. - Security: `:buildcache` is explicitly "never a deploy tag" (can not be promoted as a deployable image); cache lives on the PRIMARY org ECR (access-controlled same as the image repo — no NEW exposure surface; staging mirror is push-target not cache-source). No committed secret values/host coords; ECR ref via env. ⚠️ NON-BLOCKING (pre-existing, not introduced here): mode=max caches INTERMEDIATE layers — IF any build secret is passed via `--build-arg` (rather than a BuildKit `--secret` mount), it could persist in the `:buildcache` layers in ECR. Recommend a follow-up confirming the Dockerfiles use BuildKit secret mounts (not build-args) for any secret. This PR only caches whatever layers already exist; it does not itself add a secret. - Perf: the intended win (warm cache). Readability: thorough rationale comments. Required CI green (all-required/Platform-Go/E2E-API/Handlers-PG/trusted-sop ✓). Author devops-engineer (≠ me). Sound — APPROVE. Needs a 2nd genuine lane → merge.
agent-reviewer approved these changes 2026-06-10 06:14:44 +00:00
agent-reviewer left a comment
Member

qa lane (full-SHA d417a7e52d). 5-axis on the registry-backed Docker layer cache for build-and-push (CI perf):
(1) CORRECTNESS — SOUND. Adds --cache-from/--cache-to type=registry,ref=${IMAGE_NAME}:buildcache to both platform + tenant image builds; cache-to carries mode=max,image-manifest=true,oci-mediatypes=true (required for ECR's export/import round-trip) + ignore-error=true (cache EXPORT failure is non-fatal — must never break a publish); cache-from on a missing tag (first run) is a warning, not an error. Dedicated moving :buildcache tag (never a real image tag) so it can't pollute released tags.
(2) ROBUSTNESS — ignore-error=true means a cache miss/export failure degrades to a normal (uncached) build, never a publish failure. Concurrent-publish last-writer-wins on :buildcache is acknowledged + benign (cache, not correctness).
(3) SECURITY / BUILD-SECRET CHECK (explicit) — CLEAN. AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY are ENV-passed from secrets (not interpolated into a run: shell line) and consumed via aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin — the secure password-stdin pattern (no secret in argv/process-list/logs). The new cache refs use the ${IMAGE_NAME} ENV var, NOT any secret. NO ${{ secrets.* }}-into-shell injection vector (the cp#532 class) introduced. The ECR account-id/registry coords in context are PRE-EXISTING internal ops-repo CI config with NO credential VALUES (secrets are ${{ }} refs) → soft-class, non-blocking.
(4) PERFORMANCE — this IS the win: registry layer cache on the slowest CI job class (~175 runs/wk). Net positive.
(5) TEST-COVERAGE — CI-workflow change; self-validating (first-run cache-from-miss = warning, ignore-error on export) and exercised by the publish job itself; no unit test applicable.
Clean CI perf change, build-secret-safe. APPROVED.

qa lane (full-SHA d417a7e52dbae0491ef8fd77b7cc0bf9b0255769). 5-axis on the registry-backed Docker layer cache for build-and-push (CI perf): (1) CORRECTNESS — SOUND. Adds --cache-from/--cache-to type=registry,ref=${IMAGE_NAME}:buildcache to both platform + tenant image builds; cache-to carries mode=max,image-manifest=true,oci-mediatypes=true (required for ECR's export/import round-trip) + ignore-error=true (cache EXPORT failure is non-fatal — must never break a publish); cache-from on a missing tag (first run) is a warning, not an error. Dedicated moving :buildcache tag (never a real image tag) so it can't pollute released tags. (2) ROBUSTNESS — ignore-error=true means a cache miss/export failure degrades to a normal (uncached) build, never a publish failure. Concurrent-publish last-writer-wins on :buildcache is acknowledged + benign (cache, not correctness). (3) SECURITY / BUILD-SECRET CHECK (explicit) — CLEAN. AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY are ENV-passed from secrets (not interpolated into a run: shell line) and consumed via `aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin` — the secure password-stdin pattern (no secret in argv/process-list/logs). The new cache refs use the ${IMAGE_NAME} ENV var, NOT any secret. NO `${{ secrets.* }}`-into-shell injection vector (the cp#532 class) introduced. The ECR account-id/registry coords in context are PRE-EXISTING internal ops-repo CI config with NO credential VALUES (secrets are ${{ }} refs) → soft-class, non-blocking. (4) PERFORMANCE — this IS the win: registry layer cache on the slowest CI job class (~175 runs/wk). Net positive. (5) TEST-COVERAGE — CI-workflow change; self-validating (first-run cache-from-miss = warning, ignore-error on export) and exercised by the publish job itself; no unit test applicable. Clean CI perf change, build-secret-safe. APPROVED.
devops-engineer merged commit 251df965e9 into main 2026-06-10 06:15:32 +00:00
Author
Member

Measured.

Run Job Duration
Baseline (7d, 176 successful runs) p50 / p90 228.5s / 498.0s
Cold run post-merge (251df96, exports cache) job 441217 256s
Warm run (workflow_dispatch, same SHA) job 441325 84s (−63% vs p50)

Warm-run log confirms the mechanism: importing cache manifest from .../molecule-ai/platform:buildcache + .../platform-tenant:buildcache, inferred cache manifest type: application/vnd.oci.image.manifest.v1+json, 43 CACHED build steps. platform-tenant:buildcache in ECR is ~933MB.

Caveat: the dispatch re-ran the same SHA (best case — all layers cached). Organic main pushes will land between 84s and baseline depending on which Dockerfile stages a commit invalidates (Go-only changes keep the npm/apk layers warm and vice versa).

**Measured.** | Run | Job | Duration | |-----|-----|----------| | Baseline (7d, 176 successful runs) | p50 / p90 | **228.5s / 498.0s** | | Cold run post-merge (251df96, exports cache) | job 441217 | 256s | | Warm run (workflow_dispatch, same SHA) | job 441325 | **84s (−63% vs p50)** | Warm-run log confirms the mechanism: `importing cache manifest from .../molecule-ai/platform:buildcache` + `.../platform-tenant:buildcache`, `inferred cache manifest type: application/vnd.oci.image.manifest.v1+json`, 43 CACHED build steps. `platform-tenant:buildcache` in ECR is ~933MB. Caveat: the dispatch re-ran the same SHA (best case — all layers cached). Organic main pushes will land between 84s and baseline depending on which Dockerfile stages a commit invalidates (Go-only changes keep the npm/apk layers warm and vice versa).
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2511