ci(tenant-image): add build-time smoke gate so broken image never becomes :staging-latest (P0 SEV) #3111
Reference in New Issue
Block a user
Delete Branch "fix/p0-sev-image-smoke-gate"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
P0 SEV hardening — prod tenant onboarding was down
Per PM dispatch d8ae426e (2026-06-21): start_platform docker run exit=127 on tenant boot; container never starts → 502. The build was pushing the broken image to ECR as
:staging-latestwithout any local verification, then:latestwas advanced bydeploy-productionafter canary verify (which also missed the defect).Fix
The tenant-image build now uses buildx
--load(not--push) so the just-built image is loaded into the runner's local daemon. After build:docker runthe image locally (port 18080→8080)http://localhost:18080/healthzevery 2s for up to 120s::error::so the failure is actionableA broken image can no longer become
:staging-latest.Why both this gate AND canary/staging-verify
The post-push
canary/staging-verifyjob remains as the cloud-side safety net (catches issues that only manifest in the cloudflared/EC2/staging-org context the local smoke cannot reproduce). The build-time gate catches the exit=127 / won't-boot class of defect ~10x faster (no ECR round-trip, no canary provisioning) and with zero blast radius (no broken image in ECR to roll back).Diff
+75 / -3lines in.gitea/workflows/publish-workspace-server-image.yml(single file).Test plan
bash -non the run block)Rollback
Single-file revert is safe:
git revert 48bb97e2restores the--push-only behavior. The canary/staging-verify remains as the only safety net (regression to pre-fix state, but no worse).Refs: PM dispatch d8ae426e, internal#2187 (gate-making plan), cp#245 (boot-timeout flake surface — smoke gate is local and unaffected).
🤖 Generated with Claude Code
APPROVED on current head
48bb97e2.5-axis: correctness: the workflow now builds the tenant image with
--load, runs the just-built image locally, polls/healthzfor 120s, and exits before anydocker pushif the smoke fails, so a broken image cannot become:staging-latest. Robustness: logs are emitted on failure and cleanup runs for the smoke container; pushes happen only after smoke passes. Security: no new secret exposure, existing registry flow preserved. Performance: adds a bounded pre-push smoke cost but avoids ECR/staging round trips on broken images. Readability: comments make the P0 gate intent clear.New commits pushed, approval review dismissed automatically according to repository settings
REQUEST_CHANGES on current head
248c7f52.Blocking finding:
.gitea/workflows/publish-workspace-server-image.yml:518 iterates
for t in "${build_tags[@]}", butbuild_tagsis an alternating argv array declared at lines 283-288:--tag, image ref,--tag, image ref, etc. The first loop iteration therefore computestag_value="--tag"and runsdocker push --tag, so even if both smoke variants pass, the publish step fails before pushing any tenant image. Iterate only over the tag-value elements, e.g. by index over odd positions, or store a separate image-ref list for pushing.5-axis: correctness: smoke design is directionally right and now covers full-env sidecar plus sidecar-disabled paths, but the post-smoke push loop is mechanically wrong and prevents successful publication. Robustness: the gate is fail-closed before ECR push, but the broken loop makes the release lane unusable. Security: no new secret exposure found. Performance: added pgvector/full-env smoke cost is acceptable for a publish lane. Readability: comments explain the RCA and variants well, but the argv-array reuse is misleading enough to cause this bug.
Status: does NOT meet 2-genuine/green for merge; current combined status is failure and CR2's prior approval is stale/dismissed.
REQUEST_CHANGES on current head
248c7f52.Blocking finding: the two smoke variants are directionally correct and fail before publish, but the post-smoke publish loop is broken.
.gitea/workflows/publish-workspace-server-image.ymliteratesfor t in "${build_tags[@]}"; that array is alternating buildx argv entries (--tag, image-ref,--tag, image-ref). The first iteration therefore computestag_value="--tag"and runsdocker push --tag, so a successful smoke still fails before any image reaches:latest/:staging-latest.5-axis: correctness: FULL ENV and MEMORY_PLUGIN_DISABLE=1 smoke coverage matches the requested P0 hardening, but publication is mechanically broken. Robustness: the gate is fail-closed before push, but currently also prevents any successful push. Security: no new secret exposure found. Performance: bounded smoke cost is acceptable for this lane. Readability: the argv-array reuse is misleading; use a separate image-ref list or iterate only the odd tag-value positions.
APPROVED on current head
ec2d48a1.5-axis: correctness: the prior RC is addressed.
build_tagsremains the buildx argv list fordocker buildx build, while the new parallelpush_refscontains only bare image refs and the push loop now iteratesfor ref in "${push_refs[@]}"; docker push "${ref}", so it cannot rundocker push --tag. The two smoke variants still gate publication before any push. Robustness: smoke failures and push failures remain fail-closed. Security: no new secret exposure. Performance: bounded pre-push smoke cost remains appropriate for the publish lane. Readability: comments now clearly separate build argv from push refs.APPROVED on current head
ec2d48a1.5-axis: correctness: the RC 12946 push-loop bug is fixed.
build_tagsremains the buildx argv array (--tag, ref pairs), while the newpush_refsarray contains only bare image refs and the push loop iteratesfor ref in "${push_refs[@]}"; docker push "${ref}", so it can no longer rundocker push --tag. The pre-push smoke gate still runs both full-env sidecar and sidecar-disabled variants before any ECR push. Robustness: fail-closed before publishing; cleanup/traps remain in place; both smoke variants must pass. Security: no new secret exposure. Performance: added smoke cost is acceptable in the publish lane. Readability: comments now clearly distinguish buildx tag argv from push refs and reference the prior RCs.CI/merge-readiness: latest readback was not green yet: reserved/security checks were waiting on current-head approvals and staging E2E/template-delivery contexts were pending; gate-check reported CI_FAIL due pending contexts. Code review is approved, but merge should still wait for required CI/policy green.