fix(ci): block superseded prod-deploy from rolling the fleet backward + settle /buildinfo (#2213) #2215
Reference in New Issue
Block a user
Delete Branch "fix/prod-deploy-verify-tenant-lag-2213"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root cause (RCA'd from prod logs — #2213)
The
publish-workspace-server-image / Production auto-deploy (push)main-red was an ordering race between two overlapping deploy jobs, NOT a slow-settling tenant.Two main pushes landed ~2 min apart:
7a72516(05:30:21Z) then7f25373(05:32:28Z). This workflow has noconcurrency:(intentional — Gitea 1.22.6 cancels queued prod deploys), so BOTHdeploy-productionjobs ran.Timeline (from job logs 275427 =
7f25373, 275383 =7a72516):7f25373)hongming→staging-7f25373ok:true, 8/8), soak 60s7a72516)7f25373)::error::hongming is stale: actual=7a72516, expected=7f25373→ REDstaging-7a72516(reverts it!), then superseded-guard skips verify, exits green:latest→staging-7a72516(older image)So the OLDER
7a72516job — superseded before it even started — rolled hongming BACKWARD and re-pointed:latestbackward. The #2194 superseded guard only protected the verify step, which runs AFTER the redeploy + promote, so it didn't prevent the backward side-effects.This is the mirror of #2194 (there: a superseded job false-red'd because the fleet was AHEAD; here: a superseded job actively rolled the fleet BEHIND and the newer job caught it).
Verdict: real per-tenant gap caused by a superseded job, with a latent settle false-red too
7a72516fright now (edge/buildinfoconfirms, stable over hours; SHA is ldflags-baked so no cache). It needs a manual redeploy withtarget_tag=staging-latest(CTO will do this — NOT done here)./buildinfoonce with no settle window for a tenant still draining its old container.Fix (no change to redeploy/rollout logic itself)
Check superseded before production side effects— runs the existingcheck-supersededBEFORE the rollout; gates OFF both the redeploy-fleet step and the:latestpromote when a newer commit already owns main. Fail-safe: unreadable head ⇒ NOT superseded ⇒ genuine deploys never skip. In-step verify guard kept for "newer job lands DURING rollout"./buildinfosettle budget (default 240s / 20s interval, overridable via repo vars) — poll until the tenant reports the target SHA or the budget is exhausted, then fail loud. A genuinely stuck tenant is NOT masked.Validation
test_prod_auto_deploy.pypass (incl. 2 new regression tests pinning the7a72516/7f25373shape)lint-workflow-yaml.py+lint-curl-status-capture.pycleanbash -nclean on everyrun:block in the deploy jobRefs #2213. Do NOT auto-merge — prod deploy pipeline change, needs CTO sign-off.
🤖 Generated with Claude Code
Owner-merged with CTO sign-off (王泓铭). Fixes #2213: superseded deploy-production jobs were rolling the fleet + :latest BACKWARD because the #2194 guard only protected the verify step, not the rollout. Now check-superseded runs BEFORE redeploy-fleet + the :latest promote (fail-safe on unreadable head); + per-tenant /buildinfo settle budget so a lagging tenant isn t a false-red. 35 tests. hongming already manually restored to staging-7f25373; this PR s clean deploy re-promotes :latest forward.