fix(ci): block superseded prod-deploy from rolling the fleet backward + settle /buildinfo (#2213) #2215

Merged
claude-ceo-assistant merged 1 commits from fix/prod-deploy-verify-tenant-lag-2213 into main 2026-06-04 06:19:17 +00:00
Member

Root cause (RCA'd from prod logs — #2213)

The publish-workspace-server-image / Production auto-deploy (push) main-red was an ordering race between two overlapping deploy jobs, NOT a slow-settling tenant.

Two main pushes landed ~2 min apart: 7a72516 (05:30:21Z) then 7f25373 (05:32:28Z). This workflow has no concurrency: (intentional — Gitea 1.22.6 cancels queued prod deploys), so BOTH deploy-production jobs ran.

Timeline (from job logs 275427 = 7f25373, 275383 = 7a72516):

Time Job Action
05:38:55 275427 (7f25373) redeploy canary hongmingstaging-7f25373
05:39:13 275427 CP verified hongming on-target (ok:true, 8/8), soak 60s
05:39:21 275383 (7a72516) job STARTS (main head already 7f25373)
05:41:08 275427 edge verify: ::error::hongming is stale: actual=7a72516, expected=7f25373RED
05:42:00 275383 redeploy canary hongming → staging-7a72516 (reverts it!), then superseded-guard skips verify, exits green
05:42:05 275383 promotes :lateststaging-7a72516 (older image)

So the OLDER 7a72516 job — superseded before it even started — rolled hongming BACKWARD and re-pointed :latest backward. The #2194 superseded guard only protected the verify step, which runs AFTER the redeploy + promote, so it didn't prevent the backward side-effects.

This is the mirror of #2194 (there: a superseded job false-red'd because the fleet was AHEAD; here: a superseded job actively rolled the fleet BEHIND and the newer job caught it).

Verdict: real per-tenant gap caused by a superseded job, with a latent settle false-red too

  • hongming is genuinely on 7a72516f right now (edge /buildinfo confirms, stable over hours; SHA is ldflags-baked so no cache). It needs a manual redeploy with target_tag=staging-latest (CTO will do this — NOT done here).
  • The verify step also had a latent timing false-red: it polled /buildinfo once with no settle window for a tenant still draining its old container.

Fix (no change to redeploy/rollout logic itself)

  1. Check superseded before production side effects — runs the existing check-superseded BEFORE the rollout; gates OFF both the redeploy-fleet step and the :latest promote when a newer commit already owns main. Fail-safe: unreadable head ⇒ NOT superseded ⇒ genuine deploys never skip. In-step verify guard kept for "newer job lands DURING rollout".
  2. Per-tenant /buildinfo settle budget (default 240s / 20s interval, overridable via repo vars) — poll until the tenant reports the target SHA or the budget is exhausted, then fail loud. A genuinely stuck tenant is NOT masked.

Validation

  • 35/35 test_prod_auto_deploy.py pass (incl. 2 new regression tests pinning the 7a72516/7f25373 shape)
  • lint-workflow-yaml.py + lint-curl-status-capture.py clean
  • bash -n clean on every run: block in the deploy job
  • YAML parses

Refs #2213. Do NOT auto-merge — prod deploy pipeline change, needs CTO sign-off.

🤖 Generated with Claude Code

## Root cause (RCA'd from prod logs — #2213) The `publish-workspace-server-image / Production auto-deploy (push)` main-red was an **ordering race between two overlapping deploy jobs**, NOT a slow-settling tenant. Two main pushes landed ~2 min apart: `7a72516` (05:30:21Z) then `7f25373` (05:32:28Z). This workflow has **no `concurrency:`** (intentional — Gitea 1.22.6 cancels queued prod deploys), so BOTH `deploy-production` jobs ran. Timeline (from job logs 275427 = `7f25373`, 275383 = `7a72516`): | Time | Job | Action | |------|-----|--------| | 05:38:55 | 275427 (`7f25373`) | redeploy canary `hongming` → `staging-7f25373` | | 05:39:13 | 275427 | CP verified hongming on-target (`ok:true`, 8/8), soak 60s | | 05:39:21 | 275383 (`7a72516`) | job STARTS (main head already `7f25373`) | | 05:41:08 | 275427 | edge verify: `::error::hongming is stale: actual=7a72516, expected=7f25373` → **RED** | | 05:42:00 | 275383 | redeploy canary hongming → **`staging-7a72516`** (reverts it!), then superseded-guard skips verify, exits **green** | | 05:42:05 | 275383 | promotes `:latest` → **`staging-7a72516`** (older image) | So the OLDER `7a72516` job — superseded before it even started — **rolled hongming BACKWARD** and **re-pointed `:latest` backward**. The #2194 superseded guard only protected the *verify* step, which runs AFTER the redeploy + promote, so it didn't prevent the backward side-effects. This is the mirror of #2194 (there: a superseded job false-red'd because the fleet was AHEAD; here: a superseded job actively rolled the fleet BEHIND and the newer job caught it). ## Verdict: real per-tenant gap caused by a superseded job, with a latent settle false-red too - hongming is genuinely on `7a72516f` **right now** (edge `/buildinfo` confirms, stable over hours; SHA is ldflags-baked so no cache). It needs a manual redeploy with `target_tag=staging-latest` (CTO will do this — NOT done here). - The verify step also had a latent timing false-red: it polled `/buildinfo` once with no settle window for a tenant still draining its old container. ## Fix (no change to redeploy/rollout logic itself) 1. **`Check superseded before production side effects`** — runs the existing `check-superseded` BEFORE the rollout; gates OFF both the redeploy-fleet step and the `:latest` promote when a newer commit already owns main. Fail-safe: unreadable head ⇒ NOT superseded ⇒ genuine deploys never skip. In-step verify guard kept for "newer job lands DURING rollout". 2. **Per-tenant `/buildinfo` settle budget** (default 240s / 20s interval, overridable via repo vars) — poll until the tenant reports the target SHA or the budget is exhausted, then fail loud. A genuinely stuck tenant is NOT masked. ## Validation - 35/35 `test_prod_auto_deploy.py` pass (incl. 2 new regression tests pinning the `7a72516`/`7f25373` shape) - `lint-workflow-yaml.py` + `lint-curl-status-capture.py` clean - `bash -n` clean on every `run:` block in the deploy job - YAML parses Refs #2213. Do NOT auto-merge — prod deploy pipeline change, needs CTO sign-off. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-06-04 05:57:54 +00:00
fix(ci): block superseded prod-deploy from rolling the fleet backward + settle /buildinfo (#2213)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 1s
CI / Detect changes (pull_request) Successful in 10s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 10s
E2E Chat / detect-changes (pull_request) Successful in 11s
CI / Python Lint & Test (pull_request) Successful in 14s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 10s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 3s
qa-review / approved (pull_request_target) Failing after 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
gate-check-v3 / gate-check (pull_request_target) Successful in 7s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 4s
CI / Platform (Go) (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 2s
security-review / approved (pull_request_target) Failing after 12s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1s
CI / all-required (pull_request) Successful in 4s
sop-tier-check / tier-check (pull_request_target) Successful in 12s
E2E Chat / E2E Chat (pull_request) Successful in 6s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m0s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m16s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m12s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m11s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 2m12s
audit-force-merge / audit (pull_request_target) Successful in 5s
3b19919a95
Root cause of the #2213 main-red (`publish-workspace-server-image /
Production auto-deploy` failing on hongming "is stale"):

Two main pushes landed ~2 min apart (7a72516 then 7f25373). With no
`concurrency:` on this workflow (intentional — Gitea 1.22.6 cancels queued
prod deploys) BOTH deploy-production jobs run. The OLDER 7a72516 job started
late, after 7f25373 was already main's head. The #2194 superseded guard only
protected the *verify* step — it ran AFTER the redeploy and the :latest
promote. So the older job still:
  1. redeployed the canary (hongming) BACKWARD to staging-7a72516, reverting
     it from the newer SHA the 7f25373 job had just shipped — which is exactly
     what the 7f25373 job's verify then saw ("hongming is stale: actual=7a72516,
     expected=7f25373") -> main red; AND
  2. promoted :latest BACKWARD to the older staging-7a72516 image,
before finally skipping verify and exiting green.

Fix (defense in depth, no change to the redeploy/rollout logic itself):
- Add a "Check superseded before production side effects" step that runs the
  existing check-superseded BEFORE the rollout. When a newer commit already
  owns main, gate OFF both the redeploy-fleet step and the :latest promote so
  an older job never rolls the fleet (or :latest) backward. Fail-safe: an
  unreadable head is treated as NOT superseded, so a genuine deploy never
  silently skips. The in-step verify guard is kept to catch a newer job that
  lands DURING this job's rollout.
- Harden the /buildinfo verify with a bounded per-tenant settle budget
  (default 240s, 20s interval, both overridable via repo vars). `curl --retry`
  only retries connection/5xx failures, not a stale-but-200 body, so a tenant
  whose container the CP just swapped — still serving the draining old image
  at the edge — false-reds "stale" on the first poll. Now we poll until the
  tenant reports the target SHA or the budget is exhausted, then fail loud.
  A genuinely stuck tenant is NOT masked.

Tests: pin the superseded contract for the exact 7a72516/7f25373 incident
shape (older job superseded -> skip; latest job -> still rolls + verifies).
All 35 prod-auto-deploy unit tests pass; lint-workflow-yaml + curl-status
linters clean; every run block bash -n clean.

Refs #2213
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
claude-ceo-assistant merged commit 185ff47fed into main 2026-06-04 06:19:17 +00:00
Author
Member

Owner-merged with CTO sign-off (王泓铭). Fixes #2213: superseded deploy-production jobs were rolling the fleet + :latest BACKWARD because the #2194 guard only protected the verify step, not the rollout. Now check-superseded runs BEFORE redeploy-fleet + the :latest promote (fail-safe on unreadable head); + per-tenant /buildinfo settle budget so a lagging tenant isn t a false-red. 35 tests. hongming already manually restored to staging-7f25373; this PR s clean deploy re-promotes :latest forward.

Owner-merged with CTO sign-off (王泓铭). Fixes #2213: superseded deploy-production jobs were rolling the fleet + :latest BACKWARD because the #2194 guard only protected the verify step, not the rollout. Now check-superseded runs BEFORE redeploy-fleet + the :latest promote (fail-safe on unreadable head); + per-tenant /buildinfo settle budget so a lagging tenant isn t a false-red. 35 tests. hongming already manually restored to staging-7f25373; this PR s clean deploy re-promotes :latest forward.
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2215