fix(ci): retry buildinfo verification with backoff for slow-rolling tenants (closes #2213) #2223

Closed
core-be wants to merge 1 commits from fix/2213-redeploy-verify-stale-tenant-retry into main
Member

Closes #2213

Summary

  • Main-red #2213: production auto-deploy verification fails because hongming (and potentially other tenants) haven't finished rolling out by the time the strict /buildinfo SHA check runs immediately after redeploy-fleet returns.
  • Wrap the per-tenant /buildinfo fetch + SHA comparison in a bounded retry loop (5 attempts, 10s backoff = max 40s extra per slow tenant) before declaring stale.
  • Keeps existing curl retry flags for network resilience. Does not change redeploy-fleet call or stragglers report.

Test plan

  • bash -n .gitea/workflows/publish-workspace-server-image.yml passes
  • Additive change — strict equality becomes lenient with bounded retry; no consumer contracts changed

Scope

  • .gitea/workflows/publish-workspace-server-image.yml — 16 insertions, 3 deletions in the Verify reachable tenants report this SHA step

/sop-ack engineer-ack as core-be

Closes #2213 ## Summary - Main-red #2213: production auto-deploy verification fails because hongming (and potentially other tenants) haven't finished rolling out by the time the strict `/buildinfo` SHA check runs immediately after `redeploy-fleet` returns. - Wrap the per-tenant `/buildinfo` fetch + SHA comparison in a bounded retry loop (5 attempts, 10s backoff = max 40s extra per slow tenant) before declaring stale. - Keeps existing curl retry flags for network resilience. Does not change redeploy-fleet call or stragglers report. ## Test plan - [x] `bash -n .gitea/workflows/publish-workspace-server-image.yml` passes - [x] Additive change — strict equality becomes lenient with bounded retry; no consumer contracts changed ## Scope - `.gitea/workflows/publish-workspace-server-image.yml` — 16 insertions, 3 deletions in the `Verify reachable tenants report this SHA` step /sop-ack engineer-ack as core-be
core-be added 1 commit 2026-06-04 06:56:42 +00:00
fix(ci): retry buildinfo verification with backoff for slow-rolling tenants (closes #2213)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 1s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 8s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
E2E Chat / detect-changes (pull_request) Successful in 9s
CI / Detect changes (pull_request) Successful in 13s
E2E API Smoke Test / detect-changes (pull_request) Successful in 14s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 4s
sop-checklist / review-refire (pull_request_target) Has been skipped
qa-review / approved (pull_request_target) Failing after 6s
gate-check-v3 / gate-check (pull_request_target) Successful in 7s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
sop-tier-check / tier-check (pull_request_target) Successful in 6s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
E2E Chat / E2E Chat (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
CI / Platform (Go) (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 1s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
sop-checklist / all-items-acked (pull_request_target) Successful in 13s
security-review / approved (pull_request_target) Failing after 13s
CI / all-required (pull_request) Successful in 10s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m0s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m15s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m21s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m30s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m37s
audit-force-merge / audit (pull_request_target) Has been skipped
20e37f1fad
Main-red #2213: production auto-deploy verification fails for hongming
because the strict /buildinfo SHA comparison runs immediately after
redeploy-fleet returns, before slower-rolling tenants have finished
cycling their containers.

Fix:
- Wrap the per-tenant /buildinfo fetch + SHA comparison in a retry
  loop (5 attempts, 10s backoff) before declaring stale.
- Keeps the existing curl retry flags for network resilience.
- Only affects the verification step; the redeploy-fleet call and
  stragglers report are unchanged.

SOP: /sop-ack engineer-ack as core-be
Tested: bash syntax check; no inline-shell unit tests exist for this
workflow step, but the change is additive (strict → lenient with
bounded retry).
core-be closed this pull request 2026-06-04 07:21:02 +00:00
core-be deleted branch fix/2213-redeploy-verify-stale-tenant-retry 2026-06-04 07:21:02 +00:00
Some checks are pending
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 1s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 8s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
E2E Chat / detect-changes (pull_request) Successful in 9s
CI / Detect changes (pull_request) Successful in 13s
E2E API Smoke Test / detect-changes (pull_request) Successful in 14s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 4s
sop-checklist / review-refire (pull_request_target) Has been skipped
qa-review / approved (pull_request_target) Failing after 6s
Required
Details
gate-check-v3 / gate-check (pull_request_target) Successful in 7s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
Required
Details
sop-tier-check / tier-check (pull_request_target) Successful in 6s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
E2E Chat / E2E Chat (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
CI / Platform (Go) (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 1s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
Required
Details
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
Required
Details
sop-checklist / all-items-acked (pull_request_target) Successful in 13s
security-review / approved (pull_request_target) Failing after 13s
Required
Details
CI / all-required (pull_request) Successful in 10s
Required
Details
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m0s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m15s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m21s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m30s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m37s
audit-force-merge / audit (pull_request_target) Has been skipped
reserved-path-review / reserved-path-review (pull_request_target)
Required

Pull request closed

Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2223