fix(deploy): #2859 bounded retry + error surfacing for redeploy-fleet transient 502s #2862
Reference in New Issue
Block a user
Delete Branch "fix/2859-redeploy-fleet-transient-retry"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #2859.
The production auto-deploy helper hard-failed when CP returned HTTP 502 for the
hongmingcanaryredeploy-fleetcall. 502/503/504 are typically transient gateway/upstream flakes (SSM, ECS), so the whole fleet rollout should not halt on a single unclassified gateway error.Changes:
redeploy_scoped()now retries HTTP 502/503/504 up to 3 times with 5s/10s/20s backoff, emitting::warning::step annotations._raise_for_redeploy_result()surfaces the CP error body (error/message/truncated JSON) in the RuntimeError so the operator sees the tenant-level reason instead of just the status code.Test plan:
python -m pytest .gitea/scripts/tests/test_prod_auto_deploy.py— 46 passed locally.REQUEST_CHANGES on head
227e33c7ac.Required core CI is green on the exact head (
CI / all-required,E2E API Smoke Test,Handlers Postgres Integration, andE2E Peer Visibilityare present+success; the local-provision real-image failure is advisory). The prod-auto-deploy tests also ran in Ops Scripts Tests.Blocking findings:
Retry loop sleeps after the final allowed attempt and logs a retry that will never happen.
redeploy_scoped()iterates over[5, 10, 20]; on the third transient response it printsattempt 3/3; retrying in 20s, sleeps 20 seconds, then exits the loop and returns the failure without making another request. That is bounded, but the backoff is not sane: it delays surfacing the terminal CP error and the warning is misleading. Make the last transient response return immediately, or define the contract as initial attempt + 3 retries and actually perform the fourth call after the 20s backoff. Update the tests so they would fail on a terminal sleep-without-retry.The transient 502/503/504 diagnostics are still too thin for the operator path this PR is fixing. The warning includes status and
only_slugs, but not the endpoint URL/path or CP response body._raise_for_redeploy_result()surfaces the body only after the final result is handed to the caller; interim transient warnings still hide the reason for each retry, and the tests do not assert body/endpoint logging. Include the endpoint and a bounded/truncated body detail in the retry warning (without secrets), and pin that in tests.Scope drift: this PR is described/reviewed as the #2859 prod deploy helper change, but the diff also changes
.gitea/workflows/local-provision-e2e.yml,tests/e2e/test_local_provision_lifecycle_e2e.sh, andworkspace-server/internal/provisioner/*for #2851 advertise-host behavior. Please split or remove those unrelated local-provision changes from this PR so the deploy-helper fix can be reviewed and landed independently.The retry concept is right, and non-transient 500 no-retry plus CP error-body surfacing are directionally covered, but the terminal backoff/diagnostic behavior and unrelated #2851 changes need cleanup before approval.
REQUEST_CHANGES on
227e33c7ac.The bounded retry direction is right, and exact-head required/code CI is green (
CI / all-required, Platform Go, E2E API Smoke, Handlers Postgres, Peer Visibility, Local Provision stub). The known staging-LLM and real-image advisory reds are not blockers.Blocker:
redeploy_scoped()sleeps after the final transient response even though no further CP call will be made. The loop at.gitea/scripts/prod-auto-deploy.py:234-246iterates over[5, 10, 20], calls CP three times, and on the third 502/503/504 prints “retrying in 20s”, sleeps 20s, then immediately returnslast_status,last_body. That wastes 20s on every exhausted transient failure and the log says it is retrying when it is not. The test currently pins this bad behavior (test_redeploy_scoped_gives_up_after_max_retriesasserts sleeps ==[5, 10, 20]).Please make the semantics explicit and load-bearing: either three total attempts with sleeps
[5, 10], or initial attempt plus three retries with four total CP calls and sleeps[5, 10, 20]before attempts 2-4. In either case, do not sleep after the last attempted request, and update the warning text/test so an exhausted run cannot false-green on wasted terminal backoff.Secondary scope note: this PR is described as the #2859 prod-deploy helper, but the current diff also changes local-provision/provisioner files for #2851. Those may be valid, but they are outside the stated scope and should either be split or called out explicitly so reviewers do not treat this as prod-auto-deploy.py-only.
227e33c7acto0deda38a0cAPPROVED on head
0deda38a0c.Re-reviewed the fixed head against RC #11780:
total_attempts = 1 + REDEPLOY_MAX_RETRIES, delays[5,10,20]are applied only before attempts 2-4, and exhausted transient failures logretries exhaustedwithout a terminal sleep..gitea/scripts/prod-auto-deploy.pyand.gitea/scripts/tests/test_prod_auto_deploy.py; the unrelated #2851 local-provision/provisioner files are gone.CI / all-required,E2E API Smoke Test,Handlers Postgres Integration, andE2E Peer Visibilityare present+success.Ops Scripts Testsis also success on this head.The retry remains capped, non-transient 500s are not retried, and CP failure bodies are surfaced for operator diagnosis.
APPROVE on
0deda38a0c.Re-verified the #11781 fixes. Retry semantics are now explicit and correct:
redeploy_scoped()performs initial + 3 retries (4 total attempts), sleeps[5, 10, 20]only before real follow-up attempts, and the final transient response logsretries exhaustedinstead of a misleadingretryingmessage. The tests drive the real function via monkeypatchedcp_api_json/time.sleep, assert the no-trailing-sleep shape, and cover retry success, exhausted retries, non-transient no-retry, and CP error body surfacing.Scope drift is also fixed: PR files are only
.gitea/scripts/prod-auto-deploy.pyand.gitea/scripts/tests/test_prod_auto_deploy.py; the #2851 local-provision/provisioner changes are no longer in this PR. Exact-head required/code CI is green (CI / all-required, Platform Go, E2E API Smoke, Handlers Postgres, Peer Visibility, Local Provision stub). Remaining reds are ceremony gates and the known real-image advisory, not blockers for this helper-only fix.