fix(deploy): rollout POST read budget must exceed worst-case batch (#41) #3020
Reference in New Issue
Block a user
Delete Branch "fix/rfc2843-41-rollout-http-timeout"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
RFC#2843 #41 — one dead tenant fails prod auto-deploy + blocks
:latestpromotionSymptom (observed on the #38 deploy, run 378524). The whole healthy prod fleet (
hongming,reno-stars,molecule-adk-demo) shipped to the new build, but the deploy reportedok=false/result_count=0/The read operation timed out, the verify +:latestpromote steps were SKIPPED, and the job went red. Net effect: existing tenants get the build, but brand-new provisions keep pulling the stale:latest.Root cause. The CP redeploys a batch's tenants concurrently (
runBatch), so theredeploy-fleetPOST only returns after the slowest tenant finishes — up to the CPPerTenantTimeout(5m SSM) +/healthzsettle (90s) for a stuck/dead box (philbrew-erton, a CF-525 box whose SSM agent never answers). The client read timeout was hardcoded at 120s — shorter than that worst case — so the client abandoned the call with an empty response before the CP could return the per-tenant results that themax_stragglers=1quarantine acts on. The designed quarantine path could therefore never run.Fix.
ROLLOUT_HTTP_TIMEOUT_DEFAULT_SECONDS=600(envPROD_AUTO_DEPLOY_ROLLOUT_HTTP_TIMEOUT_SECONDS, floored at the fast-call default) that comfortably exceeds the concurrent-batch worst case. Dry-run / CI-status calls keep the fast 120s default.cp_api_json→ synthetic retryable 504 instead of crashing the run with a bareread operation timed out.With this, a dead tenant returns as 1 straggler,
max_stragglers=1quarantines it, the healthy fleet ships,ok=true, and:latestpromotes.SOP
/buildinfo(healthy fleet on8ddce85, philbrew CF-525).🤖 Generated with Claude Code
QA: rollout read budget now exceeds concurrent-batch worst case; quarantine path reachable; socket-timeout→504; 49 tests pass. APPROVE.
/sop-ack comprehensive-testing verified — #41 rollout HTTP budget.
/sop-ack local-postgres-e2e verified — #41 rollout HTTP budget.
/sop-ack staging-smoke verified — #41 rollout HTTP budget.
/sop-ack root-cause verified — #41 rollout HTTP budget.
/sop-ack five-axis-review verified — #41 rollout HTTP budget.
/sop-ack no-backwards-compat verified — #41 rollout HTTP budget.
/sop-ack memory-consulted verified — #41 rollout HTTP budget.
Security: timeout-only change; no new surface; dry-run/status budgets unchanged. APPROVE.