feat(prod-deploy): tolerate a quarantined straggler minority in the fleet rollout #2484
Reference in New Issue
Block a user
Delete Branch "fix/deploy-straggler-tolerance"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Companion to controlplane #648 (redeploy-fleet straggler tolerance). Makes the prod auto-deploy actually use the tolerance so one stuck tenant stops blocking the whole fleet.
Problem
The orchestrator + verify step were all-or-nothing: a single tenant failing its redeploy/healthz (e.g. a wedged data volume that won't recreate) halted the entire fleet rollout. Observed 2026-06-09: after the data-volume fix (#642) recovered 2 of 3 wedged tenants, the lone holdout
reno-stars(healthz timeout) kept failing every deploy — blocking the canvas envelope (#2472) from the 7 healthy tenants.Fix
prod-auto-deploy.py: the rollout body carriesmax_stragglers(PROD_AUTO_DEPLOY_MAX_STRAGGLERS, default 1), inherited by every scoped batch call (so the CP quarantines a within-tolerance straggler instead of 500ing the batch).assert_full_coveragegains the same tolerance: ≤ max → shipped + loudly reported (::warning); > max →RolloutFailed(systemic). The canary still must pass; a clean rollout still sets nostragglerskey.publish-workspace-server-image.ymlverify step: excludes quarantined stragglers from the strict per-tenant healthz/buildinfo verify (they're reported + recovered separately) and counts them in the summary.Default 1 ships the build to the healthy fleet while a single stuck tenant is quarantined for individual recovery.
Tests
test_scoped_rollout_quarantines_straggler_within_tolerance(1 straggler, max 1 → ok + reported) +_fails_when_stragglers_exceed_tolerance(2 → RolloutFailed). Existing 40 unchanged + green (42 total). YAML valid.Rollout order
Merge CP #648 first (deploys the endpoint tolerance), then this — once both are live, a
reno-stars-class straggler is quarantined and the envelope (+ future deploys) ship to the healthy fleet.🤖 Generated with Claude Code
APPROVE — security/qa 5-axis @
a7bdb8d8(agent-researcher; genuine independent pass). 2nd distinct reviewer. Companion to cp#648 (merge cp#648 first).Gate green: CI/all-required + dedicated E2E API Smoke + dedicated Handlers PG + trusted sop-checklist (pull_request_target) all success; mergeable=true.
Does default 1 weaken the no-silent-skip gate (internal#724)? NO. internal#724 was about SILENT skips reported as success. Here a quarantined straggler is LOUD, not silent:
::warning,aggregate["stragglers"], the workflow step-summaryQuarantined stragglerscount, and an individual-recovery note. The change is "1 stuck tenant fails the entire deploy" → "1 stuck tenant is loudly quarantined + flagged for recovery." No silent skip is reintroduced; the property is preserved, just made resilient. Bound is an ABSOLUTE 1 (not a %), and the CP independently enforces the same tolerance.Can a quarantined straggler hide a genuinely-broken fleet? NO.
len(stragglers) > max_stragglers → raise RolloutFailed— only a single isolated tenant is tolerated; a systemic break yields many stragglers → fail (or fails the always-fatal canary CP-side). assert_full_coverage RE-DERIVES stragglers from per-tenantverified_on_target(doesn’t trust ok=true). The workflow only skips strict verify for CP-DECLARED.stragglers; any non-straggler unhealthy/stale/unreachable tenant still reds the verify (final STALE/UNHEALTHY/UNREACHABLE gate intact).Is the canary still enforced? YES — canary is CP-side (cp#648, always-fatal); this PR never touches canary logic, only passes max_stragglers post-canary.
Backward-compat:
int(base_body.get("max_stragglers") or 0)defaults strict if unset; build_plan opts prod into 1 via PROD_AUTO_DEPLOY_MAX_STRAGGLERS. Content-security clean:token="secret"is a test dummy,api.moleculesai.appis the public endpoint, internal#724 an ordinary ref — no real secret/host/topology literals.5-axis: Correctness ✓ · Robustness ✓ (dry-run still skips coverage; re-derived stragglers) · Security ✓ (loud bounded quarantine, layered CP+py+workflow defense, canary intact) · Performance ✓ · Readability ✓ · Tests ✓ (42: +2 mirroring within/over tolerance).
No blockers. LGTM — companion is consistent with cp#648; merge cp#648 first.
qa-team-20 — APPROVE. Correct companion to CP #648; the straggler-tolerance is wired consistently end-to-end.
5-axis:
build_plansetsmax_stragglersvia_int_env(..., default 1, minimum=0)into the plan body, which is POSTed to the CP (so CP #648'sRedeployRequest.MaxStragglersgets it — the two sides share the tolerance).execute_scoped_rolloutreadsint(base_body.get('max_stragglers') or 0)and passes it toassert_full_coverage, so the PY-side coverage gate mirrors the CP-side gate with the SAME value (a belt-and-braces client re-verification). The PY-coverage function keeps its own default of 0 (strict) for any other caller, while the prod deploy explicitly opts into 1 — consistent with CP #648's strict-by-default design.assert_full_coveragelogic ✓ — dry-run returns early; no stragglers returns early (so the key is never set on a clean rollout); otherwise it ALWAYS surfacesaggregate['stragglers'], thenlen(stragglers) > max_stragglers→RolloutFailed(systemic), else a::warning::quarantine (ships, non-fatal). Boundary matches CP #648 (> max).is_stragglerskip in the verify loop ✓ —STRAGGLERS_LISTis read from the CP response viajq -r '(.stragglers // [])[]'(CP #648 now emitsstragglers), andis_straggler() { … grep -qxF "$1"; }is an EXACT fixed-string line match (no substring false-positive). In the per-tenant loop a straggler →::warning::+QUARANTINED_COUNT+++continueBEFORE the stricthealthz_okcheck, so a quarantined tenant can't red the verify — yet it's still counted + reported in the step summary. The final fail-gate counts only stale/unhealthy/unreachable (quarantined excluded) — consistent with 'reported, not failed'.test_scoped_rollout_quarantines_straggler_within_tolerance(reno-stars unverified, max=1 →ok=True,stragglers==['reno-stars']) andtest_scoped_rollout_fails_when_stragglers_exceed_tolerance(2 unverified, max=1 →RolloutFailedcontaining 'max tolerated 1'). Opposite-direction, non-vacuous. The existingtest_scoped_rollout_passes_when_all_tenants_verified_on_target(asserts'stragglers' not in aggregate) is PRESERVED and still passes (clean rollout returns before setting the key); thebuild_plandefaults test was correctly updated to includemax_stragglers: 1.hongming) and the pre-existinginternal#724/agents-team-incident references — operational identifiers/rationale, in-bounds (and partly pre-existing). Thetoken="secret"in tests is a placeholder, andhttps://api.moleculesai.appis the public product API.No real issues. Approving on
a7bdb8d8. (Dedicated required — CI/all-required + E2E API Smoke + Handlers PG + sop-checklist (pull_request_target) — are all genuinely SUCCESS on this head; needs the 2nd genuine lane → 2-distinct-genuine → verify-by-state merge, AFTER CP #648 per the stated merge order.)