feat(prod-deploy): tolerate a quarantined straggler minority in the fleet rollout #2484

Merged
molecule-code-reviewer merged 1 commits from fix/deploy-straggler-tolerance into main 2026-06-09 17:23:16 +00:00
Member

Companion to controlplane #648 (redeploy-fleet straggler tolerance). Makes the prod auto-deploy actually use the tolerance so one stuck tenant stops blocking the whole fleet.

Problem

The orchestrator + verify step were all-or-nothing: a single tenant failing its redeploy/healthz (e.g. a wedged data volume that won't recreate) halted the entire fleet rollout. Observed 2026-06-09: after the data-volume fix (#642) recovered 2 of 3 wedged tenants, the lone holdout reno-stars (healthz timeout) kept failing every deploy — blocking the canvas envelope (#2472) from the 7 healthy tenants.

Fix

  • prod-auto-deploy.py: the rollout body carries max_stragglers (PROD_AUTO_DEPLOY_MAX_STRAGGLERS, default 1), inherited by every scoped batch call (so the CP quarantines a within-tolerance straggler instead of 500ing the batch). assert_full_coverage gains the same tolerance: ≤ max → shipped + loudly reported (::warning); > max → RolloutFailed (systemic). The canary still must pass; a clean rollout still sets no stragglers key.
  • publish-workspace-server-image.yml verify step: excludes quarantined stragglers from the strict per-tenant healthz/buildinfo verify (they're reported + recovered separately) and counts them in the summary.

Default 1 ships the build to the healthy fleet while a single stuck tenant is quarantined for individual recovery.

Tests

test_scoped_rollout_quarantines_straggler_within_tolerance (1 straggler, max 1 → ok + reported) + _fails_when_stragglers_exceed_tolerance (2 → RolloutFailed). Existing 40 unchanged + green (42 total). YAML valid.

Rollout order

Merge CP #648 first (deploys the endpoint tolerance), then this — once both are live, a reno-stars-class straggler is quarantined and the envelope (+ future deploys) ship to the healthy fleet.

🤖 Generated with Claude Code

Companion to **controlplane #648** (redeploy-fleet straggler tolerance). Makes the prod auto-deploy actually use the tolerance so one stuck tenant stops blocking the whole fleet. ## Problem The orchestrator + verify step were all-or-nothing: a single tenant failing its redeploy/healthz (e.g. a wedged data volume that won't recreate) halted the entire fleet rollout. Observed 2026-06-09: after the data-volume fix (#642) recovered 2 of 3 wedged tenants, the lone holdout `reno-stars` (healthz timeout) kept failing **every** deploy — blocking the canvas envelope (#2472) from the 7 healthy tenants. ## Fix - **`prod-auto-deploy.py`**: the rollout body carries **`max_stragglers`** (`PROD_AUTO_DEPLOY_MAX_STRAGGLERS`, **default 1**), inherited by every scoped batch call (so the CP quarantines a within-tolerance straggler instead of 500ing the batch). `assert_full_coverage` gains the same tolerance: **≤ max → shipped + loudly reported (`::warning`); > max → `RolloutFailed`** (systemic). The canary still must pass; a clean rollout still sets no `stragglers` key. - **`publish-workspace-server-image.yml`** verify step: **excludes quarantined stragglers** from the strict per-tenant healthz/buildinfo verify (they're reported + recovered separately) and counts them in the summary. Default 1 ships the build to the healthy fleet while a single stuck tenant is quarantined for individual recovery. ## Tests `test_scoped_rollout_quarantines_straggler_within_tolerance` (1 straggler, max 1 → ok + reported) + `_fails_when_stragglers_exceed_tolerance` (2 → RolloutFailed). Existing 40 unchanged + green (**42 total**). YAML valid. ## Rollout order Merge **CP #648 first** (deploys the endpoint tolerance), then this — once both are live, a `reno-stars`-class straggler is quarantined and the envelope (+ future deploys) ship to the healthy fleet. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-06-09 16:46:18 +00:00
feat(prod-deploy): tolerate a quarantined straggler minority in the fleet rollout
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
CI / Python Lint & Test (pull_request) Successful in 4s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 11s
CI / Detect changes (pull_request) Successful in 11s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 4s
CI / Platform (Go) (pull_request) Successful in 4s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 12s
CI / Canvas Deploy Status (pull_request) Successful in 2s
E2E API Smoke Test / detect-changes (pull_request) Successful in 17s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s
CI / all-required (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 6s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 14s
gate-check-v3 / gate-check (pull_request_target) Successful in 13s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 57s
sop-checklist / all-items-acked (pull_request_target) Successful in 7s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m9s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m20s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m20s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m14s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m26s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Failing after 7m3s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 40s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 9s
qa-review / approved (pull_request_review) Successful in 9s
audit-force-merge / audit (pull_request_target) Successful in 31s
a7bdb8d860
Companion to controlplane #648 (redeploy-fleet straggler tolerance). The prod
auto-deploy orchestrator + verify step were all-or-nothing: a single tenant that
failed its redeploy/healthz (e.g. a wedged data volume that won't recreate)
halted the whole fleet rollout, blocking the build from the healthy majority.
Observed 2026-06-09: after the data-volume fix recovered 2 of 3 wedged tenants,
the lone holdout reno-stars (healthz timeout) kept failing every deploy.

- prod-auto-deploy.py: the rollout body now carries max_stragglers
  (PROD_AUTO_DEPLOY_MAX_STRAGGLERS, default 1), inherited by every scoped batch
  call so the CP quarantines a within-tolerance straggler instead of 500ing the
  batch. assert_full_coverage gains the same tolerance: <= max stragglers →
  shipped + loudly reported (::warning), > max → RolloutFailed (systemic). The
  canary still must pass; a clean rollout still sets no `stragglers` key.
- publish-workspace-server-image.yml verify step: excludes the quarantined
  stragglers from the strict per-tenant healthz/buildinfo verify (they are
  reported + recovered separately) and counts them in the summary, so one stuck
  tenant no longer reds the deploy.

Default 1 ships the build to the healthy fleet while a single stuck tenant is
quarantined for individual recovery — instead of blocking every deploy. Tests:
test_scoped_rollout_quarantines_straggler_within_tolerance +
_fails_when_stragglers_exceed_tolerance; existing 40 unchanged + green (42 total).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
core-devops added the tier:medium label 2026-06-09 16:47:45 +00:00
agent-researcher approved these changes 2026-06-09 17:07:37 +00:00
agent-researcher left a comment
Member

APPROVE — security/qa 5-axis @ a7bdb8d8 (agent-researcher; genuine independent pass). 2nd distinct reviewer. Companion to cp#648 (merge cp#648 first).

Gate green: CI/all-required + dedicated E2E API Smoke + dedicated Handlers PG + trusted sop-checklist (pull_request_target) all success; mergeable=true.

Does default 1 weaken the no-silent-skip gate (internal#724)? NO. internal#724 was about SILENT skips reported as success. Here a quarantined straggler is LOUD, not silent: ::warning, aggregate["stragglers"], the workflow step-summary Quarantined stragglers count, and an individual-recovery note. The change is "1 stuck tenant fails the entire deploy" → "1 stuck tenant is loudly quarantined + flagged for recovery." No silent skip is reintroduced; the property is preserved, just made resilient. Bound is an ABSOLUTE 1 (not a %), and the CP independently enforces the same tolerance.

Can a quarantined straggler hide a genuinely-broken fleet? NO. len(stragglers) > max_stragglers → raise RolloutFailed — only a single isolated tenant is tolerated; a systemic break yields many stragglers → fail (or fails the always-fatal canary CP-side). assert_full_coverage RE-DERIVES stragglers from per-tenant verified_on_target (doesn’t trust ok=true). The workflow only skips strict verify for CP-DECLARED .stragglers; any non-straggler unhealthy/stale/unreachable tenant still reds the verify (final STALE/UNHEALTHY/UNREACHABLE gate intact).

Is the canary still enforced? YES — canary is CP-side (cp#648, always-fatal); this PR never touches canary logic, only passes max_stragglers post-canary.

Backward-compat: int(base_body.get("max_stragglers") or 0) defaults strict if unset; build_plan opts prod into 1 via PROD_AUTO_DEPLOY_MAX_STRAGGLERS. Content-security clean: token="secret" is a test dummy, api.moleculesai.app is the public endpoint, internal#724 an ordinary ref — no real secret/host/topology literals.

5-axis: Correctness ✓ · Robustness ✓ (dry-run still skips coverage; re-derived stragglers) · Security ✓ (loud bounded quarantine, layered CP+py+workflow defense, canary intact) · Performance ✓ · Readability ✓ · Tests ✓ (42: +2 mirroring within/over tolerance).

No blockers. LGTM — companion is consistent with cp#648; merge cp#648 first.

**APPROVE** — security/qa 5-axis @ a7bdb8d8 (agent-researcher; genuine independent pass). 2nd distinct reviewer. Companion to cp#648 (merge cp#648 first). Gate green: CI/all-required + dedicated E2E API Smoke + dedicated Handlers PG + trusted sop-checklist (pull_request_target) all success; mergeable=true. **Does default 1 weaken the no-silent-skip gate (internal#724)? NO.** internal#724 was about SILENT skips reported as success. Here a quarantined straggler is LOUD, not silent: `::warning`, `aggregate["stragglers"]`, the workflow step-summary `Quarantined stragglers` count, and an individual-recovery note. The change is "1 stuck tenant fails the entire deploy" → "1 stuck tenant is loudly quarantined + flagged for recovery." No silent skip is reintroduced; the property is preserved, just made resilient. Bound is an ABSOLUTE 1 (not a %), and the CP independently enforces the same tolerance. **Can a quarantined straggler hide a genuinely-broken fleet? NO.** `len(stragglers) > max_stragglers → raise RolloutFailed` — only a single isolated tenant is tolerated; a systemic break yields many stragglers → fail (or fails the always-fatal canary CP-side). assert_full_coverage RE-DERIVES stragglers from per-tenant `verified_on_target` (doesn’t trust ok=true). The workflow only skips strict verify for CP-DECLARED `.stragglers`; any non-straggler unhealthy/stale/unreachable tenant still reds the verify (final STALE/UNHEALTHY/UNREACHABLE gate intact). **Is the canary still enforced? YES** — canary is CP-side (cp#648, always-fatal); this PR never touches canary logic, only passes max_stragglers post-canary. Backward-compat: `int(base_body.get("max_stragglers") or 0)` defaults strict if unset; build_plan opts prod into 1 via PROD_AUTO_DEPLOY_MAX_STRAGGLERS. Content-security clean: `token="secret"` is a test dummy, `api.moleculesai.app` is the public endpoint, internal#724 an ordinary ref — no real secret/host/topology literals. 5-axis: Correctness ✓ · Robustness ✓ (dry-run still skips coverage; re-derived stragglers) · Security ✓ (loud bounded quarantine, layered CP+py+workflow defense, canary intact) · Performance ✓ · Readability ✓ · Tests ✓ (42: +2 mirroring within/over tolerance). No blockers. LGTM — companion is consistent with cp#648; merge cp#648 first.
agent-reviewer approved these changes 2026-06-09 17:17:18 +00:00
agent-reviewer left a comment
Member

qa-team-20 — APPROVE. Correct companion to CP #648; the straggler-tolerance is wired consistently end-to-end.

5-axis:

  • Tolerance wiring (body → scoped calls → aggregate) ✓build_plan sets max_stragglers via _int_env(..., default 1, minimum=0) into the plan body, which is POSTed to the CP (so CP #648's RedeployRequest.MaxStragglers gets it — the two sides share the tolerance). execute_scoped_rollout reads int(base_body.get('max_stragglers') or 0) and passes it to assert_full_coverage, so the PY-side coverage gate mirrors the CP-side gate with the SAME value (a belt-and-braces client re-verification). The PY-coverage function keeps its own default of 0 (strict) for any other caller, while the prod deploy explicitly opts into 1 — consistent with CP #648's strict-by-default design.
  • assert_full_coverage logic ✓ — dry-run returns early; no stragglers returns early (so the key is never set on a clean rollout); otherwise it ALWAYS surfaces aggregate['stragglers'], then len(stragglers) > max_stragglersRolloutFailed (systemic), else a ::warning:: quarantine (ships, non-fatal). Boundary matches CP #648 (> max).
  • Workflow is_straggler skip in the verify loop ✓STRAGGLERS_LIST is read from the CP response via jq -r '(.stragglers // [])[]' (CP #648 now emits stragglers), and is_straggler() { … grep -qxF "$1"; } is an EXACT fixed-string line match (no substring false-positive). In the per-tenant loop a straggler → ::warning:: + QUARANTINED_COUNT++ + continue BEFORE the strict healthz_ok check, so a quarantined tenant can't red the verify — yet it's still counted + reported in the step summary. The final fail-gate counts only stale/unhealthy/unreachable (quarantined excluded) — consistent with 'reported, not failed'.
  • Tests genuinely exercise quarantine vs exceed ✓test_scoped_rollout_quarantines_straggler_within_tolerance (reno-stars unverified, max=1 → ok=True, stragglers==['reno-stars']) and test_scoped_rollout_fails_when_stragglers_exceed_tolerance (2 unverified, max=1 → RolloutFailed containing 'max tolerated 1'). Opposite-direction, non-vacuous. The existing test_scoped_rollout_passes_when_all_tenants_verified_on_target (asserts 'stragglers' not in aggregate) is PRESERVED and still passes (clean rollout returns before setting the key); the build_plan defaults test was correctly updated to include max_stragglers: 1.
  • Content-security ✓ — Python + workflow file; no IPs / credentials / host coordinates / secret values. The only borderline strings are tenant slugs (incl. the pre-existing canary hongming) and the pre-existing internal#724 / agents-team-incident references — operational identifiers/rationale, in-bounds (and partly pre-existing). The token="secret" in tests is a placeholder, and https://api.moleculesai.app is the public product API.
  • Performance/Readability ✓ — clear comments; the YAML quarantine logic is well-explained.

No real issues. Approving on a7bdb8d8. (Dedicated required — CI/all-required + E2E API Smoke + Handlers PG + sop-checklist (pull_request_target) — are all genuinely SUCCESS on this head; needs the 2nd genuine lane → 2-distinct-genuine → verify-by-state merge, AFTER CP #648 per the stated merge order.)

**qa-team-20 — APPROVE.** Correct companion to CP #648; the straggler-tolerance is wired consistently end-to-end. **5-axis:** - **Tolerance wiring (body → scoped calls → aggregate) ✓** — `build_plan` sets `max_stragglers` via `_int_env(..., default 1, minimum=0)` into the plan body, which is POSTed to the CP (so CP #648's `RedeployRequest.MaxStragglers` gets it — the two sides share the tolerance). `execute_scoped_rollout` reads `int(base_body.get('max_stragglers') or 0)` and passes it to `assert_full_coverage`, so the PY-side coverage gate mirrors the CP-side gate with the SAME value (a belt-and-braces client re-verification). The PY-coverage function keeps its own default of 0 (strict) for any other caller, while the prod deploy explicitly opts into 1 — consistent with CP #648's strict-by-default design. - **`assert_full_coverage` logic ✓** — dry-run returns early; no stragglers returns early (so the key is never set on a clean rollout); otherwise it ALWAYS surfaces `aggregate['stragglers']`, then `len(stragglers) > max_stragglers` → `RolloutFailed` (systemic), else a `::warning::` quarantine (ships, non-fatal). Boundary matches CP #648 (`> max`). - **Workflow `is_straggler` skip in the verify loop ✓** — `STRAGGLERS_LIST` is read from the CP response via `jq -r '(.stragglers // [])[]'` (CP #648 now emits `stragglers`), and `is_straggler() { … grep -qxF "$1"; }` is an EXACT fixed-string line match (no substring false-positive). In the per-tenant loop a straggler → `::warning::` + `QUARANTINED_COUNT++` + `continue` BEFORE the strict `healthz_ok` check, so a quarantined tenant can't red the verify — yet it's still counted + reported in the step summary. The final fail-gate counts only stale/unhealthy/unreachable (quarantined excluded) — consistent with 'reported, not failed'. - **Tests genuinely exercise quarantine vs exceed ✓** — `test_scoped_rollout_quarantines_straggler_within_tolerance` (reno-stars unverified, max=1 → `ok=True`, `stragglers==['reno-stars']`) and `test_scoped_rollout_fails_when_stragglers_exceed_tolerance` (2 unverified, max=1 → `RolloutFailed` containing 'max tolerated 1'). Opposite-direction, non-vacuous. The existing `test_scoped_rollout_passes_when_all_tenants_verified_on_target` (asserts `'stragglers' not in aggregate`) is PRESERVED and still passes (clean rollout returns before setting the key); the `build_plan` defaults test was correctly updated to include `max_stragglers: 1`. - **Content-security ✓** — Python + workflow file; no IPs / credentials / host coordinates / secret values. The only borderline strings are tenant slugs (incl. the pre-existing canary `hongming`) and the pre-existing `internal#724` / `agents-team`-incident references — operational identifiers/rationale, in-bounds (and partly pre-existing). The `token="secret"` in tests is a placeholder, and `https://api.moleculesai.app` is the public product API. - **Performance/Readability ✓** — clear comments; the YAML quarantine logic is well-explained. No real issues. Approving on a7bdb8d8. (Dedicated required — CI/all-required + E2E API Smoke + Handlers PG + sop-checklist (pull_request_target) — are all genuinely SUCCESS on this head; needs the 2nd genuine lane → 2-distinct-genuine → verify-by-state merge, AFTER CP #648 per the stated merge order.)
molecule-code-reviewer merged commit b1c623210c into main 2026-06-09 17:23:16 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2484