fix(prod-auto-deploy): fail on tenants not verified on target build (internal#724) #1998
Reference in New Issue
Block a user
Delete Branch "fix/internal-724-prod-auto-deploy-straggler-surfacing"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Paired with molecule-controlplane PR #394 for internal#724. The production auto-deploy (
prod-auto-deploy.py→ CP redeploy-fleet) aggregated per-tenant results but never asserted fleet coverage: a tenant enumerated-but-skipped, or one that SSM-succeeded onto the old image, passed as a clean deploy. That is howagents-teamstayed 46h behind the fleet with no straggler reported.Changes
rollout_stragglers()— every enumerated tenant NOT proven on the target build is a straggler: errored, skipped (no result row — the agents-team class), orverified_on_target=false. Backward-compatible: a missingverified_on_targetkey (pre-fix CP) is treated as verified, so the gate degrades to the old ok-based behavior against an un-upgraded CP rather than failing spuriously. Once CP #394 deploys, the key is always present and real stragglers are caught.assert_full_coverage()— raisesRolloutFailed(→ non-zero exit; response JSON written withok=false+stragglers+error) when any straggler remains after a non-dry-run rollout. A dry run asserts nothing.publish-workspace-server-image.yml— per-tenant summary gains an "On target" column and a loud ⚠ Stragglers section; the step emits a::error::naming the off-target tenants before failing.Test evidence + mutation results
New tests in
test_prod_auto_deploy.py:test_rollout_stragglers_flags_tenant_not_on_target,…_enumerated_tenant_with_no_result,…_missing_key_is_backward_compatible,…_ignores_dry_run_rowstest_scoped_rollout_fails_when_a_tenant_stays_on_old_tag(every per-tenant call returns ok=True, one tenant not on target → rollout still fails loudly withstragglers==["agents-team"])test_scoped_rollout_passes_when_all_tenants_verified_on_target,test_scoped_rollout_dry_run_does_not_assert_coverageMutation: removing the
assert_full_coveragecall →test_scoped_rollout_fails_when_a_tenant_stays_on_old_taggoes RED. Restored → GREEN.All 24 prod-auto-deploy tests pass;
ruff checkclean;py_compileclean; workflow YAML validates.Five-axis self-review
⚠ NOT MERGED. Behavior-affecting deploy path → CTO merge-go required. Sequencing: CP #394 should deploy first (emits
verified_on_target); this change is backward-compatible so order is not strictly required, but the gate only becomes load-bearing once CP #394 is live.🤖 Generated with Claude Code
Independent Five-Axis review (built + ran on op-host).
Correctness — rollout_stragglers flags every enumerated tenant not proven on target (errored / no-result / verified_on_target=false), excludes DryRun rows; missing verified_on_target key treated as verified (backward-compat with un-upgraded CP); assert_full_coverage raises RolloutFailed (non-zero exit) only for non-dry-run. No finding.
Non-regression — backward-compatible against pre-fix CP (degrades to ok-based behavior, no spurious failure); dry-run asserts nothing; canary+batch flow unchanged. all_slugs is derived from CP plan_rollout_slugs, so the gate only becomes load-bearing once CP#394 is live — consistent with the sequencing note. No finding.
Tests — all 24 prod-auto-deploy tests GREEN; ruff clean; py_compile clean. Mutation-verified: removing the assert_full_coverage call -> test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag RED while passes_when_all_verified stays GREEN. Load-bearing.
Security — read-only over CP response JSON; no new inputs/secrets.
Tier/merge-gate — companion of cp#394. Backward-compatible so order not strictly required, but gate only bites once CP#394 deploys. Land together.
CI note: combined state shows failure, but the 3 REQUIRED contexts (CI/all-required, E2E API Smoke, Handlers Postgres) are satisfied; the failing jobs (lint-continue-on-error-tracking, Staging SaaS smoke, Synthetic E2E) are pre-existing on main / environmental and NOT touched by this PR (diff is 3 files), and are not in the required set. The two approved review-gates resolve with these approvals.
Verdict: APPROVE. (Do not merge — behavior-affecting deploy path, CTO merge-go.)
Independent Five-Axis review (built + ran on op-host).
Correctness — rollout_stragglers flags every enumerated tenant not proven on target (errored / no-result / verified_on_target=false), excludes DryRun rows; missing verified_on_target key treated as verified (backward-compat with un-upgraded CP); assert_full_coverage raises RolloutFailed (non-zero exit) only for non-dry-run. No finding.
Non-regression — backward-compatible against pre-fix CP (degrades to ok-based behavior, no spurious failure); dry-run asserts nothing; canary+batch flow unchanged. all_slugs is derived from CP plan_rollout_slugs, so the gate only becomes load-bearing once CP#394 is live — consistent with the sequencing note. No finding.
Tests — all 24 prod-auto-deploy tests GREEN; ruff clean; py_compile clean. Mutation-verified: removing the assert_full_coverage call -> test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag RED while passes_when_all_verified stays GREEN. Load-bearing.
Security — read-only over CP response JSON; no new inputs/secrets.
Tier/merge-gate — companion of cp#394. Backward-compatible so order not strictly required, but gate only bites once CP#394 deploys. Land together.
CI note: combined state shows failure, but the 3 REQUIRED contexts (CI/all-required, E2E API Smoke, Handlers Postgres) are satisfied; the failing jobs (lint-continue-on-error-tracking, Staging SaaS smoke, Synthetic E2E) are pre-existing on main / environmental and NOT touched by this PR (diff is 3 files), and are not in the required set. The two approved review-gates resolve with these approvals.
Verdict: APPROVE. (Do not merge — behavior-affecting deploy path, CTO merge-go.)