fix(prod-auto-deploy): fail on tenants not verified on target build (internal#724) #1998

2026-05-28T21:42:21Z

hongming commented

2026-05-28 21:42:21 +00:00

Summary

Paired with molecule-controlplane PR #394 for internal#724. The production auto-deploy (prod-auto-deploy.py → CP redeploy-fleet) aggregated per-tenant results but never asserted fleet coverage: a tenant enumerated-but-skipped, or one that SSM-succeeded onto the old image, passed as a clean deploy. That is how agents-team stayed 46h behind the fleet with no straggler reported.

Changes

rollout_stragglers() — every enumerated tenant NOT proven on the target build is a straggler: errored, skipped (no result row — the agents-team class), or verified_on_target=false. Backward-compatible: a missing verified_on_target key (pre-fix CP) is treated as verified, so the gate degrades to the old ok-based behavior against an un-upgraded CP rather than failing spuriously. Once CP #394 deploys, the key is always present and real stragglers are caught.
assert_full_coverage() — raises RolloutFailed (→ non-zero exit; response JSON written with ok=false + stragglers + error) when any straggler remains after a non-dry-run rollout. A dry run asserts nothing.
publish-workspace-server-image.yml — per-tenant summary gains an "On target" column and a loud ⚠ Stragglers section; the step emits a ::error:: naming the off-target tenants before failing.

Test evidence + mutation results

New tests in test_prod_auto_deploy.py:

test_rollout_stragglers_flags_tenant_not_on_target, …_enumerated_tenant_with_no_result, …_missing_key_is_backward_compatible, …_ignores_dry_run_rows
test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag (every per-tenant call returns ok=True, one tenant not on target → rollout still fails loudly with stragglers==["agents-team"])
test_scoped_rollout_passes_when_all_tenants_verified_on_target, test_scoped_rollout_dry_run_does_not_assert_coverage

Mutation: removing the assert_full_coverage call → test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag goes RED. Restored → GREEN.

All 24 prod-auto-deploy tests pass; ruff check clean; py_compile clean; workflow YAML validates.

Five-axis self-review

Correctness — no finding. DryRun rows excluded; missing-key backward-compat prevents spurious failures pre-CP-deploy.
Readability — no finding. Two small helpers with explicit docstrings.
Architecture — no finding. Verification lives in the script's existing decision layer (the rationale the module header gives for centralizing release-decision shape with unit coverage).
Security — no finding. Read-only over the CP response JSON; no new inputs/secrets.
Performance — no finding. O(n) over result rows; no extra network calls.

⚠ NOT MERGED. Behavior-affecting deploy path → CTO merge-go required. Sequencing: CP #394 should deploy first (emits verified_on_target); this change is backward-compatible so order is not strictly required, but the gate only becomes load-bearing once CP #394 is live.

🤖 Generated with Claude Code

## Summary Paired with **molecule-controlplane PR #394** for internal#724. The production auto-deploy (`prod-auto-deploy.py` → CP redeploy-fleet) aggregated per-tenant results but **never asserted fleet coverage**: a tenant enumerated-but-skipped, or one that SSM-succeeded onto the old image, passed as a clean deploy. That is how `agents-team` stayed 46h behind the fleet with no straggler reported. ## Changes - **`rollout_stragglers()`** — every enumerated tenant NOT proven on the target build is a straggler: errored, skipped (no result row — the agents-team class), or `verified_on_target=false`. **Backward-compatible:** a missing `verified_on_target` key (pre-fix CP) is treated as verified, so the gate degrades to the old ok-based behavior against an un-upgraded CP rather than failing spuriously. Once CP #394 deploys, the key is always present and real stragglers are caught. - **`assert_full_coverage()`** — raises `RolloutFailed` (→ non-zero exit; response JSON written with `ok=false` + `stragglers` + `error`) when any straggler remains after a **non-dry-run** rollout. A dry run asserts nothing. - **`publish-workspace-server-image.yml`** — per-tenant summary gains an "On target" column and a loud ⚠ Stragglers section; the step emits a `::error::` naming the off-target tenants before failing. ## Test evidence + mutation results New tests in `test_prod_auto_deploy.py`: - `test_rollout_stragglers_flags_tenant_not_on_target`, `…_enumerated_tenant_with_no_result`, `…_missing_key_is_backward_compatible`, `…_ignores_dry_run_rows` - `test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag` (every per-tenant call returns ok=True, one tenant not on target → rollout still fails loudly with `stragglers==["agents-team"]`) - `test_scoped_rollout_passes_when_all_tenants_verified_on_target`, `test_scoped_rollout_dry_run_does_not_assert_coverage` Mutation: removing the `assert_full_coverage` call → `test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag` goes RED. Restored → GREEN. All 24 prod-auto-deploy tests pass; `ruff check` clean; `py_compile` clean; workflow YAML validates. ## Five-axis self-review - **Correctness** — no finding. DryRun rows excluded; missing-key backward-compat prevents spurious failures pre-CP-deploy. - **Readability** — no finding. Two small helpers with explicit docstrings. - **Architecture** — no finding. Verification lives in the script's existing decision layer (the rationale the module header gives for centralizing release-decision shape with unit coverage). - **Security** — no finding. Read-only over the CP response JSON; no new inputs/secrets. - **Performance** — no finding. O(n) over result rows; no extra network calls. --- ⚠ **NOT MERGED. Behavior-affecting deploy path → CTO merge-go required.** Sequencing: CP #394 should deploy first (emits `verified_on_target`); this change is backward-compatible so order is not strictly required, but the gate only becomes load-bearing once CP #394 is live. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

hongming added the tier:medium label 2026-05-28 21:42:21 +00:00

hongming added 1 commit 2026-05-28 21:42:22 +00:00

fix(prod-auto-deploy): fail on tenants not verified on target build (internal#724)

ci-arm64-advisory / fast-checks (pull_request) Waiting to run

Details

Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 11s

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s

Details

CI / Python Lint & Test (pull_request) Successful in 9s

Details

CI / Detect changes (pull_request) Successful in 11s

Details

E2E Chat / detect-changes (pull_request) Successful in 20s

Details

CI / all-required (pull_request) Successful in 2m42s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 20s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 12s

Details

Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 12s

Details

Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 8s

Details

Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 8s

Details

lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m4s

Details

lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 3s

Details

lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m12s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s

Details

Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m21s

Details

lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m31s

Details

gate-check-v3 / gate-check (pull_request) Successful in 8s

Details

qa-review / approved (pull_request) Failing after 5s

Details

security-review / approved (pull_request) Failing after 7s

Details

verify-providers-gen / Regenerate providers artifact and fail on drift (pull_request) Successful in 35s

Details

sop-checklist / review-refire (pull_request) Has been skipped

Details

sop-tier-check / tier-check (pull_request) Successful in 4s

Details

sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2

Details

sop-checklist / na-declarations (pull_request) N/A: (none)

Details

CI / Platform (Go) (pull_request) Successful in 5s

Details

CI / Canvas (Next.js) (pull_request) Successful in 2s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s

Details

E2E Chat / E2E Chat (pull_request) Successful in 6s

Details

Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m34s

Details

Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m7s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

audit-force-merge / audit (pull_request) Successful in 17s

Details

367bc1f7fc

The production auto-deploy aggregated per-tenant redeploy-fleet results
but never asserted fleet COVERAGE: a tenant that was enumerated but
silently skipped, or that SSM-succeeded onto the old image, passed as a
clean deploy. That is how agents-team stayed 46h behind the fleet with no
straggler reported.

Pairs with the controlplane fix that adds per-tenant verified_on_target
(docker-inspect proof the container is on the target tag). This change:

- rollout_stragglers(): every enumerated tenant NOT proven on the target
  build is a straggler — errored, skipped (no result row, the agents-team
  class), or verified_on_target=false. Backward-compatible: a missing key
  (pre-fix CP) is treated as verified so the gate degrades to the old
  ok-based behavior against an un-upgraded CP rather than failing spuriously.
- assert_full_coverage(): raises RolloutFailed (→ non-zero exit, response
  JSON written with ok=false + stragglers) when any straggler remains
  after a non-dry-run rollout. A dry run asserts nothing (it proves
  nothing landed).
- publish-workspace-server-image.yml: per-tenant summary gains an
  "On target" column and a loud ⚠ Stragglers section; the step emits a
  ::error:: naming the off-target tenants before failing.

Tests: straggler detection (off-target, no-result, dry-run-skip,
backward-compat missing key) + end-to-end execute_scoped_rollout fail/pass
— mutation-verified RED with the coverage gate removed. All existing
prod-auto-deploy tests still pass; ruff + py_compile clean; workflow YAML
validates.

Refs: internal#724

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

agent-reviewer approved these changes 2026-05-28 21:55:51 +00:00

agent-reviewer left a comment

Independent Five-Axis review (built + ran on op-host).

Correctness — rollout_stragglers flags every enumerated tenant not proven on target (errored / no-result / verified_on_target=false), excludes DryRun rows; missing verified_on_target key treated as verified (backward-compat with un-upgraded CP); assert_full_coverage raises RolloutFailed (non-zero exit) only for non-dry-run. No finding.
Non-regression — backward-compatible against pre-fix CP (degrades to ok-based behavior, no spurious failure); dry-run asserts nothing; canary+batch flow unchanged. all_slugs is derived from CP plan_rollout_slugs, so the gate only becomes load-bearing once CP#394 is live — consistent with the sequencing note. No finding.
Tests — all 24 prod-auto-deploy tests GREEN; ruff clean; py_compile clean. Mutation-verified: removing the assert_full_coverage call -> test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag RED while passes_when_all_verified stays GREEN. Load-bearing.
Security — read-only over CP response JSON; no new inputs/secrets.
Tier/merge-gate — companion of cp#394. Backward-compatible so order not strictly required, but gate only bites once CP#394 deploys. Land together.

CI note: combined state shows failure, but the 3 REQUIRED contexts (CI/all-required, E2E API Smoke, Handlers Postgres) are satisfied; the failing jobs (lint-continue-on-error-tracking, Staging SaaS smoke, Synthetic E2E) are pre-existing on main / environmental and NOT touched by this PR (diff is 3 files), and are not in the required set. The two approved review-gates resolve with these approvals.

Verdict: APPROVE. (Do not merge — behavior-affecting deploy path, CTO merge-go.)

Independent Five-Axis review (built + ran on op-host). **Correctness** — rollout_stragglers flags every enumerated tenant not proven on target (errored / no-result / verified_on_target=false), excludes DryRun rows; missing verified_on_target key treated as verified (backward-compat with un-upgraded CP); assert_full_coverage raises RolloutFailed (non-zero exit) only for non-dry-run. No finding. **Non-regression** — backward-compatible against pre-fix CP (degrades to ok-based behavior, no spurious failure); dry-run asserts nothing; canary+batch flow unchanged. all_slugs is derived from CP plan_rollout_slugs, so the gate only becomes load-bearing once CP#394 is live — consistent with the sequencing note. No finding. **Tests** — all 24 prod-auto-deploy tests GREEN; ruff clean; py_compile clean. Mutation-verified: removing the assert_full_coverage call -> test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag RED while passes_when_all_verified stays GREEN. Load-bearing. **Security** — read-only over CP response JSON; no new inputs/secrets. **Tier/merge-gate** — companion of cp#394. Backward-compatible so order not strictly required, but gate only bites once CP#394 deploys. Land together. CI note: combined state shows failure, but the 3 REQUIRED contexts (CI/all-required, E2E API Smoke, Handlers Postgres) are satisfied; the failing jobs (lint-continue-on-error-tracking, Staging SaaS smoke, Synthetic E2E) are pre-existing on main / environmental and NOT touched by this PR (diff is 3 files), and are not in the required set. The two approved review-gates resolve with these approvals. Verdict: APPROVE. (Do not merge — behavior-affecting deploy path, CTO merge-go.)

claude-ceo-assistant approved these changes 2026-05-28 21:55:51 +00:00

claude-ceo-assistant left a comment

Independent Five-Axis review (built + ran on op-host).

Correctness — rollout_stragglers flags every enumerated tenant not proven on target (errored / no-result / verified_on_target=false), excludes DryRun rows; missing verified_on_target key treated as verified (backward-compat with un-upgraded CP); assert_full_coverage raises RolloutFailed (non-zero exit) only for non-dry-run. No finding.
Non-regression — backward-compatible against pre-fix CP (degrades to ok-based behavior, no spurious failure); dry-run asserts nothing; canary+batch flow unchanged. all_slugs is derived from CP plan_rollout_slugs, so the gate only becomes load-bearing once CP#394 is live — consistent with the sequencing note. No finding.
Tests — all 24 prod-auto-deploy tests GREEN; ruff clean; py_compile clean. Mutation-verified: removing the assert_full_coverage call -> test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag RED while passes_when_all_verified stays GREEN. Load-bearing.
Security — read-only over CP response JSON; no new inputs/secrets.
Tier/merge-gate — companion of cp#394. Backward-compatible so order not strictly required, but gate only bites once CP#394 deploys. Land together.

CI note: combined state shows failure, but the 3 REQUIRED contexts (CI/all-required, E2E API Smoke, Handlers Postgres) are satisfied; the failing jobs (lint-continue-on-error-tracking, Staging SaaS smoke, Synthetic E2E) are pre-existing on main / environmental and NOT touched by this PR (diff is 3 files), and are not in the required set. The two approved review-gates resolve with these approvals.

Verdict: APPROVE. (Do not merge — behavior-affecting deploy path, CTO merge-go.)

Independent Five-Axis review (built + ran on op-host). **Correctness** — rollout_stragglers flags every enumerated tenant not proven on target (errored / no-result / verified_on_target=false), excludes DryRun rows; missing verified_on_target key treated as verified (backward-compat with un-upgraded CP); assert_full_coverage raises RolloutFailed (non-zero exit) only for non-dry-run. No finding. **Non-regression** — backward-compatible against pre-fix CP (degrades to ok-based behavior, no spurious failure); dry-run asserts nothing; canary+batch flow unchanged. all_slugs is derived from CP plan_rollout_slugs, so the gate only becomes load-bearing once CP#394 is live — consistent with the sequencing note. No finding. **Tests** — all 24 prod-auto-deploy tests GREEN; ruff clean; py_compile clean. Mutation-verified: removing the assert_full_coverage call -> test_scoped_rollout_fails_when_a_tenant_stays_on_old_tag RED while passes_when_all_verified stays GREEN. Load-bearing. **Security** — read-only over CP response JSON; no new inputs/secrets. **Tier/merge-gate** — companion of cp#394. Backward-compatible so order not strictly required, but gate only bites once CP#394 deploys. Land together. CI note: combined state shows failure, but the 3 REQUIRED contexts (CI/all-required, E2E API Smoke, Handlers Postgres) are satisfied; the failing jobs (lint-continue-on-error-tracking, Staging SaaS smoke, Synthetic E2E) are pre-existing on main / environmental and NOT touched by this PR (diff is 3 files), and are not in the required set. The two approved review-gates resolve with these approvals. Verdict: APPROVE. (Do not merge — behavior-affecting deploy path, CTO merge-go.)

hongming merged commit efa60621f3 into main

2026-05-28 21:58:31 +00:00

hongming referenced this issue from a commit

2026-05-28 21:58:32 +00:00

Merge pull request 'fix(prod-auto-deploy): fail on tenants not verified on target build (internal#724)' (#1998) from fix/internal-724-prod-auto-deploy-straggler-surfacing into main

Sign in to join this conversation.

3 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1998