fix(ci): add confirm:true to redeploy-fleet callers (cp#228 contract) #1595
Reference in New Issue
Block a user
Delete Branch "fix/redeploy-fleet-confirm-callers"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Follow-up to molecule-controlplane#228 (merge SHA
44317b0, task #308). CP/cp/admin/tenants/redeploy-fleetnow requires explicit ack of fleet-wide intent — empty body /{confirm:false}/{only_slugs:[]}→ 400 instead of silently rolling every live tenant.All 4 molecule-core callers of
redeploy-fleetare fleet-wide intent (canary + fan-out, no slug scoping), so this PR addsconfirm: trueto each request body. Without this PR, the next invocation of any of these workflows after the new CP image becomes live will 400 and break prod auto-deploy.Callers updated
.gitea/workflows/publish-workspace-server-image.ymlpush:main(prod auto-deploy).gitea/scripts/prod-auto-deploy.py planconfirm: Trueinprod-auto-deploy.pybody dict.gitea/workflows/redeploy-tenants-on-main.ymljq -nconfirm: truein inline body.gitea/workflows/redeploy-tenants-on-staging.ymlworkflow_dispatchjq -nconfirm: truein inline body.gitea/workflows/staging-verify.yml:latestpromotion to prodjq -nconfirm: truein inline bodySurvey was exhaustive —
grep -rn 'redeploy-fleet' .gitea/workflows/ .github/workflows/ scripts/ tools/shows the 4 callers above plus comment-only references insweep-stale-e2e-orgs.yml(both.gitea/and.github/copies) and a doc-comment inscripts/staging-smoke.sh— no other callers need updating. No caller in molecule-core scopes by slug; per-tenant ops happen via the manual CP admin curl, not these workflows.Test plan
python3 -m pytest .gitea/scripts/tests/test_prod_auto_deploy.py -q→ 10 passedpython3 -m pytest .gitea/scripts/tests/ -q→ 136 passedpython3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows→ 59 files, 0 fatalpython3 .gitea/scripts/lint-curl-status-capture.py→ cleantest_build_plan_always_sets_confirm_true_for_fleet_intentpins that operator-overridable knobs (soak, batch, dry_run, canary slug) do NOT drop the ack.TestRedeployFleet_ConfirmTrueProceeds(target_tag + confirm:true). Pairs withTestRedeployFleet_EmptyBodyReturns400.dry_run=true).Risk
only_slugsfield). Wrong classification would redeploy unintended tenants; correct classification per workflow comment-headers and body shape (canary_slug+batch_sizeshape == fleet-wide).feedback_image_promote_is_not_user_live, contract is NOT live until Railway pulls the new:latestCP image and restarts the CP container. The 4 callers in this PR don't fire automatically until the next push:main / dispatch / staging-smoke run, so there's headroom. This PR lands before the next caller invocation.References
44317b0ecf0f0143dac0964deb76797e6b383044): contract enforcement.feedback_image_promote_is_not_user_live: image-push ≠ live (CP container restart on new pin required).APPROVED — devops/CI five-axis review of mc#1595
Scope: Adds
confirm: trueto all 4 fleet-wide callers of CP/cp/admin/tenants/redeploy-fleetto satisfy the new contract landed in cp#228 (44317b0, task #308). Without this PR, the next invocation of any of these workflows after the new CP image becomes user-live will 400 and break the prod auto-deploy path.Axis 1 — Correctness vs cp#228 contract:
Verified
internal/handlers/admin_redeploy.go(cp@44317b0) line 188:if len(req.OnlySlugs) == 0 && !req.Confirmreturns 400. With this PR every caller now sendsconfirm: truein its body (orprod-auto-deploy.pybody dict), so the gate evaluates as(0 == 0) && !true= false → proceeds.Axis 2 — Intent classification (fleet-wide vs slug-scoped):
Walked all 4 callers. None scope by
only_slugs. Each has fleet semantics:publish-workspace-server-imagerolls every live tenant on push:main;redeploy-tenants-on-mainrolls the entire prod fleet with canary + fan-out;redeploy-tenants-on-stagingrolls the entire staging fleet (staging IS the canary);staging-verifypromotes:latestacross the prod fleet after staging-green. Classification is correct:confirm: trueis the right shape for all four.Axis 3 — Survey completeness:
Ran
grep -rn 'redeploy-fleet' .gitea/workflows/ .github/workflows/ scripts/ tools/against the branch — only the 4 named callers actually POST to the endpoint.sweep-stale-e2e-orgs.yml(both.gitea/and.github/copies) andscripts/staging-smoke.shreferenceredeploy-fleetin doc-comments only. Nothing else to update.Axis 4 — Test posture:
test_build_plan_always_sets_confirm_true_for_fleet_intentregression-pins the body shape against operator-knob overrides (soak/batch/dry_run/canary slug).test_build_plan_defaults_to_staging_sha_target_and_prod_cpupdated to assertconfirm: Truein the default body..gitea/scripts/tests/136/136 pass.lint-workflow-yaml.py59 files, 0 fatal.lint-curl-status-capture.pyclean.TestRedeployFleet_ConfirmTrueProceeds(target_tag+confirm:true→ 200).Axis 5 — Risk + sequencing:
Per
feedback_image_promote_is_not_user_live: cp#228 merged 07:48Z and publish-cp-image (run 86729) completed Success at 07:50Z, but the contract isn't user-live until Railway pulls the new:latestand CP container restarts on the new pin. None of the 4 callers in this PR fire automatically until the next push:main / staging-smoke green / workflow_dispatch — so there's headroom. Worst-case blast if classification is wrong: redeploy 400s and fails fast (loud, reversible), NOT a silent fleet-wide mutation.Devops scope clear, infra ack from core-devops (engineers team). Independent of cp#228 author (devops-engineer) and this PR's author (infra-sre).
APPROVED — QA five-axis review of mc#1595
Scope: Adds
confirm: trueto molecule-core's 4 fleet-wide callers of CP/cp/admin/tenants/redeploy-fleet, satisfying the new contract from cp#228 (44317b0, task #308). Same blast surface as the aa375b1f forensic (2026-05-20 empty{}body redeployed agents-team + chloe-dong + hongming).Axis 1 — Contract pairing:
Walked cp@44317b0's 4 contract tests:
TestRedeployFleet_EmptyBodyReturns400— bare{}→ 400, no SSM (matches pre-PR shape, this PR removes that path).TestRedeployFleet_ConfirmTrueProceeds—{confirm:true, target_tag}→ 200, request forwarded with Confirm=true (matches post-PR shape).TestRedeployFleet_OnlySlugsProceeds— N/A for this PR (no caller scopes by slug).TestRedeployFleet_EmptyOnlySlugsWithoutConfirmReturns400— N/A.This PR's body shapes mirror
TestRedeployFleet_ConfirmTrueProceedsexactly. Contract pairing is sound.Axis 2 — Test additions:
test_build_plan_always_sets_confirm_true_for_fleet_intent: regression-pin against operator-knob overrides (soak/batch/dry_run/canary slug). Asserts the ack does NOT drop when knobs are tuned. This is the right level of guard — pins the contract pair at the python build_plan boundary.test_build_plan_defaults_to_staging_sha_target_and_prod_cpupdated to expectconfirm: Truein default body (existing test, additive assertion, no other behavior change).pytest .gitea/scripts/tests/.Axis 3 — Survey:
Independently ran
grep -rn 'redeploy-fleet' .gitea/workflows/ .github/workflows/ scripts/ tools/. Confirmed the 4 named callers are exhaustive:publish-workspace-server-image.yml(via prod-auto-deploy.py),redeploy-tenants-on-main.yml,redeploy-tenants-on-staging.yml,staging-verify.yml. Other matches are doc-comments only (sweep-stale-e2e-orgs.ymlx2,scripts/staging-smoke.sh). No missed callers.Axis 4 — Intent classification spot-check:
Walked each workflow's body shape:
canary_slug+soak_seconds+batch_size(fleet-wide canary + fan-out pattern). None setonly_slugs. Classification asconfirm: true(fleet-wide) is correct.Axis 5 — Lint posture:
confirm: trueliteral (no--argjsonneeded; jq evaluates the literal).Sequencing note:
Per
feedback_image_promote_is_not_user_live: cp#228 image-push at 07:50Z is NOT the same as user-live. The 4 callers only fire on next push:main / staging-smoke / dispatch, so there's headroom for this PR to land before the next caller invocation.QA scope clear from core-qa (qa + engineers team).