Compare commits

...

1 Commits

Author SHA1 Message Date
infra-sre 356b0f5cac fix(ci): add confirm:true to redeploy-fleet callers (cp#228 contract)
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 10s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Successful in 10s
E2E Chat / detect-changes (pull_request) Successful in 10s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 29s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 10s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 42s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 45s
qa-review / approved (pull_request) Failing after 6s
sop-checklist / review-refire (pull_request) Has been skipped
gate-check-v3 / gate-check (pull_request) Successful in 9s
security-review / approved (pull_request) Failing after 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s
sop-tier-check / tier-check (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3s
E2E Chat / E2E Chat (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m31s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m35s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m42s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m38s
sop-checklist / all-items-acked (pull_request) [info tier:low] acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, l
sop-checklist / na-declarations (pull_request) N/A: (none)
CI / Platform (Go) (pull_request) Successful in 4m55s
CI / Canvas (Next.js) (pull_request) Successful in 6m7s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 6m24s
CI / all-required (pull_request) Successful in 6m35s
audit-force-merge / audit (pull_request) Successful in 5s
CP /cp/admin/tenants/redeploy-fleet now requires explicit
acknowledgement of fleet-wide intent (cp#228 / task #308): empty body
/ {confirm:false} / {only_slugs:[]} return 400 instead of silently
mutating every live tenant. Surfaced 2026-05-20 by aa375b1f, which
sent {} and rolled agents-team + chloe-dong + hongming all at once.

All 4 molecule-core callers of redeploy-fleet are fleet-wide intent
(canary + fan-out, no slug scoping), so this PR adds confirm:true to
each request body:

  1. publish-workspace-server-image.yml (production auto-deploy on
     push:main) — body built by .gitea/scripts/prod-auto-deploy.py;
     confirm:true added there.
  2. redeploy-tenants-on-main.yml (manual/scheduled prod fleet
     redeploy) — inline jq -n body, confirm:true added.
  3. redeploy-tenants-on-staging.yml (staging fleet redeploy on
     workspace-server push:main + workflow_dispatch) — inline jq -n
     body, confirm:true added.
  4. staging-verify.yml (post-staging-green :latest promotion to
     prod) — inline jq -n body, confirm:true added.

No caller in molecule-core scopes by slug (only_slugs); per-tenant
ops happen via the manual CP admin curl, not via these workflows.
Comment-only references in sweep-stale-e2e-orgs.yml (both .gitea/
and .github/ copies) describe redeploy-fleet's auto-rollout — no
caller change needed.

Tests:
- .gitea/scripts/tests/test_prod_auto_deploy.py:
    * Existing test_build_plan_defaults_to_staging_sha_target_and_prod_cp
      updated to assert confirm=True in the default plan body.
    * New test_build_plan_always_sets_confirm_true_for_fleet_intent
      regression-pins that operator-overridable knobs (soak, batch,
      dry_run, canary slug) do NOT drop the ack.
    * All 10 prod-auto-deploy tests pass (.gitea/scripts/tests/ 136 pass).
- Workflow-level body shape mirrored 1:1 with cp#228's
  TestRedeployFleet_ConfirmTrueProceeds (target_tag + confirm:true).
  Pairs with cp#228's TestRedeployFleet_EmptyBodyReturns400.
- lint-workflow-yaml.py clean (59 files, 0 fatal); curl-status-capture
  lint clean.

Sequencing context:
- cp#228 (44317b0) merged 07:48Z; publish-cp-image (run 86729)
  completed Success at 07:50Z, pushing CP image to ECR with the new
  contract.
- Per feedback_image_promote_is_not_user_live, contract is NOT live
  until Railway pulls the new :latest CP image and restarts the CP
  container — image-push ≠ live. The 4 callers in this PR don't fire
  automatically until the next push:main (publish-workspace-server,
  redeploy-tenants-on-main) or staging-smoke pipeline (staging-verify,
  redeploy-tenants-on-staging), so there's headroom. This PR lands
  before the next caller invocation to avoid a 400.

Author independence:
- This PR is in molecule-core; cp#228 author was devops-engineer.
  Using infra-sre persona here keeps molecule-core-side review math
  clean (no same-author-as-cp#228 entanglement in this repo's review
  pool).

References:
- cp#228 (44317b0ecf0f0143dac0964deb76797e6b383044): contract
  enforcement (internal/handlers/admin_redeploy.go +25, provisioner
  +9, 4 contract tests +154).
- task #308: surfaced empty-body fleet-wide mutation.
- feedback_image_promote_is_not_user_live: image-push ≠ live.
2026-05-20 01:06:17 -07:00
5 changed files with 55 additions and 3 deletions
+6
View File
@@ -71,6 +71,12 @@ def build_plan(env: dict[str, str]) -> dict:
"soak_seconds": _int_env(env, "PROD_AUTO_DEPLOY_SOAK_SECONDS", 60, minimum=0),
"batch_size": _int_env(env, "PROD_AUTO_DEPLOY_BATCH_SIZE", 3),
"dry_run": truthy_flag(env.get("PROD_AUTO_DEPLOY_DRY_RUN", "")),
# confirm:true ack required by CP /cp/admin/tenants/redeploy-fleet
# contract (cp#228 / task #308) for fleet-wide intent. Empty body
# / {confirm:false} / {only_slugs:[]} → 400. This caller is the
# production auto-deploy step that rolls every live tenant (canary
# + fan-out), no slug scoping, so confirm:true is correct.
"confirm": True,
}
if canary_slug:
body["canary_slug"] = canary_slug
@@ -36,9 +36,37 @@ def test_build_plan_defaults_to_staging_sha_target_and_prod_cp():
"soak_seconds": 60,
"batch_size": 3,
"dry_run": False,
# cp#228 / task #308: fleet-wide intent must carry confirm:true.
"confirm": True,
}
def test_build_plan_always_sets_confirm_true_for_fleet_intent():
"""Regression guard: every plan body MUST carry confirm:true.
CP /cp/admin/tenants/redeploy-fleet (cp#228) returns 400 on empty
body / {confirm:false} / {only_slugs:[]} to prevent accidental
fleet-wide mutation. This caller is fleet-wide intent (canary +
fan-out, no slug scoping), so the plan MUST carry confirm:true.
Pairs with cp#228's TestRedeployFleet_EmptyBodyReturns400 +
TestRedeployFleet_ConfirmTrueProceeds.
"""
plan = prod.build_plan({"GITHUB_SHA": "abcdef1234567890"})
assert plan["body"]["confirm"] is True
# Operator-overridable knobs do NOT drop the ack.
plan = prod.build_plan(
{
"GITHUB_SHA": "abcdef1234567890",
"PROD_AUTO_DEPLOY_SOAK_SECONDS": "0",
"PROD_AUTO_DEPLOY_BATCH_SIZE": "10",
"PROD_AUTO_DEPLOY_DRY_RUN": "true",
"PROD_AUTO_DEPLOY_CANARY_SLUG": "",
}
)
assert plan["body"]["confirm"] is True
def test_build_plan_rejects_non_prod_cp_without_explicit_override():
try:
prod.build_plan(
@@ -151,6 +151,11 @@ jobs:
exit 1
fi
# confirm:true ack required by CP /cp/admin/tenants/redeploy-fleet
# contract (cp#228 / task #308) for fleet-wide intent. Empty body
# / {confirm:false} / {only_slugs:[]} → 400. This caller redeploys
# the entire prod fleet (canary + fan-out), no slug scoping, so
# confirm:true is correct.
BODY=$(jq -nc \
--arg tag "$TARGET_TAG" \
--arg canary "$CANARY_SLUG" \
@@ -162,7 +167,8 @@ jobs:
canary_slug: $canary,
soak_seconds: $soak,
batch_size: $batch,
dry_run: $dry
dry_run: $dry,
confirm: true
}')
echo "POST $CP_URL/cp/admin/tenants/redeploy-fleet"
@@ -123,6 +123,11 @@ jobs:
exit 1
fi
# confirm:true ack required by CP /cp/admin/tenants/redeploy-fleet
# contract (cp#228 / task #308) for fleet-wide intent. Empty body
# / {confirm:false} / {only_slugs:[]} → 400. Staging IS the
# canary, no slug scoping; this rolls the entire staging fleet,
# so confirm:true is correct.
BODY=$(jq -nc \
--arg tag "$TARGET_TAG" \
--arg canary "$CANARY_SLUG" \
@@ -134,7 +139,8 @@ jobs:
canary_slug: $canary,
soak_seconds: $soak,
batch_size: $batch,
dry_run: $dry
dry_run: $dry,
confirm: true
}')
echo "POST $CP_URL/cp/admin/tenants/redeploy-fleet"
+7 -1
View File
@@ -235,6 +235,11 @@ jobs:
set -euo pipefail
TARGET_TAG="staging-${SHA}"
# confirm:true ack required by CP /cp/admin/tenants/redeploy-fleet
# contract (cp#228 / task #308) for fleet-wide intent. Empty body
# / {confirm:false} / {only_slugs:[]} → 400. This caller promotes
# the verified staging image across the entire prod fleet (canary
# + fan-out), no slug scoping, so confirm:true is correct.
BODY=$(jq -nc \
--arg tag "$TARGET_TAG" \
--argjson soak "${SOAK_SECONDS:-120}" \
@@ -244,7 +249,8 @@ jobs:
target_tag: $tag,
soak_seconds: $soak,
batch_size: $batch,
dry_run: $dry
dry_run: $dry,
confirm: true
}')
if [ -n "${CANARY_SLUG:-}" ]; then