fix(prod-deploy): fail-closed kill-switch + required-contexts; un-mask redeploy job (#3210 tail) #3225
Reference in New Issue
Block a user
Delete Branch "fix/deploy-gate-hardening-3210c"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
#3210 deploy-gate hardening tail (HIGH + HIGH + MEDIUM)
The prod-deploy-side fail-opens from the #3210 audit family (merge-gate ones in #3222 merged / #3224). All strictly tightening.
🟠 FIX C (HIGH) — prod kill-switch fails OPEN
live_disable_flag()returned""(=not-disabled) on ANY read failure (rotated-token 401 / 500 / timeout) → a prod rollout PROCEEDED despite an armedPROD_AUTO_DEPLOY_DISABLED. Fix: HTTP 404 (unset) is the ONLY legitimate not-disabled signal; missing token / non-404 HTTP / network error now RAISE → deploy HOLDS.🟠 FIX D (HIGH) — empty required-contexts → deploy with no CI
A non-blank
PROD_AUTO_DEPLOY_REQUIRED_CONTEXTSparsing to zero tokens (e.g.",") madewait_for_ci_context()'sall([])vacuously True → deploy with NO CI verified. Fix:required_contexts()raises on non-blank→empty;wait_for_ci_context()refuses an empty set (defence-in-depth). Blank/unset still uses the defaults.🟡 FIX E (MEDIUM) — redeploy job un-masked
redeploy-tenants-on-main.yml jobs.redeployran undercontinue-on-error: true, masking the redeploy POST + stale-verify gates → a failed prod redeploy/rollback reported success. Fix: removed it from the side-effecting job (kill-switch skip still exits 0 via its own stepif:). workflow_dispatch-only + bp-exempt.Tests: +21 in
test_prod_auto_deploy.py(70 passed) — read-failure→HOLD, 404→OK, armed→raise, empty-required→raise/refuse; proven to fail against pre-fix. Lints (coe-tracking / no-coe-on-required / workflow-yaml) exit 0. Addresses the deploy-side #3210 tail.APPROVED on
6006cf632c.5-axis + adversarial security review:
CI notes: Platform(Go) and CI/all-required are green. Ops Scripts is red in unrelated test_sop_checklist.py tuple/list expectations. Lint pre-flip continue-on-error is red because it requires owner/run-log proof for the redeploy COE true→false flip; that is a merge-readiness/process proof gate, not a code-safety objection to this fail-closed fix.
APPROVED on
6006cf63.5-axis/security review: Correctness: the prod auto-deploy kill-switch live re-check now fails closed unless the API definitively returns 404 for an unset variable; missing token, network errors, and non-404 read failures raise and hold the deploy. PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS blank/unset still uses defaults, but non-blank values that parse to zero contexts raise, and wait_for_ci_context has a second empty-context guard so all([]) cannot green a deploy. Removing continue-on-error from the redeploy job is correct for a side-effecting production redeploy/rollback path; the documented kill-switch skip remains explicit rather than masked. Robustness: tests cover 404, 200 value, HTTP failures, missing token, network error, empty contexts, and defense-in-depth. Security: strictly tightens prod deploy gates; no new secret exposure. Performance: no meaningful impact. Readability: workflow comment and Python errors make operator behavior clear.
Reviewed files: .gitea/scripts/prod-auto-deploy.py, .gitea/scripts/tests/test_prod_auto_deploy.py, .gitea/workflows/redeploy-tenants-on-main.yml. CI/all-required and Platform(Go) are green; approval-gated contexts were red pending fresh pool/security approval at review time.