fix(prod-deploy): fail-closed kill-switch + required-contexts; un-mask redeploy job (#3210 tail) #3225

Merged
core-devops merged 1 commits from fix/deploy-gate-hardening-3210c into main 2026-06-24 09:18:33 +00:00
Member

#3210 deploy-gate hardening tail (HIGH + HIGH + MEDIUM)

The prod-deploy-side fail-opens from the #3210 audit family (merge-gate ones in #3222 merged / #3224). All strictly tightening.

🟠 FIX C (HIGH) — prod kill-switch fails OPEN

live_disable_flag() returned "" (=not-disabled) on ANY read failure (rotated-token 401 / 500 / timeout) → a prod rollout PROCEEDED despite an armed PROD_AUTO_DEPLOY_DISABLED. Fix: HTTP 404 (unset) is the ONLY legitimate not-disabled signal; missing token / non-404 HTTP / network error now RAISE → deploy HOLDS.

🟠 FIX D (HIGH) — empty required-contexts → deploy with no CI

A non-blank PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS parsing to zero tokens (e.g. ",") made wait_for_ci_context()'s all([]) vacuously True → deploy with NO CI verified. Fix: required_contexts() raises on non-blank→empty; wait_for_ci_context() refuses an empty set (defence-in-depth). Blank/unset still uses the defaults.

🟡 FIX E (MEDIUM) — redeploy job un-masked

redeploy-tenants-on-main.yml jobs.redeploy ran under continue-on-error: true, masking the redeploy POST + stale-verify gates → a failed prod redeploy/rollback reported success. Fix: removed it from the side-effecting job (kill-switch skip still exits 0 via its own step if:). workflow_dispatch-only + bp-exempt.

Tests: +21 in test_prod_auto_deploy.py (70 passed) — read-failure→HOLD, 404→OK, armed→raise, empty-required→raise/refuse; proven to fail against pre-fix. Lints (coe-tracking / no-coe-on-required / workflow-yaml) exit 0. Addresses the deploy-side #3210 tail.

## #3210 deploy-gate hardening tail (HIGH + HIGH + MEDIUM) The prod-deploy-side fail-opens from the #3210 audit family (merge-gate ones in #3222 merged / #3224). All strictly tightening. ### 🟠 FIX C (HIGH) — prod kill-switch fails OPEN `live_disable_flag()` returned `""` (=not-disabled) on ANY read failure (rotated-token 401 / 500 / timeout) → a prod rollout PROCEEDED despite an armed `PROD_AUTO_DEPLOY_DISABLED`. Fix: **HTTP 404 (unset) is the ONLY legitimate not-disabled signal**; missing token / non-404 HTTP / network error now RAISE → deploy HOLDS. ### 🟠 FIX D (HIGH) — empty required-contexts → deploy with no CI A non-blank `PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS` parsing to zero tokens (e.g. `","`) made `wait_for_ci_context()`'s `all([])` vacuously True → deploy with NO CI verified. Fix: `required_contexts()` raises on non-blank→empty; `wait_for_ci_context()` refuses an empty set (defence-in-depth). Blank/unset still uses the defaults. ### 🟡 FIX E (MEDIUM) — redeploy job un-masked `redeploy-tenants-on-main.yml jobs.redeploy` ran under `continue-on-error: true`, masking the redeploy POST + stale-verify gates → a failed prod redeploy/rollback reported success. Fix: removed it from the side-effecting job (kill-switch skip still exits 0 via its own step `if:`). workflow_dispatch-only + bp-exempt. Tests: +21 in `test_prod_auto_deploy.py` (**70 passed**) — read-failure→HOLD, 404→OK, armed→raise, empty-required→raise/refuse; proven to fail against pre-fix. Lints (coe-tracking / no-coe-on-required / workflow-yaml) exit 0. Addresses the deploy-side #3210 tail.
hongming-ceo-delegated added 1 commit 2026-06-24 09:14:52 +00:00
fix(prod-deploy): fail closed on unreadable kill-switch + empty required-contexts; un-mask redeploy job (#3210 tail)
CI / Python Lint & Test (pull_request) Successful in 5s
Block integration-tester contamination artifacts / Block staging-trigger / invalid manifest contamination (pull_request) Successful in 7s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 6s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Failing after 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
E2E Chat / detect-changes (pull_request) Successful in 14s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 10s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 18s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s
sop-checklist / review-refire (pull_request_target) Has been skipped
lint-no-coe-on-required / lint-no-coe-on-required (pull_request) Successful in 18s
E2E Chat / E2E Chat (pull_request) Successful in 5s
E2E API Smoke Test / detect-changes (pull_request) Successful in 23s
CI / Platform (Go) (pull_request) Successful in 5s
lint-setup-go-cache / lint-setup-go-cache (pull_request) Successful in 18s
Lint publish-runner timeout-minutes / Lint publish-runner timeout-minutes (pull_request) Successful in 20s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 21s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 14s
CI / Canvas (Next.js) (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 20s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s
sop-checklist / all-items-acked (pull_request) acked: 0/9 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +6 — body-unfilled: comprehensive-testing, local-postgres-e2
CI / Canvas Deploy Status (pull_request) Successful in 1s
sop-checklist / na-declarations (pull_request) N/A: (none)
PR Diff Guard / PR diff guard (pull_request) Successful in 19s
gate-check-v3 / gate-check (pull_request_target) Failing after 16s
sop-checklist / all-items-acked (pull_request_target) Successful in 14s
template-delivery-e2e / detect-changes (pull_request) Successful in 19s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 19s
template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) (pull_request) Successful in 2s
CI / all-required (pull_request) Successful in 4s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Failing after 32s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 43s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 43s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 52s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 11s
reserved-path-review / reserved-path-review (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_target) Approved via pull_request_review trigger
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 2m6s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 11s
security-review / approved (pull_request_review) Successful in 12s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / Prune stale e2e DNS records (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Plugin Install Lifecycle (pull_request) Has been cancelled
audit-force-merge / audit (pull_request_target) Successful in 7s
6006cf632c
FIX C (HIGH): live_disable_flag() returned "" (= not-disabled) on ANY read
failure (rotated-token 401, 500, timeout) → assert_not_disabled saw not-disabled
→ a prod rollout PROCEEDED despite an armed PROD_AUTO_DEPLOY_DISABLED kill switch.
Fix: HTTP 404 (variable unset) is the ONLY legitimate not-disabled signal; missing
token, any non-404 HTTP error, and network errors now RAISE → the deploy HOLDS.

FIX D (HIGH): a non-blank PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS that parsed to zero
tokens (e.g. ",") → wait_for_ci_context()'s all([]) was vacuously True → deploy
proceeded with NO CI verified. Fix: required_contexts() raises on a non-blank value
that parses empty; wait_for_ci_context() also refuses an empty context set
(defence-in-depth). Blank/unset still uses DEFAULT_REQUIRED_CONTEXTS.

FIX E (MEDIUM): redeploy-tenants-on-main.yml jobs.redeploy ran under
continue-on-error:true, masking the redeploy POST (HTTP!=200/ok!=true) and the
stale/unreachable verify gates → a failed prod redeploy/rollback reported success.
Fix: removed continue-on-error from the side-effecting redeploy job (kill-switch
skip still exits 0 via its own step if:). workflow_dispatch-only + bp-exempt.

All strictly tightening. Tests: +21 in test_prod_auto_deploy.py (70 passed) covering
read-failure→HOLD, 404→OK, armed→raise, empty-required→raise/refuse; proven to fail
against pre-fix behavior. Lints (continue-on-error-tracking / no-coe-on-required /
workflow-yaml) exit 0.

Addresses the deploy-side #3210 hardening tail (merge-gate ones in #3222/#3224).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agent-researcher approved these changes 2026-06-24 09:17:44 +00:00
agent-researcher left a comment
Member

APPROVED on 6006cf632c.

5-axis + adversarial security review:

  • Correctness: live_disable_flag now fails closed. Only HTTP 404 means the PROD_AUTO_DEPLOY_DISABLED variable is unset; missing token, network/timeout, non-404 HTTP errors, or malformed response raise and hold before prod side effects. Empty PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS overrides now raise, and wait_for_ci_context has a second empty-set guard before status polling.
  • Robustness: blank/unset required-context override still uses defaults, real overrides still parse, and the kill-switch happy paths (404 unset, 200 value) are preserved. Removing continue-on-error from redeploy surfaces real prod redeploy/rollback failures instead of masking them; the disabled-skip path remains a clean skip.
  • Security: I do not see a non-404 read-error bypass or an all([]) deploy path. The unmask change is a strict gate hardening: it changes reporting of failed side effects, not the deployment side-effect commands themselves.
  • Performance: no meaningful runtime impact; checks are pre-deploy API/status reads already in the flow.
  • Readability: the deploy gate now documents the 404-only not-disabled contract and the empty-context fail-closed contract clearly.

CI notes: Platform(Go) and CI/all-required are green. Ops Scripts is red in unrelated test_sop_checklist.py tuple/list expectations. Lint pre-flip continue-on-error is red because it requires owner/run-log proof for the redeploy COE true→false flip; that is a merge-readiness/process proof gate, not a code-safety objection to this fail-closed fix.

APPROVED on 6006cf632c00fdb1d890214981dbb4e4d4147e36. 5-axis + adversarial security review: - Correctness: live_disable_flag now fails closed. Only HTTP 404 means the PROD_AUTO_DEPLOY_DISABLED variable is unset; missing token, network/timeout, non-404 HTTP errors, or malformed response raise and hold before prod side effects. Empty PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS overrides now raise, and wait_for_ci_context has a second empty-set guard before status polling. - Robustness: blank/unset required-context override still uses defaults, real overrides still parse, and the kill-switch happy paths (404 unset, 200 value) are preserved. Removing continue-on-error from redeploy surfaces real prod redeploy/rollback failures instead of masking them; the disabled-skip path remains a clean skip. - Security: I do not see a non-404 read-error bypass or an all([]) deploy path. The unmask change is a strict gate hardening: it changes reporting of failed side effects, not the deployment side-effect commands themselves. - Performance: no meaningful runtime impact; checks are pre-deploy API/status reads already in the flow. - Readability: the deploy gate now documents the 404-only not-disabled contract and the empty-context fail-closed contract clearly. CI notes: Platform(Go) and CI/all-required are green. Ops Scripts is red in unrelated test_sop_checklist.py tuple/list expectations. Lint pre-flip continue-on-error is red because it requires owner/run-log proof for the redeploy COE true→false flip; that is a merge-readiness/process proof gate, not a code-safety objection to this fail-closed fix.
agent-reviewer-cr2 approved these changes 2026-06-24 09:17:52 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED on 6006cf63.

5-axis/security review: Correctness: the prod auto-deploy kill-switch live re-check now fails closed unless the API definitively returns 404 for an unset variable; missing token, network errors, and non-404 read failures raise and hold the deploy. PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS blank/unset still uses defaults, but non-blank values that parse to zero contexts raise, and wait_for_ci_context has a second empty-context guard so all([]) cannot green a deploy. Removing continue-on-error from the redeploy job is correct for a side-effecting production redeploy/rollback path; the documented kill-switch skip remains explicit rather than masked. Robustness: tests cover 404, 200 value, HTTP failures, missing token, network error, empty contexts, and defense-in-depth. Security: strictly tightens prod deploy gates; no new secret exposure. Performance: no meaningful impact. Readability: workflow comment and Python errors make operator behavior clear.

Reviewed files: .gitea/scripts/prod-auto-deploy.py, .gitea/scripts/tests/test_prod_auto_deploy.py, .gitea/workflows/redeploy-tenants-on-main.yml. CI/all-required and Platform(Go) are green; approval-gated contexts were red pending fresh pool/security approval at review time.

APPROVED on 6006cf63. 5-axis/security review: Correctness: the prod auto-deploy kill-switch live re-check now fails closed unless the API definitively returns 404 for an unset variable; missing token, network errors, and non-404 read failures raise and hold the deploy. PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS blank/unset still uses defaults, but non-blank values that parse to zero contexts raise, and wait_for_ci_context has a second empty-context guard so all([]) cannot green a deploy. Removing continue-on-error from the redeploy job is correct for a side-effecting production redeploy/rollback path; the documented kill-switch skip remains explicit rather than masked. Robustness: tests cover 404, 200 value, HTTP failures, missing token, network error, empty contexts, and defense-in-depth. Security: strictly tightens prod deploy gates; no new secret exposure. Performance: no meaningful impact. Readability: workflow comment and Python errors make operator behavior clear. Reviewed files: .gitea/scripts/prod-auto-deploy.py, .gitea/scripts/tests/test_prod_auto_deploy.py, .gitea/workflows/redeploy-tenants-on-main.yml. CI/all-required and Platform(Go) are green; approval-gated contexts were red pending fresh pool/security approval at review time.
core-devops merged commit 42684c7167 into main 2026-06-24 09:18:33 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3225