Merge/deploy gate fail-open family (audit follow-up to RC 13618 / #3192) #3210

Open
opened 2026-06-24 06:41:31 +00:00 by hongming-ceo-delegated · 0 comments
Member

Merge/deploy gate fail-open family — audit follow-up to RC 13618 (#3192)

While fixing RC 13618 (load_enforced_file_contexts() returning [] on an unreadable SSOT → silently-disabled #3181 enforcement, fixed in #3192), an adversarial audit of the merge/deploy gate surface found a family of the same "absence/empty/read-error → silently permissive" class in OTHER gates. The #3192 fix was adversarially verified to have no bypass; these are separate gates and are filed here rather than bundled into #3192.

Each is verified-real (independent skeptic confirmed reachable + proceeds-on-failure).

🔴 CRITICAL — zero-approval merge

gitea-merge-queue.py parse_branch_protection L676-679 → evaluate_merge_readiness step 3 L1300-1307. If the branch_protections API returns required_approvals: 0 (admin lowers it, a Gitea migration/default resets it, or a restored/blanked BP record), negative, or bool True (isinstance(True,int) → coerced to 1, halving the 2-genuine bar), the genuine-approval gate is silently skipped → a PR can auto-merge with ZERO approvals.
Fix: hard floor at derivation: required_approvals = max(int(approvals), REQUIRED_APPROVALS_DEFAULT); reject bool explicitly (isinstance(approvals,int) and not isinstance(approvals,bool)); treat 0/negative as fail-closed config error. Defensive floor in evaluate_merge_readiness (non-exempt path) too. Unit tests for 0, -1, bool True.

🟠 HIGH — out-of-roster REQUEST_CHANGES silently dropped

_approval_validator.py classify_reviews L215-235. The reviewer_set filter (L222) applies to BOTH approvals AND request-changes, so a genuine current-head REQUEST_CHANGES from a login NOT in REVIEWER_SET (e.g. the CTO/founder account, or any reviewer not in the roster) does NOT block the merge while two roster approvals stand.
Fix: apply reviewer_set only to the approver count; honor an official current-head REQUEST_CHANGES from ANY login. (Or two calls: reviewer_set=REVIEWER_SET for approvals, reviewer_set=None for the block set.) Unit test: out-of-roster RC still blocks.

🟠 HIGH — empty PUSH_REQUIRED_CONTEXTS → main-green backstop passes vacuously

gitea-merge-queue.py push_required_contexts L516-518 → evaluate_merge_readiness step 1 / process_once. PUSH_REQUIRED_CONTEXTS='' (or names the push CI never emits) → []required_contexts_green(main_latest, []) returns green for ANY main state incl. all-red → queue keeps merging onto a red main.
Fix: fail-closed on empty parsed set (assert at least CI / all-required (push); pause the tick with ::error:: if empty), mirroring the BP enable_status_check && contexts empty guard.

🟠 HIGH — prod-auto-deploy.py empty required-contexts → deploy with no CI verified

required_contexts() L182-186 + wait_for_ci_context() L670-717. A non-blank PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS with no extractable tokens (",") → []all([]) is True → rollout proceeds with zero CI verified.
Fix: raise if the raw value was non-blank but parsed to []; wait_for_ci_context() must refuse to deploy with no contexts.

🟠 HIGH — prod-auto-deploy.py kill-switch fails OPEN

live_disable_flag() L641-658 + assert_not_disabled() L661-667. During an incident an operator sets PROD_AUTO_DEPLOY_DISABLED=true, but if the live read returns non-200 (401 on a rotated token, 500) the fn returns "" (= not-disabled) → rollout proceeds despite the active kill-switch.
Fix: distinguish "read says not-disabled" from "read failed"; on a non-404 error or missing token, raise (HOLD). Only a 404 (variable unset) is a legitimate not-disabled signal.

🟡 MEDIUM — conductor snapshot staleness can merge over a freshly-red context

gitea-merge-queue.py load_conductor_snapshot L323-366 + get_combined_status. A snapshot <600s old captured green; the context goes red on the same head before the next snapshot → queue reads stale-green and merges (incl. force_merge over a now-red non-BP enforced context).
Fix: for the FINAL pre-merge gate, re-fetch live /commits/{sha}/status and re-run required_contexts_green + enforced_file_contexts_green immediately before merge_pull; use the snapshot only for cheap enumeration.

🟡 MEDIUM — malformed/absent snapshot ts treated as FRESH

gitea-merge-queue.py load_conductor_snapshot L348-364. if ts_str: skips the age check when ts is empty/absent; except ValueError: pass swallows a parse failure and falls through to return snapshot — comment says "conservative" but it's the opposite (an arbitrarily-old snapshot is trusted).
Fix: require a parseable ts to trust the snapshot; except ValueError: return None (discard + self-fetch); fix the misleading comment.

🟡 MEDIUM — redeploy-tenants-on-main.yml job under continue-on-error: true

jobs.redeploy L77 masks the gates at L228-236 (HTTP!=200, ok!=true) and L370-388 (stale/unreachable verify) — a failed prod redeploy/rollback reports success.
Fix: remove continue-on-error: true from the side-effecting redeploy job; scope it to advisory steps only, never the redeploy POST or the ok/stale gates.


Method: ultracode adversarial audit (5 finders + 3 fix-bypass skeptics + per-finding refutation). The lint_no_coe_on_required "external defense" cited in the original RC 13618 code was also validated: it hard-fails on a missing SSOT but only on the proposing PR branch (paths-filtered) + push:main — NOT a backstop for the merge queue's runtime checkout, confirming RC 13618 had no real safety net.

Recommend the CRITICAL (zero-approval) + the out-of-roster-REQUEST_CHANGES (CTO block dropped) as fast-follow; the rest as a hardening sweep.

## Merge/deploy gate fail-open family — audit follow-up to RC 13618 (#3192) While fixing RC 13618 (`load_enforced_file_contexts()` returning `[]` on an unreadable SSOT → silently-disabled #3181 enforcement, fixed in #3192), an adversarial audit of the merge/deploy gate surface found a **family** of the same "absence/empty/read-error → silently permissive" class in OTHER gates. The #3192 fix was adversarially verified to have **no bypass**; these are separate gates and are filed here rather than bundled into #3192. Each is verified-real (independent skeptic confirmed reachable + proceeds-on-failure). ### 🔴 CRITICAL — zero-approval merge **`gitea-merge-queue.py` `parse_branch_protection` L676-679 → `evaluate_merge_readiness` step 3 L1300-1307.** If the branch_protections API returns `required_approvals: 0` (admin lowers it, a Gitea migration/default resets it, or a restored/blanked BP record), negative, or bool `True` (`isinstance(True,int)` → coerced to 1, halving the 2-genuine bar), the genuine-approval gate is silently skipped → a PR can auto-merge with ZERO approvals. **Fix:** hard floor at derivation: `required_approvals = max(int(approvals), REQUIRED_APPROVALS_DEFAULT)`; reject bool explicitly (`isinstance(approvals,int) and not isinstance(approvals,bool)`); treat 0/negative as fail-closed config error. Defensive floor in `evaluate_merge_readiness` (non-exempt path) too. Unit tests for 0, -1, bool True. ### 🟠 HIGH — out-of-roster REQUEST_CHANGES silently dropped **`_approval_validator.py` `classify_reviews` L215-235.** The `reviewer_set` filter (L222) applies to BOTH approvals AND request-changes, so a genuine current-head REQUEST_CHANGES from a login NOT in `REVIEWER_SET` (e.g. the **CTO/founder account**, or any reviewer not in the roster) does NOT block the merge while two roster approvals stand. **Fix:** apply `reviewer_set` only to the approver count; honor an official current-head REQUEST_CHANGES from ANY login. (Or two calls: `reviewer_set=REVIEWER_SET` for approvals, `reviewer_set=None` for the block set.) Unit test: out-of-roster RC still blocks. ### 🟠 HIGH — empty `PUSH_REQUIRED_CONTEXTS` → main-green backstop passes vacuously **`gitea-merge-queue.py` `push_required_contexts` L516-518 → `evaluate_merge_readiness` step 1 / `process_once`.** `PUSH_REQUIRED_CONTEXTS=''` (or names the push CI never emits) → `[]` → `required_contexts_green(main_latest, [])` returns green for ANY main state incl. all-red → queue keeps merging onto a red main. **Fix:** fail-closed on empty parsed set (assert at least `CI / all-required (push)`; pause the tick with `::error::` if empty), mirroring the BP `enable_status_check && contexts empty` guard. ### 🟠 HIGH — `prod-auto-deploy.py` empty required-contexts → deploy with no CI verified **`required_contexts()` L182-186 + `wait_for_ci_context()` L670-717.** A non-blank `PROD_AUTO_DEPLOY_REQUIRED_CONTEXTS` with no extractable tokens (`","`) → `[]` → `all([])` is True → rollout proceeds with zero CI verified. **Fix:** raise if the raw value was non-blank but parsed to `[]`; `wait_for_ci_context()` must refuse to deploy with no contexts. ### 🟠 HIGH — `prod-auto-deploy.py` kill-switch fails OPEN **`live_disable_flag()` L641-658 + `assert_not_disabled()` L661-667.** During an incident an operator sets `PROD_AUTO_DEPLOY_DISABLED=true`, but if the live read returns non-200 (401 on a rotated token, 500) the fn returns `""` (= not-disabled) → rollout proceeds despite the active kill-switch. **Fix:** distinguish "read says not-disabled" from "read failed"; on a non-404 error or missing token, raise (HOLD). Only a 404 (variable unset) is a legitimate not-disabled signal. ### 🟡 MEDIUM — conductor snapshot staleness can merge over a freshly-red context **`gitea-merge-queue.py` `load_conductor_snapshot` L323-366 + `get_combined_status`.** A snapshot <600s old captured green; the context goes red on the same head before the next snapshot → queue reads stale-green and merges (incl. force_merge over a now-red non-BP enforced context). **Fix:** for the FINAL pre-merge gate, re-fetch live `/commits/{sha}/status` and re-run `required_contexts_green` + `enforced_file_contexts_green` immediately before `merge_pull`; use the snapshot only for cheap enumeration. ### 🟡 MEDIUM — malformed/absent snapshot `ts` treated as FRESH **`gitea-merge-queue.py` `load_conductor_snapshot` L348-364.** `if ts_str:` skips the age check when `ts` is empty/absent; `except ValueError: pass` swallows a parse failure and falls through to `return snapshot` — comment says "conservative" but it's the opposite (an arbitrarily-old snapshot is trusted). **Fix:** require a parseable `ts` to trust the snapshot; `except ValueError: return None` (discard + self-fetch); fix the misleading comment. ### 🟡 MEDIUM — `redeploy-tenants-on-main.yml` job under `continue-on-error: true` **`jobs.redeploy` L77** masks the gates at L228-236 (HTTP!=200, ok!=true) and L370-388 (stale/unreachable verify) — a failed prod redeploy/rollback reports success. **Fix:** remove `continue-on-error: true` from the side-effecting redeploy job; scope it to advisory steps only, never the redeploy POST or the ok/stale gates. --- **Method:** ultracode adversarial audit (5 finders + 3 fix-bypass skeptics + per-finding refutation). The `lint_no_coe_on_required` "external defense" cited in the original RC 13618 code was also validated: it hard-fails on a missing SSOT but only on the proposing PR branch (paths-filtered) + push:main — NOT a backstop for the merge queue's runtime checkout, confirming RC 13618 had no real safety net. Recommend the CRITICAL (zero-approval) + the out-of-roster-REQUEST_CHANGES (CTO block dropped) as fast-follow; the rest as a hardening sweep.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3210