2026-06-06 03:16:41 +00:00
2 changed files with 87 additions and 0 deletions
@@ -121,6 +121,92 @@ python -m pytest .gitea/scripts/tests/test_gate_auto_fire_live.py -v

 ---

+## 6. Fail-closed CI integrity — no fail-open gates (MERGE-BLOCKING)
+
+**Rule:** No CI workflow, CI script, or test check may **FAIL OPEN** — i.e. it
+must never report GREEN (exit 0, skip, warn-and-continue, `|| true`, or any
+"return success") when it could **not actually verify its invariant**. A check
+that cannot verify MUST **fail loud** (`::error::` annotation **and** a nonzero
+exit) and **fail closed** (treat inability-to-verify as **FAILURE**, never as a
+pass). An unverifiable check is a red check, full stop.
+
+This is the same family of bug as the no-flakes rule (§ *No flakes*): a green
+that isn't real. A flake is a green/red that flips for an unnamed reason; a
+fail-open gate is a green that was never earned. Both let unverified code reach
+`main`, and both are merge-blocking.
+
+### Applies to
+
+Required / hard gates on **protected contexts**: pushes to `main`, internal
+protected branches, and **same-repo** PRs (`pull_request_target`). On these
+contexts the *cause* of an unverifiable run is **irrelevant** — every one of the
+following MUST fail closed:
+
+- auth failure (401 / 403),
+- missing token or identity,
+- under-scoped credential,
+- unreachable dependency (network, Infisical, control-plane, registry),
+- a required test file that is absent or collects zero tests,
+- any transient error the check cannot prove was benign.
+
+"I couldn't check" is reported and scored exactly like "the check failed." A
+gate that can be silently defanged by removing a secret is not a gate.
+
+### The one allowed exception — explicit trust-boundary split
+
+Legitimate degradation is permitted **only** where the secret genuinely cannot
+exist — e.g. **fork PRs**, which by design have no access to repo secrets. Such
+degradation is allowed **only** when it is:
+
+1. gated behind an **explicit** fork / advisory branch in the workflow logic
+   (an intentional trust-boundary split, not an incidental `if: secrets...`),
+2. **clearly marked advisory** in its name and output, and
+3. **NOT counted as a passing REQUIRED context** — it may inform, it may not
+   satisfy the gate.
+
+Silent degradation that satisfies a required gate is **forbidden**. If a fork PR
+needs the real check, it must run via a maintainer-triggered same-repo path
+(where the secret exists and the check therefore fails closed), not by quietly
+passing the required context with no verification.
+
+### Auth-failure vs. genuine-absence — do not conflate
+
+Distinguish the two so a real finding is never masked and a masked finding is
+never mistaken for real:
+
+- **`403` (or 401) on a protected context → fail closed.** You could not verify;
+  that is a check failure, not a finding about the resource.
+- **A real `404` from a read made *with a valid, sufficiently-scoped token* →
+  the real finding.** The resource is genuinely absent; report it as such.
+
+A `403` reported as "resource not found" is itself a fail-open bug.
+
+### Required practice
+
+Every gate that depends on a token, an identity, or an external read MUST ship
+with a test or workflow-lint covering the **absent-identity / unauthorized /
+missing-file path** that asserts the gate **FAILS** (not skips, not passes).
+Add or update that coverage in the **same PR** that adds or changes the gate.
+A gate without a proven failure path is not yet a gate.
+
+### Violations seen in this codebase (all merge-blocking if reintroduced)
+
+- **serving-e2e** reporting vacuously GREEN when the Infisical identity is
+  absent (no per-(provider × auth) completion was actually exercised).
+- **branch-protection / BP-drift lints** returning `0` on a `403` instead of
+  failing closed on the unverifiable response.
+- **verify-template-models** run without `-strict`, so a drift it could not
+  confirm passed silently.
+- A **referenced-but-absent pytest file** that collects zero tests and reports
+  green — silent pass with no assertions executed.
+
+Each of these is a fail-open gate and is a merge blocker until it fails loud and
+closed on protected contexts. See also the production fail-closed defaults in
+`runbooks/sop-production-cicd.md` (*Production Defaults*), which apply the same
+principle to deploy-time gates.
+
+---
+
 ## References

 - #2159 — gate auto-trigger not firing (root cause: stale PR heads lacking
@@ -35,6 +35,7 @@ Every production CI/CD PR must include concrete answers for:
 - Verification: how production state is proven after deployment.
 - Logging: proof that CI logs do not contain raw production runtime, SSM, or secret-adjacent output.
 - Rollback: the exact command, variable, or workflow to return to a known-good tag/digest.
+- No fail-open gates: required checks fail loud + closed on protected contexts (no skip/`|| true`/`403`-as-pass). See `runbooks/dev-sop.md` § *Fail-closed CI integrity*.

 ## Human Review