From ba8ecdeb484165211410270b683c767a9d28659e Mon Sep 17 00:00:00 2001 From: core-devops Date: Fri, 5 Jun 2026 17:12:14 -0700 Subject: [PATCH] docs(sop): add merge-blocking fail-closed CI integrity rule Add a new merge-blocking section to the dev SOP forbidding fail-open CI/test gates: no check may report GREEN when it could not actually verify its invariant. Inability-to-verify (auth 401/403, missing token, under-scoped credential, unreachable dependency, missing/zero-collecting test file, transient error) MUST fail loud (::error:: + nonzero) and fail closed on protected contexts (push to main, internal protected branches, same-repo PRs). The single allowed exception is an explicit trust-boundary split (fork PRs) behind an advisory branch that is NOT a passing required context. Distinguishes 403 (can't verify -> fail closed) from a real 404 read with a valid token (the real finding). Requires every token/identity/ external-read gate to ship a test or workflow-lint for the absent- identity / unauthorized / missing-file path asserting it FAILS. Cites real codebase violations (vacuously-green serving-e2e, BP-drift lints returning 0 on 403, verify-template-models without -strict, referenced-but-absent pytest collecting zero tests) and cross-links the no-flakes rule and the production fail-closed defaults in sop-production-cicd.md, where a one-line PR-evidence checklist item is also added. Co-Authored-By: Claude Opus 4.8 (1M context) --- runbooks/dev-sop.md | 86 +++++++++++++++++++++++++++++++++ runbooks/sop-production-cicd.md | 1 + 2 files changed, 87 insertions(+) diff --git a/runbooks/dev-sop.md b/runbooks/dev-sop.md index caf2a3a45..c0a47cec2 100644 --- a/runbooks/dev-sop.md +++ b/runbooks/dev-sop.md @@ -121,6 +121,92 @@ python -m pytest .gitea/scripts/tests/test_gate_auto_fire_live.py -v --- +## 6. Fail-closed CI integrity — no fail-open gates (MERGE-BLOCKING) + +**Rule:** No CI workflow, CI script, or test check may **FAIL OPEN** — i.e. it +must never report GREEN (exit 0, skip, warn-and-continue, `|| true`, or any +"return success") when it could **not actually verify its invariant**. A check +that cannot verify MUST **fail loud** (`::error::` annotation **and** a nonzero +exit) and **fail closed** (treat inability-to-verify as **FAILURE**, never as a +pass). An unverifiable check is a red check, full stop. + +This is the same family of bug as the no-flakes rule (§ *No flakes*): a green +that isn't real. A flake is a green/red that flips for an unnamed reason; a +fail-open gate is a green that was never earned. Both let unverified code reach +`main`, and both are merge-blocking. + +### Applies to + +Required / hard gates on **protected contexts**: pushes to `main`, internal +protected branches, and **same-repo** PRs (`pull_request_target`). On these +contexts the *cause* of an unverifiable run is **irrelevant** — every one of the +following MUST fail closed: + +- auth failure (401 / 403), +- missing token or identity, +- under-scoped credential, +- unreachable dependency (network, Infisical, control-plane, registry), +- a required test file that is absent or collects zero tests, +- any transient error the check cannot prove was benign. + +"I couldn't check" is reported and scored exactly like "the check failed." A +gate that can be silently defanged by removing a secret is not a gate. + +### The one allowed exception — explicit trust-boundary split + +Legitimate degradation is permitted **only** where the secret genuinely cannot +exist — e.g. **fork PRs**, which by design have no access to repo secrets. Such +degradation is allowed **only** when it is: + +1. gated behind an **explicit** fork / advisory branch in the workflow logic + (an intentional trust-boundary split, not an incidental `if: secrets...`), +2. **clearly marked advisory** in its name and output, and +3. **NOT counted as a passing REQUIRED context** — it may inform, it may not + satisfy the gate. + +Silent degradation that satisfies a required gate is **forbidden**. If a fork PR +needs the real check, it must run via a maintainer-triggered same-repo path +(where the secret exists and the check therefore fails closed), not by quietly +passing the required context with no verification. + +### Auth-failure vs. genuine-absence — do not conflate + +Distinguish the two so a real finding is never masked and a masked finding is +never mistaken for real: + +- **`403` (or 401) on a protected context → fail closed.** You could not verify; + that is a check failure, not a finding about the resource. +- **A real `404` from a read made *with a valid, sufficiently-scoped token* → + the real finding.** The resource is genuinely absent; report it as such. + +A `403` reported as "resource not found" is itself a fail-open bug. + +### Required practice + +Every gate that depends on a token, an identity, or an external read MUST ship +with a test or workflow-lint covering the **absent-identity / unauthorized / +missing-file path** that asserts the gate **FAILS** (not skips, not passes). +Add or update that coverage in the **same PR** that adds or changes the gate. +A gate without a proven failure path is not yet a gate. + +### Violations seen in this codebase (all merge-blocking if reintroduced) + +- **serving-e2e** reporting vacuously GREEN when the Infisical identity is + absent (no per-(provider × auth) completion was actually exercised). +- **branch-protection / BP-drift lints** returning `0` on a `403` instead of + failing closed on the unverifiable response. +- **verify-template-models** run without `-strict`, so a drift it could not + confirm passed silently. +- A **referenced-but-absent pytest file** that collects zero tests and reports + green — silent pass with no assertions executed. + +Each of these is a fail-open gate and is a merge blocker until it fails loud and +closed on protected contexts. See also the production fail-closed defaults in +`runbooks/sop-production-cicd.md` (*Production Defaults*), which apply the same +principle to deploy-time gates. + +--- + ## References - #2159 — gate auto-trigger not firing (root cause: stale PR heads lacking diff --git a/runbooks/sop-production-cicd.md b/runbooks/sop-production-cicd.md index 93d3a1db6..5593361a3 100644 --- a/runbooks/sop-production-cicd.md +++ b/runbooks/sop-production-cicd.md @@ -35,6 +35,7 @@ Every production CI/CD PR must include concrete answers for: - Verification: how production state is proven after deployment. - Logging: proof that CI logs do not contain raw production runtime, SSM, or secret-adjacent output. - Rollback: the exact command, variable, or workflow to return to a known-good tag/digest. +- No fail-open gates: required checks fail loud + closed on protected contexts (no skip/`|| true`/`403`-as-pass). See `runbooks/dev-sop.md` § *Fail-closed CI integrity*. ## Human Review -- 2.52.0