docs(sop): fail-closed CI integrity — no fail-open gates (MERGE-BLOCKING) #2325

Merged
claude-ceo-assistant merged 1 commits from docs/sop-fail-closed-ci into main 2026-06-06 03:16:41 +00:00
2 changed files with 87 additions and 0 deletions
+86
View File
@@ -121,6 +121,92 @@ python -m pytest .gitea/scripts/tests/test_gate_auto_fire_live.py -v
---
## 6. Fail-closed CI integrity — no fail-open gates (MERGE-BLOCKING)
**Rule:** No CI workflow, CI script, or test check may **FAIL OPEN** — i.e. it
must never report GREEN (exit 0, skip, warn-and-continue, `|| true`, or any
"return success") when it could **not actually verify its invariant**. A check
that cannot verify MUST **fail loud** (`::error::` annotation **and** a nonzero
exit) and **fail closed** (treat inability-to-verify as **FAILURE**, never as a
pass). An unverifiable check is a red check, full stop.
This is the same family of bug as the no-flakes rule (§ *No flakes*): a green
that isn't real. A flake is a green/red that flips for an unnamed reason; a
fail-open gate is a green that was never earned. Both let unverified code reach
`main`, and both are merge-blocking.
### Applies to
Required / hard gates on **protected contexts**: pushes to `main`, internal
protected branches, and **same-repo** PRs (`pull_request_target`). On these
contexts the *cause* of an unverifiable run is **irrelevant** — every one of the
following MUST fail closed:
- auth failure (401 / 403),
- missing token or identity,
- under-scoped credential,
- unreachable dependency (network, Infisical, control-plane, registry),
- a required test file that is absent or collects zero tests,
- any transient error the check cannot prove was benign.
"I couldn't check" is reported and scored exactly like "the check failed." A
gate that can be silently defanged by removing a secret is not a gate.
### The one allowed exception — explicit trust-boundary split
Legitimate degradation is permitted **only** where the secret genuinely cannot
exist — e.g. **fork PRs**, which by design have no access to repo secrets. Such
degradation is allowed **only** when it is:
1. gated behind an **explicit** fork / advisory branch in the workflow logic
(an intentional trust-boundary split, not an incidental `if: secrets...`),
2. **clearly marked advisory** in its name and output, and
3. **NOT counted as a passing REQUIRED context** — it may inform, it may not
satisfy the gate.
Silent degradation that satisfies a required gate is **forbidden**. If a fork PR
needs the real check, it must run via a maintainer-triggered same-repo path
(where the secret exists and the check therefore fails closed), not by quietly
passing the required context with no verification.
### Auth-failure vs. genuine-absence — do not conflate
Distinguish the two so a real finding is never masked and a masked finding is
never mistaken for real:
- **`403` (or 401) on a protected context → fail closed.** You could not verify;
that is a check failure, not a finding about the resource.
- **A real `404` from a read made *with a valid, sufficiently-scoped token*
the real finding.** The resource is genuinely absent; report it as such.
A `403` reported as "resource not found" is itself a fail-open bug.
### Required practice
Every gate that depends on a token, an identity, or an external read MUST ship
with a test or workflow-lint covering the **absent-identity / unauthorized /
missing-file path** that asserts the gate **FAILS** (not skips, not passes).
Add or update that coverage in the **same PR** that adds or changes the gate.
A gate without a proven failure path is not yet a gate.
### Violations seen in this codebase (all merge-blocking if reintroduced)
- **serving-e2e** reporting vacuously GREEN when the Infisical identity is
absent (no per-(provider × auth) completion was actually exercised).
- **branch-protection / BP-drift lints** returning `0` on a `403` instead of
failing closed on the unverifiable response.
- **verify-template-models** run without `-strict`, so a drift it could not
confirm passed silently.
- A **referenced-but-absent pytest file** that collects zero tests and reports
green — silent pass with no assertions executed.
Each of these is a fail-open gate and is a merge blocker until it fails loud and
closed on protected contexts. See also the production fail-closed defaults in
`runbooks/sop-production-cicd.md` (*Production Defaults*), which apply the same
principle to deploy-time gates.
---
## References
- #2159 — gate auto-trigger not firing (root cause: stale PR heads lacking
+1
View File
@@ -35,6 +35,7 @@ Every production CI/CD PR must include concrete answers for:
- Verification: how production state is proven after deployment.
- Logging: proof that CI logs do not contain raw production runtime, SSM, or secret-adjacent output.
- Rollback: the exact command, variable, or workflow to return to a known-good tag/digest.
- No fail-open gates: required checks fail loud + closed on protected contexts (no skip/`|| true`/`403`-as-pass). See `runbooks/dev-sop.md` § *Fail-closed CI integrity*.
## Human Review