docs(sop): fail-closed CI integrity — no fail-open gates (MERGE-BLOCKING) #2325
@@ -121,6 +121,92 @@ python -m pytest .gitea/scripts/tests/test_gate_auto_fire_live.py -v
|
||||
|
||||
---
|
||||
|
||||
## 6. Fail-closed CI integrity — no fail-open gates (MERGE-BLOCKING)
|
||||
|
||||
**Rule:** No CI workflow, CI script, or test check may **FAIL OPEN** — i.e. it
|
||||
must never report GREEN (exit 0, skip, warn-and-continue, `|| true`, or any
|
||||
"return success") when it could **not actually verify its invariant**. A check
|
||||
that cannot verify MUST **fail loud** (`::error::` annotation **and** a nonzero
|
||||
exit) and **fail closed** (treat inability-to-verify as **FAILURE**, never as a
|
||||
pass). An unverifiable check is a red check, full stop.
|
||||
|
||||
This is the same family of bug as the no-flakes rule (§ *No flakes*): a green
|
||||
that isn't real. A flake is a green/red that flips for an unnamed reason; a
|
||||
fail-open gate is a green that was never earned. Both let unverified code reach
|
||||
`main`, and both are merge-blocking.
|
||||
|
||||
### Applies to
|
||||
|
||||
Required / hard gates on **protected contexts**: pushes to `main`, internal
|
||||
protected branches, and **same-repo** PRs (`pull_request_target`). On these
|
||||
contexts the *cause* of an unverifiable run is **irrelevant** — every one of the
|
||||
following MUST fail closed:
|
||||
|
||||
- auth failure (401 / 403),
|
||||
- missing token or identity,
|
||||
- under-scoped credential,
|
||||
- unreachable dependency (network, Infisical, control-plane, registry),
|
||||
- a required test file that is absent or collects zero tests,
|
||||
- any transient error the check cannot prove was benign.
|
||||
|
||||
"I couldn't check" is reported and scored exactly like "the check failed." A
|
||||
gate that can be silently defanged by removing a secret is not a gate.
|
||||
|
||||
### The one allowed exception — explicit trust-boundary split
|
||||
|
||||
Legitimate degradation is permitted **only** where the secret genuinely cannot
|
||||
exist — e.g. **fork PRs**, which by design have no access to repo secrets. Such
|
||||
degradation is allowed **only** when it is:
|
||||
|
||||
1. gated behind an **explicit** fork / advisory branch in the workflow logic
|
||||
(an intentional trust-boundary split, not an incidental `if: secrets...`),
|
||||
2. **clearly marked advisory** in its name and output, and
|
||||
3. **NOT counted as a passing REQUIRED context** — it may inform, it may not
|
||||
satisfy the gate.
|
||||
|
||||
Silent degradation that satisfies a required gate is **forbidden**. If a fork PR
|
||||
needs the real check, it must run via a maintainer-triggered same-repo path
|
||||
(where the secret exists and the check therefore fails closed), not by quietly
|
||||
passing the required context with no verification.
|
||||
|
||||
### Auth-failure vs. genuine-absence — do not conflate
|
||||
|
||||
Distinguish the two so a real finding is never masked and a masked finding is
|
||||
never mistaken for real:
|
||||
|
||||
- **`403` (or 401) on a protected context → fail closed.** You could not verify;
|
||||
that is a check failure, not a finding about the resource.
|
||||
- **A real `404` from a read made *with a valid, sufficiently-scoped token* →
|
||||
the real finding.** The resource is genuinely absent; report it as such.
|
||||
|
||||
A `403` reported as "resource not found" is itself a fail-open bug.
|
||||
|
||||
### Required practice
|
||||
|
||||
Every gate that depends on a token, an identity, or an external read MUST ship
|
||||
with a test or workflow-lint covering the **absent-identity / unauthorized /
|
||||
missing-file path** that asserts the gate **FAILS** (not skips, not passes).
|
||||
Add or update that coverage in the **same PR** that adds or changes the gate.
|
||||
A gate without a proven failure path is not yet a gate.
|
||||
|
||||
### Violations seen in this codebase (all merge-blocking if reintroduced)
|
||||
|
||||
- **serving-e2e** reporting vacuously GREEN when the Infisical identity is
|
||||
absent (no per-(provider × auth) completion was actually exercised).
|
||||
- **branch-protection / BP-drift lints** returning `0` on a `403` instead of
|
||||
failing closed on the unverifiable response.
|
||||
- **verify-template-models** run without `-strict`, so a drift it could not
|
||||
confirm passed silently.
|
||||
- A **referenced-but-absent pytest file** that collects zero tests and reports
|
||||
green — silent pass with no assertions executed.
|
||||
|
||||
Each of these is a fail-open gate and is a merge blocker until it fails loud and
|
||||
closed on protected contexts. See also the production fail-closed defaults in
|
||||
`runbooks/sop-production-cicd.md` (*Production Defaults*), which apply the same
|
||||
principle to deploy-time gates.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- #2159 — gate auto-trigger not firing (root cause: stale PR heads lacking
|
||||
|
||||
@@ -35,6 +35,7 @@ Every production CI/CD PR must include concrete answers for:
|
||||
- Verification: how production state is proven after deployment.
|
||||
- Logging: proof that CI logs do not contain raw production runtime, SSM, or secret-adjacent output.
|
||||
- Rollback: the exact command, variable, or workflow to return to a known-good tag/digest.
|
||||
- No fail-open gates: required checks fail loud + closed on protected contexts (no skip/`|| true`/`403`-as-pass). See `runbooks/dev-sop.md` § *Fail-closed CI integrity*.
|
||||
|
||||
## Human Review
|
||||
|
||||
|
||||
Reference in New Issue
Block a user