security: remove operational runbooks from PUBLIC molecule-core #3183

Merged
devops-engineer merged 6 commits from security/remove-public-runbooks-20260623 into main 2026-06-23 21:38:24 +00:00
6 changed files with 0 additions and 980 deletions
-217
View File
@@ -1,217 +0,0 @@
# Developer SOP — PR review gate auto-fire and stale-head handling
> Last updated: 2026-06-03 (cp#2159 follow-up)
>
> Applies to: all core-PR authors and reviewers on `molecule-core` and sibling
> repos using the `qa-review` + `security-review` branch-protection gates.
---
## 1. Gitea PR-head workflow-selection rule
**Rule:** For `pull_request_target` and `pull_request_review` events, Gitea
loads the workflow definition from the **PR's HEAD branch**, not from the
base (`main`) branch.
This is different from GitHub Actions, where `pull_request_target` always
loads workflows from the base branch. Gitea's behaviour means:
- A PR that was opened **before** the `pull_request_review` trigger was added
to `qa-review.yml` / `security-review.yml` will **NOT** auto-fire on review,
because its HEAD still contains the old workflow YAML (no trigger).
- A PR that was opened **after** the trigger was added (or that has been
rebased onto a commit containing the trigger) **WILL** auto-fire, because its
HEAD contains the new workflow YAML.
### Ops implication
| PR head contains `pull_request_review` trigger? | Behaviour on APPROVED review |
|---|---|
| **Yes** (cut from current main, or rebased) | Workflows auto-queue, evaluate, and POST the `(pull_request_target)` context automatically. No slash-command needed. |
| **No** (stale head, opened before #2157) | Nothing fires. Use `/qa-recheck` + `/security-recheck` slash-commands in a PR comment, OR rebase onto current main. |
---
## 2. Standard core-PR flow (post-#2157)
```
1. Author opens PR from a branch based on current main
→ qa-review + security-review workflows run on pull_request_target
→ status contexts post (initial eval, usually red until reviews land)
2. Reviewers submit real APPROVED reviews
→ If PR head has the trigger: workflows AUTO-FIRE on pull_request_review
→ Contexts flip green (or stay red if reviewer is not in team)
3. [Optional] If contexts did not flip (stale head, event lost, etc.):
→ Anyone can comment `/qa-recheck` or `/security-recheck`
→ sop-checklist.yml refires the evaluator (read-only, idempotent)
4. Both qa-review + security-review contexts are green
→ Plain Do:merge (no force-merge needed)
```
### Key point
The `/qa-recheck` and `/security-recheck` commands are a **backstop**, not the
primary path. PRs cut from current main should auto-fire without manual
intervention.
---
## 3. Diagnosing a stale head
If a PR has real team-member APPROVED reviews but the qa/security contexts
remain red and no workflow run appears on the PR's "Actions" tab for the
review event, the PR head is likely stale.
### Quick check
```bash
# From the PR page, look at the head commit SHA, then:
curl -sS "https://git.moleculesai.app/api/v1/repos/molecule-ai/molecule-core/contents/.gitea/workflows/qa-review.yml?ref=<HEAD_SHA>" \
| jq -r '.content' | base64 -d | grep -c 'pull_request_review'
# 0 → stale head (no trigger in that version of the workflow)
# >0 → trigger present; auto-fire SHOULD work (if it didn't, file a tracker)
```
### Automated diagnostic
The test suite includes `test_gate_stale_head_diagnostic.py`, which reports
"auto-fire impossible for this PR" when the head lacks the trigger. Run it
in CI or locally with:
```bash
PR_NUMBER=123 python -m pytest .gitea/scripts/tests/test_gate_stale_head_diagnostic.py -v
```
---
## 4. Rebasing vs. slash-refire
| Approach | When to use | Trade-off |
|---|---|---|
| **Rebase onto current main** | PR is genuinely stale (head lacks trigger OR head is far behind main) | Clean history, gets all recent fixes, but requires force-push and re-approval if the branch was protected |
| **`/qa-recheck` + `/security-recheck`** | PR head is recent but the review event was missed, or you want to avoid rebase churn | Quick, no force-push, but does NOT fix a missing trigger in the head |
**Do not** use slash-refire as a substitute for rebasing a stale head. If the
workflow YAML in the PR head does not contain `pull_request_review`, no amount
of rechecking will make auto-fire work.
---
## 5. Live-fire verification
The `test_gate_auto_fire_live.py` regression test exercises the full runtime
path: it submits an APPROVED review to a test PR and polls for the
`(pull_request_target)` status contexts. It is skipped when no API token is
available, and is intended to catch runtime non-fire that static structural
tests (e.g. `test_gate_review_auto_fire.py`) cannot detect.
Run manually with:
```bash
export GITEA_HOST=git.moleculesai.app
export GITEA_TOKEN=<your-token>
export REPO=molecule-ai/molecule-core
export LIVEFIRE_PR_NUMBER=<test-pr-number>
python -m pytest .gitea/scripts/tests/test_gate_auto_fire_live.py -v
```
---
## 6. Fail-closed CI integrity — no fail-open gates (MERGE-BLOCKING)
**Rule:** No CI workflow, CI script, or test check may **FAIL OPEN** — i.e. it
must never report GREEN (exit 0, skip, warn-and-continue, `|| true`, or any
"return success") when it could **not actually verify its invariant**. A check
that cannot verify MUST **fail loud** (`::error::` annotation **and** a nonzero
exit) and **fail closed** (treat inability-to-verify as **FAILURE**, never as a
pass). An unverifiable check is a red check, full stop.
This is the same family of bug as the no-flakes rule (§ *No flakes*): a green
that isn't real. A flake is a green/red that flips for an unnamed reason; a
fail-open gate is a green that was never earned. Both let unverified code reach
`main`, and both are merge-blocking.
### Applies to
Required / hard gates on **protected contexts**: pushes to `main`, internal
protected branches, and **same-repo** PRs (`pull_request_target`). On these
contexts the *cause* of an unverifiable run is **irrelevant** — every one of the
following MUST fail closed:
- auth failure (401 / 403),
- missing token or identity,
- under-scoped credential,
- unreachable dependency (network, Infisical, control-plane, registry),
- a required test file that is absent or collects zero tests,
- any transient error the check cannot prove was benign.
"I couldn't check" is reported and scored exactly like "the check failed." A
gate that can be silently defanged by removing a secret is not a gate.
### The one allowed exception — explicit trust-boundary split
Legitimate degradation is permitted **only** where the secret genuinely cannot
exist — e.g. **fork PRs**, which by design have no access to repo secrets. Such
degradation is allowed **only** when it is:
1. gated behind an **explicit** fork / advisory branch in the workflow logic
(an intentional trust-boundary split, not an incidental `if: secrets...`),
2. **clearly marked advisory** in its name and output, and
3. **NOT counted as a passing REQUIRED context** — it may inform, it may not
satisfy the gate.
Silent degradation that satisfies a required gate is **forbidden**. If a fork PR
needs the real check, it must run via a maintainer-triggered same-repo path
(where the secret exists and the check therefore fails closed), not by quietly
passing the required context with no verification.
### Auth-failure vs. genuine-absence — do not conflate
Distinguish the two so a real finding is never masked and a masked finding is
never mistaken for real:
- **`403` (or 401) on a protected context → fail closed.** You could not verify;
that is a check failure, not a finding about the resource.
- **A real `404` from a read made *with a valid, sufficiently-scoped token*
the real finding.** The resource is genuinely absent; report it as such.
A `403` reported as "resource not found" is itself a fail-open bug.
### Required practice
Every gate that depends on a token, an identity, or an external read MUST ship
with a test or workflow-lint covering the **absent-identity / unauthorized /
missing-file path** that asserts the gate **FAILS** (not skips, not passes).
Add or update that coverage in the **same PR** that adds or changes the gate.
A gate without a proven failure path is not yet a gate.
### Violations seen in this codebase (all merge-blocking if reintroduced)
- **serving-e2e** reporting vacuously GREEN when the Infisical identity is
absent (no per-(provider × auth) completion was actually exercised).
- **branch-protection / BP-drift lints** returning `0` on a `403` instead of
failing closed on the unverifiable response.
- **verify-template-models** run without `-strict`, so a drift it could not
confirm passed silently.
- A **referenced-but-absent pytest file** that collects zero tests and reports
green — silent pass with no assertions executed.
Each of these is a fail-open gate and is a merge blocker until it fails loud and
closed on protected contexts. See also the production fail-closed defaults in
`runbooks/sop-production-cicd.md` (*Production Defaults*), which apply the same
principle to deploy-time gates.
---
## References
- #2159 — gate auto-trigger not firing (root cause: stale PR heads lacking
the `pull_request_review` trigger, NOT a workflow code defect)
- #765 — static structural regression test for gate configuration
- #2157 — merged trigger addition (`pull_request_review` types: [submitted])
- #2020 — milestone confirming gate infrastructure is stable
- RFC#324 — qa-review + security-review design
@@ -1,112 +0,0 @@
# Gitea Actions migration checklist (molecule-core)
Created 2026-05-11 as part of **RFC `molecule-ai/internal#219` §1** — the
sweep of `.github/workflows/*.yml` files in `molecule-core` after the
2026-05-06 GitHub → Gitea migration. Documents which workflows were
retired, which were ported, and the reasoning for each.
The sweep used the four-surface audit pattern from saved memory
`feedback_gitea_actions_migration_audit_pattern`:
1. **YAML** — drop `workflow_dispatch.inputs`, `merge_group`,
`environment:`. Adjust `runs-on:`. Set `env.GITHUB_SERVER_URL`
per `feedback_act_runner_github_server_url`.
2. **Cache** — verify `actions/cache@v4` / `upload-artifact` pin
compatibility with Gitea 1.22.x runner.
3. **Token** — auto-injected `GITHUB_TOKEN` works for same-repo
operations; cross-repo dispatch needs explicit secret.
4. **Docs** — top-of-file "Ported from .github/workflows/X.yml on
YYYY-MM-DD per RFC internal#219 §1 sweep" comment.
Per RFC §1 contract, all ports land with `continue-on-error: true` on
every job to surface bugs without blocking; a follow-up PR flips
`continue-on-error: false` after triage.
## Category A — already mirrored (deleted .github/ copy)
These workflows had a working `.gitea/workflows/X.yml` twin at the time
of the sweep. The `.github/` copies were silently dead (Gitea Actions
in molecule-core only registers `.gitea/workflows/`) and have been
removed.
| File | .gitea/ twin |
|---|---|
| `publish-runtime.yml` | `.gitea/workflows/publish-runtime.yml` (ported via issue #206) |
| `secret-scan.yml` | `.gitea/workflows/secret-scan.yml` |
## Category B — GitHub-only, retired
These workflows depend on GitHub-specific surface (merge queue, GitHub
auto-merge primitive, github.com REST API, GHCR registry, CodeQL action
that hits api.github.com bundle endpoints) that Gitea does not provide.
No equivalent Gitea-side workflow is needed; the underlying mechanism
either doesn't exist on Gitea or has been replaced by a different
pipeline.
| File | Why retired |
|---|---|
| `auto-tag-runtime.yml` | Superseded by `.gitea/workflows/publish-runtime-autobump.yml` (auto-bump-on-workspace-edit). The autobump only does patch bumps; the deleted workflow supported `release:minor` / `release:major` PR-label-driven bumps. Follow-up issue should track restoring label-driven minor/major if anyone uses it. |
| `branch-protection-drift.yml` | Targets `Molecule-AI/molecule-core` on GitHub via `gh api /repos/.../branch-protection` — entirely GitHub-API specific. `tools/branch-protection/drift_check.sh` and `apply.sh` reference the GitHub schema (status_check_contexts, dismiss_stale_reviews, etc.) which differs from Gitea's `branch_protections` shape. Rebuilding for Gitea is out of scope for the RFC #219 sweep; follow-up issue needed for Gitea-compatible branch-protection drift detection. |
| `check-merge-group-trigger.yml` | The workflow's own header (lines 18-23) documents that it's vacuously satisfied on Gitea — Gitea has no merge queue, no `merge_group:` event type, no `gh-readonly-queue/...` refs. Nothing to lint. |
| `codeql.yml` | The workflow's own header (lines 3-67) documents that `github/codeql-action/init@v4` hits api.github.com bundle endpoints not implemented by Gitea (observed: `::error::404 page not found` in Initialize CodeQL step). Per Hongming decision 2026-05-07 (task #156): CodeQL is ADVISORY/non-blocking until a Gitea-compatible SAST pipeline lands. Replacement options (Semgrep self-host, Sonatype, GitHub-mirror-for-SAST) tracked in #156. |
| `pr-guards.yml` | The workflow's own header documents that Gitea has no `gh pr merge --auto` primitive — the guard is a structural no-op on Gitea. Branch protection on `main` does NOT reference any `pr-guards` check name; deletion is safe. |
| `promote-latest.yml` | Uses `imjasonh/setup-crane` against `ghcr.io/molecule-ai/platform` — the GHCR registry was retired during the 2026-05-06 Gitea migration (per `staging-verify.yml` header notes — file was renamed from `canary-verify.yml` on 2026-05-11; the canonical tenant image moved to ECR `153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/platform-tenant`). The workflow can no longer find any image to retag. Follow-up issue suggested if an ECR-based retag promote is desired. |
## Category C — ported to .gitea/
These workflows had real ongoing CI value but no Gitea-side equivalent.
Each was ported to `.gitea/workflows/X.yml` with:
- `workflow_dispatch.inputs` removed (Gitea 1.22.6 parser rejects them —
per `feedback_gitea_workflow_dispatch_inputs_unsupported`)
- `merge_group:` trigger removed (no merge queue)
- `environment:` blocks removed (Gitea has no environments)
- `dorny/paths-filter@v4` replaced with inline `git diff` (per the
pattern established in PR#372 ci.yml port)
- `env.GITHUB_SERVER_URL: https://git.moleculesai.app` set at workflow
level (belt-and-suspenders for `actions/checkout` etc.)
- `continue-on-error: true` on every job (RFC §1 contract — surface
defects without blocking; follow-up PR flips after triage)
- Top-of-file header: "Ported from .github/workflows/X.yml on
YYYY-MM-DD per RFC internal#219 §1 sweep."
See the C-1 / C-2 / C-3 sweep PRs for the file lists and per-file
adjustments.
## Category D — parser-rejected (none for molecule-core)
The RFC #219 §1 brief lists 7 workflows as parser-rejected (`audit-orphan-instances`,
`bake-thin-ami`, `bench-provision-time`, `cache-probe`, `deploy-pipeline`,
`e2e-tunnel-reboot`, `persona-author-check`). Verification against
molecule-core's tree (and the `docker logs molecule-gitea-1` parser-rejection
log) shows these workflows belong to other repos:
- `audit-orphan-instances`, `bake-thin-ami`, `bench-provision-time`,
`deploy-pipeline`, `e2e-tunnel-reboot` live in `molecule-ai/molecule-controlplane`
- `cache-probe`, `persona-author-check` live in `molecule-ai/internal`
For molecule-core, **Category D is empty**.
## Verification
After all sweep PRs land:
```bash
# Should produce nothing.
ls .github/workflows/*.yml | grep -vF ci.yml
# Should list 6 working workflows from the .gitea/ port directory + the
# C-1/C-2/C-3 ports.
ls .gitea/workflows/*.yml
```
Gitea Actions server should produce NO `[W] ignore invalid workflow`
lines for any `.gitea/workflows/X.yml` in molecule-core when commits
land on `main`:
```bash
ssh root@5.78.80.188 'docker logs molecule-gitea-1 --since 10m 2>&1 \
| grep "ignore invalid workflow" \
| grep -i molecule-core'
# Expected: empty.
```
-104
View File
@@ -1,104 +0,0 @@
# Gitea Merge Queue
Gitea 1.22.6 does not provide a real merge queue. Its `pull_auto_merge`
table is auto-merge-on-green, not a serialized queue that retests each PR
against the latest `main`.
`gitea-merge-queue` is the external queue for `molecule-core`.
## Queue Contract
**Auto-discovery (opt-OUT, default).** You do NOT need to label a PR. The bot
auto-discovers every open same-repo PR and merges any that meets the bar. The
`merge-queue` label is now optional metadata, not a gate. This removed the
historical autonomy gap: agent Gitea tokens lack `write:issue` (labels are
issue-scoped), so agents could never self-label and ready PRs stalled.
To keep a PR OUT of autonomous merging, add an opt-OUT label:
`merge-queue-hold`, `do-not-auto-merge`, or `wip`. Draft PRs are also skipped.
The bot processes one PR per tick:
1. Confirms `main`'s branch-protection-required push contexts are green.
2. Selects the oldest open same-repo PR that is NOT opt-out-labeled and NOT a
draft (auto-discovery). With `AUTO_DISCOVER=0` it falls back to legacy
opt-IN: only PRs carrying `merge-queue` are considered.
3. Rejects fork PRs because the queue may only update same-repo branches.
4. If the PR head does not contain current `main`, calls Gitea's
`/pulls/{n}/update?style=merge` endpoint and waits for CI on the new head.
5. Merges only when, on the PR's CURRENT head sha:
- `>= required_approvals` distinct genuine official `APPROVED` reviews from
the recognised reviewer set (read from branch protection; default 2),
- no open official `REQUEST_CHANGES`,
- every branch-protection-required status context is green, and
- the PR is `mergeable` (Gitea returns `True`; `None`/`False` = wait).
The merge bar is unchanged by auto-discovery — only WHICH PRs are considered
changes. The workflow is serialized with `concurrency`, so two PRs cannot be
merged against the same observed `main`.
## Operator Commands
Queue a PR (optional — auto-discovery already considers every ready PR; the
label is just visible metadata):
```bash
curl -fsS -X POST \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
"https://git.moleculesai.app/api/v1/repos/molecule-ai/molecule-core/issues/<PR>/labels" \
-d '{"labels":["merge-queue"]}'
```
Keep a PR OUT of autonomous merging (opt-OUT — use `merge-queue-hold`,
`do-not-auto-merge`, or `wip`):
```bash
curl -fsS -X POST \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
"https://git.moleculesai.app/api/v1/repos/molecule-ai/molecule-core/issues/<PR>/labels" \
-d '{"labels":["merge-queue-hold"]}'
```
Run the bot manually from a trusted checkout:
```bash
GITEA_TOKEN="$DEVOPS_ENGINEER_TOKEN" \
GITEA_HOST=git.moleculesai.app \
REPO=molecule-ai/molecule-core \
WATCH_BRANCH=main \
QUEUE_LABEL=merge-queue \
HOLD_LABEL=merge-queue-hold \
AUTO_DISCOVER=1 \
OPT_OUT_LABELS=do-not-auto-merge,wip \
REVIEWER_SET=agent-reviewer,agent-researcher,agent-reviewer-cr2 \
UPDATE_STYLE=merge \
python3 .gitea/scripts/gitea-merge-queue.py --dry-run
```
Dry run:
```bash
python3 .gitea/scripts/gitea-merge-queue.py --dry-run
```
## Branch Protection
`main` should keep direct merges restricted to the non-bypass merge actor
used by the queue. Normal humans and agents should not merge directly.
`block_on_outdated_branch` should be enabled as a defense in depth, but it
does not replace the queue. The queue still performs its own current-main
check immediately before merge because branch protection alone cannot
serialize two already-green PRs.
## Failure Handling
If `main` is not green, the queue pauses and does not merge anything.
If a queued PR is stale, the queue updates the PR branch and comments on the
PR. It does not merge until CI runs on the updated head.
If the queue workflow fails, treat it as a CI/CD incident. Do not bypass by
manually merging unless the human operator explicitly accepts the risk.
-406
View File
@@ -1,406 +0,0 @@
# Gitea Actions operational quirks (molecule-core)
Documents persistent operational findings about Gitea Actions runner behaviour
that differ from GitHub Actions and require workarounds in workflow YAML or
runbooks.
> Last updated: 2026-05-12 (infra-runtime-be-agent)
---
## Quirk #1 — Large repo causes fetch timeout on Gitea Actions runner
### Finding
The Gitea Actions runner (container on host `5.78.80.188`) can reach the git
remote (`https://git.moleculesai.app`) over HTTPS — a single-commit shallow
fetch (`--depth=1`) succeeds in ~16 s. However, fetching the **full compressed
repo history** (~75+ MB) exceeds the runner's network timeout window (~15 s).
This is **not a Gitea Actions bug** and **not a network isolation policy**
it is a repo-size constraint. The runner can reach external hosts (GitHub,
Docker Hub, PyPI) without issue.
### Impact
Workflows that rely on `actions/checkout` with `fetch-depth: 0` (full history)
or `git clone` will time out.
Specifically:
- `actions/checkout@v*` with `fetch-depth: 0` hangs (fetching full repo
history takes >15 s before hitting the timeout).
- `git clone <url>` hangs for the same reason.
- `git fetch origin <ref> --depth=1` **succeeds** in ~16 s — this is the
working pattern.
### Affected workflows
| Workflow | Issue | Workaround |
|---|---|---|
| `harness-replays.yml` detect-changes job | `fetch-depth: 0` + `git clone` time out | Added `timeout 20 git fetch origin base.ref --depth=1` + `continue-on-error: true` + fallback to `run=true` per PR #441 |
| `publish-workspace-server-image.yml` | In-image `git clone` of workspace templates | Pre-clone manifest deps before compose build (Task #173 pattern) |
| Any workflow using `fetch-depth: 0` | Full history fetch times out | Use `fetch-depth: 1` + explicit `git fetch` for needed refs |
### How to diagnose
```bash
# From inside the runner (add as a debug step):
timeout 20 git fetch origin main --depth=1
# If this SUCCEEDS (~16s): runner can reach the git remote — the repo is
# too large for full-history fetch.
# If this times out: true network isolation (unlikely; check firewall rules).
```
### Verification
Confirmed 2026-05-11 by running `timeout 20 git fetch origin base.ref --depth=1`
in the `detect-changes` job of `harness-replays.yml`**succeeds in ~16 s**.
Runner can reach `https://api.github.com` and `https://pypi.org` without issue,
confirming this is a repo-size constraint, not network isolation.
### References
- PR #441: fix for `harness-replays.yml` detect-changes
- Task #173: pre-clone manifest deps pattern for compose build
- internal#102: tracking customer-private + marketplace third-party repos
- `feedback_oss_first_repo_visibility_default`: 5 workspace-template repos
flipped public to allow pre-clone without auth
---
## Quirk #2 — `continue-on-error` only works at step level, not job level
### Finding
Gitea Actions (1.22.6) does not honour `continue-on-error: true` at the **job**
level the way GitHub Actions does. A job with `continue-on-error: true` that
fails still reports `status: failure` in the commit status API.
Only `continue-on-error: true` at the **step** level works as expected.
### Impact
If you want a job to always "pass" in the status API (so dependent jobs can
run and the overall CI does not show `failure`), you must add
`continue-on-error: true` to every step that can fail, AND ensure each step
exits with code 0 (e.g., append `|| true` to commands that might fail).
### Affected workflows
| Workflow | Fix |
|---|---|
| `harness-replays.yml` detect-changes | Added `continue-on-error: true` to fetch step + decide step; added `|| true` to `DIFF=$(git diff ...)` per PR #441 |
### How to diagnose
```yaml
# WRONG — job reports as failure despite flag
jobs:
my-job:
continue-on-error: true # ← ignored by Gitea
steps:
- run: git diff ... # ← if this fails, job = failure
# job-level flag does not help
# RIGHT — step-level flag prevents step from failing
jobs:
my-job:
steps:
- run: git diff ... || true # ← step exits 0
continue-on-error: true # ← belt and suspenders
```
### References
- Quirk #10 (this document): Gitea does NOT auto-populate `secrets.GITHUB_TOKEN`
- PR #441: fix applied to `harness-replays.yml`
---
## Quirk #3 — `workflow_dispatch.inputs` not supported
Gitea 1.22.6 parser rejects `workflow_dispatch.inputs`. Drop from all workflow
YAML files ported from GitHub Actions. Manual triggers should use
`workflow_dispatch` without `inputs:`.
**Reference**: `feedback_gitea_workflow_dispatch_inputs_unsupported`
---
## Quirk #4 — `merge_group` not supported
Gitea has no native merge queue concept. Drop `merge_group:` triggers from
all workflow YAML files.
For `molecule-core`, use the external serialized queue documented in
`runbooks/gitea-merge-queue.md`. Gitea's `pull_auto_merge` table is
auto-merge-on-green, not a queue that retests each PR against latest `main`.
---
## Quirk #5 — `environment:` blocks not supported
Gitea has no environments concept. Drop `environment:` from all workflow YAML
files. Secrets and variables are repo-level.
---
## Quirk #6 — Gitea combined status reports `failure` when all contexts are `null`
### Finding
When ALL individual status contexts for a commit have `state: null` (no runner
has reported yet), Gitea reports the combined commit status as `failure`. This
is a Gitea Actions bug — it conflates "no status reported yet" with "failed".
### Impact
- The `main-red-watchdog` workflow opens a `[main-red]` issue for every
scheduled workflow run where the combined state is `failure` — even when
the failure is entirely due to Gitea's combined-status bug.
- This causes spurious `[main-red]` issues that waste SRE time investigating
non-existent failures.
- **This is especially confusing for `schedule:`-only workflows** (canary,
sweep jobs, synth-E2E): Gitea attributes their scheduled runs to `main`'s
HEAD commit, so if a scheduled run fires while all contexts are still
`state: null`, the watchdog opens a `[main-red]` issue on the latest main
commit even though that commit itself is perfectly fine.
### How to diagnose
Always check the **individual context `state` fields**, not the combined
`state`/`combined_state`. In the `/repos/{org}/{repo}/commits/{sha}/statuses`
API response, look for `"state": null` on every entry — if all are null, the
combined `failure` is Gitea's bug, not a real CI failure.
```json
{
"combined_state": "failure", // ← Gitea bug when all are null
"contexts": [
{ "context": "CI / Lint", "state": null }, // still running
{ "context": "CI / Test", "state": null } // still running
]
}
```
### Affected workflows
All workflows, but especially `schedule:`-only workflows that run on `main`.
The main-red-watchdog (`.gitea/workflows/main-red-watchdog.yml`) is the
primary consumer of combined status and is affected.
### References
- Issue #481: first real-world case of this bug (2026-05-11)
- `feedback_no_such_thing_as_flakes`: watchdog directive
---
## Quirk #7 — TBD
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
### Finding
*[What Gitea Actions does differently from GitHub Actions.]*
### Impact
*[Which workflows or operations are affected.]*
### Workaround
*[How to work around this quirk.]*
### References
- internal#[N]: first observation
---
## Quirk #8 — TBD
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
### Finding
*[What Gitea Actions does differently from GitHub Actions.]*
### Impact
*[Which workflows or operations are affected.]*
### Workaround
*[How to work around this quirk.]*
### References
- internal#[N]: first observation
---
## Quirk #9 — TBD
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
### Finding
*[What Gitea Actions does differently from GitHub Actions.]*
### Impact
*[Which workflows or operations are affected.]*
### Workaround
*[How to work around this quirk.]*
### References
- internal#[N]: first observation
---
## Quirk #10 — Gitea does NOT auto-populate `secrets.GITHUB_TOKEN`
### Finding
Gitea Actions (1.22.6) does **not** auto-populate `secrets.GITHUB_TOKEN`
the way GitHub Actions does. A workflow that references `secrets.GITHUB_TOKEN`
without explicitly provisioning a named secret gets an empty string — not a
read-only token scoped to the repo.
### Impact
Workflows that call the Gitea REST API using `secrets.GITHUB_TOKEN` as auth
receive **HTTP 401** on every API call. Affected workflows in molecule-core:
| Workflow | Symptom | Workaround |
|---|---|---|
| `gate-check-v3.yml` | Reports BLOCKED on every PR | Provision `SOP_CHECKLIST_GATE_TOKEN`; update workflow to use it |
| `qa-review.yml` | Fails immediately on PR open | Same — needs named secret |
| `security-review.yml` | Fails immediately on PR open | Same — needs named secret |
### How to diagnose
Add a debug step to the failing workflow:
```yaml
- name: Diagnose token
run: |
echo "Token present: ${{ secrets.GITHUB_TOKEN != '' }}"
curl -sS --fail -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
"$GITHUB_SERVER_URL/api/v1/user" | jq -r '.login'
# Expected (GitHub): prints your username.
# Actual (Gitea): HTTP 401 or empty string.
```
### References
- internal#325: root-cause analysis and token provisioning
- `feedback_gitea_no_auto_supplied_github_token`
---
## Quirk #11 — PR-create event dispatcher races — only 1 of N workflows fires on `pull_request opened`
### Finding
When a PR is created via the Gitea web UI or API, the Gitea Actions event
dispatcher may fire **only 1 of N eligible workflows** on the initial
`pull_request opened` event. All other eligible workflows are silently dropped.
This was observed on molecule-core PR #558 (created 2026-05-11T19:54:10Z):
12+ workflows had no `paths:` filter and should have fired, but only
`sop-checklist.yml` dispatched.
Concurrent PRs created within the same minute received 1230 dispatches each,
confirming this is specific to the PR-create event dispatch, not a general
runner capacity issue.
### Impact
- PRs may not run the full CI suite on first open.
- `gate-check-v3`, `secret-scan`, `qa-review`, and `security-review` can be
silently absent from the PR's status checks.
- Branch protection may block merge even though CI is effectively green.
### How to diagnose
```bash
# List workflow runs for the PR:
gh run list --event pull_request --repo molecule-ai/molecule-core \
| grep "$(gh pr view $PR --json number --jq '.number')"
# Expected: 12+ runs on PR open.
# Actual (when race fires): only 1 run.
```
### Workaround
Force a second dispatch by pushing a no-op synchronize commit:
```bash
git commit --allow-empty -m "chore: trigger workflows [skip ci]"
git push
```
The synchronize event fires a second `pull_request` event, which reliably
triggers all eligible workflows.
### References
- internal#329: first observation on PR #558
- `feedback_gitea_pr_create_dispatcher_race`
---
## When you find a new quirk
Copy the template below, increment the quirk number, and fill in the finding,
impact, workaround, and references. Place the new section in the **correct
numerical position** (before the next higher-numbered quirk). Update this
section's final paragraph to remove the next slot's number.
### Template
```markdown
## Quirk #N — <short title>
### Finding
<What Gitea Actions does differently from GitHub Actions.>
### Impact
<Which workflows or operations are affected. Include an affected workflows
table if more than one is affected.>
### How to diagnose
<Shell commands or API calls that confirm this is the quirk, not a real failure.>
### Workaround
<How to work around this quirk in workflow YAML or operations.>
### References
- internal#[N]: first observation
- <Any Gitea issue, feedback label, or upstream bug tracker reference>
```
---
## Open questions for Gitea 1.23
- [ ] **act_runner concurrent-job cap**: issue #305 — runner saturation under
merge burst; needs `max_concurrent_jobs` cap configured on act_runner
- [ ] **Infisical→Gitea secret-sync**: issue #307 — eliminate manual secret
PUTs by wiring an Infisical cron to the Gitea API
- [ ] **PR-create dispatcher race resolution**: internal #329 — is there a
Gitea fix or config knob to disable the race? File upstream bug if not
- [ ] **GITHUB_TOKEN auto-population**: internal #325 — is this on the
Gitea 1.23 roadmap? If not, the workaround (named secret) is the permanent
answer
-64
View File
@@ -1,64 +0,0 @@
# Production Auto-Deploy
`molecule-core` deploys production tenant code automatically from Gitea Actions.
This runbook is an implementation-specific companion to `runbooks/sop-production-cicd.md`.
## Default Flow
On a push to `main` that touches deployable code, `.gitea/workflows/publish-workspace-server-image.yml`:
1. Builds and pushes platform and tenant ECR images tagged `staging-<sha>` and `staging-latest`.
2. Self-tests the production deploy helper and workflow-YAML linter.
3. Waits for strict required push contexts on the same commit to become `success`.
4. Calls production control-plane `POST /cp/admin/tenants/redeploy-fleet` with `target_tag=staging-<sha>`.
5. Verifies every redeploy result is healthy and every tenant returns the same Git SHA from `/buildinfo`.
The deploy workflow intentionally does not use Gitea `concurrency` because Gitea 1.22.6 can cancel queued runs even when `cancel-in-progress: false`.
## Kill Switch
Set either repository variable or secret:
```text
PROD_AUTO_DEPLOY_DISABLED=true
```
The image publish still runs, but the production redeploy step exits successfully without touching tenants.
Immediately before the production POST, the workflow re-checks the live Gitea repo variable when `PROD_AUTO_DEPLOY_CONTROL_TOKEN` can read Actions variables. If that token is not configured, the job-start value is still honored.
## Tunables
Repository variables:
```text
PROD_CP_URL=https://api.moleculesai.app
PROD_AUTO_DEPLOY_CANARY_SLUG=hongming
PROD_AUTO_DEPLOY_SOAK_SECONDS=60
PROD_AUTO_DEPLOY_BATCH_SIZE=3
PROD_AUTO_DEPLOY_DRY_RUN=false
PROD_MANUAL_REDEPLOY_TARGET_TAG=staging-<known-good-sha>
```
Secrets required:
```text
CP_ADMIN_API_TOKEN
AUTO_SYNC_TOKEN
PROD_AUTO_DEPLOY_CONTROL_TOKEN
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
```
`AUTO_SYNC_TOKEN` is only used to read Gitea commit statuses while waiting for required push contexts.
`PROD_AUTO_DEPLOY_CONTROL_TOKEN` is optional but recommended so the pre-POST kill-switch check can read the live `PROD_AUTO_DEPLOY_DISABLED` Actions variable.
## Manual Fallback
Use `.gitea/workflows/redeploy-tenants-on-main.yml` when the automatic path needs to be rerun or rolled back. Gitea 1.22.6 does not support reliable `workflow_dispatch` inputs, so rollback uses a repo variable:
1. Set `PROD_MANUAL_REDEPLOY_TARGET_TAG=staging-<known-good-sha>`.
2. Dispatch `manual-redeploy-tenants-on-main`.
3. Clear `PROD_MANUAL_REDEPLOY_TARGET_TAG` after the rollback finishes.
With no variable set, the fallback redeploys `staging-<current-main-sha>`.
-77
View File
@@ -1,77 +0,0 @@
# SOP: Production CI/CD Changes
Production CI/CD changes are higher risk than ordinary CI edits. They can publish images, deploy tenants, promote tags, mutate branch protection, or change merge behavior. This SOP separates rules that must be enforced by code from rules that require human judgment.
## Programmatic Gates
The workflow YAML linter is the first line of enforcement:
```bash
python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows
```
It must reject:
- Gitea-hostile syntax such as `workflow_dispatch.inputs`, `workflow_run`, workflow name collisions, slash-containing workflow names, and unsupported cross-repo action references.
- Production deploy workflows that rely on `concurrency.cancel-in-progress: false` for serialization.
- Production deploy workflows that print raw control-plane responses or raw `.error` fields into CI logs.
- Production redeploy workflows with no kill switch or rollback/pin control.
Production deploy helpers must also unit-test:
- Disable-flag parsing.
- Required status context selection.
- Terminal status handling for `failure`, `error`, `cancelled`, `canceled`, and `skipped`.
- Production control-plane URL guards.
- Rollback target/pin handling when applicable.
## Required PR Evidence
Every production CI/CD PR must include concrete answers for:
- Root cause: what production failure mode or process gap is being closed.
- Deploy gate: which exact contexts must be green before production side effects.
- Kill switch: how to stop deployment without reverting the PR.
- Verification: how production state is proven after deployment.
- Logging: proof that CI logs do not contain raw production runtime, SSM, or secret-adjacent output.
- Rollback: the exact command, variable, or workflow to return to a known-good tag/digest.
- No fail-open gates: required checks fail loud + closed on protected contexts (no skip/`|| true`/`403`-as-pass). See `runbooks/dev-sop.md` § *Fail-closed CI integrity*.
## Human Review
Production CI/CD PRs need non-author review across these roles:
- DevOps: Gitea Actions semantics, branch protection, merge queue, and runner behavior.
- SRE: rollout order, tenant health checks, observability, and partial-deploy recovery.
- Security: secrets, token scopes, log redaction, and production endpoint targeting.
Critical or Required review findings must be closed with one of:
- A code change plus verification.
- An evidence-backed rejection.
- A follow-up issue only if the finding is explicitly not merge-blocking.
Acknowledgement alone is not closure.
## Production Defaults
Production deploys should fail closed:
- Missing tenant result: fail.
- Tenant unhealthy: fail.
- `/buildinfo` unreachable: fail.
- SHA mismatch: fail.
- Required status cancelled/skipped/missing past timeout: fail.
Staging may tolerate warnings during rollout development; production should not.
## Gitea 1.22.6 Constraints
Do not design production CI/CD around unsupported or unreliable features:
- No `workflow_run`.
- No reliable `workflow_dispatch.inputs`.
- Do not assume `concurrency.cancel-in-progress: false` serializes queued runs.
- Do not rely on a masked aggregate status as the only production deploy gate.
If these constraints change after a Gitea upgrade, update this SOP and the workflow linter in the same PR.