security: remove operational runbooks from PUBLIC molecule-core #3183
@@ -1,217 +0,0 @@
|
||||
# Developer SOP — PR review gate auto-fire and stale-head handling
|
||||
|
||||
> Last updated: 2026-06-03 (cp#2159 follow-up)
|
||||
>
|
||||
> Applies to: all core-PR authors and reviewers on `molecule-core` and sibling
|
||||
> repos using the `qa-review` + `security-review` branch-protection gates.
|
||||
|
||||
---
|
||||
|
||||
## 1. Gitea PR-head workflow-selection rule
|
||||
|
||||
**Rule:** For `pull_request_target` and `pull_request_review` events, Gitea
|
||||
loads the workflow definition from the **PR's HEAD branch**, not from the
|
||||
base (`main`) branch.
|
||||
|
||||
This is different from GitHub Actions, where `pull_request_target` always
|
||||
loads workflows from the base branch. Gitea's behaviour means:
|
||||
|
||||
- A PR that was opened **before** the `pull_request_review` trigger was added
|
||||
to `qa-review.yml` / `security-review.yml` will **NOT** auto-fire on review,
|
||||
because its HEAD still contains the old workflow YAML (no trigger).
|
||||
|
||||
- A PR that was opened **after** the trigger was added (or that has been
|
||||
rebased onto a commit containing the trigger) **WILL** auto-fire, because its
|
||||
HEAD contains the new workflow YAML.
|
||||
|
||||
### Ops implication
|
||||
|
||||
| PR head contains `pull_request_review` trigger? | Behaviour on APPROVED review |
|
||||
|---|---|
|
||||
| **Yes** (cut from current main, or rebased) | Workflows auto-queue, evaluate, and POST the `(pull_request_target)` context automatically. No slash-command needed. |
|
||||
| **No** (stale head, opened before #2157) | Nothing fires. Use `/qa-recheck` + `/security-recheck` slash-commands in a PR comment, OR rebase onto current main. |
|
||||
|
||||
---
|
||||
|
||||
## 2. Standard core-PR flow (post-#2157)
|
||||
|
||||
```
|
||||
1. Author opens PR from a branch based on current main
|
||||
→ qa-review + security-review workflows run on pull_request_target
|
||||
→ status contexts post (initial eval, usually red until reviews land)
|
||||
|
||||
2. Reviewers submit real APPROVED reviews
|
||||
→ If PR head has the trigger: workflows AUTO-FIRE on pull_request_review
|
||||
→ Contexts flip green (or stay red if reviewer is not in team)
|
||||
|
||||
3. [Optional] If contexts did not flip (stale head, event lost, etc.):
|
||||
→ Anyone can comment `/qa-recheck` or `/security-recheck`
|
||||
→ sop-checklist.yml refires the evaluator (read-only, idempotent)
|
||||
|
||||
4. Both qa-review + security-review contexts are green
|
||||
→ Plain Do:merge (no force-merge needed)
|
||||
```
|
||||
|
||||
### Key point
|
||||
|
||||
The `/qa-recheck` and `/security-recheck` commands are a **backstop**, not the
|
||||
primary path. PRs cut from current main should auto-fire without manual
|
||||
intervention.
|
||||
|
||||
---
|
||||
|
||||
## 3. Diagnosing a stale head
|
||||
|
||||
If a PR has real team-member APPROVED reviews but the qa/security contexts
|
||||
remain red and no workflow run appears on the PR's "Actions" tab for the
|
||||
review event, the PR head is likely stale.
|
||||
|
||||
### Quick check
|
||||
|
||||
```bash
|
||||
# From the PR page, look at the head commit SHA, then:
|
||||
curl -sS "https://git.moleculesai.app/api/v1/repos/molecule-ai/molecule-core/contents/.gitea/workflows/qa-review.yml?ref=<HEAD_SHA>" \
|
||||
| jq -r '.content' | base64 -d | grep -c 'pull_request_review'
|
||||
# 0 → stale head (no trigger in that version of the workflow)
|
||||
# >0 → trigger present; auto-fire SHOULD work (if it didn't, file a tracker)
|
||||
```
|
||||
|
||||
### Automated diagnostic
|
||||
|
||||
The test suite includes `test_gate_stale_head_diagnostic.py`, which reports
|
||||
"auto-fire impossible for this PR" when the head lacks the trigger. Run it
|
||||
in CI or locally with:
|
||||
|
||||
```bash
|
||||
PR_NUMBER=123 python -m pytest .gitea/scripts/tests/test_gate_stale_head_diagnostic.py -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Rebasing vs. slash-refire
|
||||
|
||||
| Approach | When to use | Trade-off |
|
||||
|---|---|---|
|
||||
| **Rebase onto current main** | PR is genuinely stale (head lacks trigger OR head is far behind main) | Clean history, gets all recent fixes, but requires force-push and re-approval if the branch was protected |
|
||||
| **`/qa-recheck` + `/security-recheck`** | PR head is recent but the review event was missed, or you want to avoid rebase churn | Quick, no force-push, but does NOT fix a missing trigger in the head |
|
||||
|
||||
**Do not** use slash-refire as a substitute for rebasing a stale head. If the
|
||||
workflow YAML in the PR head does not contain `pull_request_review`, no amount
|
||||
of rechecking will make auto-fire work.
|
||||
|
||||
---
|
||||
|
||||
## 5. Live-fire verification
|
||||
|
||||
The `test_gate_auto_fire_live.py` regression test exercises the full runtime
|
||||
path: it submits an APPROVED review to a test PR and polls for the
|
||||
`(pull_request_target)` status contexts. It is skipped when no API token is
|
||||
available, and is intended to catch runtime non-fire that static structural
|
||||
tests (e.g. `test_gate_review_auto_fire.py`) cannot detect.
|
||||
|
||||
Run manually with:
|
||||
|
||||
```bash
|
||||
export GITEA_HOST=git.moleculesai.app
|
||||
export GITEA_TOKEN=<your-token>
|
||||
export REPO=molecule-ai/molecule-core
|
||||
export LIVEFIRE_PR_NUMBER=<test-pr-number>
|
||||
python -m pytest .gitea/scripts/tests/test_gate_auto_fire_live.py -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Fail-closed CI integrity — no fail-open gates (MERGE-BLOCKING)
|
||||
|
||||
**Rule:** No CI workflow, CI script, or test check may **FAIL OPEN** — i.e. it
|
||||
must never report GREEN (exit 0, skip, warn-and-continue, `|| true`, or any
|
||||
"return success") when it could **not actually verify its invariant**. A check
|
||||
that cannot verify MUST **fail loud** (`::error::` annotation **and** a nonzero
|
||||
exit) and **fail closed** (treat inability-to-verify as **FAILURE**, never as a
|
||||
pass). An unverifiable check is a red check, full stop.
|
||||
|
||||
This is the same family of bug as the no-flakes rule (§ *No flakes*): a green
|
||||
that isn't real. A flake is a green/red that flips for an unnamed reason; a
|
||||
fail-open gate is a green that was never earned. Both let unverified code reach
|
||||
`main`, and both are merge-blocking.
|
||||
|
||||
### Applies to
|
||||
|
||||
Required / hard gates on **protected contexts**: pushes to `main`, internal
|
||||
protected branches, and **same-repo** PRs (`pull_request_target`). On these
|
||||
contexts the *cause* of an unverifiable run is **irrelevant** — every one of the
|
||||
following MUST fail closed:
|
||||
|
||||
- auth failure (401 / 403),
|
||||
- missing token or identity,
|
||||
- under-scoped credential,
|
||||
- unreachable dependency (network, Infisical, control-plane, registry),
|
||||
- a required test file that is absent or collects zero tests,
|
||||
- any transient error the check cannot prove was benign.
|
||||
|
||||
"I couldn't check" is reported and scored exactly like "the check failed." A
|
||||
gate that can be silently defanged by removing a secret is not a gate.
|
||||
|
||||
### The one allowed exception — explicit trust-boundary split
|
||||
|
||||
Legitimate degradation is permitted **only** where the secret genuinely cannot
|
||||
exist — e.g. **fork PRs**, which by design have no access to repo secrets. Such
|
||||
degradation is allowed **only** when it is:
|
||||
|
||||
1. gated behind an **explicit** fork / advisory branch in the workflow logic
|
||||
(an intentional trust-boundary split, not an incidental `if: secrets...`),
|
||||
2. **clearly marked advisory** in its name and output, and
|
||||
3. **NOT counted as a passing REQUIRED context** — it may inform, it may not
|
||||
satisfy the gate.
|
||||
|
||||
Silent degradation that satisfies a required gate is **forbidden**. If a fork PR
|
||||
needs the real check, it must run via a maintainer-triggered same-repo path
|
||||
(where the secret exists and the check therefore fails closed), not by quietly
|
||||
passing the required context with no verification.
|
||||
|
||||
### Auth-failure vs. genuine-absence — do not conflate
|
||||
|
||||
Distinguish the two so a real finding is never masked and a masked finding is
|
||||
never mistaken for real:
|
||||
|
||||
- **`403` (or 401) on a protected context → fail closed.** You could not verify;
|
||||
that is a check failure, not a finding about the resource.
|
||||
- **A real `404` from a read made *with a valid, sufficiently-scoped token* →
|
||||
the real finding.** The resource is genuinely absent; report it as such.
|
||||
|
||||
A `403` reported as "resource not found" is itself a fail-open bug.
|
||||
|
||||
### Required practice
|
||||
|
||||
Every gate that depends on a token, an identity, or an external read MUST ship
|
||||
with a test or workflow-lint covering the **absent-identity / unauthorized /
|
||||
missing-file path** that asserts the gate **FAILS** (not skips, not passes).
|
||||
Add or update that coverage in the **same PR** that adds or changes the gate.
|
||||
A gate without a proven failure path is not yet a gate.
|
||||
|
||||
### Violations seen in this codebase (all merge-blocking if reintroduced)
|
||||
|
||||
- **serving-e2e** reporting vacuously GREEN when the Infisical identity is
|
||||
absent (no per-(provider × auth) completion was actually exercised).
|
||||
- **branch-protection / BP-drift lints** returning `0` on a `403` instead of
|
||||
failing closed on the unverifiable response.
|
||||
- **verify-template-models** run without `-strict`, so a drift it could not
|
||||
confirm passed silently.
|
||||
- A **referenced-but-absent pytest file** that collects zero tests and reports
|
||||
green — silent pass with no assertions executed.
|
||||
|
||||
Each of these is a fail-open gate and is a merge blocker until it fails loud and
|
||||
closed on protected contexts. See also the production fail-closed defaults in
|
||||
`runbooks/sop-production-cicd.md` (*Production Defaults*), which apply the same
|
||||
principle to deploy-time gates.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- #2159 — gate auto-trigger not firing (root cause: stale PR heads lacking
|
||||
the `pull_request_review` trigger, NOT a workflow code defect)
|
||||
- #765 — static structural regression test for gate configuration
|
||||
- #2157 — merged trigger addition (`pull_request_review` types: [submitted])
|
||||
- #2020 — milestone confirming gate infrastructure is stable
|
||||
- RFC#324 — qa-review + security-review design
|
||||
@@ -1,112 +0,0 @@
|
||||
# Gitea Actions migration checklist (molecule-core)
|
||||
|
||||
Created 2026-05-11 as part of **RFC `molecule-ai/internal#219` §1** — the
|
||||
sweep of `.github/workflows/*.yml` files in `molecule-core` after the
|
||||
2026-05-06 GitHub → Gitea migration. Documents which workflows were
|
||||
retired, which were ported, and the reasoning for each.
|
||||
|
||||
The sweep used the four-surface audit pattern from saved memory
|
||||
`feedback_gitea_actions_migration_audit_pattern`:
|
||||
|
||||
1. **YAML** — drop `workflow_dispatch.inputs`, `merge_group`,
|
||||
`environment:`. Adjust `runs-on:`. Set `env.GITHUB_SERVER_URL`
|
||||
per `feedback_act_runner_github_server_url`.
|
||||
2. **Cache** — verify `actions/cache@v4` / `upload-artifact` pin
|
||||
compatibility with Gitea 1.22.x runner.
|
||||
3. **Token** — auto-injected `GITHUB_TOKEN` works for same-repo
|
||||
operations; cross-repo dispatch needs explicit secret.
|
||||
4. **Docs** — top-of-file "Ported from .github/workflows/X.yml on
|
||||
YYYY-MM-DD per RFC internal#219 §1 sweep" comment.
|
||||
|
||||
Per RFC §1 contract, all ports land with `continue-on-error: true` on
|
||||
every job to surface bugs without blocking; a follow-up PR flips
|
||||
`continue-on-error: false` after triage.
|
||||
|
||||
## Category A — already mirrored (deleted .github/ copy)
|
||||
|
||||
These workflows had a working `.gitea/workflows/X.yml` twin at the time
|
||||
of the sweep. The `.github/` copies were silently dead (Gitea Actions
|
||||
in molecule-core only registers `.gitea/workflows/`) and have been
|
||||
removed.
|
||||
|
||||
| File | .gitea/ twin |
|
||||
|---|---|
|
||||
| `publish-runtime.yml` | `.gitea/workflows/publish-runtime.yml` (ported via issue #206) |
|
||||
| `secret-scan.yml` | `.gitea/workflows/secret-scan.yml` |
|
||||
|
||||
## Category B — GitHub-only, retired
|
||||
|
||||
These workflows depend on GitHub-specific surface (merge queue, GitHub
|
||||
auto-merge primitive, github.com REST API, GHCR registry, CodeQL action
|
||||
that hits api.github.com bundle endpoints) that Gitea does not provide.
|
||||
No equivalent Gitea-side workflow is needed; the underlying mechanism
|
||||
either doesn't exist on Gitea or has been replaced by a different
|
||||
pipeline.
|
||||
|
||||
| File | Why retired |
|
||||
|---|---|
|
||||
| `auto-tag-runtime.yml` | Superseded by `.gitea/workflows/publish-runtime-autobump.yml` (auto-bump-on-workspace-edit). The autobump only does patch bumps; the deleted workflow supported `release:minor` / `release:major` PR-label-driven bumps. Follow-up issue should track restoring label-driven minor/major if anyone uses it. |
|
||||
| `branch-protection-drift.yml` | Targets `Molecule-AI/molecule-core` on GitHub via `gh api /repos/.../branch-protection` — entirely GitHub-API specific. `tools/branch-protection/drift_check.sh` and `apply.sh` reference the GitHub schema (status_check_contexts, dismiss_stale_reviews, etc.) which differs from Gitea's `branch_protections` shape. Rebuilding for Gitea is out of scope for the RFC #219 sweep; follow-up issue needed for Gitea-compatible branch-protection drift detection. |
|
||||
| `check-merge-group-trigger.yml` | The workflow's own header (lines 18-23) documents that it's vacuously satisfied on Gitea — Gitea has no merge queue, no `merge_group:` event type, no `gh-readonly-queue/...` refs. Nothing to lint. |
|
||||
| `codeql.yml` | The workflow's own header (lines 3-67) documents that `github/codeql-action/init@v4` hits api.github.com bundle endpoints not implemented by Gitea (observed: `::error::404 page not found` in Initialize CodeQL step). Per Hongming decision 2026-05-07 (task #156): CodeQL is ADVISORY/non-blocking until a Gitea-compatible SAST pipeline lands. Replacement options (Semgrep self-host, Sonatype, GitHub-mirror-for-SAST) tracked in #156. |
|
||||
| `pr-guards.yml` | The workflow's own header documents that Gitea has no `gh pr merge --auto` primitive — the guard is a structural no-op on Gitea. Branch protection on `main` does NOT reference any `pr-guards` check name; deletion is safe. |
|
||||
| `promote-latest.yml` | Uses `imjasonh/setup-crane` against `ghcr.io/molecule-ai/platform` — the GHCR registry was retired during the 2026-05-06 Gitea migration (per `staging-verify.yml` header notes — file was renamed from `canary-verify.yml` on 2026-05-11; the canonical tenant image moved to ECR `153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/platform-tenant`). The workflow can no longer find any image to retag. Follow-up issue suggested if an ECR-based retag promote is desired. |
|
||||
|
||||
## Category C — ported to .gitea/
|
||||
|
||||
These workflows had real ongoing CI value but no Gitea-side equivalent.
|
||||
Each was ported to `.gitea/workflows/X.yml` with:
|
||||
|
||||
- `workflow_dispatch.inputs` removed (Gitea 1.22.6 parser rejects them —
|
||||
per `feedback_gitea_workflow_dispatch_inputs_unsupported`)
|
||||
- `merge_group:` trigger removed (no merge queue)
|
||||
- `environment:` blocks removed (Gitea has no environments)
|
||||
- `dorny/paths-filter@v4` replaced with inline `git diff` (per the
|
||||
pattern established in PR#372 ci.yml port)
|
||||
- `env.GITHUB_SERVER_URL: https://git.moleculesai.app` set at workflow
|
||||
level (belt-and-suspenders for `actions/checkout` etc.)
|
||||
- `continue-on-error: true` on every job (RFC §1 contract — surface
|
||||
defects without blocking; follow-up PR flips after triage)
|
||||
- Top-of-file header: "Ported from .github/workflows/X.yml on
|
||||
YYYY-MM-DD per RFC internal#219 §1 sweep."
|
||||
|
||||
See the C-1 / C-2 / C-3 sweep PRs for the file lists and per-file
|
||||
adjustments.
|
||||
|
||||
## Category D — parser-rejected (none for molecule-core)
|
||||
|
||||
The RFC #219 §1 brief lists 7 workflows as parser-rejected (`audit-orphan-instances`,
|
||||
`bake-thin-ami`, `bench-provision-time`, `cache-probe`, `deploy-pipeline`,
|
||||
`e2e-tunnel-reboot`, `persona-author-check`). Verification against
|
||||
molecule-core's tree (and the `docker logs molecule-gitea-1` parser-rejection
|
||||
log) shows these workflows belong to other repos:
|
||||
|
||||
- `audit-orphan-instances`, `bake-thin-ami`, `bench-provision-time`,
|
||||
`deploy-pipeline`, `e2e-tunnel-reboot` live in `molecule-ai/molecule-controlplane`
|
||||
- `cache-probe`, `persona-author-check` live in `molecule-ai/internal`
|
||||
|
||||
For molecule-core, **Category D is empty**.
|
||||
|
||||
## Verification
|
||||
|
||||
After all sweep PRs land:
|
||||
|
||||
```bash
|
||||
# Should produce nothing.
|
||||
ls .github/workflows/*.yml | grep -vF ci.yml
|
||||
|
||||
# Should list 6 working workflows from the .gitea/ port directory + the
|
||||
# C-1/C-2/C-3 ports.
|
||||
ls .gitea/workflows/*.yml
|
||||
```
|
||||
|
||||
Gitea Actions server should produce NO `[W] ignore invalid workflow`
|
||||
lines for any `.gitea/workflows/X.yml` in molecule-core when commits
|
||||
land on `main`:
|
||||
|
||||
```bash
|
||||
ssh root@5.78.80.188 'docker logs molecule-gitea-1 --since 10m 2>&1 \
|
||||
| grep "ignore invalid workflow" \
|
||||
| grep -i molecule-core'
|
||||
# Expected: empty.
|
||||
```
|
||||
@@ -1,104 +0,0 @@
|
||||
# Gitea Merge Queue
|
||||
|
||||
Gitea 1.22.6 does not provide a real merge queue. Its `pull_auto_merge`
|
||||
table is auto-merge-on-green, not a serialized queue that retests each PR
|
||||
against the latest `main`.
|
||||
|
||||
`gitea-merge-queue` is the external queue for `molecule-core`.
|
||||
|
||||
## Queue Contract
|
||||
|
||||
**Auto-discovery (opt-OUT, default).** You do NOT need to label a PR. The bot
|
||||
auto-discovers every open same-repo PR and merges any that meets the bar. The
|
||||
`merge-queue` label is now optional metadata, not a gate. This removed the
|
||||
historical autonomy gap: agent Gitea tokens lack `write:issue` (labels are
|
||||
issue-scoped), so agents could never self-label and ready PRs stalled.
|
||||
|
||||
To keep a PR OUT of autonomous merging, add an opt-OUT label:
|
||||
`merge-queue-hold`, `do-not-auto-merge`, or `wip`. Draft PRs are also skipped.
|
||||
|
||||
The bot processes one PR per tick:
|
||||
|
||||
1. Confirms `main`'s branch-protection-required push contexts are green.
|
||||
2. Selects the oldest open same-repo PR that is NOT opt-out-labeled and NOT a
|
||||
draft (auto-discovery). With `AUTO_DISCOVER=0` it falls back to legacy
|
||||
opt-IN: only PRs carrying `merge-queue` are considered.
|
||||
3. Rejects fork PRs because the queue may only update same-repo branches.
|
||||
4. If the PR head does not contain current `main`, calls Gitea's
|
||||
`/pulls/{n}/update?style=merge` endpoint and waits for CI on the new head.
|
||||
5. Merges only when, on the PR's CURRENT head sha:
|
||||
- `>= required_approvals` distinct genuine official `APPROVED` reviews from
|
||||
the recognised reviewer set (read from branch protection; default 2),
|
||||
- no open official `REQUEST_CHANGES`,
|
||||
- every branch-protection-required status context is green, and
|
||||
- the PR is `mergeable` (Gitea returns `True`; `None`/`False` = wait).
|
||||
|
||||
The merge bar is unchanged by auto-discovery — only WHICH PRs are considered
|
||||
changes. The workflow is serialized with `concurrency`, so two PRs cannot be
|
||||
merged against the same observed `main`.
|
||||
|
||||
## Operator Commands
|
||||
|
||||
Queue a PR (optional — auto-discovery already considers every ready PR; the
|
||||
label is just visible metadata):
|
||||
|
||||
```bash
|
||||
curl -fsS -X POST \
|
||||
-H "Authorization: token $GITEA_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://git.moleculesai.app/api/v1/repos/molecule-ai/molecule-core/issues/<PR>/labels" \
|
||||
-d '{"labels":["merge-queue"]}'
|
||||
```
|
||||
|
||||
Keep a PR OUT of autonomous merging (opt-OUT — use `merge-queue-hold`,
|
||||
`do-not-auto-merge`, or `wip`):
|
||||
|
||||
```bash
|
||||
curl -fsS -X POST \
|
||||
-H "Authorization: token $GITEA_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://git.moleculesai.app/api/v1/repos/molecule-ai/molecule-core/issues/<PR>/labels" \
|
||||
-d '{"labels":["merge-queue-hold"]}'
|
||||
```
|
||||
|
||||
Run the bot manually from a trusted checkout:
|
||||
|
||||
```bash
|
||||
GITEA_TOKEN="$DEVOPS_ENGINEER_TOKEN" \
|
||||
GITEA_HOST=git.moleculesai.app \
|
||||
REPO=molecule-ai/molecule-core \
|
||||
WATCH_BRANCH=main \
|
||||
QUEUE_LABEL=merge-queue \
|
||||
HOLD_LABEL=merge-queue-hold \
|
||||
AUTO_DISCOVER=1 \
|
||||
OPT_OUT_LABELS=do-not-auto-merge,wip \
|
||||
REVIEWER_SET=agent-reviewer,agent-researcher,agent-reviewer-cr2 \
|
||||
UPDATE_STYLE=merge \
|
||||
python3 .gitea/scripts/gitea-merge-queue.py --dry-run
|
||||
```
|
||||
|
||||
Dry run:
|
||||
|
||||
```bash
|
||||
python3 .gitea/scripts/gitea-merge-queue.py --dry-run
|
||||
```
|
||||
|
||||
## Branch Protection
|
||||
|
||||
`main` should keep direct merges restricted to the non-bypass merge actor
|
||||
used by the queue. Normal humans and agents should not merge directly.
|
||||
|
||||
`block_on_outdated_branch` should be enabled as a defense in depth, but it
|
||||
does not replace the queue. The queue still performs its own current-main
|
||||
check immediately before merge because branch protection alone cannot
|
||||
serialize two already-green PRs.
|
||||
|
||||
## Failure Handling
|
||||
|
||||
If `main` is not green, the queue pauses and does not merge anything.
|
||||
|
||||
If a queued PR is stale, the queue updates the PR branch and comments on the
|
||||
PR. It does not merge until CI runs on the updated head.
|
||||
|
||||
If the queue workflow fails, treat it as a CI/CD incident. Do not bypass by
|
||||
manually merging unless the human operator explicitly accepts the risk.
|
||||
@@ -1,406 +0,0 @@
|
||||
# Gitea Actions operational quirks (molecule-core)
|
||||
|
||||
Documents persistent operational findings about Gitea Actions runner behaviour
|
||||
that differ from GitHub Actions and require workarounds in workflow YAML or
|
||||
runbooks.
|
||||
|
||||
> Last updated: 2026-05-12 (infra-runtime-be-agent)
|
||||
|
||||
---
|
||||
|
||||
## Quirk #1 — Large repo causes fetch timeout on Gitea Actions runner
|
||||
|
||||
### Finding
|
||||
|
||||
The Gitea Actions runner (container on host `5.78.80.188`) can reach the git
|
||||
remote (`https://git.moleculesai.app`) over HTTPS — a single-commit shallow
|
||||
fetch (`--depth=1`) succeeds in ~16 s. However, fetching the **full compressed
|
||||
repo history** (~75+ MB) exceeds the runner's network timeout window (~15 s).
|
||||
|
||||
This is **not a Gitea Actions bug** and **not a network isolation policy** —
|
||||
it is a repo-size constraint. The runner can reach external hosts (GitHub,
|
||||
Docker Hub, PyPI) without issue.
|
||||
|
||||
### Impact
|
||||
|
||||
Workflows that rely on `actions/checkout` with `fetch-depth: 0` (full history)
|
||||
or `git clone` will time out.
|
||||
|
||||
Specifically:
|
||||
- `actions/checkout@v*` with `fetch-depth: 0` hangs (fetching full repo
|
||||
history takes >15 s before hitting the timeout).
|
||||
- `git clone <url>` hangs for the same reason.
|
||||
- `git fetch origin <ref> --depth=1` **succeeds** in ~16 s — this is the
|
||||
working pattern.
|
||||
|
||||
### Affected workflows
|
||||
|
||||
| Workflow | Issue | Workaround |
|
||||
|---|---|---|
|
||||
| `harness-replays.yml` detect-changes job | `fetch-depth: 0` + `git clone` time out | Added `timeout 20 git fetch origin base.ref --depth=1` + `continue-on-error: true` + fallback to `run=true` per PR #441 |
|
||||
| `publish-workspace-server-image.yml` | In-image `git clone` of workspace templates | Pre-clone manifest deps before compose build (Task #173 pattern) |
|
||||
| Any workflow using `fetch-depth: 0` | Full history fetch times out | Use `fetch-depth: 1` + explicit `git fetch` for needed refs |
|
||||
|
||||
### How to diagnose
|
||||
|
||||
```bash
|
||||
# From inside the runner (add as a debug step):
|
||||
timeout 20 git fetch origin main --depth=1
|
||||
# If this SUCCEEDS (~16s): runner can reach the git remote — the repo is
|
||||
# too large for full-history fetch.
|
||||
# If this times out: true network isolation (unlikely; check firewall rules).
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
Confirmed 2026-05-11 by running `timeout 20 git fetch origin base.ref --depth=1`
|
||||
in the `detect-changes` job of `harness-replays.yml` — **succeeds in ~16 s**.
|
||||
Runner can reach `https://api.github.com` and `https://pypi.org` without issue,
|
||||
confirming this is a repo-size constraint, not network isolation.
|
||||
|
||||
### References
|
||||
|
||||
- PR #441: fix for `harness-replays.yml` detect-changes
|
||||
- Task #173: pre-clone manifest deps pattern for compose build
|
||||
- internal#102: tracking customer-private + marketplace third-party repos
|
||||
- `feedback_oss_first_repo_visibility_default`: 5 workspace-template repos
|
||||
flipped public to allow pre-clone without auth
|
||||
|
||||
---
|
||||
|
||||
## Quirk #2 — `continue-on-error` only works at step level, not job level
|
||||
|
||||
### Finding
|
||||
|
||||
Gitea Actions (1.22.6) does not honour `continue-on-error: true` at the **job**
|
||||
level the way GitHub Actions does. A job with `continue-on-error: true` that
|
||||
fails still reports `status: failure` in the commit status API.
|
||||
|
||||
Only `continue-on-error: true` at the **step** level works as expected.
|
||||
|
||||
### Impact
|
||||
|
||||
If you want a job to always "pass" in the status API (so dependent jobs can
|
||||
run and the overall CI does not show `failure`), you must add
|
||||
`continue-on-error: true` to every step that can fail, AND ensure each step
|
||||
exits with code 0 (e.g., append `|| true` to commands that might fail).
|
||||
|
||||
### Affected workflows
|
||||
|
||||
| Workflow | Fix |
|
||||
|---|---|
|
||||
| `harness-replays.yml` detect-changes | Added `continue-on-error: true` to fetch step + decide step; added `|| true` to `DIFF=$(git diff ...)` per PR #441 |
|
||||
|
||||
### How to diagnose
|
||||
|
||||
```yaml
|
||||
# WRONG — job reports as failure despite flag
|
||||
jobs:
|
||||
my-job:
|
||||
continue-on-error: true # ← ignored by Gitea
|
||||
steps:
|
||||
- run: git diff ... # ← if this fails, job = failure
|
||||
# job-level flag does not help
|
||||
|
||||
# RIGHT — step-level flag prevents step from failing
|
||||
jobs:
|
||||
my-job:
|
||||
steps:
|
||||
- run: git diff ... || true # ← step exits 0
|
||||
continue-on-error: true # ← belt and suspenders
|
||||
```
|
||||
|
||||
### References
|
||||
|
||||
- Quirk #10 (this document): Gitea does NOT auto-populate `secrets.GITHUB_TOKEN`
|
||||
- PR #441: fix applied to `harness-replays.yml`
|
||||
|
||||
---
|
||||
|
||||
## Quirk #3 — `workflow_dispatch.inputs` not supported
|
||||
|
||||
Gitea 1.22.6 parser rejects `workflow_dispatch.inputs`. Drop from all workflow
|
||||
YAML files ported from GitHub Actions. Manual triggers should use
|
||||
`workflow_dispatch` without `inputs:`.
|
||||
|
||||
**Reference**: `feedback_gitea_workflow_dispatch_inputs_unsupported`
|
||||
|
||||
---
|
||||
|
||||
## Quirk #4 — `merge_group` not supported
|
||||
|
||||
Gitea has no native merge queue concept. Drop `merge_group:` triggers from
|
||||
all workflow YAML files.
|
||||
|
||||
For `molecule-core`, use the external serialized queue documented in
|
||||
`runbooks/gitea-merge-queue.md`. Gitea's `pull_auto_merge` table is
|
||||
auto-merge-on-green, not a queue that retests each PR against latest `main`.
|
||||
|
||||
---
|
||||
|
||||
## Quirk #5 — `environment:` blocks not supported
|
||||
|
||||
Gitea has no environments concept. Drop `environment:` from all workflow YAML
|
||||
files. Secrets and variables are repo-level.
|
||||
|
||||
---
|
||||
|
||||
## Quirk #6 — Gitea combined status reports `failure` when all contexts are `null`
|
||||
|
||||
### Finding
|
||||
|
||||
When ALL individual status contexts for a commit have `state: null` (no runner
|
||||
has reported yet), Gitea reports the combined commit status as `failure`. This
|
||||
is a Gitea Actions bug — it conflates "no status reported yet" with "failed".
|
||||
|
||||
### Impact
|
||||
|
||||
- The `main-red-watchdog` workflow opens a `[main-red]` issue for every
|
||||
scheduled workflow run where the combined state is `failure` — even when
|
||||
the failure is entirely due to Gitea's combined-status bug.
|
||||
- This causes spurious `[main-red]` issues that waste SRE time investigating
|
||||
non-existent failures.
|
||||
- **This is especially confusing for `schedule:`-only workflows** (canary,
|
||||
sweep jobs, synth-E2E): Gitea attributes their scheduled runs to `main`'s
|
||||
HEAD commit, so if a scheduled run fires while all contexts are still
|
||||
`state: null`, the watchdog opens a `[main-red]` issue on the latest main
|
||||
commit even though that commit itself is perfectly fine.
|
||||
|
||||
### How to diagnose
|
||||
|
||||
Always check the **individual context `state` fields**, not the combined
|
||||
`state`/`combined_state`. In the `/repos/{org}/{repo}/commits/{sha}/statuses`
|
||||
API response, look for `"state": null` on every entry — if all are null, the
|
||||
combined `failure` is Gitea's bug, not a real CI failure.
|
||||
|
||||
```json
|
||||
{
|
||||
"combined_state": "failure", // ← Gitea bug when all are null
|
||||
"contexts": [
|
||||
{ "context": "CI / Lint", "state": null }, // still running
|
||||
{ "context": "CI / Test", "state": null } // still running
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Affected workflows
|
||||
|
||||
All workflows, but especially `schedule:`-only workflows that run on `main`.
|
||||
The main-red-watchdog (`.gitea/workflows/main-red-watchdog.yml`) is the
|
||||
primary consumer of combined status and is affected.
|
||||
|
||||
### References
|
||||
|
||||
- Issue #481: first real-world case of this bug (2026-05-11)
|
||||
- `feedback_no_such_thing_as_flakes`: watchdog directive
|
||||
|
||||
---
|
||||
|
||||
## Quirk #7 — TBD
|
||||
|
||||
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
|
||||
|
||||
### Finding
|
||||
|
||||
*[What Gitea Actions does differently from GitHub Actions.]*
|
||||
|
||||
### Impact
|
||||
|
||||
*[Which workflows or operations are affected.]*
|
||||
|
||||
### Workaround
|
||||
|
||||
*[How to work around this quirk.]*
|
||||
|
||||
### References
|
||||
|
||||
- internal#[N]: first observation
|
||||
|
||||
---
|
||||
|
||||
## Quirk #8 — TBD
|
||||
|
||||
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
|
||||
|
||||
### Finding
|
||||
|
||||
*[What Gitea Actions does differently from GitHub Actions.]*
|
||||
|
||||
### Impact
|
||||
|
||||
*[Which workflows or operations are affected.]*
|
||||
|
||||
### Workaround
|
||||
|
||||
*[How to work around this quirk.]*
|
||||
|
||||
### References
|
||||
|
||||
- internal#[N]: first observation
|
||||
|
||||
---
|
||||
|
||||
## Quirk #9 — TBD
|
||||
|
||||
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
|
||||
|
||||
### Finding
|
||||
|
||||
*[What Gitea Actions does differently from GitHub Actions.]*
|
||||
|
||||
### Impact
|
||||
|
||||
*[Which workflows or operations are affected.]*
|
||||
|
||||
### Workaround
|
||||
|
||||
*[How to work around this quirk.]*
|
||||
|
||||
### References
|
||||
|
||||
- internal#[N]: first observation
|
||||
|
||||
---
|
||||
|
||||
## Quirk #10 — Gitea does NOT auto-populate `secrets.GITHUB_TOKEN`
|
||||
|
||||
### Finding
|
||||
|
||||
Gitea Actions (1.22.6) does **not** auto-populate `secrets.GITHUB_TOKEN`
|
||||
the way GitHub Actions does. A workflow that references `secrets.GITHUB_TOKEN`
|
||||
without explicitly provisioning a named secret gets an empty string — not a
|
||||
read-only token scoped to the repo.
|
||||
|
||||
### Impact
|
||||
|
||||
Workflows that call the Gitea REST API using `secrets.GITHUB_TOKEN` as auth
|
||||
receive **HTTP 401** on every API call. Affected workflows in molecule-core:
|
||||
|
||||
| Workflow | Symptom | Workaround |
|
||||
|---|---|---|
|
||||
| `gate-check-v3.yml` | Reports BLOCKED on every PR | Provision `SOP_CHECKLIST_GATE_TOKEN`; update workflow to use it |
|
||||
| `qa-review.yml` | Fails immediately on PR open | Same — needs named secret |
|
||||
| `security-review.yml` | Fails immediately on PR open | Same — needs named secret |
|
||||
|
||||
### How to diagnose
|
||||
|
||||
Add a debug step to the failing workflow:
|
||||
|
||||
```yaml
|
||||
- name: Diagnose token
|
||||
run: |
|
||||
echo "Token present: ${{ secrets.GITHUB_TOKEN != '' }}"
|
||||
curl -sS --fail -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
|
||||
"$GITHUB_SERVER_URL/api/v1/user" | jq -r '.login'
|
||||
# Expected (GitHub): prints your username.
|
||||
# Actual (Gitea): HTTP 401 or empty string.
|
||||
```
|
||||
|
||||
### References
|
||||
|
||||
- internal#325: root-cause analysis and token provisioning
|
||||
- `feedback_gitea_no_auto_supplied_github_token`
|
||||
|
||||
---
|
||||
|
||||
## Quirk #11 — PR-create event dispatcher races — only 1 of N workflows fires on `pull_request opened`
|
||||
|
||||
### Finding
|
||||
|
||||
When a PR is created via the Gitea web UI or API, the Gitea Actions event
|
||||
dispatcher may fire **only 1 of N eligible workflows** on the initial
|
||||
`pull_request opened` event. All other eligible workflows are silently dropped.
|
||||
|
||||
This was observed on molecule-core PR #558 (created 2026-05-11T19:54:10Z):
|
||||
12+ workflows had no `paths:` filter and should have fired, but only
|
||||
`sop-checklist.yml` dispatched.
|
||||
|
||||
Concurrent PRs created within the same minute received 12–30 dispatches each,
|
||||
confirming this is specific to the PR-create event dispatch, not a general
|
||||
runner capacity issue.
|
||||
|
||||
### Impact
|
||||
|
||||
- PRs may not run the full CI suite on first open.
|
||||
- `gate-check-v3`, `secret-scan`, `qa-review`, and `security-review` can be
|
||||
silently absent from the PR's status checks.
|
||||
- Branch protection may block merge even though CI is effectively green.
|
||||
|
||||
### How to diagnose
|
||||
|
||||
```bash
|
||||
# List workflow runs for the PR:
|
||||
gh run list --event pull_request --repo molecule-ai/molecule-core \
|
||||
| grep "$(gh pr view $PR --json number --jq '.number')"
|
||||
|
||||
# Expected: 12+ runs on PR open.
|
||||
# Actual (when race fires): only 1 run.
|
||||
```
|
||||
|
||||
### Workaround
|
||||
|
||||
Force a second dispatch by pushing a no-op synchronize commit:
|
||||
|
||||
```bash
|
||||
git commit --allow-empty -m "chore: trigger workflows [skip ci]"
|
||||
git push
|
||||
```
|
||||
|
||||
The synchronize event fires a second `pull_request` event, which reliably
|
||||
triggers all eligible workflows.
|
||||
|
||||
### References
|
||||
|
||||
- internal#329: first observation on PR #558
|
||||
- `feedback_gitea_pr_create_dispatcher_race`
|
||||
|
||||
---
|
||||
|
||||
## When you find a new quirk
|
||||
|
||||
Copy the template below, increment the quirk number, and fill in the finding,
|
||||
impact, workaround, and references. Place the new section in the **correct
|
||||
numerical position** (before the next higher-numbered quirk). Update this
|
||||
section's final paragraph to remove the next slot's number.
|
||||
|
||||
### Template
|
||||
|
||||
```markdown
|
||||
## Quirk #N — <short title>
|
||||
|
||||
### Finding
|
||||
|
||||
<What Gitea Actions does differently from GitHub Actions.>
|
||||
|
||||
### Impact
|
||||
|
||||
<Which workflows or operations are affected. Include an affected workflows
|
||||
table if more than one is affected.>
|
||||
|
||||
### How to diagnose
|
||||
|
||||
<Shell commands or API calls that confirm this is the quirk, not a real failure.>
|
||||
|
||||
### Workaround
|
||||
|
||||
<How to work around this quirk in workflow YAML or operations.>
|
||||
|
||||
### References
|
||||
|
||||
- internal#[N]: first observation
|
||||
- <Any Gitea issue, feedback label, or upstream bug tracker reference>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Open questions for Gitea 1.23
|
||||
|
||||
- [ ] **act_runner concurrent-job cap**: issue #305 — runner saturation under
|
||||
merge burst; needs `max_concurrent_jobs` cap configured on act_runner
|
||||
- [ ] **Infisical→Gitea secret-sync**: issue #307 — eliminate manual secret
|
||||
PUTs by wiring an Infisical cron to the Gitea API
|
||||
- [ ] **PR-create dispatcher race resolution**: internal #329 — is there a
|
||||
Gitea fix or config knob to disable the race? File upstream bug if not
|
||||
- [ ] **GITHUB_TOKEN auto-population**: internal #325 — is this on the
|
||||
Gitea 1.23 roadmap? If not, the workaround (named secret) is the permanent
|
||||
answer
|
||||
@@ -1,64 +0,0 @@
|
||||
# Production Auto-Deploy
|
||||
|
||||
`molecule-core` deploys production tenant code automatically from Gitea Actions.
|
||||
|
||||
This runbook is an implementation-specific companion to `runbooks/sop-production-cicd.md`.
|
||||
|
||||
## Default Flow
|
||||
|
||||
On a push to `main` that touches deployable code, `.gitea/workflows/publish-workspace-server-image.yml`:
|
||||
|
||||
1. Builds and pushes platform and tenant ECR images tagged `staging-<sha>` and `staging-latest`.
|
||||
2. Self-tests the production deploy helper and workflow-YAML linter.
|
||||
3. Waits for strict required push contexts on the same commit to become `success`.
|
||||
4. Calls production control-plane `POST /cp/admin/tenants/redeploy-fleet` with `target_tag=staging-<sha>`.
|
||||
5. Verifies every redeploy result is healthy and every tenant returns the same Git SHA from `/buildinfo`.
|
||||
|
||||
The deploy workflow intentionally does not use Gitea `concurrency` because Gitea 1.22.6 can cancel queued runs even when `cancel-in-progress: false`.
|
||||
|
||||
## Kill Switch
|
||||
|
||||
Set either repository variable or secret:
|
||||
|
||||
```text
|
||||
PROD_AUTO_DEPLOY_DISABLED=true
|
||||
```
|
||||
|
||||
The image publish still runs, but the production redeploy step exits successfully without touching tenants.
|
||||
Immediately before the production POST, the workflow re-checks the live Gitea repo variable when `PROD_AUTO_DEPLOY_CONTROL_TOKEN` can read Actions variables. If that token is not configured, the job-start value is still honored.
|
||||
|
||||
## Tunables
|
||||
|
||||
Repository variables:
|
||||
|
||||
```text
|
||||
PROD_CP_URL=https://api.moleculesai.app
|
||||
PROD_AUTO_DEPLOY_CANARY_SLUG=hongming
|
||||
PROD_AUTO_DEPLOY_SOAK_SECONDS=60
|
||||
PROD_AUTO_DEPLOY_BATCH_SIZE=3
|
||||
PROD_AUTO_DEPLOY_DRY_RUN=false
|
||||
PROD_MANUAL_REDEPLOY_TARGET_TAG=staging-<known-good-sha>
|
||||
```
|
||||
|
||||
Secrets required:
|
||||
|
||||
```text
|
||||
CP_ADMIN_API_TOKEN
|
||||
AUTO_SYNC_TOKEN
|
||||
PROD_AUTO_DEPLOY_CONTROL_TOKEN
|
||||
AWS_ACCESS_KEY_ID
|
||||
AWS_SECRET_ACCESS_KEY
|
||||
```
|
||||
|
||||
`AUTO_SYNC_TOKEN` is only used to read Gitea commit statuses while waiting for required push contexts.
|
||||
`PROD_AUTO_DEPLOY_CONTROL_TOKEN` is optional but recommended so the pre-POST kill-switch check can read the live `PROD_AUTO_DEPLOY_DISABLED` Actions variable.
|
||||
|
||||
## Manual Fallback
|
||||
|
||||
Use `.gitea/workflows/redeploy-tenants-on-main.yml` when the automatic path needs to be rerun or rolled back. Gitea 1.22.6 does not support reliable `workflow_dispatch` inputs, so rollback uses a repo variable:
|
||||
|
||||
1. Set `PROD_MANUAL_REDEPLOY_TARGET_TAG=staging-<known-good-sha>`.
|
||||
2. Dispatch `manual-redeploy-tenants-on-main`.
|
||||
3. Clear `PROD_MANUAL_REDEPLOY_TARGET_TAG` after the rollback finishes.
|
||||
|
||||
With no variable set, the fallback redeploys `staging-<current-main-sha>`.
|
||||
@@ -1,77 +0,0 @@
|
||||
# SOP: Production CI/CD Changes
|
||||
|
||||
Production CI/CD changes are higher risk than ordinary CI edits. They can publish images, deploy tenants, promote tags, mutate branch protection, or change merge behavior. This SOP separates rules that must be enforced by code from rules that require human judgment.
|
||||
|
||||
## Programmatic Gates
|
||||
|
||||
The workflow YAML linter is the first line of enforcement:
|
||||
|
||||
```bash
|
||||
python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows
|
||||
```
|
||||
|
||||
It must reject:
|
||||
|
||||
- Gitea-hostile syntax such as `workflow_dispatch.inputs`, `workflow_run`, workflow name collisions, slash-containing workflow names, and unsupported cross-repo action references.
|
||||
- Production deploy workflows that rely on `concurrency.cancel-in-progress: false` for serialization.
|
||||
- Production deploy workflows that print raw control-plane responses or raw `.error` fields into CI logs.
|
||||
- Production redeploy workflows with no kill switch or rollback/pin control.
|
||||
|
||||
Production deploy helpers must also unit-test:
|
||||
|
||||
- Disable-flag parsing.
|
||||
- Required status context selection.
|
||||
- Terminal status handling for `failure`, `error`, `cancelled`, `canceled`, and `skipped`.
|
||||
- Production control-plane URL guards.
|
||||
- Rollback target/pin handling when applicable.
|
||||
|
||||
## Required PR Evidence
|
||||
|
||||
Every production CI/CD PR must include concrete answers for:
|
||||
|
||||
- Root cause: what production failure mode or process gap is being closed.
|
||||
- Deploy gate: which exact contexts must be green before production side effects.
|
||||
- Kill switch: how to stop deployment without reverting the PR.
|
||||
- Verification: how production state is proven after deployment.
|
||||
- Logging: proof that CI logs do not contain raw production runtime, SSM, or secret-adjacent output.
|
||||
- Rollback: the exact command, variable, or workflow to return to a known-good tag/digest.
|
||||
- No fail-open gates: required checks fail loud + closed on protected contexts (no skip/`|| true`/`403`-as-pass). See `runbooks/dev-sop.md` § *Fail-closed CI integrity*.
|
||||
|
||||
## Human Review
|
||||
|
||||
Production CI/CD PRs need non-author review across these roles:
|
||||
|
||||
- DevOps: Gitea Actions semantics, branch protection, merge queue, and runner behavior.
|
||||
- SRE: rollout order, tenant health checks, observability, and partial-deploy recovery.
|
||||
- Security: secrets, token scopes, log redaction, and production endpoint targeting.
|
||||
|
||||
Critical or Required review findings must be closed with one of:
|
||||
|
||||
- A code change plus verification.
|
||||
- An evidence-backed rejection.
|
||||
- A follow-up issue only if the finding is explicitly not merge-blocking.
|
||||
|
||||
Acknowledgement alone is not closure.
|
||||
|
||||
## Production Defaults
|
||||
|
||||
Production deploys should fail closed:
|
||||
|
||||
- Missing tenant result: fail.
|
||||
- Tenant unhealthy: fail.
|
||||
- `/buildinfo` unreachable: fail.
|
||||
- SHA mismatch: fail.
|
||||
- Required status cancelled/skipped/missing past timeout: fail.
|
||||
|
||||
Staging may tolerate warnings during rollout development; production should not.
|
||||
|
||||
## Gitea 1.22.6 Constraints
|
||||
|
||||
Do not design production CI/CD around unsupported or unreliable features:
|
||||
|
||||
- No `workflow_run`.
|
||||
- No reliable `workflow_dispatch.inputs`.
|
||||
- Do not assume `concurrency.cancel-in-progress: false` serializes queued runs.
|
||||
- Do not rely on a masked aggregate status as the only production deploy gate.
|
||||
|
||||
If these constraints change after a Gitea upgrade, update this SOP and the workflow linter in the same PR.
|
||||
Reference in New Issue
Block a user