From 43f00ddaece5e03ca1d790128fea1fe629746762 Mon Sep 17 00:00:00 2001 From: Molecule AI Core-DevOps Date: Sat, 16 May 2026 10:44:54 +0000 Subject: [PATCH] docs(runbooks): add Gitea Actions operational quirks reference Documents four persistent Gitea 1.22.6 Actions quirks discovered during the 2026-05-11 CI noise investigation (PR #441): - Runner network isolation: git remote unreachable from container - continue-on-error only at step level: job-level flag ignored - workflow_dispatch.inputs not supported: parser rejects at load time - fetch-depth:0 times out: use fetch-depth:1 + Compare API Closes #457. Co-Authored-By: Claude Opus 4.7 --- docs/runbooks/gitea-actions-quirks.md | 139 ++++++++++++++++++++++++++ 1 file changed, 139 insertions(+) create mode 100644 docs/runbooks/gitea-actions-quirks.md diff --git a/docs/runbooks/gitea-actions-quirks.md b/docs/runbooks/gitea-actions-quirks.md new file mode 100644 index 00000000..fbd204d2 --- /dev/null +++ b/docs/runbooks/gitea-actions-quirks.md @@ -0,0 +1,139 @@ +# Gitea Actions Operational Quirks + +Four persistent Gitea 1.22.6 Actions quirks discovered during the 2026-05-11 CI noise investigation (PR #441). These are environment-level facts, not bugs to fix — write and review workflows with them in mind. + +--- + +## 1. Runner Network Isolation + +**Symptom**: `git fetch`, `git clone`, and other outbound TCP connections from within act_runner job containers silently time out. The git remote (`git.moleculesai.app`) is reachable from the act_runner host process but not from inside the ephemeral job containers. + +**Confirmed scope**: all `molecule-runner-*` act_runner containers, which run jobs with their own network namespace (via Docker `--network: host` but with iptables isolation inside the container). + +**Impact**: any workflow step that calls `git fetch` or `git clone` inside the job container will hang and eventually time out. This was the root cause of the 2026-05-11 CI noise (PR #441). + +### Workarounds + +**Prefer API calls over git** (preferred): +- Use the Gitea Compare API (`/api/v1/repos/{owner}/{repo}/compare/{base}...{head}`) instead of `git diff`. The Compare API returns the list of changed files directly without needing git history in the container. +- Example from `harness-replays.yml` `detect-changes` step: `curl -sS "$GITHUB_SERVER_URL/api/v1/repos/$GITHUB_REPOSITORY/compare/$BASE...$HEAD"` +- For push events where SHA-to-branch comparison is rejected (`BaseNotExist`), use the `github.event.commits` array instead — each commit object includes its added/removed/modified file list. +- See `.gitea/scripts/compare-api-diff-files.py` and `.gitea/scripts/push-commits-diff-files.py` for existing helpers. + +**If git inside the container is unavoidable**: +- Use `actions/checkout` with `fetch-depth: 1` (shallow clone) — the checkout action runs on the host side and mounts the repo into the container via `git clone --shared`, so it does not hit the container's outbound git limitation. +- The cloned files are available inside the container at the usual path. +- **Do not** run `git fetch` inside a `run:` shell step — it will hang. Use the `actions/checkout` step instead. + +**Anti-pattern (do not use)**: +```yaml +# WRONG — hangs in Gitea Actions runner containers +- name: Fetch base ref + run: git fetch origin ${{ github.event.pull_request.base.sha }} +``` + +--- + +## 2. `continue-on-error` Only at Step Level + +**Symptom**: `continue-on-error: true` set at the **job level** is silently ignored by Gitea 1.22.6. The job will fail the overall workflow run even if all its individual steps succeed except those marked `continue-on-error: true`. + +**Impact**: a job-level `continue-on-error` used as an escape hatch for flaky steps will NOT work. The escape hatch must be per-step. + +**Correct pattern**: +```yaml +jobs: + my-job: + runs-on: ubuntu-latest + steps: + - name: Flaky step + continue-on-error: true # ← must be here, on the step + run: ./might-fail.sh + - name: Deterministic step + run: ./always-works.sh +``` + +**Wrong pattern (ignored)**: +```yaml +jobs: + my-job: + runs-on: ubuntu-latest + continue-on-error: true # ← ignored by Gitea 1.22.6; do not rely on this + steps: + - name: Flaky step + run: ./might-fail.sh +``` + +**Historical context**: this was the root cause of mc#774-style "pre-existing continue-on-error mask" escapes. Before the bug was identified, jobs were using job-level `continue-on-error: true` as an escape hatch; when that stopped working (or was never working on Gitea), the flaky steps leaked failures through. The correct fix is step-level `continue-on-error: true` plus a `mc#314`-tagged comment with a removal date/commit reference so the escape hatch is not permanent. + +--- + +## 3. `workflow_dispatch.inputs` Not Supported + +**Symptom**: `workflow_dispatch.inputs` blocks in workflow YAML are rejected by the Gitea 1.22.6 workflow parser with an error at parse time. The workflow will not register. + +**Impact**: all workflows ported from GitHub Actions (per RFC internal#219 §1 sweep) dropped their `workflow_dispatch.inputs` blocks. Any future workflow that tries to use manual `workflow_dispatch` inputs will fail. + +**Workaround**: use environment variables or secrets as configuration channels instead. For path-filtered manual runs, use `workflow_dispatch` without inputs and gate logic inside the job with `if:` conditions. + +**Example — replace inputs with env**: +```yaml +# GitHub Actions (what we used to write): +on: + workflow_dispatch: + inputs: + target: + type: choice + options: [platform, canvas, all] + +# Gitea Actions (what we write now): +on: + workflow_dispatch: + # no inputs block — not supported +env: + TARGET: ${{ github.event.inputs.target || 'all' }} # ← undefined; handle inside steps +``` + +If a choice is needed, document it in the workflow comment and use a separate job or step `if:` condition. + +--- + +## 4. `fetch-depth: 0` Times Out in Container + +**Symptom**: `actions/checkout` with `fetch-depth: 0` (full history clone) hangs and times out in Gitea Actions runner containers. The act_runner host can clone fine, but the container's network isolation (see quirk #1) prevents the underlying `git fetch-pack` / `git clone --depth=0` from completing. + +**Impact**: any workflow that needs both base and head SHAs locally for `git diff` must not rely on `fetch-depth: 0`. + +**Workaround**: use `fetch-depth: 1` (shallow clone) combined with the Gitea Compare API or `github.event.commits` array (see quirk #1). The Compare API returns the same file-diff information without any git history in the container. + +**Correct pattern**: +```yaml +- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 + with: + fetch-depth: 1 # ← shallow clone only +# Then use Compare API or commits array for changed-file detection +``` + +**Wrong pattern (hangs)**: +```yaml +- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 + with: + fetch-depth: 0 # ← times out in Gitea Actions runner containers +``` + +**Note**: `actions/checkout` itself runs on the host side (the act_runner process) and is not subject to container network isolation — the shallow clone via `fetch-depth: 1` succeeds because the checkout action performs it on the host. The restriction applies only to `run:` shell steps that independently call git. + +--- + +## Enforcement in CI + +These quirks are captured as enforceable lint rules in `lint-workflow-yaml.yml`, which runs `.gitea/scripts/lint-workflow-yaml.py` against all `.gitea/workflows/*.yml` files. The script currently covers: + +1. `workflow_dispatch.inputs` blocks (rule-1) +2. `on: workflow_run` triggers (rule-2 — Gitea 1.22.6 lacks the event) +3. Job names containing `/` (rule-3 — breaks status-context tokenization) +4. Cross-file job-name collisions (rule-4) +5. `uses: org/repo@sha` pointing at non-molecule repos (rule-5) +6. `api.github.com` URL references without `GITHUB_SERVER_URL` set (rule-6 — warning) + +The `fetch-depth: 0` and `git fetch` inside `run:` steps patterns (quirks #1 and #4 above) are not yet covered by automated lint. Until they are, review workflow changes manually for these shapes. Do not add `continue-on-error: true` to the lint job as an escape hatch — if a lint fires for a legitimate reason, fix the workflow, do not suppress the lint.