Files
molecule-core/runbooks/gitea-operational-quirks.md
T
hongming-codex-laptop 9eb8aad5c1
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Successful in 13s
CI / Detect changes (pull_request) Successful in 14s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 14s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s
qa-review / approved (pull_request) Failing after 11s
gate-check-v3 / gate-check (pull_request) Successful in 16s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 24s
security-review / approved (pull_request) Failing after 17s
sop-checklist-gate / gate (pull_request) Successful in 12s
sop-tier-check / tier-check (pull_request) Successful in 11s
CI / Platform (Go) (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m19s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m23s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m24s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m29s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m40s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m40s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 2s
fix(ci): add serialized Gitea merge queue
2026-05-13 01:56:58 -07:00

407 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gitea Actions operational quirks (molecule-core)
Documents persistent operational findings about Gitea Actions runner behaviour
that differ from GitHub Actions and require workarounds in workflow YAML or
runbooks.
> Last updated: 2026-05-12 (infra-runtime-be-agent)
---
## Quirk #1 — Large repo causes fetch timeout on Gitea Actions runner
### Finding
The Gitea Actions runner (container on host `5.78.80.188`) can reach the git
remote (`https://git.moleculesai.app`) over HTTPS — a single-commit shallow
fetch (`--depth=1`) succeeds in ~16 s. However, fetching the **full compressed
repo history** (~75+ MB) exceeds the runner's network timeout window (~15 s).
This is **not a Gitea Actions bug** and **not a network isolation policy**
it is a repo-size constraint. The runner can reach external hosts (GitHub,
Docker Hub, PyPI) without issue.
### Impact
Workflows that rely on `actions/checkout` with `fetch-depth: 0` (full history)
or `git clone` will time out.
Specifically:
- `actions/checkout@v*` with `fetch-depth: 0` hangs (fetching full repo
history takes >15 s before hitting the timeout).
- `git clone <url>` hangs for the same reason.
- `git fetch origin <ref> --depth=1` **succeeds** in ~16 s — this is the
working pattern.
### Affected workflows
| Workflow | Issue | Workaround |
|---|---|---|
| `harness-replays.yml` detect-changes job | `fetch-depth: 0` + `git clone` time out | Added `timeout 20 git fetch origin base.ref --depth=1` + `continue-on-error: true` + fallback to `run=true` per PR #441 |
| `publish-workspace-server-image.yml` | In-image `git clone` of workspace templates | Pre-clone manifest deps before compose build (Task #173 pattern) |
| Any workflow using `fetch-depth: 0` | Full history fetch times out | Use `fetch-depth: 1` + explicit `git fetch` for needed refs |
### How to diagnose
```bash
# From inside the runner (add as a debug step):
timeout 20 git fetch origin main --depth=1
# If this SUCCEEDS (~16s): runner can reach the git remote — the repo is
# too large for full-history fetch.
# If this times out: true network isolation (unlikely; check firewall rules).
```
### Verification
Confirmed 2026-05-11 by running `timeout 20 git fetch origin base.ref --depth=1`
in the `detect-changes` job of `harness-replays.yml`**succeeds in ~16 s**.
Runner can reach `https://api.github.com` and `https://pypi.org` without issue,
confirming this is a repo-size constraint, not network isolation.
### References
- PR #441: fix for `harness-replays.yml` detect-changes
- Task #173: pre-clone manifest deps pattern for compose build
- internal#102: tracking customer-private + marketplace third-party repos
- `feedback_oss_first_repo_visibility_default`: 5 workspace-template repos
flipped public to allow pre-clone without auth
---
## Quirk #2 — `continue-on-error` only works at step level, not job level
### Finding
Gitea Actions (1.22.6) does not honour `continue-on-error: true` at the **job**
level the way GitHub Actions does. A job with `continue-on-error: true` that
fails still reports `status: failure` in the commit status API.
Only `continue-on-error: true` at the **step** level works as expected.
### Impact
If you want a job to always "pass" in the status API (so dependent jobs can
run and the overall CI does not show `failure`), you must add
`continue-on-error: true` to every step that can fail, AND ensure each step
exits with code 0 (e.g., append `|| true` to commands that might fail).
### Affected workflows
| Workflow | Fix |
|---|---|
| `harness-replays.yml` detect-changes | Added `continue-on-error: true` to fetch step + decide step; added `|| true` to `DIFF=$(git diff ...)` per PR #441 |
### How to diagnose
```yaml
# WRONG — job reports as failure despite flag
jobs:
my-job:
continue-on-error: true # ← ignored by Gitea
steps:
- run: git diff ... # ← if this fails, job = failure
# job-level flag does not help
# RIGHT — step-level flag prevents step from failing
jobs:
my-job:
steps:
- run: git diff ... || true # ← step exits 0
continue-on-error: true # ← belt and suspenders
```
### References
- Quirk #10 (this document): Gitea does NOT auto-populate `secrets.GITHUB_TOKEN`
- PR #441: fix applied to `harness-replays.yml`
---
## Quirk #3 — `workflow_dispatch.inputs` not supported
Gitea 1.22.6 parser rejects `workflow_dispatch.inputs`. Drop from all workflow
YAML files ported from GitHub Actions. Manual triggers should use
`workflow_dispatch` without `inputs:`.
**Reference**: `feedback_gitea_workflow_dispatch_inputs_unsupported`
---
## Quirk #4 — `merge_group` not supported
Gitea has no native merge queue concept. Drop `merge_group:` triggers from
all workflow YAML files.
For `molecule-core`, use the external serialized queue documented in
`runbooks/gitea-merge-queue.md`. Gitea's `pull_auto_merge` table is
auto-merge-on-green, not a queue that retests each PR against latest `main`.
---
## Quirk #5 — `environment:` blocks not supported
Gitea has no environments concept. Drop `environment:` from all workflow YAML
files. Secrets and variables are repo-level.
---
## Quirk #6 — Gitea combined status reports `failure` when all contexts are `null`
### Finding
When ALL individual status contexts for a commit have `state: null` (no runner
has reported yet), Gitea reports the combined commit status as `failure`. This
is a Gitea Actions bug — it conflates "no status reported yet" with "failed".
### Impact
- The `main-red-watchdog` workflow opens a `[main-red]` issue for every
scheduled workflow run where the combined state is `failure` — even when
the failure is entirely due to Gitea's combined-status bug.
- This causes spurious `[main-red]` issues that waste SRE time investigating
non-existent failures.
- **This is especially confusing for `schedule:`-only workflows** (canary,
sweep jobs, synth-E2E): Gitea attributes their scheduled runs to `main`'s
HEAD commit, so if a scheduled run fires while all contexts are still
`state: null`, the watchdog opens a `[main-red]` issue on the latest main
commit even though that commit itself is perfectly fine.
### How to diagnose
Always check the **individual context `state` fields**, not the combined
`state`/`combined_state`. In the `/repos/{org}/{repo}/commits/{sha}/statuses`
API response, look for `"state": null` on every entry — if all are null, the
combined `failure` is Gitea's bug, not a real CI failure.
```json
{
"combined_state": "failure", // ← Gitea bug when all are null
"contexts": [
{ "context": "CI / Lint", "state": null }, // still running
{ "context": "CI / Test", "state": null } // still running
]
}
```
### Affected workflows
All workflows, but especially `schedule:`-only workflows that run on `main`.
The main-red-watchdog (`.gitea/workflows/main-red-watchdog.yml`) is the
primary consumer of combined status and is affected.
### References
- Issue #481: first real-world case of this bug (2026-05-11)
- `feedback_no_such_thing_as_flakes`: watchdog directive
---
## Quirk #7 — TBD
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
### Finding
*[What Gitea Actions does differently from GitHub Actions.]*
### Impact
*[Which workflows or operations are affected.]*
### Workaround
*[How to work around this quirk.]*
### References
- internal#[N]: first observation
---
## Quirk #8 — TBD
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
### Finding
*[What Gitea Actions does differently from GitHub Actions.]*
### Impact
*[Which workflows or operations are affected.]*
### Workaround
*[How to work around this quirk.]*
### References
- internal#[N]: first observation
---
## Quirk #9 — TBD
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
### Finding
*[What Gitea Actions does differently from GitHub Actions.]*
### Impact
*[Which workflows or operations are affected.]*
### Workaround
*[How to work around this quirk.]*
### References
- internal#[N]: first observation
---
## Quirk #10 — Gitea does NOT auto-populate `secrets.GITHUB_TOKEN`
### Finding
Gitea Actions (1.22.6) does **not** auto-populate `secrets.GITHUB_TOKEN`
the way GitHub Actions does. A workflow that references `secrets.GITHUB_TOKEN`
without explicitly provisioning a named secret gets an empty string — not a
read-only token scoped to the repo.
### Impact
Workflows that call the Gitea REST API using `secrets.GITHUB_TOKEN` as auth
receive **HTTP 401** on every API call. Affected workflows in molecule-core:
| Workflow | Symptom | Workaround |
|---|---|---|
| `gate-check-v3.yml` | Reports BLOCKED on every PR | Provision `SOP_TIER_CHECK_TOKEN`; update workflow to use it |
| `qa-review.yml` | Fails immediately on PR open | Same — needs named secret |
| `security-review.yml` | Fails immediately on PR open | Same — needs named secret |
### How to diagnose
Add a debug step to the failing workflow:
```yaml
- name: Diagnose token
run: |
echo "Token present: ${{ secrets.GITHUB_TOKEN != '' }}"
curl -sS --fail -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
"$GITHUB_SERVER_URL/api/v1/user" | jq -r '.login'
# Expected (GitHub): prints your username.
# Actual (Gitea): HTTP 401 or empty string.
```
### References
- internal#325: root-cause analysis and token provisioning
- `feedback_gitea_no_auto_supplied_github_token`
---
## Quirk #11 — PR-create event dispatcher races — only 1 of N workflows fires on `pull_request opened`
### Finding
When a PR is created via the Gitea web UI or API, the Gitea Actions event
dispatcher may fire **only 1 of N eligible workflows** on the initial
`pull_request opened` event. All other eligible workflows are silently dropped.
This was observed on molecule-core PR #558 (created 2026-05-11T19:54:10Z):
12+ workflows had no `paths:` filter and should have fired, but only
`sop-tier-check.yml` dispatched.
Concurrent PRs created within the same minute received 1230 dispatches each,
confirming this is specific to the PR-create event dispatch, not a general
runner capacity issue.
### Impact
- PRs may not run the full CI suite on first open.
- `gate-check-v3`, `secret-scan`, `qa-review`, and `security-review` can be
silently absent from the PR's status checks.
- Branch protection may block merge even though CI is effectively green.
### How to diagnose
```bash
# List workflow runs for the PR:
gh run list --event pull_request --repo molecule-ai/molecule-core \
| grep "$(gh pr view $PR --json number --jq '.number')"
# Expected: 12+ runs on PR open.
# Actual (when race fires): only 1 run.
```
### Workaround
Force a second dispatch by pushing a no-op synchronize commit:
```bash
git commit --allow-empty -m "chore: trigger workflows [skip ci]"
git push
```
The synchronize event fires a second `pull_request` event, which reliably
triggers all eligible workflows.
### References
- internal#329: first observation on PR #558
- `feedback_gitea_pr_create_dispatcher_race`
---
## When you find a new quirk
Copy the template below, increment the quirk number, and fill in the finding,
impact, workaround, and references. Place the new section in the **correct
numerical position** (before the next higher-numbered quirk). Update this
section's final paragraph to remove the next slot's number.
### Template
```markdown
## Quirk #N — <short title>
### Finding
<What Gitea Actions does differently from GitHub Actions.>
### Impact
<Which workflows or operations are affected. Include an affected workflows
table if more than one is affected.>
### How to diagnose
<Shell commands or API calls that confirm this is the quirk, not a real failure.>
### Workaround
<How to work around this quirk in workflow YAML or operations.>
### References
- internal#[N]: first observation
- <Any Gitea issue, feedback label, or upstream bug tracker reference>
```
---
## Open questions for Gitea 1.23
- [ ] **act_runner concurrent-job cap**: issue #305 — runner saturation under
merge burst; needs `max_concurrent_jobs` cap configured on act_runner
- [ ] **Infisical→Gitea secret-sync**: issue #307 — eliminate manual secret
PUTs by wiring an Infisical cron to the Gitea API
- [ ] **PR-create dispatcher race resolution**: internal #329 — is there a
Gitea fix or config knob to disable the race? File upstream bug if not
- [ ] **GITHUB_TOKEN auto-population**: internal #325 — is this on the
Gitea 1.23 roadmap? If not, the workaround (named secret) is the permanent
answer