From 58be7b29acf583bdc1532fc930268ea54c33d9c3 Mon Sep 17 00:00:00 2001 From: Molecule AI Infra-SRE Date: Mon, 11 May 2026 12:43:35 +0000 Subject: [PATCH] fix(runbooks): correct Gitea runner network/fetch timing facts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SRE review of PR #457 flagged two factual errors that were not addressed before merge. Applying corrections directly per SRE mandate: no manual production changes without config-as-code. Corrections: 1. Remove "git fetch --depth=1 times out" — shallow fetch succeeds in ~16s per PR #441 detect-changes evidence. Only fetch-depth:0 and git clone time out due to ~75MB repo history size. 2. Rewrite "runner cannot reach git remote" to accurately state: runner CAN reach the remote; fetching full compressed history exceeds the ~15s network timeout window. Repo-size constraint, not network isolation. 3. Updated diagnosis snippet and verification section to match. Co-Authored-By: Claude Opus 4.7 --- runbooks/gitea-operational-quirks.md | 42 +++++++++++++++------------- 1 file changed, 22 insertions(+), 20 deletions(-) diff --git a/runbooks/gitea-operational-quirks.md b/runbooks/gitea-operational-quirks.md index 43c0dbaa..b48fe861 100644 --- a/runbooks/gitea-operational-quirks.md +++ b/runbooks/gitea-operational-quirks.md @@ -8,36 +8,36 @@ runbooks. --- -## Gitea 1.22.6 runner network isolation +## Large repo causes fetch timeout on Gitea Actions runner ### Finding -The Gitea Actions runner (container on host `5.78.80.188`) cannot reach the -git remote (`https://git.moleculesai.app`) over HTTPS from inside the runner -container. Any `git fetch`, `git clone`, or `git push` command that contacts -the remote times out at 12–15 s. +The Gitea Actions runner (container on host `5.78.80.188`) can reach the git +remote (`https://git.moleculesai.app`) over HTTPS — a single-commit shallow +fetch (`--depth=1`) succeeds in ~16 s. However, fetching the **full compressed +repo history** (~75+ MB) exceeds the runner's network timeout window (~15 s). -This is **not a Gitea Actions bug** — it is an operator-level network policy -where the runner container's network namespace is restricted from reaching the -Gitea host HTTPS endpoint. The runner can reach external hosts (GitHub, -Docker Hub, PyPI) normally. +This is **not a Gitea Actions bug** and **not a network isolation policy** — +it is a repo-size constraint. The runner can reach external hosts (GitHub, +Docker Hub, PyPI) without issue. ### Impact -Workflows that rely on `git fetch origin ` or `actions/checkout` with -`fetch-depth: 0` (full history) will hang or time out. +Workflows that rely on `actions/checkout` with `fetch-depth: 0` (full history) +or `git clone` will time out. Specifically: - `actions/checkout@v*` with `fetch-depth: 0` hangs (fetching full repo - history takes >30 s before hitting the timeout). -- `git fetch origin main --depth=1` times out at ~15 s. -- `git clone ` times out at ~15 s. + history takes >15 s before hitting the timeout). +- `git clone ` hangs for the same reason. +- `git fetch origin --depth=1` **succeeds** in ~16 s — this is the + working pattern. ### Affected workflows | Workflow | Issue | Workaround | |---|---|---| -| `harness-replays.yml` detect-changes job | `git fetch origin main --depth=1` times out | Added `timeout 20` + graceful fallback to `run=true` (always run harness) per PR #441 | +| `harness-replays.yml` detect-changes job | `fetch-depth: 0` + `git clone` time out | Added `timeout 20 git fetch origin base.ref --depth=1` + `continue-on-error: true` + fallback to `run=true` per PR #441 | | `publish-workspace-server-image.yml` | In-image `git clone` of workspace templates | Pre-clone manifest deps before compose build (Task #173 pattern) | | Any workflow using `fetch-depth: 0` | Full history fetch times out | Use `fetch-depth: 1` + explicit `git fetch` for needed refs | @@ -46,15 +46,17 @@ Specifically: ```bash # From inside the runner (add as a debug step): timeout 20 git fetch origin main --depth=1 -# If this times out: runner cannot reach git remote +# If this SUCCEEDS (~16s): runner can reach the git remote — the repo is +# too large for full-history fetch. +# If this times out: true network isolation (unlikely; check firewall rules). ``` ### Verification -Confirmed 2026-05-11 by running `timeout 20 git fetch origin main --depth=1` -in the `detect-changes` job of `harness-replays.yml` — consistently times -out at 15 s. Runner can reach `https://api.github.com` and `https://pypi.org` -without issue. +Confirmed 2026-05-11 by running `timeout 20 git fetch origin base.ref --depth=1` +in the `detect-changes` job of `harness-replays.yml` — **succeeds in ~16 s**. +Runner can reach `https://api.github.com` and `https://pypi.org` without issue, +confirming this is a repo-size constraint, not network isolation. ### References