fix(runbooks): correct Gitea runner network/fetch timing facts

SRE review of PR #457 flagged two factual errors that were not
addressed before merge. Applying corrections directly per SRE
mandate: no manual production changes without config-as-code.

Corrections:
1. Remove "git fetch --depth=1 times out" — shallow fetch succeeds
   in ~16s per PR #441 detect-changes evidence. Only fetch-depth:0
   and git clone time out due to ~75MB repo history size.
2. Rewrite "runner cannot reach git remote" to accurately state:
   runner CAN reach the remote; fetching full compressed history
   exceeds the ~15s network timeout window. Repo-size constraint,
   not network isolation.
3. Updated diagnosis snippet and verification section to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Molecule AI · infra-sre 2026-05-11 12:43:35 +00:00 committed by Molecule AI Core-BE
parent 6403c5196f
commit 58be7b29ac

View File

@ -8,36 +8,36 @@ runbooks.
---
## Gitea 1.22.6 runner network isolation
## Large repo causes fetch timeout on Gitea Actions runner
### Finding
The Gitea Actions runner (container on host `5.78.80.188`) cannot reach the
git remote (`https://git.moleculesai.app`) over HTTPS from inside the runner
container. Any `git fetch`, `git clone`, or `git push` command that contacts
the remote times out at 1215 s.
The Gitea Actions runner (container on host `5.78.80.188`) can reach the git
remote (`https://git.moleculesai.app`) over HTTPS — a single-commit shallow
fetch (`--depth=1`) succeeds in ~16 s. However, fetching the **full compressed
repo history** (~75+ MB) exceeds the runner's network timeout window (~15 s).
This is **not a Gitea Actions bug** — it is an operator-level network policy
where the runner container's network namespace is restricted from reaching the
Gitea host HTTPS endpoint. The runner can reach external hosts (GitHub,
Docker Hub, PyPI) normally.
This is **not a Gitea Actions bug** and **not a network isolation policy**
it is a repo-size constraint. The runner can reach external hosts (GitHub,
Docker Hub, PyPI) without issue.
### Impact
Workflows that rely on `git fetch origin <ref>` or `actions/checkout` with
`fetch-depth: 0` (full history) will hang or time out.
Workflows that rely on `actions/checkout` with `fetch-depth: 0` (full history)
or `git clone` will time out.
Specifically:
- `actions/checkout@v*` with `fetch-depth: 0` hangs (fetching full repo
history takes >30 s before hitting the timeout).
- `git fetch origin main --depth=1` times out at ~15 s.
- `git clone <url>` times out at ~15 s.
history takes >15 s before hitting the timeout).
- `git clone <url>` hangs for the same reason.
- `git fetch origin <ref> --depth=1` **succeeds** in ~16 s — this is the
working pattern.
### Affected workflows
| Workflow | Issue | Workaround |
|---|---|---|
| `harness-replays.yml` detect-changes job | `git fetch origin main --depth=1` times out | Added `timeout 20` + graceful fallback to `run=true` (always run harness) per PR #441 |
| `harness-replays.yml` detect-changes job | `fetch-depth: 0` + `git clone` time out | Added `timeout 20 git fetch origin base.ref --depth=1` + `continue-on-error: true` + fallback to `run=true` per PR #441 |
| `publish-workspace-server-image.yml` | In-image `git clone` of workspace templates | Pre-clone manifest deps before compose build (Task #173 pattern) |
| Any workflow using `fetch-depth: 0` | Full history fetch times out | Use `fetch-depth: 1` + explicit `git fetch` for needed refs |
@ -46,15 +46,17 @@ Specifically:
```bash
# From inside the runner (add as a debug step):
timeout 20 git fetch origin main --depth=1
# If this times out: runner cannot reach git remote
# If this SUCCEEDS (~16s): runner can reach the git remote — the repo is
# too large for full-history fetch.
# If this times out: true network isolation (unlikely; check firewall rules).
```
### Verification
Confirmed 2026-05-11 by running `timeout 20 git fetch origin main --depth=1`
in the `detect-changes` job of `harness-replays.yml`consistently times
out at 15 s. Runner can reach `https://api.github.com` and `https://pypi.org`
without issue.
Confirmed 2026-05-11 by running `timeout 20 git fetch origin base.ref --depth=1`
in the `detect-changes` job of `harness-replays.yml`**succeeds in ~16 s**.
Runner can reach `https://api.github.com` and `https://pypi.org` without issue,
confirming this is a repo-size constraint, not network isolation.
### References