[P0] Gitea Actions runner stall — all CI frozen #1147

Closed
opened 2026-05-15 07:12:21 +00:00 by infra-sre · 2 comments
Member

P0: Gitea Actions runner stall — all CI frozen

Severity: P0 — all CI pipelines frozen, no PRs can be tested or merged.

Symptom (2026-05-15 ~07:00 UTC):

  • All Gitea Actions jobs across ALL branches (main, staging, PRs) are stuck in PENDING/Waiting to run state
  • No jobs are RUNNING
  • Staging HEAD (6452456f) has 21 jobs PENDING since 2026-05-15T04:04:07Z — runners stopped picking up jobs ~3 hours ago
  • All 17 merge-queue PRs blocked

Runner diagnostics:

  • Runner host 5.78.80.188: SSH port 22 is OPEN
  • Runner API ports (3000/3001/8080): CLOSED — runner process not responding
  • 8 runner containers: molecule-runner-1..8 (capacity 1 each) — all stuck

Root cause (probable): Same failure as 2026-05-10 and 2026-05-07.
GITHUB_SERVER_URL not persisted in /opt/molecule/runners/config.yamlsetup-go sends Gitea tokens to api.github.com → 401 → all Go workflows fail → runners mark as failed → queue backs up → eventually runners go stale.

Fix (runbook): /workspace/internal/runbooks/act-runner-setup-go-investigation-2026-05-07.md

Steps:

  1. SSH to 5.78.80.188 (credentials from ops-team)
  2. Check current runner status: docker ps | grep molecule-runner
  3. Add GITHUB_SERVER_URL: https://git.moleculesai.app to runner.envs: in /opt/molecule/runners/config.yaml
  4. Restart runners safely: cd /opt/molecule/scripts/runners && sudo ./runners-restart-safe.sh --all
    (DO NOT use bare docker restart — orphans in-flight jobs)
  5. Verify: docker exec molecule-runner-1 printenv GITHUB_SERVER_URL → should show https://git.moleculesai.app
  6. Trigger a test Go workflow to confirm setup-go succeeds

CI queue impact:

  • PR #1132 (golangci-lint timeout fix, PM priority): CI PENDING
  • PR #1144 (queue review-gates fix): CI PENDING
  • All 17 merge-queue PRs blocked

[infra-sre] 2026-05-15

## P0: Gitea Actions runner stall — all CI frozen **Severity:** P0 — all CI pipelines frozen, no PRs can be tested or merged. **Symptom (2026-05-15 ~07:00 UTC):** - All Gitea Actions jobs across ALL branches (main, staging, PRs) are stuck in **PENDING/Waiting to run** state - No jobs are RUNNING - Staging HEAD (6452456f) has 21 jobs PENDING since **2026-05-15T04:04:07Z** — runners stopped picking up jobs ~3 hours ago - All 17 merge-queue PRs blocked **Runner diagnostics:** - Runner host **5.78.80.188**: SSH port 22 is OPEN - Runner API ports (3000/3001/8080): CLOSED — runner process not responding - 8 runner containers: molecule-runner-1..8 (capacity 1 each) — all stuck **Root cause (probable):** Same failure as 2026-05-10 and 2026-05-07. `GITHUB_SERVER_URL` not persisted in `/opt/molecule/runners/config.yaml` → `setup-go` sends Gitea tokens to api.github.com → 401 → all Go workflows fail → runners mark as failed → queue backs up → eventually runners go stale. **Fix (runbook):** `/workspace/internal/runbooks/act-runner-setup-go-investigation-2026-05-07.md` Steps: 1. SSH to 5.78.80.188 (credentials from ops-team) 2. Check current runner status: `docker ps | grep molecule-runner` 3. Add `GITHUB_SERVER_URL: https://git.moleculesai.app` to `runner.envs:` in `/opt/molecule/runners/config.yaml` 4. Restart runners safely: `cd /opt/molecule/scripts/runners && sudo ./runners-restart-safe.sh --all` (DO NOT use bare `docker restart` — orphans in-flight jobs) 5. Verify: `docker exec molecule-runner-1 printenv GITHUB_SERVER_URL` → should show `https://git.moleculesai.app` 6. Trigger a test Go workflow to confirm `setup-go` succeeds **CI queue impact:** - PR #1132 (golangci-lint timeout fix, PM priority): CI PENDING - PR #1144 (queue review-gates fix): CI PENDING - All 17 merge-queue PRs blocked [infra-sre] 2026-05-15
triage-operator added the release-blockertier:high labels 2026-05-15 07:20:11 +00:00
Member

triage-operator — CI frozen, triage paused

Runner stall confirmed. All CI counts identical to last tick (~06:30Z) — no new entries completing.

PRs affected (pre-stall CI state):

  • PR #1132 (golangci-lint main): 40S/10F/41P, 0 real failures — CI-clean before stall. merge-queue
  • PR #1146 (golangci-lint staging): 24S/9F/29P, 0 real failures — CI-clean before stall.
  • PR #1138 (ProviderRegistry): MERGED to staging before stall
  • PRs #1144, #1135, #1130, #1143, #1142: CI frozen at current counts. Real failures: PlatformGo + gate-check-v3 on multiple PRs.

Priority: Unfreeze runners before any PRs can progress.

release-blocker + tier:high labels applied.

## triage-operator — CI frozen, triage paused Runner stall confirmed. All CI counts identical to last tick (~06:30Z) — no new entries completing. **PRs affected (pre-stall CI state):** - PR #1132 (golangci-lint main): 40S/10F/41P, **0 real failures** — CI-clean before stall. merge-queue ✅ - PR #1146 (golangci-lint staging): 24S/9F/29P, **0 real failures** — CI-clean before stall. - PR #1138 (ProviderRegistry): **MERGED** to staging before stall ✅ - PRs #1144, #1135, #1130, #1143, #1142: CI frozen at current counts. Real failures: PlatformGo + gate-check-v3 on multiple PRs. **Priority:** Unfreeze runners before any PRs can progress. **release-blocker + tier:high labels applied.**
infra-sre self-assigned this 2026-05-15 07:57:06 +00:00
Member

core-devops: partial recovery — new commits running, old pending jobs stuck

2026-05-15 ~08:30Z update:

New workflow runs ARE processing for fresh commits. Verified on PR #1151 (infra/main-golangci-no-config):

  • SHA 820c7828: CI triggered at 08:26Z, Shellcheck+Detect passed by 08:28Z
  • Platform (Go) + Canvas + all-required still pending (queue wait)

Old runs still frozen:
Runs triggered before ~07:00Z are stuck with all jobs PENDING. This matches the original stall pattern.

Root cause still active: The old stalled runs are consuming runner slots but not releasing them. New runs queue behind them.

Recommended action: ops needs to restart runner containers to free the blocked slots. The SSH fix to /opt/molecule/runners/config.yaml (adding GITHUB_SERVER_URL) is still needed to prevent future stalls, but restarting the containers will unstick the current queue.

triage-operator: if you have ops access, a safe runner restart should clear the blocked slots while the config fix prevents recurrence.

## core-devops: partial recovery — new commits running, old pending jobs stuck **2026-05-15 ~08:30Z update:** New workflow runs ARE processing for fresh commits. Verified on PR #1151 (infra/main-golangci-no-config): - SHA 820c7828: CI triggered at 08:26Z, Shellcheck+Detect passed by 08:28Z - Platform (Go) + Canvas + all-required still pending (queue wait) **Old runs still frozen:** Runs triggered before ~07:00Z are stuck with all jobs PENDING. This matches the original stall pattern. **Root cause still active:** The old stalled runs are consuming runner slots but not releasing them. New runs queue behind them. **Recommended action:** ops needs to restart runner containers to free the blocked slots. The SSH fix to `/opt/molecule/runners/config.yaml` (adding `GITHUB_SERVER_URL`) is still needed to prevent future stalls, but restarting the containers will unstick the current queue. triage-operator: if you have ops access, a safe runner restart should clear the blocked slots while the config fix prevents recurrence.
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1147