ci(workflow): all-required umbrella deadlocks the runner pool — move to ci-meta #1779

Closed
opened 2026-05-24 05:25:51 +00:00 by hongming · 1 comment
Owner

Summary

ci.yml's all-required umbrella job is runs-on: ubuntu-latest — the same pool as its dependencies (Platform Go / Canvas Next.js / Shellcheck E2E). With ~10 main-pool runners and multiple PRs active concurrently, umbrellas hog the slots their own sub-jobs need. Result: sub-jobs never dispatch within the umbrella's 40-min internal deadline → umbrella fails → sub-jobs cancelled retroactively.

Observed 2026-05-24

During session merging 4 PRs (#1737/#1742/#1758/#1759), at one point:

  • 4 umbrella jobs spinning on main-pool runners (11min, 20min, 20min ages)
  • Waiting queue for ubuntu-latest jobs: 8+ molecule-core jobs aged 27-45min
  • Result: Platform Go / Canvas / Shellcheck for two of those PRs never started in their 40-min window
  • One PR (molecule-controlplane) had a job waiting 98 minutes

Killing the zombie umbrella containers immediately unblocked dispatch and 3 of 4 PRs went green within ~5min.

Fix

Move all-required to the ["ci-meta"] runner pool. The meta pool is sized for poller jobs (currently 2 runners: molecule-runner-ci-meta-1, molecule-runner-ci-meta-2). The umbrella does almost no work — it polls commit status — so the meta pool is the right home.

Change

.gitea/workflows/ci.yml:

   all-required:
     ...
     continue-on-error: false
-    runs-on: ubuntu-latest
+    runs-on: ["ci-meta"]
     timeout-minutes: 45

Acceptance

  • all-required jobs run on ci-meta runners (verified via docker ps on operator)
  • When multiple PRs are active, Platform/Canvas/Shellcheck jobs dispatch promptly (no 30+ min waits)
  • CI completes within typical workspace-server build time (~20-30 min)

Related

Surfaced as load-bearing during CTO-bypass merge of #1737/#1742/#1758/#1759 on 2026-05-24. Two PRs (#1737, #1759) required compensating-status merge because of this deadlock.

## Summary `ci.yml`'s `all-required` umbrella job is `runs-on: ubuntu-latest` — the **same pool** as its dependencies (Platform Go / Canvas Next.js / Shellcheck E2E). With ~10 main-pool runners and multiple PRs active concurrently, umbrellas hog the slots their own sub-jobs need. Result: sub-jobs never dispatch within the umbrella's 40-min internal deadline → umbrella fails → sub-jobs cancelled retroactively. ## Observed 2026-05-24 During session merging 4 PRs (#1737/#1742/#1758/#1759), at one point: - 4 umbrella jobs spinning on main-pool runners (11min, 20min, 20min ages) - Waiting queue for ubuntu-latest jobs: 8+ molecule-core jobs aged 27-45min - Result: Platform Go / Canvas / Shellcheck for two of those PRs **never started** in their 40-min window - One PR (molecule-controlplane) had a job waiting 98 minutes Killing the zombie umbrella containers immediately unblocked dispatch and 3 of 4 PRs went green within ~5min. ## Fix Move `all-required` to the `["ci-meta"]` runner pool. The meta pool is sized for poller jobs (currently 2 runners: `molecule-runner-ci-meta-1`, `molecule-runner-ci-meta-2`). The umbrella does almost no work — it polls commit status — so the meta pool is the right home. ### Change `.gitea/workflows/ci.yml`: ```diff all-required: ... continue-on-error: false - runs-on: ubuntu-latest + runs-on: ["ci-meta"] timeout-minutes: 45 ``` ## Acceptance - [ ] `all-required` jobs run on ci-meta runners (verified via docker ps on operator) - [ ] When multiple PRs are active, Platform/Canvas/Shellcheck jobs dispatch promptly (no 30+ min waits) - [ ] CI completes within typical workspace-server build time (~20-30 min) ## Related Surfaced as load-bearing during CTO-bypass merge of #1737/#1742/#1758/#1759 on 2026-05-24. Two PRs (#1737, #1759) required compensating-status merge because of this deadlock.
Author
Owner

Closing — already fixed on main by commit 7da843f2 ("fix(ci): move all-required to meta runner lane") landed 2026-05-23 20:09 PDT, before this issue was filed. git blame line 491 of .gitea/workflows/ci.yml confirms.

The deadlock observed during the 2026-05-24 merge session was caused by PRs whose workflow definitions predate 7da843f2 — Gitea evaluates runs-on: from the commit-under-test's workflow file. So PRs branched off pre-7da843f2 main still ran their umbrella on the main pool. As those PRs rebase onto post-fix main, the deadlock self-resolves.

No new code needed; flagging only to document the discovery.

Closing — already fixed on main by commit `7da843f2` ("fix(ci): move all-required to meta runner lane") landed 2026-05-23 20:09 PDT, before this issue was filed. `git blame` line 491 of `.gitea/workflows/ci.yml` confirms. The deadlock observed during the 2026-05-24 merge session was caused by PRs whose workflow definitions **predate** 7da843f2 — Gitea evaluates `runs-on:` from the commit-under-test's workflow file. So PRs branched off pre-7da843f2 main still ran their umbrella on the main pool. As those PRs rebase onto post-fix main, the deadlock self-resolves. No new code needed; flagging only to document the discovery.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1779