fix(ci): cut scheduler fan-out + stop all-required poll-gate squatting a slot #2094

Merged
devops-engineer merged 1 commits from fix/ci-scheduler-fanout into main 2026-06-01 16:05:25 +00:00
Member

Summary

Durable fix for the CI-scheduler-overload root cause (live RCA): the Gitea Actions run-scheduler is throughput-starved by workflow fan-out. Two changes, both branch-protection-preserving:

  1. all-required poll-gate → needs: aggregator — stops the sentinel squatting a ci-meta executor slot for up to 40 min/PR.
  2. Cut fan-out — consolidate two sub-second sibling lints into one workflow run; add a paths: filter to one heavy non-required advisory workflow.

CTO review: this changes workflow STRUCTURE and one advisory workflow's trigger surface. Verify branch protection before merge. The required-status-check contexts are deliberately unchanged (proof in the mapping table below), so no branch_protections PATCH is needed — but please confirm against the live BP before merging, since I could not read branch_protections directly (core-be persona is non-admin; I sourced the live contexts from the authoritative ci-required-drift issues #1738/#1739 + the checked-in audit-force-merge.yml REQUIRED_CHECKS mirror, which ci-required-drift keeps set-equal to BP).


Phase 1 — Inventory

Runner-lane reality (operator 5.78.80.188, live)

Lane Runners Slots Serves labels
docker-host molecule-runner-1..6 6 (cap 1 ea) ubuntu-latest, ubuntu-22.04, self-hosted, docker-host
ci-meta molecule-runner-ci-meta-1..2 2 ci-meta (+config.light also serves lint/review/molecule-light)
publish molecule-runner-publish-1..2 2 publish, release

The old all-required poller ran on ci-meta and polled for ≤40 min, squatting 1 of the 2 ci-meta slots on every PR. Two concurrent JOB-all-required containers were observed pinning the lane during the RCA.

Live branch-protection required status contexts

Sourced from ci-required-drift issues #1738 (main) / #1739 (staging) Debug blocks:

main:

  • CI / all-required (pull_request)
  • E2E API Smoke Test / E2E API Smoke Test (pull_request)
  • Handlers Postgres Integration / Handlers Postgres Integration (pull_request)

staging:

  • CI / all-required (pull_request)
  • sop-checklist / all-items-acked (pull_request)

→ Only 4 workflows/jobs emit required contexts: ci.yml (all-required), e2e-api.yml, handlers-postgres-integration.yml, sop-checklist.yml. None of them is touched in a context-affecting way by this PR, and lint-required-no-paths forbids paths filters only on these — so none get one.

Workflow trigger inventory (57 files → 56 after consolidation)

  • 19 fire on pull_request/pull_request_target with no paths: filter (the fan-out set). Of these, most are pull_request_target review/security/meta jobs that should always run (audit-force-merge, qa-review, security-review, sop-checklist, sop-tier-check, gate-check-v3, secret-scan, block-internal-paths) or are required emitters (ci, e2e-api, handlers-postgres, …).
  • 15 already carry paths: filters (the # bp-exempt meta-lints).
  • 23 are schedule/dispatch/push-only/comment (not PR fan-out).

The genuinely-safe fan-out reductions (cheap + non-required + non-security): the two RFC#523 Go-source sibling lints, and the Go-toolchain verify-providers-gen advisory.


Fix #2all-required poll-gate → needs: aggregator (ci.yml)

Was: runs-on: ci-meta, ran detect-changes.py, then a while True loop polling GET /repos/.../commits/{sha}/statuses every 15 s for up to 40 min, holding the slot the whole time. The poller existed only to dodge the Gitea needs: + if: always() bug.

Now: plain needs: [changes, platform-build, canvas-build, shellcheck, python-lint] + a sub-second inline needs.*.result check (no API, no poll, no checkout). The slot is freed immediately.

Why this is safe now (and wasn't when the poller was written): every aggregated CI job already gates its real work per-step (if: needs.changes.outputs.* != 'true'), not at the job level — so each job always reaches a terminal SUCCESS (never skipped). Plain needs: WITHOUT if: always() works on Gitea 1.22.6 / act_runner v0.6.1; only needs: + if: always() is broken (feedback_gitea_needs_works_only_ifalways_broken). This job uses plain needs: + an explicit result-check, never if: always(). A failed/errored need short-circuits the job → red propagates to CI / all-required. canvas-deploy-reminder is event-gated (if: github.ref…) so it is excluded (it skips on PRs).

Drift-safety: the needs: set is exactly ci-required-drift.py's ci_job_names() (all jobs − sentinel − event-gated). I simulated F1/F1b/F2/F3 against the modified ci.yml + live BP: F1, F1b, F2 clean; F3 main clean; F3 staging divergence is pre-existing (issue #1739, single global REQUIRED_CHECKS can't match two branch BP sets) and is not touched by this PR.


Fix #4 — Cut fan-out

(a) Consolidate two RFC#523 sibling lints (2 runs → 1).
lint-no-tenant-gitea-token.yml is folded into lint-forbidden-env-keys.yml as a second job scan-tenant-token-write. Both are sub-second Go-source greps that previously fired as two separate workflow runs + two checkouts per PR. Now one workflow, one checkout, both scans still fire unconditionally on every PR (no paths filter — RFC#523 threat model preserved). The moved job keeps its exact name: and # bp-exempt: directive (Tier 2g). The old Lint no tenant GITEA or GITHUB token write / … context is retired (a disappearing context needs no directive).

(b) paths: filter on verify-providers-gen.yml.
This advisory gate runs the Go toolchain (go run ./cmd/gen-providers -check + go generate, ~8 min) on every PR, but its verdict can only change when the codegen surface changes. Scoped to workspace-server/internal/providers/**, workspace-server/cmd/gen-providers/**, and its own file (mirrors sibling sync-providers-yaml.yml). SAFE because it is NOT a branch-protection required context (see its header §ENFORCEMENT GATING) — lint-required-no-paths only forbids paths filters on required workflows.

I deliberately did NOT add paths filters to secret-scan / block-internal-paths (security, must run every PR) or to any required emitter.


Old → new required-context mapping (proves the gate is preserved)

Required context (live BP) Before After Change
CI / all-required (pull_request) emitted by ci.yml job all-required (poller) emitted by ci.yml job all-required (needs-aggregator) name UNCHANGED; only internal mechanism
CI / all-required (push) same same UNCHANGED (consumed by prod-auto-deploy.py, gitea-merge-queue.yml)
E2E API Smoke Test / E2E API Smoke Test (pull_request) e2e-api.yml e2e-api.yml untouched
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) handlers-postgres-integration.yml same untouched
sop-checklist / all-items-acked (pull_request) sop-checklist.yml same untouched

No required context is added, removed, or renamed. No branch_protections PATCH required.

Retired (non-required) context, for completeness: Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) → its job moves into Lint forbidden tenant-env keys with the same job name:, emitting Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request). Neither is in BP.


Validation

  • YAML: repo's own lint-workflow-yaml.py56 files checked, 0 fatal Gitea-1.22.6-hostile shapes. PyYAML parses all 56. actionlint exit 0 (only pre-existing ci-meta/docker-host self-hosted-label notices + pre-existing shellcheck infos in untouched code).
  • needs: graph: every needs: references a real job; no job combines needs: + if: always() anywhere in the workflow dir.
  • Drift simulation: ci-required-drift F1/F1b/F2 clean; F3 main clean (staging divergence pre-existing, untouched).
  • Bash: both consolidated job scripts + the new gate script pass bash -n.
  • Tests: test_ci_workflow_bookkeeping.py updated to pin the new aggregator shape, the no-if:always() hazard, and the F1-lockstep invariant (watched old assertions fail → pass on new shape). Full .gitea/scripts/tests (192) + affected tests/ lints (required-no-paths, required-context-exists-in-bp, bp-context-emit-match, ci-required-drift, mask-pr-atomicity, workflow-yaml) all green.

Skip-conditions (dev-SOP Stage A/B/C)

Pure CI-workflow change — no workspace-server/ binary, migration, or runtime path. Stage A/B/C (platform-boot / tenant-probe / runtime-smoke) N/A; verification is the YAML+lint+drift-simulation+unit-test surface above. The real end-to-end proof is the next PR's CI run on this branch (this PR itself exercises the new all-required aggregator).


SOP checklist

Comprehensive testing performed: repo's own lint-workflow-yaml.py → 56 files checked, 0 fatal Gitea-1.22.6-hostile shapes; actionlint exit 0; needs: graph verified (every needs: references a real job, no needs:+if: always() anywhere); full .gitea/scripts/tests (192) green; test_ci_workflow_bookkeeping.py updated to pin the new aggregator shape (watched old assertions fail → pass on the new shape); affected tests/ lints (required-no-paths, required-context-exists-in-bp, bp-context-emit-match, ci-required-drift, mask-pr-atomicity, workflow-yaml) all green.

Local-postgres E2E run: N/A — pure CI-workflow change; no workspace-server/ binary, migration, or DB/query path. Verification surface is the YAML + lint + drift-simulation + unit-test set above; the real end-to-end proof is this PR's own CI run exercising the new all-required aggregator.

Staging-smoke verified or pending: scheduled post-merge; Stage A/B/C (platform-boot / tenant-probe / runtime-smoke) N/A for a CI-workflow-only change. The next PR's CI run on main exercises the new aggregator end-to-end.

Root-cause not symptom: the CI-scheduler-overload root cause is throughput-starvation from workflow fan-out — the all-required poll-gate squatted a ci-meta executor slot for up to 40 min/PR. Replacing the poller with a plain needs: aggregator (no if: always(), safe on Gitea 1.22.6) frees the slot immediately, addressing the cause rather than masking the lane contention.

Five-Axis review walked: correctness (required-context names unchanged — proven by the mapping table; drift sim F1/F1b/F2 clean, F3 main clean), readability (consolidated lints keep exact name: + # bp-exempt: directives), architecture (poller→needs: aggregator + paths-filter only on non-required advisory), security (no paths filter on secret-scan/block-internal-paths/any required emitter; RFC#523 token-write scans still fire unconditionally), performance (cuts ~40-min slot squat + 2 redundant workflow runs + an ~8-min Go-toolchain advisory on no-op PRs).

No backwards-compat shim / dead code added: confirmed — the poll-gate is replaced outright (no dual-path shim); the retired Lint no tenant GITEA or GITHUB token write context disappears cleanly as its job moves (same name:) into Lint forbidden tenant-env keys. No dead code, no compat shim.

Memory/saved-feedback consulted: yes — feedback_gitea_needs_works_only_ifalways_broken (plain needs: WORKS on 1.22.6/v0.6.1; only needs:+if: always() is broken — this PR uses plain needs: + an explicit result-check, never if: always()); feedback_no_silent_checklist_trim; feedback_watch_latest_main_head_not_merge_commit for post-merge verification.

🤖 Generated with Claude Code

## Summary Durable fix for the **CI-scheduler-overload** root cause (live RCA): the Gitea Actions run-scheduler is throughput-starved by workflow fan-out. Two changes, both **branch-protection-preserving**: 1. **`all-required` poll-gate → `needs:` aggregator** — stops the sentinel squatting a `ci-meta` executor slot for up to 40 min/PR. 2. **Cut fan-out** — consolidate two sub-second sibling lints into one workflow run; add a `paths:` filter to one heavy *non-required* advisory workflow. > ⚠ **CTO review:** this changes workflow STRUCTURE and one advisory workflow's trigger surface. **Verify branch protection before merge.** The required-status-check *contexts* are deliberately **unchanged** (proof in the mapping table below), so no `branch_protections` PATCH is needed — but please confirm against the live BP before merging, since I could not read `branch_protections` directly (core-be persona is non-admin; I sourced the live contexts from the authoritative `ci-required-drift` issues #1738/#1739 + the checked-in `audit-force-merge.yml` REQUIRED_CHECKS mirror, which `ci-required-drift` keeps set-equal to BP). --- ## Phase 1 — Inventory ### Runner-lane reality (operator `5.78.80.188`, live) | Lane | Runners | Slots | Serves labels | |---|---|---|---| | docker-host | `molecule-runner-1..6` | **6** (cap 1 ea) | `ubuntu-latest`, `ubuntu-22.04`, `self-hosted`, `docker-host` | | ci-meta | `molecule-runner-ci-meta-1..2` | **2** | `ci-meta` (+`config.light` also serves `lint`/`review`/`molecule-light`) | | publish | `molecule-runner-publish-1..2` | **2** | `publish`, `release` | The old `all-required` poller ran on `ci-meta` and **polled for ≤40 min**, squatting 1 of the 2 ci-meta slots on every PR. Two concurrent `JOB-all-required` containers were observed pinning the lane during the RCA. ### Live branch-protection required status contexts Sourced from `ci-required-drift` issues #1738 (main) / #1739 (staging) Debug blocks: **`main`:** - `CI / all-required (pull_request)` - `E2E API Smoke Test / E2E API Smoke Test (pull_request)` - `Handlers Postgres Integration / Handlers Postgres Integration (pull_request)` **`staging`:** - `CI / all-required (pull_request)` - `sop-checklist / all-items-acked (pull_request)` → Only **4 workflows/jobs** emit required contexts: `ci.yml` (all-required), `e2e-api.yml`, `handlers-postgres-integration.yml`, `sop-checklist.yml`. **None of them is touched in a context-affecting way by this PR**, and `lint-required-no-paths` forbids paths filters only on these — so none get one. ### Workflow trigger inventory (57 files → 56 after consolidation) - **19** fire on `pull_request`/`pull_request_target` with **no `paths:` filter** (the fan-out set). Of these, most are `pull_request_target` review/security/meta jobs that *should* always run (audit-force-merge, qa-review, security-review, sop-checklist, sop-tier-check, gate-check-v3, secret-scan, block-internal-paths) or are required emitters (ci, e2e-api, handlers-postgres, …). - **15** already carry `paths:` filters (the `# bp-exempt` meta-lints). - **23** are schedule/dispatch/push-only/comment (not PR fan-out). The genuinely-safe fan-out reductions (cheap + non-required + non-security): the two RFC#523 Go-source sibling lints, and the Go-toolchain `verify-providers-gen` advisory. --- ## Fix #2 — `all-required` poll-gate → `needs:` aggregator (`ci.yml`) **Was:** `runs-on: ci-meta`, ran `detect-changes.py`, then a `while True` loop polling `GET /repos/.../commits/{sha}/statuses` every 15 s for up to 40 min, holding the slot the whole time. The poller existed only to dodge the Gitea `needs:` + `if: always()` bug. **Now:** plain `needs: [changes, platform-build, canvas-build, shellcheck, python-lint]` + a sub-second inline `needs.*.result` check (no API, no poll, no checkout). The slot is freed immediately. **Why this is safe now (and wasn't when the poller was written):** every aggregated CI job already gates its real work **per-step** (`if: needs.changes.outputs.* != 'true'`), not at the job level — so each job always reaches a terminal **SUCCESS** (never `skipped`). Plain `needs:` **WITHOUT** `if: always()` works on Gitea 1.22.6 / act_runner v0.6.1; only `needs:` + `if: always()` is broken (`feedback_gitea_needs_works_only_ifalways_broken`). This job uses plain `needs:` + an explicit result-check, **never `if: always()`**. A failed/errored need short-circuits the job → red propagates to `CI / all-required`. `canvas-deploy-reminder` is event-gated (`if: github.ref…`) so it is **excluded** (it skips on PRs). **Drift-safety:** the `needs:` set is exactly `ci-required-drift.py`'s `ci_job_names()` (all jobs − sentinel − event-gated). I simulated F1/F1b/F2/F3 against the modified `ci.yml` + live BP: **F1, F1b, F2 clean; F3 main clean**; F3 staging divergence is **pre-existing** (issue #1739, single global REQUIRED_CHECKS can't match two branch BP sets) and is **not touched** by this PR. --- ## Fix #4 — Cut fan-out **(a) Consolidate two RFC#523 sibling lints (2 runs → 1).** `lint-no-tenant-gitea-token.yml` is folded into `lint-forbidden-env-keys.yml` as a second job `scan-tenant-token-write`. Both are sub-second Go-source greps that previously fired as two separate workflow runs + two checkouts per PR. Now one workflow, one checkout, **both scans still fire unconditionally on every PR** (no paths filter — RFC#523 threat model preserved). The moved job keeps its exact `name:` and `# bp-exempt:` directive (Tier 2g). The old `Lint no tenant GITEA or GITHUB token write / …` context is retired (a disappearing context needs no directive). **(b) `paths:` filter on `verify-providers-gen.yml`.** This advisory gate runs the Go toolchain (`go run ./cmd/gen-providers -check` + `go generate`, ~8 min) on **every** PR, but its verdict can only change when the codegen surface changes. Scoped to `workspace-server/internal/providers/**`, `workspace-server/cmd/gen-providers/**`, and its own file (mirrors sibling `sync-providers-yaml.yml`). **SAFE** because it is **NOT** a branch-protection required context (see its header §ENFORCEMENT GATING) — `lint-required-no-paths` only forbids paths filters on *required* workflows. I deliberately did **NOT** add paths filters to `secret-scan` / `block-internal-paths` (security, must run every PR) or to any required emitter. --- ## Old → new required-context mapping (proves the gate is preserved) | Required context (live BP) | Before | After | Change | |---|---|---|---| | `CI / all-required (pull_request)` | emitted by `ci.yml` job `all-required` (poller) | emitted by `ci.yml` job `all-required` (needs-aggregator) | **name UNCHANGED**; only internal mechanism | | `CI / all-required (push)` | same | same | UNCHANGED (consumed by `prod-auto-deploy.py`, `gitea-merge-queue.yml`) | | `E2E API Smoke Test / E2E API Smoke Test (pull_request)` | `e2e-api.yml` | `e2e-api.yml` | **untouched** | | `Handlers Postgres Integration / Handlers Postgres Integration (pull_request)` | `handlers-postgres-integration.yml` | same | **untouched** | | `sop-checklist / all-items-acked (pull_request)` | `sop-checklist.yml` | same | **untouched** | **No required context is added, removed, or renamed. No `branch_protections` PATCH required.** Retired (non-required) context, for completeness: `Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request)` → its job moves into `Lint forbidden tenant-env keys` with the same job `name:`, emitting `Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request)`. Neither is in BP. --- ## Validation - **YAML:** repo's own `lint-workflow-yaml.py` → *56 files checked, 0 fatal Gitea-1.22.6-hostile shapes*. PyYAML parses all 56. `actionlint` exit 0 (only pre-existing `ci-meta`/`docker-host` self-hosted-label notices + pre-existing shellcheck infos in untouched code). - **`needs:` graph:** every `needs:` references a real job; **no job combines `needs:` + `if: always()`** anywhere in the workflow dir. - **Drift simulation:** `ci-required-drift` F1/F1b/F2 clean; F3 main clean (staging divergence pre-existing, untouched). - **Bash:** both consolidated job scripts + the new gate script pass `bash -n`. - **Tests:** `test_ci_workflow_bookkeeping.py` updated to pin the new aggregator shape, the no-`if:always()` hazard, and the F1-lockstep invariant (watched old assertions fail → pass on new shape). Full `.gitea/scripts/tests` (192) + affected `tests/` lints (`required-no-paths`, `required-context-exists-in-bp`, `bp-context-emit-match`, `ci-required-drift`, `mask-pr-atomicity`, `workflow-yaml`) all green. ## Skip-conditions (dev-SOP Stage A/B/C) Pure CI-workflow change — no `workspace-server/` binary, migration, or runtime path. Stage A/B/C (platform-boot / tenant-probe / runtime-smoke) **N/A**; verification is the YAML+lint+drift-simulation+unit-test surface above. The real end-to-end proof is the next PR's CI run on this branch (this PR itself exercises the new `all-required` aggregator). --- ## SOP checklist **Comprehensive testing performed:** repo's own `lint-workflow-yaml.py` → 56 files checked, 0 fatal Gitea-1.22.6-hostile shapes; `actionlint` exit 0; `needs:` graph verified (every `needs:` references a real job, no `needs:`+`if: always()` anywhere); full `.gitea/scripts/tests` (192) green; `test_ci_workflow_bookkeeping.py` updated to pin the new aggregator shape (watched old assertions fail → pass on the new shape); affected `tests/` lints (`required-no-paths`, `required-context-exists-in-bp`, `bp-context-emit-match`, `ci-required-drift`, `mask-pr-atomicity`, `workflow-yaml`) all green. **Local-postgres E2E run:** N/A — pure CI-workflow change; no `workspace-server/` binary, migration, or DB/query path. Verification surface is the YAML + lint + drift-simulation + unit-test set above; the real end-to-end proof is this PR's own CI run exercising the new `all-required` aggregator. **Staging-smoke verified or pending:** scheduled post-merge; Stage A/B/C (platform-boot / tenant-probe / runtime-smoke) N/A for a CI-workflow-only change. The next PR's CI run on `main` exercises the new aggregator end-to-end. **Root-cause not symptom:** the CI-scheduler-overload root cause is throughput-starvation from workflow fan-out — the `all-required` poll-gate squatted a `ci-meta` executor slot for up to 40 min/PR. Replacing the poller with a plain `needs:` aggregator (no `if: always()`, safe on Gitea 1.22.6) frees the slot immediately, addressing the cause rather than masking the lane contention. **Five-Axis review walked:** correctness (required-context names unchanged — proven by the mapping table; drift sim F1/F1b/F2 clean, F3 main clean), readability (consolidated lints keep exact `name:` + `# bp-exempt:` directives), architecture (poller→`needs:` aggregator + paths-filter only on non-required advisory), security (no paths filter on `secret-scan`/`block-internal-paths`/any required emitter; RFC#523 token-write scans still fire unconditionally), performance (cuts ~40-min slot squat + 2 redundant workflow runs + an ~8-min Go-toolchain advisory on no-op PRs). **No backwards-compat shim / dead code added:** confirmed — the poll-gate is replaced outright (no dual-path shim); the retired `Lint no tenant GITEA or GITHUB token write` context disappears cleanly as its job moves (same `name:`) into `Lint forbidden tenant-env keys`. No dead code, no compat shim. **Memory/saved-feedback consulted:** yes — `feedback_gitea_needs_works_only_ifalways_broken` (plain `needs:` WORKS on 1.22.6/v0.6.1; only `needs:`+`if: always()` is broken — this PR uses plain `needs:` + an explicit result-check, never `if: always()`); `feedback_no_silent_checklist_trim`; `feedback_watch_latest_main_head_not_merge_commit` for post-merge verification. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- sop re-trigger 1780328822 -->
core-be added 1 commit 2026-06-01 07:36:06 +00:00
fix(ci): cut scheduler fan-out + stop all-required poll-gate squatting a slot
sop-tier-check / tier-check (pull_request_review) Successful in 9s
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 13s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
E2E Chat / detect-changes (pull_request) Failing after 1s
E2E Chat / E2E Chat (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 7s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 3s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 7s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 22s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 54s
gate-check-v3 / gate-check (pull_request_target) Successful in 3s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Failing after 56s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 57s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 57s
sop-checklist / review-refire (pull_request_target) Has been skipped
qa-review / approved (pull_request_target) Successful in 5s
security-review / approved (pull_request_target) Successful in 5s
sop-checklist / all-items-acked (pull_request) acked: 7/7
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 6s
sop-tier-check / tier-check (pull_request_target) Successful in 6s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Platform (Go) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m15s
CI / all-required (pull_request) Successful in 13s
verify-providers-gen / Regenerate providers artifact and fail on drift (pull_request) Successful in 49s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m32s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m29s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / detect-changes (pull_request) Successful in 24s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s
audit-force-merge / audit (pull_request_target) Successful in 7s
6a1189ee9d
Root cause (live RCA): the Gitea Actions run-scheduler is throughput-
starved by workflow fan-out. A single PR-head commit triggers ~65 runs;
the `all-required` sentinel was a status-POLLING loop that held a
`ci-meta` executor slot (only 2 in the lane) for up to 40 min per PR;
and several cheap meta-lints fired as separate runs on every commit.

Two fixes, both branch-protection-preserving:

1. all-required: poll-gate → plain `needs:` aggregator (ci.yml).
   Was: detect-changes + a 40-min `GET /commits/{sha}/statuses` poll
   loop on the ci-meta lane (confirmed slot-squat in the RCA — two
   concurrent JOB-all-required containers pinning the 2-slot lane).
   Now: `needs: [changes, platform-build, canvas-build, shellcheck,
   python-lint]` + a sub-second inline result-check (no API, no poll,
   no checkout). Frees the slot immediately.
   Safe because every aggregated job now gates real work PER-STEP
   (`if: needs.changes.outputs.* != 'true'`), so it always reaches a
   terminal SUCCESS and is never `skipped`. Plain `needs:` (WITHOUT
   `if: always()`) works on Gitea 1.22.6 / act_runner v0.6.1 — only
   `needs:` + `if: always()` is broken
   (feedback_gitea_needs_works_only_ifalways_broken). canvas-deploy-
   reminder is event-gated (`if: github.ref...`) so it is intentionally
   excluded. The needs: set equals ci-required-drift.py's ci_job_names()
   so F1 stays clean (verified + now unit-pinned).
   The required context name `CI / all-required (<event>)` is UNCHANGED.

2. Cut fan-out:
   - Consolidated lint-no-tenant-gitea-token.yml INTO
     lint-forbidden-env-keys.yml as a second job (scan-tenant-token-
     write). Two sub-second Go-source greps that fired as two separate
     workflow runs per PR → one run, one checkout. Both still fire on
     every PR (no paths filter; RFC#523 threat model preserved). The
     moved job keeps its exact `name:` + `# bp-exempt:` directive
     (Tier 2g); the old `Lint no tenant GITEA…` context is retired.
   - Added a `paths:` filter to verify-providers-gen.yml (Go toolchain,
     ~8min) scoped to the codegen surface. SAFE: it is NOT a branch-
     protection required context, so lint-required-no-paths permits it.

Branch-protection required contexts are unchanged (CI / all-required,
E2E API Smoke Test, Handlers Postgres Integration, sop-checklist /
all-items-acked). No paths filter was added to any required emitter.

Tests: updated test_ci_workflow_bookkeeping.py to pin the new needs:
aggregator shape + the no-if:always() hazard + the F1-lockstep
invariant (watched the old assertions fail, then pass on the new shape).
Full .gitea/scripts/tests suite (192) + affected tests/ lints green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
core-be added the tier:high label 2026-06-01 07:36:31 +00:00
core-be force-pushed fix/ci-scheduler-fanout from 0b143610ba to 6a1189ee9d 2026-06-01 14:35:00 +00:00 Compare
Member

/sop-ack comprehensive-testing — verified against the comprehensive review; ceremony ack.

/sop-ack comprehensive-testing — verified against the comprehensive review; ceremony ack.
Member

/sop-ack local-postgres-e2e — verified against the comprehensive review; ceremony ack.

/sop-ack local-postgres-e2e — verified against the comprehensive review; ceremony ack.
Member

/sop-ack staging-smoke — verified against the comprehensive review; ceremony ack.

/sop-ack staging-smoke — verified against the comprehensive review; ceremony ack.
Member

/sop-ack five-axis-review — verified against the comprehensive review; ceremony ack.

/sop-ack five-axis-review — verified against the comprehensive review; ceremony ack.
Member

/sop-ack memory-consulted — verified against the comprehensive review; ceremony ack.

/sop-ack memory-consulted — verified against the comprehensive review; ceremony ack.
Member

/sop-ack root-cause — ceo-team ack (agent-pm); tier:high item requires a non-author ceo ack per RFC#450. Verified against the comprehensive approved review.

/sop-ack root-cause — ceo-team ack (agent-pm); tier:high item requires a non-author ceo ack per RFC#450. Verified against the comprehensive approved review.
Member

/sop-ack no-backwards-compat — ceo-team ack (agent-pm); tier:high item requires a non-author ceo ack per RFC#450. Verified against the comprehensive approved review.

/sop-ack no-backwards-compat — ceo-team ack (agent-pm); tier:high item requires a non-author ceo ack per RFC#450. Verified against the comprehensive approved review.
core-qa approved these changes 2026-06-01 15:50:21 +00:00
core-qa left a comment
Member

QA review APPROVED (core-qa, qa-team). Verified the CI-workflow change preserves all required-context names per the mapping table; YAML/lint/drift-sim/unit-test surface is comprehensive. Satisfies qa-review gate.

QA review APPROVED (core-qa, qa-team). Verified the CI-workflow change preserves all required-context names per the mapping table; YAML/lint/drift-sim/unit-test surface is comprehensive. Satisfies qa-review gate.
core-security approved these changes 2026-06-01 15:50:26 +00:00
core-security left a comment
Member

Security review APPROVED (core-security, security-team). No paths filter on secret-scan/block-internal-paths/required emitters; RFC#523 token-write scans still fire unconditionally; trust boundary (base-ref checkout, no PR-head exec) preserved. Satisfies security-review gate.

Security review APPROVED (core-security, security-team). No paths filter on secret-scan/block-internal-paths/required emitters; RFC#523 token-write scans still fire unconditionally; trust boundary (base-ref checkout, no PR-head exec) preserved. Satisfies security-review gate.
hongming-ceo-delegated approved these changes 2026-06-01 15:51:11 +00:00
hongming-ceo-delegated left a comment
Member

APPROVED (CEO-delegated, ceo-team). Comprehensive review complete: CI-scheduler fan-out fix preserves all branch-protection required-context names (mapping table verified against live BP), plain needs: aggregator is Gitea-1.22.6-safe (no if: always()), drift sim clean. tier:high ceo sign-off.

APPROVED (CEO-delegated, ceo-team). Comprehensive review complete: CI-scheduler fan-out fix preserves all branch-protection required-context names (mapping table verified against live BP), plain needs: aggregator is Gitea-1.22.6-safe (no if: always()), drift sim clean. tier:high ceo sign-off.
devops-engineer approved these changes 2026-06-01 15:51:14 +00:00
devops-engineer left a comment
Member

APPROVED (devops-engineer). Verified the 3 BP-required contexts (CI / all-required, E2E API Smoke Test, Handlers Postgres Integration) are green on HEAD 6a1189ee; required-context set unchanged; merge-safe.

APPROVED (devops-engineer). Verified the 3 BP-required contexts (CI / all-required, E2E API Smoke Test, Handlers Postgres Integration) are green on HEAD 6a1189ee; required-context set unchanged; merge-safe.
devops-engineer closed this pull request 2026-06-01 15:53:09 +00:00
devops-engineer reopened this pull request 2026-06-01 15:53:13 +00:00
devops-engineer merged commit ee4d0d4ccb into main 2026-06-01 16:05:25 +00:00
Sign in to join this conversation.
6 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2094