runtime-prbuild-compat: mostly cancel-cascade, with 2 real drift-catches (mc#1529 §2) #1545

Open
opened 2026-05-19 00:17:00 +00:00 by core-devops · 0 comments
Member

Sub-issue of #1529. Root-caused 2026-05-18.

Pattern

6/20 main pushes (30%) failed runtime-prbuild-compat.yml. Status-breakdown of last 15 failed runs on main:

  • 12/15 were status=3 (Cancelled) — task_id=0 or task_id set but job never finished
  • 3/15 were status=2 (true Failure)

Root cause #1 (dominant): concurrency cancel-cascade

The workflow has concurrency: cancel-in-progress: true keyed by ${{ github.event_name }}-${{ github.event.pull_request.head.sha || github.sha }}. When multiple commits land on main in quick succession (e.g. merge bursts of #1531/#1532/...), the older runs get cancelled by Gitea Actions in favor of the latest SHA. Cancelled = status=3 — sweep #1529 counted these as red. They are not actually broken.

Example: 4 main pushes within 47 seconds (commits 06b0ec8fbcc66ecdebf88a461b0e947b on 2026-05-18T23:00Z) — first 3 cancelled, only the 4th ran to completion. This is exactly the designed behavior of cancel-in-progress: true on a busy main.

Root cause #2 (minority): drift gate working as designed

The 3 status=2 failures all hit error: TOP_LEVEL_MODULES drifted from workspace/*.py contents: ['a2a_tools_identity']. This is scripts/build_runtime_package.py's drift gate firing correctly — a2a_tools_identity.py was added to workspace/ (PR #17 of template-runtime mirror) before the same set was updated in TOP_LEVEL_MODULES. Fixed at HEAD by commit 309276e3 fix(runtime-pkg): add a2a_tools_identity to TOP_LEVEL_MODULES. The gate caught the bug — that's a win, not a flake.

Class

  • (b) cancel-cascade noise being mis-counted as failure — dominant
  • (c) real drift-catch working as designed — minority, self-resolving as the matching fix-commit lands

Action

None on the workflow itself. The 30% red rate is misleading — most are cancelled, the rest are real catches.

If #1529 wants to filter cancel-cascade out of chronic-red sweeps, change the sweep's red-counter to exclude status=3 (Cancelled) — only count status=2 (Failure). That would bring this workflow's measured fail-rate to ~10% (3/30), all of which were genuine drift-gate catches.

Boundary

Do NOT remove cancel-in-progress: true. Removing it would let stale jobs from N-2 commits keep burning runner slots while N is in flight (this is exactly the design goal of the directive).

Sub-issue of #1529. Root-caused 2026-05-18. ## Pattern 6/20 main pushes (30%) failed `runtime-prbuild-compat.yml`. Status-breakdown of last 15 failed runs on main: - **12/15** were status=3 (Cancelled) — `task_id=0` or `task_id` set but job never finished - **3/15** were status=2 (true Failure) ## Root cause #1 (dominant): concurrency cancel-cascade The workflow has `concurrency: cancel-in-progress: true` keyed by `${{ github.event_name }}-${{ github.event.pull_request.head.sha || github.sha }}`. When multiple commits land on main in quick succession (e.g. merge bursts of #1531/#1532/...), the older runs get cancelled by Gitea Actions in favor of the latest SHA. Cancelled = status=3 — sweep #1529 counted these as red. They are not actually broken. Example: 4 main pushes within 47 seconds (commits 06b0ec8f → bcc66ecd → ebf88a46 → 1b0e947b on 2026-05-18T23:00Z) — first 3 cancelled, only the 4th ran to completion. This is exactly the designed behavior of `cancel-in-progress: true` on a busy main. ## Root cause #2 (minority): drift gate working as designed The 3 status=2 failures all hit `error: TOP_LEVEL_MODULES drifted from workspace/*.py contents: ['a2a_tools_identity']`. This is `scripts/build_runtime_package.py`'s drift gate firing correctly — `a2a_tools_identity.py` was added to `workspace/` (PR #17 of template-runtime mirror) before the same set was updated in `TOP_LEVEL_MODULES`. Fixed at HEAD by commit `309276e3 fix(runtime-pkg): add a2a_tools_identity to TOP_LEVEL_MODULES`. The gate caught the bug — that's a win, not a flake. ## Class - (b) cancel-cascade noise being mis-counted as failure — dominant - (c) real drift-catch working as designed — minority, self-resolving as the matching fix-commit lands ## Action None on the workflow itself. The 30% red rate is misleading — most are cancelled, the rest are real catches. If #1529 wants to filter cancel-cascade out of chronic-red sweeps, change the sweep's red-counter to exclude `status=3` (Cancelled) — only count `status=2` (Failure). That would bring this workflow's measured fail-rate to ~10% (3/30), all of which were genuine drift-gate catches. ## Boundary Do NOT remove `cancel-in-progress: true`. Removing it would let stale jobs from N-2 commits keep burning runner slots while N is in flight (this is exactly the design goal of the directive).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1545