main-red-watchdog should exclude action_run.status=3 (Cancelled) — currently mis-flagged as failure #1564

Closed
opened 2026-05-19 03:30:18 +00:00 by core-devops · 0 comments
Member

Problem

.gitea/workflows/main-red-watchdog.yml (and the script at .gitea/scripts/main-red-watchdog.py) currently treats Gitea's per-context combined-commit-status state == "failure" as "main red". But Gitea maps both action_run.status=2 (Failure) and action_run.status=3 (Cancelled) to commit-status failure. On a busy main branch with concurrency: cancel-in-progress: true, every merge burst produces a wave of status=3 cancellations on prior in-flight runs — those bubble to failure and the watchdog files a phantom [main-red] issue.

Evidence

Last 7 days on molecule-ai/molecule-core refs/heads/main:

workflow                          | run_S=1 | run_S=2 | run_S=3 | cancel-cascade-confirmed
ci.yml                            |     80  |     47  |     55  | 55/55
handlers-postgres-integration.yml |     94  |     49  |     53  | 52/53
runtime-prbuild-compat.yml        |    120  |     24  |     51  | 51/51

"Cancel-cascade confirmed" = the cancelled run has a newer run on the same ref within 2h (i.e. an in-progress run that got cancelled by the next push).

Phantom issues already filed by the current watchdog: #1562, #1552, #1540, #1532, #1527, #1526, #1522, #1503, #1487, #1484 (selection — most contain "Has been cancelled" per-context strings). Tagged tier:high and contributing to the "main red" alarm fatigue that motivated mc#1529.

Canonical enum (Gitea 1.22.6, source-verified in models/actions/status.go)

0 = Unknown
1 = Success
2 = Failure       <-- real defect, file
3 = Cancelled     <-- concurrency artifact, ignore
4 = Skipped
5 = Waiting
6 = Running
7 = Blocked

Per the memory note reference_chronic_red_sweep_cancelled_vs_failed_filter (saved during mc#1529 triage).

Proposed fix (do NOT implement here — owner picks up)

Two options, owner picks:

A. Filter at the script.gitea/scripts/main-red-watchdog.py:is_red(): when reading each per-context entry, look up the underlying action_run/action_task status via /api/v1/repos/{owner}/{repo}/actions/tasks/{id} (or parse the target_url for the run id) and exclude entries whose terminal status is 3 (Cancelled). Keep status=2 + status=error.

B. Filter at the description-string level (cheap) — the current entry body strings already say "Has been cancelled" for status=3 vs "Failing after Ns" for status=2. Excluding entries whose description matches ^Has been cancelled$ would catch the bulk with no extra API call. Caveat: relies on Gitea description-string stability.

Either way: add a unit test that synthesises a combined-status response with one "Has been cancelled" entry and asserts is_red() == (False, []).

Out of scope

  • This issue does NOT close phantom issues retroactively — that's separate cleanup.
  • The Platform (Go) sqlmock failures on ci.yml (18× in 7d, real status=2) are a separate bug to file once root-caused (out of scope here).

Refs

  • mc#1529 (closed) — triage that surfaced this
  • reference_chronic_red_sweep_cancelled_vs_failed_filter (operator memory)
  • reference_gitea_1_22_6_action_status_enum (operator memory)
  • feedback_loop_fix_dont_just_report — owner: please implement, this is the durable fix not the symptom

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

## Problem `.gitea/workflows/main-red-watchdog.yml` (and the script at `.gitea/scripts/main-red-watchdog.py`) currently treats Gitea's per-context combined-commit-status `state == "failure"` as "main red". But Gitea maps **both** `action_run.status=2` (Failure) **and** `action_run.status=3` (Cancelled) to commit-status `failure`. On a busy main branch with `concurrency: cancel-in-progress: true`, every merge burst produces a wave of status=3 cancellations on prior in-flight runs — those bubble to `failure` and the watchdog files a phantom `[main-red]` issue. ## Evidence Last 7 days on `molecule-ai/molecule-core` `refs/heads/main`: ``` workflow | run_S=1 | run_S=2 | run_S=3 | cancel-cascade-confirmed ci.yml | 80 | 47 | 55 | 55/55 handlers-postgres-integration.yml | 94 | 49 | 53 | 52/53 runtime-prbuild-compat.yml | 120 | 24 | 51 | 51/51 ``` "Cancel-cascade confirmed" = the cancelled run has a newer run on the same ref within 2h (i.e. an in-progress run that got cancelled by the next push). Phantom issues already filed by the current watchdog: #1562, #1552, #1540, #1532, #1527, #1526, #1522, #1503, #1487, #1484 (selection — most contain `"Has been cancelled"` per-context strings). Tagged `tier:high` and contributing to the "main red" alarm fatigue that motivated mc#1529. ## Canonical enum (Gitea 1.22.6, source-verified in `models/actions/status.go`) ``` 0 = Unknown 1 = Success 2 = Failure <-- real defect, file 3 = Cancelled <-- concurrency artifact, ignore 4 = Skipped 5 = Waiting 6 = Running 7 = Blocked ``` Per the memory note `reference_chronic_red_sweep_cancelled_vs_failed_filter` (saved during mc#1529 triage). ## Proposed fix (do NOT implement here — owner picks up) Two options, owner picks: **A. Filter at the script** — `.gitea/scripts/main-red-watchdog.py:is_red()`: when reading each per-context entry, look up the underlying `action_run`/`action_task` status via `/api/v1/repos/{owner}/{repo}/actions/tasks/{id}` (or parse the `target_url` for the run id) and exclude entries whose terminal status is 3 (Cancelled). Keep status=2 + status=`error`. **B. Filter at the description-string level (cheap)** — the current entry body strings already say `"Has been cancelled"` for status=3 vs `"Failing after Ns"` for status=2. Excluding entries whose `description` matches `^Has been cancelled$` would catch the bulk with no extra API call. Caveat: relies on Gitea description-string stability. Either way: add a unit test that synthesises a combined-status response with one `"Has been cancelled"` entry and asserts `is_red() == (False, [])`. ## Out of scope - This issue does NOT close phantom issues retroactively — that's separate cleanup. - The Platform (Go) sqlmock failures on `ci.yml` (18× in 7d, real status=2) are a separate bug to file once root-caused (out of scope here). ## Refs - mc#1529 (closed) — triage that surfaced this - `reference_chronic_red_sweep_cancelled_vs_failed_filter` (operator memory) - `reference_gitea_1_22_6_action_status_enum` (operator memory) - `feedback_loop_fix_dont_just_report` — owner: please implement, this is the durable fix not the symptom Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1564