fix(watchdog): add HEAD-recheck + settling delay to suppress cancel-cascade false-positives #1635
Merged
core-devops
merged 1 commits from 2026-05-21 06:08:42 +00:00
fix/main-red-watchdog-action-run-status-filter into main
1 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ec53cac4a1 |
fix(watchdog): add HEAD-recheck + settling delay to suppress cancel-cascade false-positives
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 25s
CI / Python Lint & Test (pull_request) Successful in 13s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
E2E Chat / detect-changes (pull_request) Successful in 12s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 1m38s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 4m41s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m8s
gate-check-v3 / gate-check (pull_request) Successful in 7s
qa-review / approved (pull_request) Successful in 4s
security-review / approved (pull_request) Successful in 4s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 3s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 5s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m13s
CI / Canvas (Next.js) (pull_request) Successful in 5m54s
CI / all-required (pull_request) Successful in 5m32s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m38s
E2E Chat / E2E Chat (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
Harness Replays / Harness Replays (pull_request) Successful in 2s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m39s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
audit-force-merge / audit (pull_request) Successful in 6s
Adds a 90s settling window + HEAD-recheck before filing `[main-red]`
issues. After initial red detection, the watchdog now:
1. Re-fetches HEAD SHA — if main moved on (new commit landed
mid-tick), skip-file and let the next cron tick re-evaluate.
2. Re-fetches combined status on the same SHA — if it recovered
(transient cancel-cascade rolled forward to success on retry),
skip-file.
Both skip paths emit distinct Loki events
(`main_red_skipped_head_drift`, `main_red_skipped_recovered`) so obs
queries can track filter activity vs the genuine `main_red_detected`
path.
Background — 7 false-positive `[main-red]` issues filed in 24h
(mc#1597, #1605, #1609, #1613, #1626, #1627, #1630), all closed in
triage 2026-05-21 04:55 as 6 cancel-cascade + 1 emission artifact,
zero real regressions.
Empirical 7-day DB sweep on 2026-05-20 showed that of 702
`action_run.status=3` (Cancelled) entries that wrote a
`commit_status.state='failure'` row, only 76 (~11%) carried
description=`'Has been cancelled'` (the existing mc#1564 filter's
match string). 89% used `'Failing after Ns'`, indistinguishable from
real `status=2` (Failure) at the commit-status layer.
The canonical filter (only file when `action_run.status=2`) is not
reachable from a Gitea Actions runner — Gitea 1.22.6 exposes no REST
endpoint for `action_run.status` (probed empirically:
`/api/v1/.../actions/runs/{id}`, `/jobs/{id}`, `/tasks/{id}` all
return HTTP 404; swagger.v1.json contains no read endpoints for
action runs). The SPA backend requires a session CSRF token, and DB
access (`mol_action_status`, `docker exec ... psql`) lives only on
the operator host. The HEAD-recheck is the strongest signal a runner
can produce without that endpoint, and it caught all 7 of the
mc#1597..1630 false-positives in offline replay (HEAD had moved past
each of the 7 SHAs by the time the issue filed).
The PR also adds `_resolve_action_run_status(target_url)` as an
extensibility hook returning None today; when Gitea >=1.23 or an
op-host proxy exposes the status endpoint, the function body can be
filled in without changing callers.
Tests (4 new + 1 hook test):
- `test_head_recheck_skips_file_when_head_moved` —
SHA_A initial → SHA_B on recheck → no POST.
- `test_head_recheck_skips_file_when_recheck_status_recovered` —
failure → success on same SHA → no POST.
- `test_head_recheck_files_when_still_red_after_settling` —
over-filter regression guard: persistent red MUST file.
- `test_head_recheck_skips_when_initial_was_only_cancel_cascade` —
cancel-cascade filter ordering guard.
- `test_resolve_action_run_status_returns_none_on_no_endpoint` —
pins the extensibility-hook return contract.
An autouse `_stub_time_sleep` fixture is added so the existing
integration-style tests (`test_red_detected_opens_issue` et al.) don't
each block 90s on the new sleep call — pre-fix suite ran ~0.1s; with
the bare implementation it took >4 minutes (the stub keeps the suite
fast and deterministic without per-test patching).
41 tests pass in 0.15s. `--dry-run` against live Gitea returns the
correct PENDING/no-action result on the current `main` head.
References:
- reference_chronic_red_sweep_cancelled_vs_failed_filter
- feedback_gitea_status_enum_use_helper_not_raw_int
- reference_gitea_action_status_enum_corrected_2026_05_19
- feedback_dispatch_investigation_and_fix_default_dont_ask
- feedback_decide_routine_prod_ops_no_go_ask
Task: #394
Closes: mc#1597, #1605, #1609, #1613, #1626, #1627, #1630 (already
manually closed in triage; this PR is the structural fix that
prevents the recurrence class)
|