fix(watchdog): add HEAD-recheck + settling delay to suppress cancel-cascade false-positives #1635
Reference in New Issue
Block a user
Delete Branch "fix/main-red-watchdog-action-run-status-filter"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Adds a 90s settling window + HEAD-recheck before filing
[main-red]issues, suppressing the false-positive cluster mc#1597..1630.After initial red detection:
Both skip paths emit distinct Loki events so obs queries can track filter activity vs the genuine red signal.
Evidence
7 false-positive
[main-red]issues filed in 24h (mc#1597, #1605, #1609, #1613, #1626, #1627, #1630), all closed in triage 2026-05-21 04:55 as 6 cancel-cascade + 1 emission artifact, zero real regressions.Empirical 7-day DB sweep on 2026-05-20 (operator-host
psql) shows the existing description-string filter ('Has been cancelled') catches only 76/702 ≈ 11% ofaction_run.status=3(Cancelled) entries — 89% are written as'Failing after Ns', indistinguishable from realstatus=2(Failure) at the commit-status layer:For all 7 false-positive SHAs, the underlying
action_run.statuswas confirmed as either 3 (Cancelled) or 1 (Success — emission artifact) via the canonicalmol_action_statushelper on op-host.Why not query
action_run.statusdirectly (spec Part 1)The canonical filter is not reachable from a Gitea Actions runner. Empirically probed against Gitea 1.22.6:
/api/v1/.../actions/runs/{id}→ HTTP 404/api/v1/.../actions/jobs/{id}→ HTTP 404/api/v1/.../actions/tasks/{id}→ HTTP 404swagger.v1.jsonpaths matching*action*→ secrets + variables + runners only; no run endpointsThe SPA backend (
/{repo}/actions/runs/{id}/jobs/{idx}POST) requires a session CSRF token, unreachable from a runner. DB access (mol_action_status,docker exec molecule-postgres-1 psql ...) lives only on the operator host.The HEAD-recheck is the strongest signal a runner can produce without that endpoint. In offline replay against the 7 false-positive SHAs, HEAD had moved past each one by the time the issue filed — the new filter catches all 7.
The PR also adds
_resolve_action_run_status(target_url)as an extensibility hook returningNonetoday; when Gitea ≥1.23 (or an op-host proxy) exposes the status endpoint, only that function body changes — caller contract is stable.Test plan
test_head_recheck_skips_file_when_head_moved— SHA_A initial → SHA_B on recheck → no POSTtest_head_recheck_skips_file_when_recheck_status_recovered— failure → success on same SHA → no POSTtest_head_recheck_files_when_still_red_after_settling— over-filter regression guard (persistent red MUST file)test_head_recheck_skips_when_initial_was_only_cancel_cascade— cancel-cascade filter ordering guardtest_resolve_action_run_status_returns_none_on_no_endpoint— pins extensibility-hook contract--dry-runagainst live Gitea returns correct PENDING/no-action on currentmainHEAD (660fc20124)_stub_time_sleepfixture keeps suite fast)Sibling-watchdog sweep (per task #394 follow-up)
Swept all 74
molecule-ai/*repos for sibling watchdogs that file issues fromcommit_statusdata:molecule-core/.gitea/workflows/main-red-watchdog.yml— this PRmolecule-core/.gitea/workflows/status-reaper.yml— does NOT file issues (reaps stale rows); not affectedmolecule-core/.gitea/workflows/ci-required-drift.yml— files issues from config drift (workflow YAML vs branch-protection), not commit_status; not affectedmolecule-controlplane/.gitea/workflows/ci-required-drift.yml— same config-drift pattern as core; not affectedNo sibling watchdogs share the status-enum-blind defect. No follow-up task needed.
References
reference_chronic_red_sweep_cancelled_vs_failed_filterfeedback_gitea_status_enum_use_helper_not_raw_intreference_gitea_action_status_enum_corrected_2026_05_19feedback_dispatch_investigation_and_fix_default_dont_askfeedback_decide_routine_prod_ops_no_go_askTask: #394
Closes: mc#1597, #1605, #1609, #1613, #1626, #1627, #1630 (manually closed in triage; this PR is the structural fix that prevents the recurrence class)
Adds a 90s settling window + HEAD-recheck before filing `[main-red]` issues. After initial red detection, the watchdog now: 1. Re-fetches HEAD SHA — if main moved on (new commit landed mid-tick), skip-file and let the next cron tick re-evaluate. 2. Re-fetches combined status on the same SHA — if it recovered (transient cancel-cascade rolled forward to success on retry), skip-file. Both skip paths emit distinct Loki events (`main_red_skipped_head_drift`, `main_red_skipped_recovered`) so obs queries can track filter activity vs the genuine `main_red_detected` path. Background — 7 false-positive `[main-red]` issues filed in 24h (mc#1597, #1605, #1609, #1613, #1626, #1627, #1630), all closed in triage 2026-05-21 04:55 as 6 cancel-cascade + 1 emission artifact, zero real regressions. Empirical 7-day DB sweep on 2026-05-20 showed that of 702 `action_run.status=3` (Cancelled) entries that wrote a `commit_status.state='failure'` row, only 76 (~11%) carried description=`'Has been cancelled'` (the existing mc#1564 filter's match string). 89% used `'Failing after Ns'`, indistinguishable from real `status=2` (Failure) at the commit-status layer. The canonical filter (only file when `action_run.status=2`) is not reachable from a Gitea Actions runner — Gitea 1.22.6 exposes no REST endpoint for `action_run.status` (probed empirically: `/api/v1/.../actions/runs/{id}`, `/jobs/{id}`, `/tasks/{id}` all return HTTP 404; swagger.v1.json contains no read endpoints for action runs). The SPA backend requires a session CSRF token, and DB access (`mol_action_status`, `docker exec ... psql`) lives only on the operator host. The HEAD-recheck is the strongest signal a runner can produce without that endpoint, and it caught all 7 of the mc#1597..1630 false-positives in offline replay (HEAD had moved past each of the 7 SHAs by the time the issue filed). The PR also adds `_resolve_action_run_status(target_url)` as an extensibility hook returning None today; when Gitea >=1.23 or an op-host proxy exposes the status endpoint, the function body can be filled in without changing callers. Tests (4 new + 1 hook test): - `test_head_recheck_skips_file_when_head_moved` — SHA_A initial → SHA_B on recheck → no POST. - `test_head_recheck_skips_file_when_recheck_status_recovered` — failure → success on same SHA → no POST. - `test_head_recheck_files_when_still_red_after_settling` — over-filter regression guard: persistent red MUST file. - `test_head_recheck_skips_when_initial_was_only_cancel_cascade` — cancel-cascade filter ordering guard. - `test_resolve_action_run_status_returns_none_on_no_endpoint` — pins the extensibility-hook return contract. An autouse `_stub_time_sleep` fixture is added so the existing integration-style tests (`test_red_detected_opens_issue` et al.) don't each block 90s on the new sleep call — pre-fix suite ran ~0.1s; with the bare implementation it took >4 minutes (the stub keeps the suite fast and deterministic without per-test patching). 41 tests pass in 0.15s. `--dry-run` against live Gitea returns the correct PENDING/no-action result on the current `main` head. References: - reference_chronic_red_sweep_cancelled_vs_failed_filter - feedback_gitea_status_enum_use_helper_not_raw_int - reference_gitea_action_status_enum_corrected_2026_05_19 - feedback_dispatch_investigation_and_fix_default_dont_ask - feedback_decide_routine_prod_ops_no_go_ask Task: #394 Closes: mc#1597, #1605, #1609, #1613, #1626, #1627, #1630 (already manually closed in triage; this PR is the structural fix that prevents the recurrence class)Five-Axis review — APPROVED.
1. Correctness — HEAD-recheck logic is correct. Branch-SHA fetched twice (line 671→690), recheck uses fresh
get_combined_status(sha)on same SHA, both skip paths early-return 0. Race window when HEAD moves DURING the 90s sleep is handled benignly: if HEAD changes, skip-file and next cron tick re-evaluates against new SHA (worst case: 1-tick latency, never a false-negative on a genuinely red SHA — the new SHA gets its own evaluation). 90s settling delay well below hourly cron cadence. Over-filter guard (test_head_recheck_files_when_still_red_after_settling) locks the genuine-failure path. No finding.2. Readability —
_resolve_action_run_statusdocstring is exemplary: cites enum, probe results with HTTP codes, swap-in contract for future Gitea ≥1.23. Constants block at line 90-104 clearly explains override + test-stub seam. No finding.3. Architecture — 90s sleep on hourly cron = 2.5% runner-slot occupancy; acceptable for the 89% false-positive suppression (per 7d sweep 626/702 = 89% cancels slipping through). Extensibility hook lets future endpoint swap-in without caller churn. No finding.
4. Security — no new auth surface, no new secrets, no new exfil paths. Loki events emit SHA + context only (existing pattern). No finding.
5. Performance — +1 branch GET + +1 combined-status GET per red detection. Red detections are rare (≤1/hour). Empirical evidence in PR body cites
action_run.statusenum bucketing with absolute counts (76/702, 287/702, etc.). Closing issue numbers (mc#1597, #1605, #1609, #1613, #1626, #1627, #1630) reproducible via Gitea issue history.Minor (non-blocking): PR body does not cite per-SHA commit hashes for the 7 false-positives — only issue numbers. Future audits can derive SHAs from those issues' bodies. Recommend a follow-up only if a recurrence-class audit reopens.
Top findings: NONE blocking. APPROVED.
Security review (core-security): scope confirmed via GET /pulls/1635/files — only
.gitea/scripts/main-red-watchdog.py(+107 LoC) andtests/test_main_red_watchdog.py(+226 LoC) touched. HEAD-recheck logic reuses the existingapi()helper with the sameGITEA_TOKENAuthorization header against pre-existing read endpoints (/repos/{o}/{r}/branches/{b}and/repos/{o}/{r}/commits/{sha}/status) — no auth-scope widening, no new HTTP surfaces, no secret/token references introduced. The only new env var isWATCHDOG_RECHECK_DELAY_SECS(numeric tuning knob,int()-parsed → no injection vector, non-secret). The 90stime.sleepruns in-process on the Gitea-Actions Ubuntu runner inside a single workflow job: no external resource is held open, no requests are issued during the wait, and the cron cadence (:05hourly) makes 90s a strict subset of one tick — zero DoS surface (no CWE-400/CWE-770 concern). Tests stubtime.sleepto a no-op so CI runtime is unaffected. No CWE flags. APPROVED.