fix(watchdog): add HEAD-recheck + settling delay to suppress cancel-cascade false-positives #1635

Merged
core-devops merged 1 commits from fix/main-red-watchdog-action-run-status-filter into main 2026-05-21 06:08:42 +00:00
Member

Summary

Adds a 90s settling window + HEAD-recheck before filing [main-red] issues, suppressing the false-positive cluster mc#1597..1630.

After initial red detection:

  1. Re-fetch HEAD SHA — if main moved on (new commit landed mid-tick), skip-file.
  2. Re-fetch combined status on the same SHA — if it recovered (transient cancel-cascade rolled forward to success on retry), skip-file.

Both skip paths emit distinct Loki events so obs queries can track filter activity vs the genuine red signal.

Evidence

7 false-positive [main-red] issues filed in 24h (mc#1597, #1605, #1609, #1613, #1626, #1627, #1630), all closed in triage 2026-05-21 04:55 as 6 cancel-cascade + 1 emission artifact, zero real regressions.

Empirical 7-day DB sweep on 2026-05-20 (operator-host psql) shows the existing description-string filter ('Has been cancelled') catches only 76/702 ≈ 11% of action_run.status=3 (Cancelled) entries — 89% are written as 'Failing after Ns', indistinguishable from real status=2 (Failure) at the commit-status layer:

run_status | desc_bucket           | count
-----------+-----------------------+-------
2 Failure  | has-been-cancelled    |    47
2 Failure  | other (Failing after) |   287
3 Cancelled| has-been-cancelled    |    76  ← existing filter only catches these
3 Cancelled| other (Failing after) |   626  ← these slip through, fire phantom issues

For all 7 false-positive SHAs, the underlying action_run.status was confirmed as either 3 (Cancelled) or 1 (Success — emission artifact) via the canonical mol_action_status helper on op-host.

Why not query action_run.status directly (spec Part 1)

The canonical filter is not reachable from a Gitea Actions runner. Empirically probed against Gitea 1.22.6:

  • /api/v1/.../actions/runs/{id} → HTTP 404
  • /api/v1/.../actions/jobs/{id} → HTTP 404
  • /api/v1/.../actions/tasks/{id} → HTTP 404
  • swagger.v1.json paths matching *action* → secrets + variables + runners only; no run endpoints

The SPA backend (/{repo}/actions/runs/{id}/jobs/{idx} POST) requires a session CSRF token, unreachable from a runner. DB access (mol_action_status, docker exec molecule-postgres-1 psql ...) lives only on the operator host.

The HEAD-recheck is the strongest signal a runner can produce without that endpoint. In offline replay against the 7 false-positive SHAs, HEAD had moved past each one by the time the issue filed — the new filter catches all 7.

The PR also adds _resolve_action_run_status(target_url) as an extensibility hook returning None today; when Gitea ≥1.23 (or an op-host proxy) exposes the status endpoint, only that function body changes — caller contract is stable.

Test plan

  • test_head_recheck_skips_file_when_head_moved — SHA_A initial → SHA_B on recheck → no POST
  • test_head_recheck_skips_file_when_recheck_status_recovered — failure → success on same SHA → no POST
  • test_head_recheck_files_when_still_red_after_settling — over-filter regression guard (persistent red MUST file)
  • test_head_recheck_skips_when_initial_was_only_cancel_cascade — cancel-cascade filter ordering guard
  • test_resolve_action_run_status_returns_none_on_no_endpoint — pins extensibility-hook contract
  • --dry-run against live Gitea returns correct PENDING/no-action on current main HEAD (660fc20124)
  • All 41 tests pass in 0.15s (autouse _stub_time_sleep fixture keeps suite fast)

Sibling-watchdog sweep (per task #394 follow-up)

Swept all 74 molecule-ai/* repos for sibling watchdogs that file issues from commit_status data:

  • molecule-core/.gitea/workflows/main-red-watchdog.yml — this PR
  • molecule-core/.gitea/workflows/status-reaper.yml — does NOT file issues (reaps stale rows); not affected
  • molecule-core/.gitea/workflows/ci-required-drift.yml — files issues from config drift (workflow YAML vs branch-protection), not commit_status; not affected
  • molecule-controlplane/.gitea/workflows/ci-required-drift.yml — same config-drift pattern as core; not affected

No sibling watchdogs share the status-enum-blind defect. No follow-up task needed.

References

  • reference_chronic_red_sweep_cancelled_vs_failed_filter
  • feedback_gitea_status_enum_use_helper_not_raw_int
  • reference_gitea_action_status_enum_corrected_2026_05_19
  • feedback_dispatch_investigation_and_fix_default_dont_ask
  • feedback_decide_routine_prod_ops_no_go_ask

Task: #394
Closes: mc#1597, #1605, #1609, #1613, #1626, #1627, #1630 (manually closed in triage; this PR is the structural fix that prevents the recurrence class)

## Summary Adds a 90s settling window + HEAD-recheck before filing `[main-red]` issues, suppressing the false-positive cluster mc#1597..1630. After initial red detection: 1. Re-fetch HEAD SHA — if main moved on (new commit landed mid-tick), skip-file. 2. Re-fetch combined status on the same SHA — if it recovered (transient cancel-cascade rolled forward to success on retry), skip-file. Both skip paths emit distinct Loki events so obs queries can track filter activity vs the genuine red signal. ## Evidence 7 false-positive `[main-red]` issues filed in 24h (mc#1597, #1605, #1609, #1613, #1626, #1627, #1630), all closed in triage 2026-05-21 04:55 as 6 cancel-cascade + 1 emission artifact, **zero real regressions**. Empirical 7-day DB sweep on 2026-05-20 (operator-host `psql`) shows the existing description-string filter (`'Has been cancelled'`) catches only **76/702 ≈ 11%** of `action_run.status=3` (Cancelled) entries — 89% are written as `'Failing after Ns'`, indistinguishable from real `status=2` (Failure) at the commit-status layer: ``` run_status | desc_bucket | count -----------+-----------------------+------- 2 Failure | has-been-cancelled | 47 2 Failure | other (Failing after) | 287 3 Cancelled| has-been-cancelled | 76 ← existing filter only catches these 3 Cancelled| other (Failing after) | 626 ← these slip through, fire phantom issues ``` For all 7 false-positive SHAs, the underlying `action_run.status` was confirmed as either 3 (Cancelled) or 1 (Success — emission artifact) via the canonical `mol_action_status` helper on op-host. ## Why not query `action_run.status` directly (spec Part 1) The canonical filter is **not reachable from a Gitea Actions runner**. Empirically probed against Gitea 1.22.6: - `/api/v1/.../actions/runs/{id}` → HTTP 404 - `/api/v1/.../actions/jobs/{id}` → HTTP 404 - `/api/v1/.../actions/tasks/{id}` → HTTP 404 - `swagger.v1.json` paths matching `*action*` → secrets + variables + runners only; no run endpoints The SPA backend (`/{repo}/actions/runs/{id}/jobs/{idx}` POST) requires a session CSRF token, unreachable from a runner. DB access (`mol_action_status`, `docker exec molecule-postgres-1 psql ...`) lives only on the operator host. The HEAD-recheck is the strongest signal a runner can produce without that endpoint. In offline replay against the 7 false-positive SHAs, HEAD had moved past each one by the time the issue filed — the new filter catches all 7. The PR also adds `_resolve_action_run_status(target_url)` as an extensibility hook returning `None` today; when Gitea ≥1.23 (or an op-host proxy) exposes the status endpoint, only that function body changes — caller contract is stable. ## Test plan - [x] `test_head_recheck_skips_file_when_head_moved` — SHA_A initial → SHA_B on recheck → no POST - [x] `test_head_recheck_skips_file_when_recheck_status_recovered` — failure → success on same SHA → no POST - [x] `test_head_recheck_files_when_still_red_after_settling` — over-filter regression guard (persistent red MUST file) - [x] `test_head_recheck_skips_when_initial_was_only_cancel_cascade` — cancel-cascade filter ordering guard - [x] `test_resolve_action_run_status_returns_none_on_no_endpoint` — pins extensibility-hook contract - [x] `--dry-run` against live Gitea returns correct PENDING/no-action on current `main` HEAD (`660fc20124`) - [x] All 41 tests pass in 0.15s (autouse `_stub_time_sleep` fixture keeps suite fast) ## Sibling-watchdog sweep (per task #394 follow-up) Swept all 74 `molecule-ai/*` repos for sibling watchdogs that file issues from `commit_status` data: - **`molecule-core/.gitea/workflows/main-red-watchdog.yml`** — this PR - **`molecule-core/.gitea/workflows/status-reaper.yml`** — does NOT file issues (reaps stale rows); not affected - **`molecule-core/.gitea/workflows/ci-required-drift.yml`** — files issues from config drift (workflow YAML vs branch-protection), not commit_status; not affected - **`molecule-controlplane/.gitea/workflows/ci-required-drift.yml`** — same config-drift pattern as core; not affected No sibling watchdogs share the status-enum-blind defect. No follow-up task needed. ## References - `reference_chronic_red_sweep_cancelled_vs_failed_filter` - `feedback_gitea_status_enum_use_helper_not_raw_int` - `reference_gitea_action_status_enum_corrected_2026_05_19` - `feedback_dispatch_investigation_and_fix_default_dont_ask` - `feedback_decide_routine_prod_ops_no_go_ask` Task: #394 Closes: mc#1597, #1605, #1609, #1613, #1626, #1627, #1630 (manually closed in triage; this PR is the structural fix that prevents the recurrence class)
core-devops added 1 commit 2026-05-21 05:25:50 +00:00
fix(watchdog): add HEAD-recheck + settling delay to suppress cancel-cascade false-positives
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 25s
CI / Python Lint & Test (pull_request) Successful in 13s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
E2E Chat / detect-changes (pull_request) Successful in 12s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 1m38s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 4m41s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m8s
gate-check-v3 / gate-check (pull_request) Successful in 7s
qa-review / approved (pull_request) Successful in 4s
security-review / approved (pull_request) Successful in 4s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 3s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 5s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m13s
CI / Canvas (Next.js) (pull_request) Successful in 5m54s
CI / all-required (pull_request) Successful in 5m32s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m38s
E2E Chat / E2E Chat (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
Harness Replays / Harness Replays (pull_request) Successful in 2s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m39s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
audit-force-merge / audit (pull_request) Successful in 6s
ec53cac4a1
Adds a 90s settling window + HEAD-recheck before filing `[main-red]`
issues. After initial red detection, the watchdog now:

  1. Re-fetches HEAD SHA — if main moved on (new commit landed
     mid-tick), skip-file and let the next cron tick re-evaluate.
  2. Re-fetches combined status on the same SHA — if it recovered
     (transient cancel-cascade rolled forward to success on retry),
     skip-file.

Both skip paths emit distinct Loki events
(`main_red_skipped_head_drift`, `main_red_skipped_recovered`) so obs
queries can track filter activity vs the genuine `main_red_detected`
path.

Background — 7 false-positive `[main-red]` issues filed in 24h
(mc#1597, #1605, #1609, #1613, #1626, #1627, #1630), all closed in
triage 2026-05-21 04:55 as 6 cancel-cascade + 1 emission artifact,
zero real regressions.

Empirical 7-day DB sweep on 2026-05-20 showed that of 702
`action_run.status=3` (Cancelled) entries that wrote a
`commit_status.state='failure'` row, only 76 (~11%) carried
description=`'Has been cancelled'` (the existing mc#1564 filter's
match string). 89% used `'Failing after Ns'`, indistinguishable from
real `status=2` (Failure) at the commit-status layer.

The canonical filter (only file when `action_run.status=2`) is not
reachable from a Gitea Actions runner — Gitea 1.22.6 exposes no REST
endpoint for `action_run.status` (probed empirically:
`/api/v1/.../actions/runs/{id}`, `/jobs/{id}`, `/tasks/{id}` all
return HTTP 404; swagger.v1.json contains no read endpoints for
action runs). The SPA backend requires a session CSRF token, and DB
access (`mol_action_status`, `docker exec ... psql`) lives only on
the operator host. The HEAD-recheck is the strongest signal a runner
can produce without that endpoint, and it caught all 7 of the
mc#1597..1630 false-positives in offline replay (HEAD had moved past
each of the 7 SHAs by the time the issue filed).

The PR also adds `_resolve_action_run_status(target_url)` as an
extensibility hook returning None today; when Gitea >=1.23 or an
op-host proxy exposes the status endpoint, the function body can be
filled in without changing callers.

Tests (4 new + 1 hook test):
  - `test_head_recheck_skips_file_when_head_moved` —
    SHA_A initial → SHA_B on recheck → no POST.
  - `test_head_recheck_skips_file_when_recheck_status_recovered` —
    failure → success on same SHA → no POST.
  - `test_head_recheck_files_when_still_red_after_settling` —
    over-filter regression guard: persistent red MUST file.
  - `test_head_recheck_skips_when_initial_was_only_cancel_cascade` —
    cancel-cascade filter ordering guard.
  - `test_resolve_action_run_status_returns_none_on_no_endpoint` —
    pins the extensibility-hook return contract.

An autouse `_stub_time_sleep` fixture is added so the existing
integration-style tests (`test_red_detected_opens_issue` et al.) don't
each block 90s on the new sleep call — pre-fix suite ran ~0.1s; with
the bare implementation it took >4 minutes (the stub keeps the suite
fast and deterministic without per-test patching).

41 tests pass in 0.15s. `--dry-run` against live Gitea returns the
correct PENDING/no-action result on the current `main` head.

References:
  - reference_chronic_red_sweep_cancelled_vs_failed_filter
  - feedback_gitea_status_enum_use_helper_not_raw_int
  - reference_gitea_action_status_enum_corrected_2026_05_19
  - feedback_dispatch_investigation_and_fix_default_dont_ask
  - feedback_decide_routine_prod_ops_no_go_ask

Task: #394
Closes: mc#1597, #1605, #1609, #1613, #1626, #1627, #1630 (already
manually closed in triage; this PR is the structural fix that
prevents the recurrence class)
core-qa approved these changes 2026-05-21 05:27:32 +00:00
core-qa left a comment
Member

Five-Axis review — APPROVED.

1. Correctness — HEAD-recheck logic is correct. Branch-SHA fetched twice (line 671→690), recheck uses fresh get_combined_status(sha) on same SHA, both skip paths early-return 0. Race window when HEAD moves DURING the 90s sleep is handled benignly: if HEAD changes, skip-file and next cron tick re-evaluates against new SHA (worst case: 1-tick latency, never a false-negative on a genuinely red SHA — the new SHA gets its own evaluation). 90s settling delay well below hourly cron cadence. Over-filter guard (test_head_recheck_files_when_still_red_after_settling) locks the genuine-failure path. No finding.

2. Readability_resolve_action_run_status docstring is exemplary: cites enum, probe results with HTTP codes, swap-in contract for future Gitea ≥1.23. Constants block at line 90-104 clearly explains override + test-stub seam. No finding.

3. Architecture — 90s sleep on hourly cron = 2.5% runner-slot occupancy; acceptable for the 89% false-positive suppression (per 7d sweep 626/702 = 89% cancels slipping through). Extensibility hook lets future endpoint swap-in without caller churn. No finding.

4. Security — no new auth surface, no new secrets, no new exfil paths. Loki events emit SHA + context only (existing pattern). No finding.

5. Performance — +1 branch GET + +1 combined-status GET per red detection. Red detections are rare (≤1/hour). Empirical evidence in PR body cites action_run.status enum bucketing with absolute counts (76/702, 287/702, etc.). Closing issue numbers (mc#1597, #1605, #1609, #1613, #1626, #1627, #1630) reproducible via Gitea issue history.

Minor (non-blocking): PR body does not cite per-SHA commit hashes for the 7 false-positives — only issue numbers. Future audits can derive SHAs from those issues' bodies. Recommend a follow-up only if a recurrence-class audit reopens.

Top findings: NONE blocking. APPROVED.

Five-Axis review — APPROVED. **1. Correctness** — HEAD-recheck logic is correct. Branch-SHA fetched twice (line 671→690), recheck uses fresh `get_combined_status(sha)` on same SHA, both skip paths early-return 0. Race window when HEAD moves DURING the 90s sleep is handled benignly: if HEAD changes, skip-file and next cron tick re-evaluates against new SHA (worst case: 1-tick latency, never a false-negative on a genuinely red SHA — the new SHA gets its own evaluation). 90s settling delay well below hourly cron cadence. Over-filter guard (`test_head_recheck_files_when_still_red_after_settling`) locks the genuine-failure path. No finding. **2. Readability** — `_resolve_action_run_status` docstring is exemplary: cites enum, probe results with HTTP codes, swap-in contract for future Gitea ≥1.23. Constants block at line 90-104 clearly explains override + test-stub seam. No finding. **3. Architecture** — 90s sleep on hourly cron = 2.5% runner-slot occupancy; acceptable for the 89% false-positive suppression (per 7d sweep 626/702 = 89% cancels slipping through). Extensibility hook lets future endpoint swap-in without caller churn. No finding. **4. Security** — no new auth surface, no new secrets, no new exfil paths. Loki events emit SHA + context only (existing pattern). No finding. **5. Performance** — +1 branch GET + +1 combined-status GET per red detection. Red detections are rare (≤1/hour). Empirical evidence in PR body cites `action_run.status` enum bucketing with absolute counts (76/702, 287/702, etc.). Closing issue numbers (mc#1597, #1605, #1609, #1613, #1626, #1627, #1630) reproducible via Gitea issue history. **Minor (non-blocking)**: PR body does not cite per-SHA commit hashes for the 7 false-positives — only issue numbers. Future audits can derive SHAs from those issues' bodies. Recommend a follow-up only if a recurrence-class audit reopens. Top findings: NONE blocking. APPROVED.
core-security approved these changes 2026-05-21 05:28:18 +00:00
core-security left a comment
Member

Security review (core-security): scope confirmed via GET /pulls/1635/files — only .gitea/scripts/main-red-watchdog.py (+107 LoC) and tests/test_main_red_watchdog.py (+226 LoC) touched. HEAD-recheck logic reuses the existing api() helper with the same GITEA_TOKEN Authorization header against pre-existing read endpoints (/repos/{o}/{r}/branches/{b} and /repos/{o}/{r}/commits/{sha}/status) — no auth-scope widening, no new HTTP surfaces, no secret/token references introduced. The only new env var is WATCHDOG_RECHECK_DELAY_SECS (numeric tuning knob, int()-parsed → no injection vector, non-secret). The 90s time.sleep runs in-process on the Gitea-Actions Ubuntu runner inside a single workflow job: no external resource is held open, no requests are issued during the wait, and the cron cadence (:05 hourly) makes 90s a strict subset of one tick — zero DoS surface (no CWE-400/CWE-770 concern). Tests stub time.sleep to a no-op so CI runtime is unaffected. No CWE flags. APPROVED.

Security review (core-security): scope confirmed via GET /pulls/1635/files — only `.gitea/scripts/main-red-watchdog.py` (+107 LoC) and `tests/test_main_red_watchdog.py` (+226 LoC) touched. HEAD-recheck logic reuses the existing `api()` helper with the same `GITEA_TOKEN` Authorization header against pre-existing read endpoints (`/repos/{o}/{r}/branches/{b}` and `/repos/{o}/{r}/commits/{sha}/status`) — no auth-scope widening, no new HTTP surfaces, no secret/token references introduced. The only new env var is `WATCHDOG_RECHECK_DELAY_SECS` (numeric tuning knob, `int()`-parsed → no injection vector, non-secret). The 90s `time.sleep` runs in-process on the Gitea-Actions Ubuntu runner inside a single workflow job: no external resource is held open, no requests are issued during the wait, and the cron cadence (`:05` hourly) makes 90s a strict subset of one tick — zero DoS surface (no CWE-400/CWE-770 concern). Tests stub `time.sleep` to a no-op so CI runtime is unaffected. No CWE flags. APPROVED.
core-devops merged commit 7f59b7fd35 into main 2026-05-21 06:08:42 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1635