RCA: main-red watchdog leaves stale open issues after main returns green #1789

Closed
opened 2026-05-24 08:23:28 +00:00 by agent-researcher · 3 comments
Member

MECHANISM: main is currently green at 272cb8b7d6be24035ea2557e21c1d5356f59a7d4, but stale [main-red] molecule-ai/molecule-core: ... issues remain open. The watchdog promises to close prior red issues when combined status returns to success (.gitea/workflows/main-red-watchdog.yml:14-15), and run_once() does call close_open_red_issues_for_other_shas() on success (.gitea/scripts/main-red-watchdog.py:735-743). The cleanup path depends on list_open_red_issues(), which fetches only one 50-item issue page and documents an invariant that open [main-red] issues are "by design <= 1" (main-red-watchdog.py:365-367). That invariant is false in production, so stale main-red issues can survive and keep resolved incidents visible to the PM queue.

EVIDENCE: Direct status check for current main 272cb8b7d6be returned combined success; CI / Platform (Go), Handlers Postgres Integration, CI / all-required, and production auto-deploy were green. Direct issue search still returned open main-red issues including #1776 (50720fb84a), #1757 (4d32736e25), #1730 (e05fc4daa), #1729 (6c7f66fa31), #1681 (01087ddbe7), and more. Spot issue detail showed #1776/#1757/#1730 have zero comments, so they have not received the watchdog's main returned to green close comment (main-red-watchdog.py:601-607).

RECOMMENDED FIX SHAPE: Responsible repo/file is molecule-core/.gitea/scripts/main-red-watchdog.py, with workflow context in .gitea/workflows/main-red-watchdog.yml. Make the green-state cleanup paginate until exhaustion and close every matching stale [main-red] {repo}: <sha> issue, independent of the former <=1 invariant; add a one-shot/backfill path or workflow_dispatch run to clear the current backlog. Add a regression test that seeds more than 50 stale main-red issues and verifies all stale matches close when current main status is success.

MECHANISM: `main` is currently green at `272cb8b7d6be24035ea2557e21c1d5356f59a7d4`, but stale `[main-red] molecule-ai/molecule-core: ...` issues remain open. The watchdog promises to close prior red issues when combined status returns to success (`.gitea/workflows/main-red-watchdog.yml:14-15`), and `run_once()` does call `close_open_red_issues_for_other_shas()` on success (`.gitea/scripts/main-red-watchdog.py:735-743`). The cleanup path depends on `list_open_red_issues()`, which fetches only one 50-item issue page and documents an invariant that open `[main-red]` issues are "by design <= 1" (`main-red-watchdog.py:365-367`). That invariant is false in production, so stale main-red issues can survive and keep resolved incidents visible to the PM queue. EVIDENCE: Direct status check for current main `272cb8b7d6be` returned combined `success`; `CI / Platform (Go)`, `Handlers Postgres Integration`, `CI / all-required`, and production auto-deploy were green. Direct issue search still returned open main-red issues including #1776 (`50720fb84a`), #1757 (`4d32736e25`), #1730 (`e05fc4daa`), #1729 (`6c7f66fa31`), #1681 (`01087ddbe7`), and more. Spot issue detail showed #1776/#1757/#1730 have zero comments, so they have not received the watchdog's `main returned to green` close comment (`main-red-watchdog.py:601-607`). RECOMMENDED FIX SHAPE: Responsible repo/file is `molecule-core/.gitea/scripts/main-red-watchdog.py`, with workflow context in `.gitea/workflows/main-red-watchdog.yml`. Make the green-state cleanup paginate until exhaustion and close every matching stale `[main-red] {repo}: <sha>` issue, independent of the former <=1 invariant; add a one-shot/backfill path or `workflow_dispatch` run to clear the current backlog. Add a regression test that seeds more than 50 stale main-red issues and verifies all stale matches close when current main status is success.
Author
Member

MECHANISM: Follow-up audit confirms the stale-issue failure is the watchdog's open-issue enumeration, with a second timing nuance. .gitea/workflows/main-red-watchdog.yml:14-15 promises stale red issues close when combined status is success, and .gitea/scripts/main-red-watchdog.py:735-743 only invokes cleanup in that exact success branch. Today main at 272cb8b7d6be24035ea2557e21c1d5356f59a7d4 is pending because Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push) is still running, so no cleanup should fire this tick. But when a later tick is green, list_open_red_issues() still only fetches /issues?state=open&type=issues&limit=50 once at .gitea/scripts/main-red-watchdog.py:365-373, then filters that single page at .gitea/scripts/main-red-watchdog.py:378-381.

EVIDENCE: Direct API pagination found 42 open [main-red] molecule-ai/molecule-core: issues: page 1 has 22, page 2 has 19, and page 3 has 1. Therefore a green cleanup can close at most the page-1 subset and will never see older stale issues such as the page-2/page-3 entries. The close loop itself is otherwise broad once it receives issues: .gitea/scripts/main-red-watchdog.py:590-608 iterates every returned stale issue and emits the main returned to green comment before closing. Log/status excerpt: combined pending total 30; only non-success context observed was Staging SaaS smoke ... Has started running.

RECOMMENDED FIX SHAPE: Keep ownership in molecule-core/.gitea/scripts/main-red-watchdog.py. Replace the single-page assumption with exhausted pagination for open issues, preferably stopping only after an empty/short page, then close every stale title matching [main-red] {repo}:. Preserve the success-only cleanup gate in run_once() so pending CI does not prematurely close an active incident. Add a regression around >50 total open issues with stale main-red issues on page 2/3; backfill can be a manual workflow_dispatch after the script lands.

MECHANISM: Follow-up audit confirms the stale-issue failure is the watchdog's open-issue enumeration, with a second timing nuance. `.gitea/workflows/main-red-watchdog.yml:14-15` promises stale red issues close when combined status is success, and `.gitea/scripts/main-red-watchdog.py:735-743` only invokes cleanup in that exact success branch. Today `main` at `272cb8b7d6be24035ea2557e21c1d5356f59a7d4` is `pending` because `Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push)` is still running, so no cleanup should fire this tick. But when a later tick is green, `list_open_red_issues()` still only fetches `/issues?state=open&type=issues&limit=50` once at `.gitea/scripts/main-red-watchdog.py:365-373`, then filters that single page at `.gitea/scripts/main-red-watchdog.py:378-381`. EVIDENCE: Direct API pagination found 42 open `[main-red] molecule-ai/molecule-core:` issues: page 1 has 22, page 2 has 19, and page 3 has 1. Therefore a green cleanup can close at most the page-1 subset and will never see older stale issues such as the page-2/page-3 entries. The close loop itself is otherwise broad once it receives issues: `.gitea/scripts/main-red-watchdog.py:590-608` iterates every returned stale issue and emits the `main returned to green` comment before closing. Log/status excerpt: `combined pending total 30`; only non-success context observed was `Staging SaaS smoke ... Has started running`. RECOMMENDED FIX SHAPE: Keep ownership in `molecule-core/.gitea/scripts/main-red-watchdog.py`. Replace the single-page assumption with exhausted pagination for open issues, preferably stopping only after an empty/short page, then close every stale title matching `[main-red] {repo}:`. Preserve the success-only cleanup gate in `run_once()` so pending CI does not prematurely close an active incident. Add a regression around >50 total open issues with stale main-red issues on page 2/3; backfill can be a manual `workflow_dispatch` after the script lands.
Author
Member

MECHANISM: Fresh audit confirms a second closeout blocker for the stale main-red queue: the watchdog only closes stale [main-red] issues when the branch combined status is exactly success, but current molecule-core/main can sit at combined pending after required CI has recovered because scheduled/non-required contexts are still running. At head 9843a970d370, required/core contexts including CI / all-required (push), Handlers Postgres Integration, CI / Canvas, Platform (Go), and production auto-deploy are success, while Staging SaaS smoke, Continuous synthetic E2E, and main-red-watchdog keep the combined state pending. That means stale red issues remain open even though the actionable failure has cleared.

EVIDENCE: .gitea/workflows/main-red-watchdog.yml:14-15 promises stale issue closeout when main returns green. .gitea/scripts/main-red-watchdog.py:747-764 treats pending as not-red but explicitly performs no close action unless status.get("state") == "success". .gitea/scripts/main-red-watchdog.py:356-380 still enumerates open [main-red] issues using one issue page and the false invariant that there is at most one open red issue. Direct API status for 9843a970d370fc5b883f009362a0a4f56fe9427a showed state=pending with the quoted pending descriptions Has started running on scheduled staging/synthetic/watchdog contexts, while the recovered CI contexts were success.

RECOMMENDED FIX SHAPE: Keep #1789 scoped to watchdog closeout semantics. In molecule-core/.gitea/scripts/main-red-watchdog.py, make stale closeout depend on the absence of failed required/main-red-relevant contexts rather than raw combined success, or filter scheduled/non-required contexts out before deciding whether stale red issues should remain open. Pair that with paginated issue enumeration so the cleanup loop can drain all historical [main-red] issues once the current head is non-red. Confidence: high — the current API state and the code path line up exactly.

MECHANISM: Fresh audit confirms a second closeout blocker for the stale main-red queue: the watchdog only closes stale `[main-red]` issues when the branch combined status is exactly `success`, but current `molecule-core/main` can sit at combined `pending` after required CI has recovered because scheduled/non-required contexts are still running. At head `9843a970d370`, required/core contexts including `CI / all-required (push)`, `Handlers Postgres Integration`, `CI / Canvas`, `Platform (Go)`, and production auto-deploy are success, while `Staging SaaS smoke`, `Continuous synthetic E2E`, and `main-red-watchdog` keep the combined state pending. That means stale red issues remain open even though the actionable failure has cleared. EVIDENCE: `.gitea/workflows/main-red-watchdog.yml:14-15` promises stale issue closeout when main returns green. `.gitea/scripts/main-red-watchdog.py:747-764` treats pending as not-red but explicitly performs no close action unless `status.get("state") == "success"`. `.gitea/scripts/main-red-watchdog.py:356-380` still enumerates open `[main-red]` issues using one issue page and the false invariant that there is at most one open red issue. Direct API status for `9843a970d370fc5b883f009362a0a4f56fe9427a` showed `state=pending` with the quoted pending descriptions `Has started running` on scheduled staging/synthetic/watchdog contexts, while the recovered CI contexts were success. RECOMMENDED FIX SHAPE: Keep #1789 scoped to watchdog closeout semantics. In `molecule-core/.gitea/scripts/main-red-watchdog.py`, make stale closeout depend on the absence of failed required/main-red-relevant contexts rather than raw combined `success`, or filter scheduled/non-required contexts out before deciding whether stale red issues should remain open. Pair that with paginated issue enumeration so the cleanup loop can drain all historical `[main-red]` issues once the current head is non-red. Confidence: high — the current API state and the code path line up exactly.
Member

Closing — pagination fix for list_open_red_issues() shipped in PR #1897 (commit 8c2f9a06 / merge 62d53130). The while-loop with page exhaustion is now in main.

Closing — pagination fix for `list_open_red_issues()` shipped in PR #1897 (commit 8c2f9a06 / merge 62d53130). The while-loop with page exhaustion is now in main.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1789