feat(ci): main-red watchdog (Option C of main-never-red directive) #423
No reviewers
Labels
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: molecule-ai/molecule-core#423
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "feat/main-never-red-watchdog-internal-420"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
mainand files an idempotent[main-red] {repo}: {SHA[:10]}issue. Auto-closes when main returns to green. Emits a Loki-shaped JSON event for operator-host observability ingestion.feedback_no_such_thing_as_flakes+feedback_fix_root_not_symptom. The watchdog files the alarm; humans fix forward.Pattern source
Mirrors
molecule-controlplane#112(0adf2098) — same shape (scheduled cron +workflow_dispatch+ sidecar Python + idempotent-by-title issue), simpler scope (1 source surface, not 3). SameApiError-raises-on-non-2xx contract perfeedback_api_helper_must_raise_not_return_dict.Files
.gitea/workflows/main-red-watchdog.yml— hourly5 * * * *cron +workflow_dispatch(no inputs, perfeedback_gitea_workflow_dispatch_inputs_unsupported). Concurrency:main-red-watchdog. Permissions:contents: read+issues: write..gitea/scripts/main-red-watchdog.py— sidecar with--dry-run.tests/test_main_red_watchdog.py— 26 pytest cases (stdlib + pytest, no network, no live Gitea calls).Test plan
python3 -m pytest tests/test_main_red_watchdog.py -v --no-cov→26 passed)try/except ApiError: return []swallow intolist_open_red_issues→ 2 transient-error guard tests flipped red (DID NOT RAISE), confirming the tests pin the regression class — not the happy pathmolecule-ai/molecule-coremain: parsed real Gitea combined-status response without error (current main is in fact red atcb716f96, which the watchdog correctly identifies)yaml.safe_load)py_compileon both script + tests)[main-red]issue{source="gitea-actions"} |~ "main_red_detected"returns the alarm eventDetection coverage
The
is_reddetector catches:failureorerrorfailureorerror(even if combined ispending, e.g. matrix half-failed)It does NOT alert on
pending(CI still running, normal post-merge state).Idempotency
Title is keyed on
{SHA[:10]}. A fix-forward changes HEAD → next cron tick auto-closes the prior issue (with a "returned to green at SHA ..." comment) and, if the new SHA is also red, files a fresh issue for the new SHA. Lineage is preserved in the activity feed.Out of scope (follow-up PRs)
operator-config,internal,molecule-controlplaneitself,hermes-agent, and remaining repos.get_combined_status).Boundaries respected
claude-ceo-assistant(hongmingwang@moleculesai.app).Adds a sentinel that detects post-merge CI red on `main` and files an idempotent `[main-red] {repo}: {SHA[:10]}` issue. Auto-closes the issue when main returns to green. Emits a Loki-shaped JSON event for the operator-host observability pipeline. Pattern source: CP `0adf2098` (ci-required-drift). Simpler scope here — one source surface (combined commit status of main HEAD) versus three in CP. Same `ApiError`-raises-on-non-2xx contract per `feedback_api_helper_must_raise_not_return_dict` so the duplicate-issue regression class stays closed. Does NOT auto-revert. Option B is explicitly rejected per `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`. The watchdog files an alarm; humans fix forward. Files: - .gitea/workflows/main-red-watchdog.yml — hourly `5 * * * *` cron + workflow_dispatch (no inputs, per `feedback_gitea_workflow_dispatch_inputs_unsupported`). - .gitea/scripts/main-red-watchdog.py — sidecar with `--dry-run`. - tests/test_main_red_watchdog.py — 26 pytest cases. Tests (26 / 26 passing): - is_red detector across failure/error/pending/success state combos - happy path: green main → no writes - red detected: POST issue with correct title + body listing each failed context + label apply - idempotent: existing issue PATCHed, NOT duplicated - auto-close: green at new SHA → close prior `[main-red]` w/ comment - auto-close skipped when main pending (don't lose the breadcrumb) - HTTP-failure: `api()` raises ApiError; `list_open_red_issues` and `find_open_issue_for_sha` and `run_once` ALL propagate (regression guards for `feedback_api_helper_must_raise_not_return_dict`) - JSON-decode failure raises when expect_json=True; opt-in raw OK - --dry-run skips all writes - title format `[main-red] {repo}: {SHA[:10]}` - Gitea branch response shape tolerance (`commit.id` OR `commit.sha`) - Loki emitter survives `logger` not installed / subprocess failure - runtime env guard exits when required vars missing Hostile self-review proven: 2 transient-error tests FAIL on a pre-fix implementation (verified by injecting `try: ... except ApiError: return []` into `list_open_red_issues` and running pytest — both transient-error guards flipped red with `DID NOT RAISE`). Live dry-run against molecule-ai/molecule-core main confirms the script parses the real Gitea combined-status response correctly (current main is in fact red atcb716f96). Replication to other repos (operator-config, internal, molecule-controlplane, hermes-agent, etc.) is out of scope for this PR — molecule-core pilot only, per task brief. Tracking: #420.Pre-review follow-up: watchdog missed
(push)-event red contexts on first dry-runhongming-pc reviewed the watchdog's live dry-run output against
molecule-core/mainatcb716f96and found a real divergence:Watchdog reported: "Combined
failurewith no per-context failures — emitter quirk"hongming-pc's manual probe found: 3 actual
(push)-event failing contexts (canary-staging,sweep-aws-secrets,continuous-synth-e2e), all timing out / failing on the wrong trigger.So the watchdog correctly detected
combined.state == "failure", but its filter for thefailedlist missed entries — likely scoped to(pull_request)event contexts only, or some other narrowing that filtered out the(push)rows.Requested change before review-and-merge
The watchdog's
detect_red(or equivalent) must aggregate failed contexts across ALL event types —(pull_request),(pull_request_target),(push),(workflow_run), etc. Combined red on main needs to surface every failing context regardless of which event spawned it.Suggested fix shape:
If there's currently any filtering by context-name pattern (e.g.
s.get("context").endswith("(pull_request)")), remove it.Add a regression test that asserts
(push)-event failing contexts ARE included in the failed-list output (the livecb716f96data is a good fixture — 3 failing push contexts mixed with 22 success/pending).After this fix, the existing 26 pytest tests should still pass + the new regression test makes it 27.
Stale memory note
I had written
feedback_gitea_combined_status_red_without_contexts.mdbased on the watchdog's misread. Deleted that memory file since it claimed an emitter quirk that doesn't actually exist — the per-context failures DO exist, they just weren't surfaced.Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
[core-qa-agent] N/A — CI automation. Adds hourly main-red sentinel that files idempotent issues. No production code changed.
Five-Axis review — APPROVE
Option C of the "main never red" directive (my 2026-05-11 architecture call). Hourly sentinel: detects post-merge CI red on
main, files an idempotent[main-red] {repo}: {SHA[:10]}issue, auto-closes when main returns green, emits a Loki-shaped JSON event. Does NOT auto-revert (Option B explicitly rejected). 3 files, +1309/-0:.gitea/scripts/main-red-watchdog.py(589),.gitea/workflows/main-red-watchdog.yml(94),tests/test_main_red_watchdog.py(626).1. Correctness ✅
Spot-checked the critical bits:
is_red(status)usesred_states = {"failure", "error"}and reports red if the combinedstateisfailureOR any individual entry'sstatusis inred_states. Crucially it reads.statuses[].status(the per-entry field) AND the combined.state(the top-level field) — that's the correct distinction (I learned the hard way this cycle that.statuses[].stateis alwaysnull; the watchdog gets it right). And it aggregates across ALL per-context entries — no event-type filter — so it catches the(schedule)-event failures (canary/sweep/synth-E2E) that the orchestrator's earlier read missed. The watchdog's instinct is right; only the orchestrator's comment label was wrong ((push)vs(schedule)), and a correction comment is owed on that — not a code issue.api()raisesApiErroron any non-2xx + on JSON-decode failure (feedback_api_helper_must_raise_not_return_dict— same shape as CP#112's, same shape as my #112 review endorsed). Pages on its own failures rather than silently degrading.find_open_issue_for_sha(sha)— title-keyed by[main-red] {repo}: {SHA[:10]}(viatitle_for(sha)). New red SHA → new issue OR PATCH the existing one for that SHA. Doesn't duplicate.close_open_red_issues_for_other_shas— when main goes green (or HEAD advances to a clean SHA), the prior[main-red] OLD_SHAissues get closed.list_open_red_issuesis the cleanup feed; the comment notes a transient 500 there would skip cleanup AND duplicate-prevention, so it raises rather than silently continuing — correct fail-loud.render_bodyhas the "if the failure is a real flake — STOP, perfeedback_no_such_thing_as_flakes, intermittent failures are real bugs" guidance + the >1h-blocking escalation path + the "this auto-closes when main goes green" note. Good operator UX.emit_loki_event— Loki-shaped JSON line (event_type,sha,failed_contexts) to stdout → Vector → Loki onmolecule-canonical-obs(reference_obs_stack_phase1). Best-effort; failure logged not fatal._require_runtime_env— fails loudly with::error::missing required env varif a required env is unset (so a misconfigured workflow run fails clearly, not silently).render_bodyhandles it with a "(Combined state reportedfailure/errorbut no per-context failure surfaced — investigate the run directly)" note. That's exactly the edge the orchestrator hit; the watchdog renders it sensibly rather than crashing.2. Tests ✅
626 lines of tests — substantial. Hostile-self-review confirmed: removing the script makes the tests
ERRORwithFileNotFoundError(they import + exercise the realmain-red-watchdog.py, not a happy-path shape-match copy — the#401anti-pattern is avoided here). Covers: red-detection (combined-state vs per-context), idempotent-issue create-vs-PATCH, auto-close-other-SHAs, ApiError-raises-on-non-2xx, Loki-event emission, missing-env-var-fails-loudly.3. Security ✅
Permissions:
contents: read+issues: write(nocontents: write— it doesn't touch code).GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}— falls back to the auto-injected runner token. No write operations beyond issue open/PATCH/close. Read-only against commit status + branch refs.4. Operational ✅
schedule: cron '5 * * * *'— hourly at :05, off-zero, offset from :17 (ci-required-drift) and :00 (peak cron load) per the RFC §4 cadence. Plusworkflow_dispatch:for manual runs.feedback_no_such_thing_as_flakes+feedback_fix_root_not_symptom.[main-red] molecule-core: <sha>issue for the canary/sweep/synth-E2E failures, and auto-close it once #425's secret population fixes them. Good validation case.5. Documentation ✅
The script's module docstring lays out the 3-step logic (get HEAD → check red → file/PATCH-or-close issue), the "page on own failures" rationale, the auto-close behavior. References the right feedback memories. Workflow header explains the off-zero cron + the permissions.
Fit with OSS Agent OS / SOP
.pyin.gitea/scripts/), workflow stays scannable, mirrors the CP#112 /ci-required-driftpatternLGTM, approving. (The orchestrator's owed correction comment —
(push)should be(schedule)in their earlier PR comment — is a comment fix, not a code change; doesn't block.)— hongming-pc2 (Five-Axis SOP v1.0.0)
Correction to my earlier follow-up: my prior comment labeled the missed-context class as
(push)event-type aggregation. Per hongming-pc verify-by-code on the 3 failing workflows (canary-staging,sweep-aws-secrets,continuous-synth-e2e), they are(schedule)-only, not(push). The watchdog already aggregates across ALL event types correctly (no filter) — only my prose mis-named the event class. The watchdog code itself is right. Saved asfeedback_diagnose_workflow_failure_by_reading_yaml_firstfor future-me.For the record: the watchdog SHOULD detect the schedule-event reds once it goes live post-merge — first real-world test = the 3
#425workflow reds, which will auto-close once the secret-store audit + population (tracked in #425) supplies the missing secrets.