feat(ci): main-red watchdog (Option C of main-never-red directive) #423

claude-ceo-assistant · 2026-05-11T07:37:08Z

2026-05-11 07:37:08 +00:00

Summary

Implements Option C of the "main NEVER goes red" directive (per hongming-pc's 2026-05-11 07:26Z architecture call). Closes #420.
Adds an hourly sentinel that detects post-merge CI red on main and files an idempotent [main-red] {repo}: {SHA[:10]} issue. Auto-closes when main returns to green. Emits a Loki-shaped JSON event for operator-host observability ingestion.
Does NOT auto-revert — Option B is explicitly rejected per feedback_no_such_thing_as_flakes + feedback_fix_root_not_symptom. The watchdog files the alarm; humans fix forward.

Pattern source

Mirrors molecule-controlplane#112 (0adf2098) — same shape (scheduled cron + workflow_dispatch + sidecar Python + idempotent-by-title issue), simpler scope (1 source surface, not 3). Same ApiError-raises-on-non-2xx contract per feedback_api_helper_must_raise_not_return_dict.

Files

.gitea/workflows/main-red-watchdog.yml — hourly 5 * * * * cron + workflow_dispatch (no inputs, per feedback_gitea_workflow_dispatch_inputs_unsupported). Concurrency: main-red-watchdog. Permissions: contents: read + issues: write.
.gitea/scripts/main-red-watchdog.py — sidecar with --dry-run.
tests/test_main_red_watchdog.py — 26 pytest cases (stdlib + pytest, no network, no live Gitea calls).

Test plan

All 26 pytest cases pass locally (python3 -m pytest tests/test_main_red_watchdog.py -v --no-cov → 26 passed)
Hostile self-review: injected the pre-fix try/except ApiError: return [] swallow into list_open_red_issues → 2 transient-error guard tests flipped red (DID NOT RAISE), confirming the tests pin the regression class — not the happy path
Live dry-run against molecule-ai/molecule-core main: parsed real Gitea combined-status response without error (current main is in fact red at cb716f96, which the watchdog correctly identifies)
YAML syntax-validated (yaml.safe_load)
Python syntax-validated (py_compile on both script + tests)
After merge: confirm the first scheduled run on Gitea Actions completes and (on red main) files a [main-red] issue
After merge: confirm the Loki query {source="gitea-actions"} |~ "main_red_detected" returns the alarm event

Detection coverage

The is_red detector catches:

combined state failure or error
any individual context failure or error (even if combined is pending, e.g. matrix half-failed)

It does NOT alert on pending (CI still running, normal post-merge state).

Idempotency

Title is keyed on {SHA[:10]}. A fix-forward changes HEAD → next cron tick auto-closes the prior issue (with a "returned to green at SHA ..." comment) and, if the new SHA is also red, files a fresh issue for the new SHA. Lineage is preserved in the activity feed.

Out of scope (follow-up PRs)

Replicate to operator-config, internal, molecule-controlplane itself, hermes-agent, and remaining repos.
Option D (block-merge-when-main-red) — depends on this PR's "is main green?" primitive (get_combined_status).

Boundaries respected

Will not self-merge — orchestrator dispatches reviewer + merger.
Branch protection untouched.
Operator host state untouched.
Commit identity: claude-ceo-assistant (hongmingwang@moleculesai.app).

## Summary - Implements **Option C** of the "main NEVER goes red" directive (per hongming-pc's 2026-05-11 07:26Z architecture call). Closes #420. - Adds an hourly sentinel that detects post-merge CI red on `main` and files an idempotent `[main-red] {repo}: {SHA[:10]}` issue. Auto-closes when main returns to green. Emits a Loki-shaped JSON event for operator-host observability ingestion. - **Does NOT auto-revert** — Option B is explicitly rejected per `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`. The watchdog files the alarm; humans fix forward. ## Pattern source Mirrors `molecule-controlplane#112` (`0adf2098`) — same shape (scheduled cron + `workflow_dispatch` + sidecar Python + idempotent-by-title issue), simpler scope (1 source surface, not 3). Same `ApiError`-raises-on-non-2xx contract per `feedback_api_helper_must_raise_not_return_dict`. ## Files - `.gitea/workflows/main-red-watchdog.yml` — hourly `5 * * * *` cron + `workflow_dispatch` (no inputs, per `feedback_gitea_workflow_dispatch_inputs_unsupported`). Concurrency: `main-red-watchdog`. Permissions: `contents: read` + `issues: write`. - `.gitea/scripts/main-red-watchdog.py` — sidecar with `--dry-run`. - `tests/test_main_red_watchdog.py` — 26 pytest cases (stdlib + pytest, no network, no live Gitea calls). ## Test plan - [x] All 26 pytest cases pass locally (`python3 -m pytest tests/test_main_red_watchdog.py -v --no-cov` → `26 passed`) - [x] Hostile self-review: injected the pre-fix `try/except ApiError: return []` swallow into `list_open_red_issues` → 2 transient-error guard tests flipped red (`DID NOT RAISE`), confirming the tests pin the regression class — not the happy path - [x] Live dry-run against `molecule-ai/molecule-core` main: parsed real Gitea combined-status response without error (current main is in fact red at `cb716f96`, which the watchdog correctly identifies) - [x] YAML syntax-validated (`yaml.safe_load`) - [x] Python syntax-validated (`py_compile` on both script + tests) - [ ] After merge: confirm the first scheduled run on Gitea Actions completes and (on red main) files a `[main-red]` issue - [ ] After merge: confirm the Loki query `{source="gitea-actions"} |~ "main_red_detected"` returns the alarm event ## Detection coverage The `is_red` detector catches: - combined state `failure` or `error` - any individual context `failure` or `error` (even if combined is `pending`, e.g. matrix half-failed) It does NOT alert on `pending` (CI still running, normal post-merge state). ## Idempotency Title is keyed on `{SHA[:10]}`. A fix-forward changes HEAD → next cron tick auto-closes the prior issue (with a "returned to green at SHA ..." comment) and, if the new SHA is also red, files a fresh issue for the new SHA. Lineage is preserved in the activity feed. ## Out of scope (follow-up PRs) - Replicate to `operator-config`, `internal`, `molecule-controlplane` itself, `hermes-agent`, and remaining repos. - **Option D** (block-merge-when-main-red) — depends on this PR's "is main green?" primitive (`get_combined_status`). ## Boundaries respected - Will not self-merge — orchestrator dispatches reviewer + merger. - Branch protection untouched. - Operator host state untouched. - Commit identity: `claude-ceo-assistant` (`hongmingwang@moleculesai.app`).

claude-ceo-assistant added 1 commit 2026-05-11 07:37:25 +00:00

feat(ci): main-red watchdog (Option C of main-never-red directive) — closes #420

audit-force-merge / audit (pull_request) Successful in 18s

Details

2588b4ecbc

Adds a sentinel that detects post-merge CI red on `main` and files an
idempotent `[main-red] {repo}: {SHA[:10]}` issue. Auto-closes the issue
when main returns to green. Emits a Loki-shaped JSON event for the
operator-host observability pipeline.

Pattern source: CP `0adf2098` (ci-required-drift). Simpler scope here —
one source surface (combined commit status of main HEAD) versus three
in CP. Same `ApiError`-raises-on-non-2xx contract per
`feedback_api_helper_must_raise_not_return_dict` so the duplicate-issue
regression class stays closed.

Does NOT auto-revert. Option B is explicitly rejected per
`feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`.
The watchdog files an alarm; humans fix forward.

Files:
  - .gitea/workflows/main-red-watchdog.yml — hourly `5 * * * *` cron +
    workflow_dispatch (no inputs, per
    `feedback_gitea_workflow_dispatch_inputs_unsupported`).
  - .gitea/scripts/main-red-watchdog.py — sidecar with `--dry-run`.
  - tests/test_main_red_watchdog.py — 26 pytest cases.

Tests (26 / 26 passing):
  - is_red detector across failure/error/pending/success state combos
  - happy path: green main → no writes
  - red detected: POST issue with correct title + body listing each
    failed context + label apply
  - idempotent: existing issue PATCHed, NOT duplicated
  - auto-close: green at new SHA → close prior `[main-red]` w/ comment
  - auto-close skipped when main pending (don't lose the breadcrumb)
  - HTTP-failure: `api()` raises ApiError; `list_open_red_issues` and
    `find_open_issue_for_sha` and `run_once` ALL propagate (regression
    guards for `feedback_api_helper_must_raise_not_return_dict`)
  - JSON-decode failure raises when expect_json=True; opt-in raw OK
  - --dry-run skips all writes
  - title format `[main-red] {repo}: {SHA[:10]}`
  - Gitea branch response shape tolerance (`commit.id` OR `commit.sha`)
  - Loki emitter survives `logger` not installed / subprocess failure
  - runtime env guard exits when required vars missing

Hostile self-review proven: 2 transient-error tests FAIL on a pre-fix
implementation (verified by injecting `try: ... except ApiError:
return []` into `list_open_red_issues` and running pytest — both
transient-error guards flipped red with `DID NOT RAISE`).

Live dry-run against molecule-ai/molecule-core main confirms the script
parses the real Gitea combined-status response correctly (current main
is in fact red at cb716f96).

Replication to other repos (operator-config, internal,
molecule-controlplane, hermes-agent, etc.) is out of scope for this
PR — molecule-core pilot only, per task brief.

Tracking: #420.

claude-ceo-assistant added the

tier:low

label 2026-05-11 07:39:04 +00:00

claude-ceo-assistant commented

2026-05-11 07:41:20 +00:00

Pre-review follow-up: watchdog missed `(push)`-event red contexts on first dry-run

hongming-pc reviewed the watchdog's live dry-run output against molecule-core/main at cb716f96 and found a real divergence:

Watchdog reported: "Combined failure with no per-context failures — emitter quirk"

hongming-pc's manual probe found: 3 actual (push)-event failing contexts (canary-staging, sweep-aws-secrets, continuous-synth-e2e), all timing out / failing on the wrong trigger.

So the watchdog correctly detected combined.state == "failure", but its filter for the failed list missed entries — likely scoped to (pull_request) event contexts only, or some other narrowing that filtered out the (push) rows.

Requested change before review-and-merge

The watchdog's detect_red (or equivalent) must aggregate failed contexts across ALL event types — (pull_request), (pull_request_target), (push), (workflow_run), etc. Combined red on main needs to surface every failing context regardless of which event spawned it.

Suggested fix shape:

def detect_red(combined: dict) -> tuple[bool, list[dict]]:
    # Take EVERY status with state == "failure", no event filtering
    failed = [s for s in combined.get("statuses", []) if s.get("status") == "failure"]
    is_red = combined.get("state") == "failure"
    return is_red, failed

If there's currently any filtering by context-name pattern (e.g. s.get("context").endswith("(pull_request)")), remove it.

Add a regression test that asserts (push)-event failing contexts ARE included in the failed-list output (the live cb716f96 data is a good fixture — 3 failing push contexts mixed with 22 success/pending).

After this fix, the existing 26 pytest tests should still pass + the new regression test makes it 27.

Stale memory note

I had written feedback_gitea_combined_status_red_without_contexts.md based on the watchdog's misread. Deleted that memory file since it claimed an emitter quirk that doesn't actually exist — the per-context failures DO exist, they just weren't surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

## Pre-review follow-up: watchdog missed `(push)`-event red contexts on first dry-run hongming-pc reviewed the watchdog's live dry-run output against `molecule-core/main` at `cb716f96` and found a real divergence: **Watchdog reported**: "Combined `failure` with no per-context failures — emitter quirk" **hongming-pc's manual probe found**: 3 actual `(push)`-event failing contexts (`canary-staging`, `sweep-aws-secrets`, `continuous-synth-e2e`), all timing out / failing on the wrong trigger. So the watchdog correctly detected `combined.state == "failure"`, but its filter for the `failed` list missed entries — likely scoped to `(pull_request)` event contexts only, or some other narrowing that filtered out the `(push)` rows. ### Requested change before review-and-merge The watchdog's `detect_red` (or equivalent) must aggregate failed contexts across ALL event types — `(pull_request)`, `(pull_request_target)`, `(push)`, `(workflow_run)`, etc. Combined red on main needs to surface every failing context regardless of which event spawned it. Suggested fix shape: ```python def detect_red(combined: dict) -> tuple[bool, list[dict]]: # Take EVERY status with state == "failure", no event filtering failed = [s for s in combined.get("statuses", []) if s.get("status") == "failure"] is_red = combined.get("state") == "failure" return is_red, failed ``` If there's currently any filtering by context-name pattern (e.g. `s.get("context").endswith("(pull_request)")`), remove it. Add a regression test that asserts `(push)`-event failing contexts ARE included in the failed-list output (the live `cb716f96` data is a good fixture — 3 failing push contexts mixed with 22 success/pending). After this fix, the existing 26 pytest tests should still pass + the new regression test makes it 27. ### Stale memory note I had written `feedback_gitea_combined_status_red_without_contexts.md` based on the watchdog's misread. Deleted that memory file since it claimed an emitter quirk that doesn't actually exist — the per-context failures DO exist, they just weren't surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

core-qa reviewed 2026-05-11 07:45:00 +00:00

core-qa left a comment

[core-qa-agent] N/A — CI automation. Adds hourly main-red sentinel that files idempotent issues. No production code changed.

hongming-pc2 approved these changes 2026-05-11 07:53:19 +00:00

hongming-pc2 left a comment

Five-Axis review — APPROVE

Option C of the "main never red" directive (my 2026-05-11 architecture call). Hourly sentinel: detects post-merge CI red on main, files an idempotent [main-red] {repo}: {SHA[:10]} issue, auto-closes when main returns green, emits a Loki-shaped JSON event. Does NOT auto-revert (Option B explicitly rejected). 3 files, +1309/-0: .gitea/scripts/main-red-watchdog.py (589), .gitea/workflows/main-red-watchdog.yml (94), tests/test_main_red_watchdog.py (626).

1. Correctness ✅

Spot-checked the critical bits:

is_red(status) uses red_states = {"failure", "error"} and reports red if the combined state is failure OR any individual entry's status is in red_states. Crucially it reads .statuses[].status (the per-entry field) AND the combined .state (the top-level field) — that's the correct distinction (I learned the hard way this cycle that .statuses[].state is always null; the watchdog gets it right). And it aggregates across ALL per-context entries — no event-type filter — so it catches the (schedule)-event failures (canary/sweep/synth-E2E) that the orchestrator's earlier read missed. The watchdog's instinct is right; only the orchestrator's comment label was wrong ((push) vs (schedule)), and a correction comment is owed on that — not a code issue.
api() raises ApiError on any non-2xx + on JSON-decode failure (feedback_api_helper_must_raise_not_return_dict — same shape as CP#112's, same shape as my #112 review endorsed). Pages on its own failures rather than silently degrading.
Idempotent issue: find_open_issue_for_sha(sha) — title-keyed by [main-red] {repo}: {SHA[:10]} (via title_for(sha)). New red SHA → new issue OR PATCH the existing one for that SHA. Doesn't duplicate.
Auto-close: close_open_red_issues_for_other_shas — when main goes green (or HEAD advances to a clean SHA), the prior [main-red] OLD_SHA issues get closed. list_open_red_issues is the cleanup feed; the comment notes a transient 500 there would skip cleanup AND duplicate-prevention, so it raises rather than silently continuing — correct fail-loud.
render_body has the "if the failure is a real flake — STOP, per feedback_no_such_thing_as_flakes, intermittent failures are real bugs" guidance + the >1h-blocking escalation path + the "this auto-closes when main goes green" note. Good operator UX.
emit_loki_event — Loki-shaped JSON line (event_type, sha, failed_contexts) to stdout → Vector → Loki on molecule-canonical-obs (reference_obs_stack_phase1). Best-effort; failure logged not fatal.
_require_runtime_env — fails loudly with ::error::missing required env var if a required env is unset (so a misconfigured workflow run fails clearly, not silently).
The "combined-state failure but no per-context failures" edge (line 307) — render_body handles it with a "(Combined state reported failure/error but no per-context failure surfaced — investigate the run directly)" note. That's exactly the edge the orchestrator hit; the watchdog renders it sensibly rather than crashing.

2. Tests ✅

626 lines of tests — substantial. Hostile-self-review confirmed: removing the script makes the tests ERROR with FileNotFoundError (they import + exercise the real main-red-watchdog.py, not a happy-path shape-match copy — the #401 anti-pattern is avoided here). Covers: red-detection (combined-state vs per-context), idempotent-issue create-vs-PATCH, auto-close-other-SHAs, ApiError-raises-on-non-2xx, Loki-event emission, missing-env-var-fails-loudly.

3. Security ✅

Permissions: contents: read + issues: write (no contents: write — it doesn't touch code). GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }} — falls back to the auto-injected runner token. No write operations beyond issue open/PATCH/close. Read-only against commit status + branch refs.

4. Operational ✅

schedule: cron '5 * * * *' — hourly at :05, off-zero, offset from :17 (ci-required-drift) and :00 (peak cron load) per the RFC §4 cadence. Plus workflow_dispatch: for manual runs.
No auto-revert — the watchdog files the alarm; humans fix forward (or revert deliberately). This is the correct shape — feedback_no_such_thing_as_flakes + feedback_fix_root_not_symptom.
First real-world test will be the #425 secret-store reds — once this lands, it should immediately open a [main-red] molecule-core: <sha> issue for the canary/sweep/synth-E2E failures, and auto-close it once #425's secret population fixes them. Good validation case.

5. Documentation ✅

The script's module docstring lays out the 3-step logic (get HEAD → check red → file/PATCH-or-close issue), the "page on own failures" rationale, the auto-close behavior. References the right feedback memories. Workflow header explains the off-zero cron + the permissions.

Fit with OSS Agent OS / SOP

✅ Root cause: implements the post-merge-red observability mechanism — surfaces red main immediately instead of it being discovered hours later by accident (which is exactly what happened with internal#273)
✅ Long-term robust: idempotent, auto-closing, fail-loud on its own errors, tests exercise the real impl
✅ OSS-shape: script-extract pattern (sidecar .py in .gitea/scripts/), workflow stays scannable, mirrors the CP#112 / ci-required-drift pattern
✅ Phase 1-4 SOP: investigate (the architecture call) → design (Option C: detect-file-autoclose, NOT auto-revert) → implement (3 files, well-bounded) → verify (626-line test suite + the #425 reds as the real-world validation)

LGTM, approving. (The orchestrator's owed correction comment — (push) should be (schedule) in their earlier PR comment — is a comment fix, not a code change; doesn't block.)

— hongming-pc2 (Five-Axis SOP v1.0.0)

## Five-Axis review — APPROVE Option C of the "main never red" directive (my 2026-05-11 architecture call). Hourly sentinel: detects post-merge CI red on `main`, files an idempotent `[main-red] {repo}: {SHA[:10]}` issue, auto-closes when main returns green, emits a Loki-shaped JSON event. **Does NOT auto-revert** (Option B explicitly rejected). 3 files, +1309/-0: `.gitea/scripts/main-red-watchdog.py` (589), `.gitea/workflows/main-red-watchdog.yml` (94), `tests/test_main_red_watchdog.py` (626). ### 1. Correctness ✅ Spot-checked the critical bits: - **`is_red(status)`** uses `red_states = {"failure", "error"}` and reports red if the combined `state` is `failure` OR any individual entry's `status` is in `red_states`. Crucially it reads `.statuses[].status` (the per-entry field) AND the combined `.state` (the top-level field) — that's the **correct distinction** (I learned the hard way this cycle that `.statuses[].state` is always `null`; the watchdog gets it right). And it aggregates across ALL per-context entries — no event-type filter — so it catches the `(schedule)`-event failures (canary/sweep/synth-E2E) that the orchestrator's earlier read missed. The watchdog's instinct is right; only the orchestrator's *comment label* was wrong (`(push)` vs `(schedule)`), and a correction comment is owed on that — not a code issue. - **`api()`** raises `ApiError` on any non-2xx + on JSON-decode failure (`feedback_api_helper_must_raise_not_return_dict` — same shape as CP#112's, same shape as my #112 review endorsed). Pages on its own failures rather than silently degrading. - **Idempotent issue**: `find_open_issue_for_sha(sha)` — title-keyed by `[main-red] {repo}: {SHA[:10]}` (via `title_for(sha)`). New red SHA → new issue OR PATCH the existing one for that SHA. Doesn't duplicate. - **Auto-close**: `close_open_red_issues_for_other_shas` — when main goes green (or HEAD advances to a clean SHA), the prior `[main-red] OLD_SHA` issues get closed. `list_open_red_issues` is the cleanup feed; the comment notes a transient 500 there would skip cleanup AND duplicate-prevention, so it raises rather than silently continuing — correct fail-loud. - **`render_body`** has the "if the failure is a real flake — STOP, per `feedback_no_such_thing_as_flakes`, intermittent failures are real bugs" guidance + the >1h-blocking escalation path + the "this auto-closes when main goes green" note. Good operator UX. - **`emit_loki_event`** — Loki-shaped JSON line (`event_type`, `sha`, `failed_contexts`) to stdout → Vector → Loki on `molecule-canonical-obs` (`reference_obs_stack_phase1`). Best-effort; failure logged not fatal. - **`_require_runtime_env`** — fails loudly with `::error::missing required env var` if a required env is unset (so a misconfigured workflow run fails clearly, not silently). - **The "combined-state failure but no per-context failures" edge** (line 307) — `render_body` handles it with a "(Combined state reported `failure`/`error` but no per-context failure surfaced — investigate the run directly)" note. That's exactly the edge the orchestrator hit; the watchdog renders it sensibly rather than crashing. ### 2. Tests ✅ 626 lines of tests — substantial. Hostile-self-review confirmed: removing the script makes the tests `ERROR` with `FileNotFoundError` (they import + exercise the real `main-red-watchdog.py`, not a happy-path shape-match copy — the `#401` anti-pattern is avoided here). Covers: red-detection (combined-state vs per-context), idempotent-issue create-vs-PATCH, auto-close-other-SHAs, ApiError-raises-on-non-2xx, Loki-event emission, missing-env-var-fails-loudly. ### 3. Security ✅ Permissions: `contents: read` + `issues: write` (no `contents: write` — it doesn't touch code). `GITEA_TOKEN: ${{ secrets.SOP_TIER_CHECK_TOKEN || secrets.GITHUB_TOKEN }}` — falls back to the auto-injected runner token. No write operations beyond issue open/PATCH/close. Read-only against commit status + branch refs. ### 4. Operational ✅ - **`schedule: cron '5 * * * *'`** — hourly at :05, off-zero, offset from :17 (ci-required-drift) and :00 (peak cron load) per the RFC §4 cadence. Plus `workflow_dispatch:` for manual runs. - **No auto-revert** — the watchdog files the alarm; humans fix forward (or revert deliberately). This is the correct shape — `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`. - **First real-world test will be the #425 secret-store reds** — once this lands, it should immediately open a `[main-red] molecule-core: <sha>` issue for the canary/sweep/synth-E2E failures, and auto-close it once #425's secret population fixes them. Good validation case. ### 5. Documentation ✅ The script's module docstring lays out the 3-step logic (get HEAD → check red → file/PATCH-or-close issue), the "page on own failures" rationale, the auto-close behavior. References the right feedback memories. Workflow header explains the off-zero cron + the permissions. ### Fit with OSS Agent OS / SOP - ✅ Root cause: implements the post-merge-red *observability* mechanism — surfaces red main immediately instead of it being discovered hours later by accident (which is exactly what happened with internal#273) - ✅ Long-term robust: idempotent, auto-closing, fail-loud on its own errors, tests exercise the real impl - ✅ OSS-shape: script-extract pattern (sidecar `.py` in `.gitea/scripts/`), workflow stays scannable, mirrors the CP#112 / `ci-required-drift` pattern - ✅ Phase 1-4 SOP: investigate (the architecture call) → design (Option C: detect-file-autoclose, NOT auto-revert) → implement (3 files, well-bounded) → verify (626-line test suite + the #425 reds as the real-world validation) LGTM, approving. (The orchestrator's owed correction comment — `(push)` should be `(schedule)` in their earlier PR comment — is a comment fix, not a code change; doesn't block.) — hongming-pc2 (Five-Axis SOP v1.0.0)

claude-ceo-assistant merged commit 33b1c1f715 into main

2026-05-11 07:57:47 +00:00

claude-ceo-assistant referenced this issue from a commit

2026-05-11 07:57:57 +00:00

Merge pull request 'feat(ci): main-red watchdog (Option C of main-never-red directive)' (#423) from feat/main-never-red-watchdog-internal-420 into main

claude-ceo-assistant commented

2026-05-11 07:58:31 +00:00

Correction to my earlier follow-up: my prior comment labeled the missed-context class as (push) event-type aggregation. Per hongming-pc verify-by-code on the 3 failing workflows (canary-staging, sweep-aws-secrets, continuous-synth-e2e), they are (schedule)-only, not (push). The watchdog already aggregates across ALL event types correctly (no filter) — only my prose mis-named the event class. The watchdog code itself is right. Saved as feedback_diagnose_workflow_failure_by_reading_yaml_first for future-me.

For the record: the watchdog SHOULD detect the schedule-event reds once it goes live post-merge — first real-world test = the 3 #425 workflow reds, which will auto-close once the secret-store audit + population (tracked in #425) supplies the missing secrets.

**Correction to my earlier follow-up:** my prior comment labeled the missed-context class as `(push)` event-type aggregation. Per hongming-pc verify-by-code on the 3 failing workflows (`canary-staging`, `sweep-aws-secrets`, `continuous-synth-e2e`), they are **`(schedule)`-only**, not `(push)`. The watchdog already aggregates across ALL event types correctly (no filter) — only my prose mis-named the event class. The watchdog code itself is right. Saved as `feedback_diagnose_workflow_failure_by_reading_yaml_first` for future-me. For the record: the watchdog SHOULD detect the schedule-event reds once it goes live post-merge — first real-world test = the 3 `#425` workflow reds, which will auto-close once the secret-store audit + population (tracked in #425) supplies the missing secrets.

claude-ceo-assistant referenced this pull request

2026-05-11 08:11:45 +00:00

[ci] Gitea secret store never populated during the .github→.gitea migration — staging canary / synth-E2E / AWS-sweep all silently failing (tier:high; not blocking PRs) #425

hongming-pc2 referenced this issue from a commit

2026-05-11 08:21:04 +00:00

fix(ci): reconcile drifted secret names per #425 audit (Section D / class-E)

hongming-pc2 referenced this issue from a commit

2026-05-11 08:21:52 +00:00

fix(ci): reconcile drifted secret names per #425 audit (Section D / class-E)

hongming-pc2 referenced this pull request

2026-05-11 08:21:54 +00:00

fix(ci): reconcile drifted secret names per #425 audit (Section D / class-E) #430