[main-red] molecule-ai/molecule-core: a5d4bea96b #494

Closed
opened 2026-05-11 15:06:09 +00:00 by gitea-actions · 2 comments

Main is RED on molecule-ai/molecule-core at a5d4bea96b

Commit: https://git.moleculesai.app/molecule-ai/molecule-core/commit/a5d4bea96bfaba49c52923d3068a31f295a5a0d1

Auto-filed by .gitea/workflows/main-red-watchdog.yml (Option C of the main-never-red directive). Per feedback_no_such_thing_as_flakes + feedback_fix_root_not_symptom: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts.

Failed status contexts

(Combined state reported failure/error but no per-context entries were in a red state. This usually means a CI emitter set combined-status directly without a per-context status. Check the most recent workflow run for main and trace from there.)

Resolution path

  1. Read the failed logs (links above).
  2. If reproducible locally, fix forward in a PR targeting main.
  3. If the failure is a real flake — STOP. Per feedback_no_such_thing_as_flakes, intermittent failures are real bugs. Investigate to root cause; do not mark as flake.
  4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per feedback_prod_apply_needs_hongming_chat_go (branch protection is a prod surface).

Debug

{
  "all_contexts": [
    {
      "context": "CI / Canvas (Next.js) (push)",
      "state": null
    },
    {
      "context": "CI / Canvas Deploy Reminder (push)",
      "state": null
    },
    {
      "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)",
      "state": null
    },
    {
      "context": "Block internal-flavored paths / Block forbidden paths (push)",
      "state": null
    },
    {
      "context": "Harness Replays / detect-changes (push)",
      "state": null
    },
    {
      "context": "Secret scan / Scan diff for credential-shaped strings (push)",
      "state": null
    },
    {
      "context": "publish-workspace-server-image / build-and-push (push)",
      "state": null
    },
    {
      "context": "E2E API Smoke Test / detect-changes (push)",
      "state": null
    },
    {
      "context": "Harness Replays / Harness Replays (push)",
      "state": null
    },
    {
      "context": "CI / Detect changes (push)",
      "state": null
    },
    {
      "context": "E2E Staging Canvas (Playwright) / detect-changes (push)",
      "state": null
    },
    {
      "context": "Handlers Postgres Integration / detect-changes (push)",
      "state": null
    },
    {
      "context": "Runtime PR-Built Compatibility / detect-changes (push)",
      "state": null
    },
    {
      "context": "CI / Platform (Go) (push)",
      "state": null
    },
    {
      "context": "CI / Shellcheck (E2E scripts) (push)",
      "state": null
    },
    {
      "context": "CI / Python Lint & Test (push)",
      "state": null
    },
    {
      "context": "E2E API Smoke Test / E2E API Smoke Test (push)",
      "state": null
    },
    {
      "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)",
      "state": null
    },
    {
      "context": "Runtime PR-Built Compatibility / PR-built wheel + import smoke (push)",
      "state": null
    },
    {
      "context": "publish-canvas-image / Build & push canvas image (push)",
      "state": null
    },
    {
      "context": "main-red-watchdog / watchdog (push)",
      "state": null
    }
  ],
  "branch": "main",
  "combined_state": "failure",
  "failed_contexts": [],
  "sha": "a5d4bea96bfaba49c52923d3068a31f295a5a0d1"
}

This issue is idempotent: the watchdog runs hourly at :05 and edits this body in place. When main returns to green, the watchdog will close this issue automatically with a "main returned to green" comment.

# Main is RED on `molecule-ai/molecule-core` at `a5d4bea96b` Commit: <https://git.moleculesai.app/molecule-ai/molecule-core/commit/a5d4bea96bfaba49c52923d3068a31f295a5a0d1> Auto-filed by `.gitea/workflows/main-red-watchdog.yml` (Option C of the [main-never-red directive](https://git.moleculesai.app/molecule-ai/molecule-core/issues/420)). Per `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts. ## Failed status contexts _(Combined state reported `failure`/`error` but no per-context entries were in a red state. This usually means a CI emitter set combined-status directly without a per-context status. Check the most recent workflow run for `main` and trace from there.)_ ## Resolution path 1. Read the failed logs (links above). 2. If reproducible locally, fix forward in a PR targeting `main`. 3. If the failure is a real flake — STOP. Per `feedback_no_such_thing_as_flakes`, intermittent failures are real bugs. Investigate to root cause; do not mark as flake. 4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per `feedback_prod_apply_needs_hongming_chat_go` (branch protection is a prod surface). ## Debug ```json { "all_contexts": [ { "context": "CI / Canvas (Next.js) (push)", "state": null }, { "context": "CI / Canvas Deploy Reminder (push)", "state": null }, { "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)", "state": null }, { "context": "Block internal-flavored paths / Block forbidden paths (push)", "state": null }, { "context": "Harness Replays / detect-changes (push)", "state": null }, { "context": "Secret scan / Scan diff for credential-shaped strings (push)", "state": null }, { "context": "publish-workspace-server-image / build-and-push (push)", "state": null }, { "context": "E2E API Smoke Test / detect-changes (push)", "state": null }, { "context": "Harness Replays / Harness Replays (push)", "state": null }, { "context": "CI / Detect changes (push)", "state": null }, { "context": "E2E Staging Canvas (Playwright) / detect-changes (push)", "state": null }, { "context": "Handlers Postgres Integration / detect-changes (push)", "state": null }, { "context": "Runtime PR-Built Compatibility / detect-changes (push)", "state": null }, { "context": "CI / Platform (Go) (push)", "state": null }, { "context": "CI / Shellcheck (E2E scripts) (push)", "state": null }, { "context": "CI / Python Lint & Test (push)", "state": null }, { "context": "E2E API Smoke Test / E2E API Smoke Test (push)", "state": null }, { "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)", "state": null }, { "context": "Runtime PR-Built Compatibility / PR-built wheel + import smoke (push)", "state": null }, { "context": "publish-canvas-image / Build & push canvas image (push)", "state": null }, { "context": "main-red-watchdog / watchdog (push)", "state": null } ], "branch": "main", "combined_state": "failure", "failed_contexts": [], "sha": "a5d4bea96bfaba49c52923d3068a31f295a5a0d1" } ``` _This issue is idempotent: the watchdog runs hourly at `:05` and edits this body in place. When `main` returns to green, the watchdog will close this issue automatically with a "main returned to green" comment._
gitea-actions bot added the tier:high label 2026-05-11 15:06:10 +00:00
Owner

Update: main is green again as of ca5831b81e9d (HEAD on main after #475→#476 merged) — CI / Python Lint & Test (push) is success, and the combined status on the HEAD commit is success across all 19 contexts. This issue can be closed.

Caveat for whoever closes it: #495's analysis (the fake_discover mock-signature mismatch in test_completed_response_sanitized, introduced via the #477 merge) was specific and credible — yet Python Lint & Test is now passing on main without #496 (the fix-PR) being merged. Two possibilities: (a) the failure was a transient / test-ordering effect that didn't reproduce on the #476-merge run, or (b) something in #475/#476 incidentally resolved it. Either way — per feedback_no_such_thing_as_flakes, #496 should still land to make TestPollingPathSanitization deterministically correct (right fake_discover(ws_id, source_workspace_id=None) signature + monkeypatch.setattr + assertions against _A2A_BOUNDARY_START/END not the messaging-path _A2A_RESULT_FROM_PEER). #496 is ready (sop-tier-check green, Python Lint & Test running) and just needs a whitelisted-persona APPROVE. And the merge-gate gap that let #477 land with a broken test (PR-CI green but main red — merge-race / base-drift) is worth its own follow-up (re-run required checks on the merge commit, or block merge if PR-head ≠ CI-head). Don't close this until #496 lands, to be safe.

— hongming-pc2 (monitor-cycle triage)

Update: `main` is **green** again as of `ca5831b81e9d` (HEAD on `main` after #475→#476 merged) — `CI / Python Lint & Test (push)` is `success`, and the combined status on the HEAD commit is `success` across all 19 contexts. This issue can be closed. Caveat for whoever closes it: #495's analysis (the `fake_discover` mock-signature mismatch in `test_completed_response_sanitized`, introduced via the #477 merge) was specific and credible — yet `Python Lint & Test` is now passing on `main` *without* #496 (the fix-PR) being merged. Two possibilities: (a) the failure was a transient / test-ordering effect that didn't reproduce on the #476-merge run, or (b) something in #475/#476 incidentally resolved it. Either way — per `feedback_no_such_thing_as_flakes`, **#496 should still land** to make `TestPollingPathSanitization` deterministically correct (right `fake_discover(ws_id, source_workspace_id=None)` signature + `monkeypatch.setattr` + assertions against `_A2A_BOUNDARY_START/END` not the messaging-path `_A2A_RESULT_FROM_PEER`). #496 is ready (`sop-tier-check` green, `Python Lint & Test` running) and just needs a whitelisted-persona APPROVE. And the merge-gate gap that let #477 land with a broken test (PR-CI green but main red — merge-race / base-drift) is worth its own follow-up (re-run required checks on the merge commit, or block merge if PR-head ≠ CI-head). Don't close this until #496 lands, to be safe. — hongming-pc2 (monitor-cycle triage)
Owner

Main-red root-cause update (post #496 merge): the recurring reds are (1) Harness Replays / detect-changes → fixed by #497, (2) operational/scheduled workflows reporting commit statuses on push: — those structurally shouldn't gate main

Tracing the flicker since #477→#476→#496 all landed (#496 merged 15:29Z — the TestPollingPathSanitization fix from #495 is now on main):

On the current main HEAD (82083fbad9...): CI / Python Lint & Test (push) is green, CI / Platform (Go), CI / Canvas, all the E2E contexts — all green. The combined status is failure solely because of one context: Harness Replays / detect-changes (push): failure.

Root cause of that one: #476 switched harness-replays.yml's detect-changes from git diff to the Gitea Compare API but derived both BASE and HEAD from ${GITHUB_REF#refs/heads/} for push events → compare/main...main → the decide step fails on push to main. #497 (fix(harness-replays): correct BASE/HEAD for push events) is the fix — BASE = github.event.before (SHA), HEAD = branch name. It's mergeable=true and has my advisory APPROVE (review 1340); it needs a whitelisted-persona (core-qa / core-lead / core-devops / engineers) APPROVE to merge. Merging #497 clears this red.

The other intermittent reds (publish-runtime-autobump / autobump-and-tag, Sweep stale AWS Secrets Manager secrets, Staging SaaS smoke (every 30 min) — seen on the momentary HEAD 3a28330f...): these are operational / scheduled workflows that also fire on push: and post a commit status. They fail for reasons that have nothing to do with whether the pushed commit's code is good:

  • Sweep stale AWS Secrets Manager — almost certainly the secretsmanager:ListSecrets permission gap I flagged in the mc#482 review: #482 pointed the sweep at AWS_ACCESS_KEY_ID (the prod-molecule-cp principal), which the original workflow header said does not have ListSecrets. The proper fix is the dedicated janitor principal (internal#302). It's a janitor, not a code check.
  • publish-runtime-autobump — exits non-zero when there's nothing to bump (no new runtime version) on a given push. Also not a code check.
  • Staging SaaS smoke — the canary tracked in #424 (tier:low). Environment health, not code.

Recommendation (CI/CD hardening — squarely the team CI/CD charter's territory): these scheduled/operational workflows should either be schedule:-only (drop the push: trigger), or — if there's a reason to also run them on push — they should not report a commit status (so they don't flip main's combined state to failure and trip main-red-watchdog.yml). A commit status should mean "is this commit's code OK", and only the CI-proper + E2E + the security/lint gates belong in that set. Right now main's green/red flickers with the AWS janitor's IAM perms and the autobump's nothing-to-do exit code, which is noise that's been generating false main-red issues (this one). Worth a small RFC/PR to scope the push:-triggered-status-reporting workflows down to the real code gates.

For closing #494: hold until #497 merges (that's the actual current-HEAD red). The operational-workflow flicker is the deeper issue — recommend a separate tracking issue for the "scoped push-status-reporting" cleanup rather than keeping #494 open for it.

— hongming-pc2 (monitor-cycle triage)

## Main-red root-cause update (post #496 merge): the recurring reds are (1) `Harness Replays / detect-changes` → fixed by #497, (2) operational/scheduled workflows reporting commit statuses on `push:` — those structurally shouldn't gate main Tracing the flicker since #477→#476→#496 all landed (#496 merged 15:29Z — the `TestPollingPathSanitization` fix from #495 is now on main): **On the current main HEAD (`82083fbad9...`)**: `CI / Python Lint & Test (push)` is **green**, `CI / Platform (Go)`, `CI / Canvas`, all the E2E contexts — **all green**. The combined status is `failure` solely because of **one** context: `Harness Replays / detect-changes (push): failure`. **Root cause of that one**: #476 switched `harness-replays.yml`'s detect-changes from `git diff` to the Gitea Compare API but derived both BASE and HEAD from `${GITHUB_REF#refs/heads/}` for push events → `compare/main...main` → the `decide` step fails on push to main. **#497** (`fix(harness-replays): correct BASE/HEAD for push events`) is the fix — `BASE = github.event.before` (SHA), `HEAD = branch name`. It's `mergeable=true` and has my advisory APPROVE (review 1340); it needs a whitelisted-persona (core-qa / core-lead / core-devops / engineers) APPROVE to merge. **Merging #497 clears this red.** **The other intermittent reds** (`publish-runtime-autobump / autobump-and-tag`, `Sweep stale AWS Secrets Manager secrets`, `Staging SaaS smoke (every 30 min)` — seen on the momentary HEAD `3a28330f...`): these are **operational / scheduled workflows** that also fire on `push:` and post a commit status. They fail for reasons that have **nothing to do with whether the pushed commit's code is good**: - `Sweep stale AWS Secrets Manager` — almost certainly the `secretsmanager:ListSecrets` permission gap I flagged in the mc#482 review: #482 pointed the sweep at `AWS_ACCESS_KEY_ID` (the prod-`molecule-cp` principal), which the original workflow header said does *not* have `ListSecrets`. The proper fix is the dedicated janitor principal (`internal#302`). It's a janitor, not a code check. - `publish-runtime-autobump` — exits non-zero when there's nothing to bump (no new runtime version) on a given push. Also not a code check. - `Staging SaaS smoke` — the canary tracked in #424 (`tier:low`). Environment health, not code. **Recommendation (CI/CD hardening — squarely the team CI/CD charter's territory)**: these scheduled/operational workflows should either be `schedule:`-only (drop the `push:` trigger), or — if there's a reason to also run them on push — they should **not report a commit status** (so they don't flip `main`'s combined state to `failure` and trip `main-red-watchdog.yml`). A commit status should mean "is this commit's code OK", and only the CI-proper + E2E + the security/lint gates belong in that set. Right now `main`'s green/red flickers with the AWS janitor's IAM perms and the autobump's nothing-to-do exit code, which is noise that's been generating false main-red issues (this one). Worth a small RFC/PR to scope the `push:`-triggered-status-reporting workflows down to the real code gates. **For closing #494**: hold until #497 merges (that's the actual current-HEAD red). The operational-workflow flicker is the deeper issue — recommend a separate tracking issue for the "scoped push-status-reporting" cleanup rather than keeping #494 open for it. — hongming-pc2 (monitor-cycle triage)
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#494