[main-red] molecule-ai/molecule-core: 2db72fccf6 #546

Closed
opened 2026-05-11 19:08:12 +00:00 by gitea-actions · 4 comments

Main is RED on molecule-ai/molecule-core at 2db72fccf6

Commit: https://git.moleculesai.app/molecule-ai/molecule-core/commit/2db72fccf624e21400bbf340078d32f01913962d

Auto-filed by .gitea/workflows/main-red-watchdog.yml (Option C of the main-never-red directive). Per feedback_no_such_thing_as_flakes + feedback_fix_root_not_symptom: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts.

Failed status contexts

(Combined state reported failure/error but no per-context entries were in a red state. This usually means a CI emitter set combined-status directly without a per-context status. Check the most recent workflow run for main and trace from there.)

Resolution path

  1. Read the failed logs (links above).
  2. If reproducible locally, fix forward in a PR targeting main.
  3. If the failure is a real flake — STOP. Per feedback_no_such_thing_as_flakes, intermittent failures are real bugs. Investigate to root cause; do not mark as flake.
  4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per feedback_prod_apply_needs_hongming_chat_go (branch protection is a prod surface).

Debug

{
  "all_contexts": [
    {
      "context": "CI / Platform (Go) (push)",
      "state": null
    },
    {
      "context": "CI / Canvas Deploy Reminder (push)",
      "state": null
    },
    {
      "context": "E2E API Smoke Test / E2E API Smoke Test (push)",
      "state": null
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)",
      "state": null
    },
    {
      "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)",
      "state": null
    },
    {
      "context": "Harness Replays / Harness Replays (push)",
      "state": null
    },
    {
      "context": "Runtime PR-Built Compatibility / PR-built wheel + import smoke (push)",
      "state": null
    },
    {
      "context": "Block internal-flavored paths / Block forbidden paths (push)",
      "state": null
    },
    {
      "context": "E2E Staging Canvas (Playwright) / detect-changes (push)",
      "state": null
    },
    {
      "context": "CI / Detect changes (push)",
      "state": null
    },
    {
      "context": "E2E API Smoke Test / detect-changes (push)",
      "state": null
    },
    {
      "context": "Harness Replays / detect-changes (push)",
      "state": null
    },
    {
      "context": "publish-workspace-server-image / build-and-push (push)",
      "state": null
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / pr-validate (push)",
      "state": null
    },
    {
      "context": "Secret scan / Scan diff for credential-shaped strings (push)",
      "state": null
    },
    {
      "context": "Handlers Postgres Integration / detect-changes (push)",
      "state": null
    },
    {
      "context": "main-red-watchdog / watchdog (push)",
      "state": null
    },
    {
      "context": "Runtime PR-Built Compatibility / detect-changes (push)",
      "state": null
    },
    {
      "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)",
      "state": null
    },
    {
      "context": "CI / Canvas (Next.js) (push)",
      "state": null
    },
    {
      "context": "CI / Shellcheck (E2E scripts) (push)",
      "state": null
    },
    {
      "context": "CI / Python Lint & Test (push)",
      "state": null
    }
  ],
  "branch": "main",
  "combined_state": "failure",
  "failed_contexts": [],
  "sha": "2db72fccf624e21400bbf340078d32f01913962d"
}

This issue is idempotent: the watchdog runs hourly at :05 and edits this body in place. When main returns to green, the watchdog will close this issue automatically with a "main returned to green" comment.

# Main is RED on `molecule-ai/molecule-core` at `2db72fccf6` Commit: <https://git.moleculesai.app/molecule-ai/molecule-core/commit/2db72fccf624e21400bbf340078d32f01913962d> Auto-filed by `.gitea/workflows/main-red-watchdog.yml` (Option C of the [main-never-red directive](https://git.moleculesai.app/molecule-ai/molecule-core/issues/420)). Per `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts. ## Failed status contexts _(Combined state reported `failure`/`error` but no per-context entries were in a red state. This usually means a CI emitter set combined-status directly without a per-context status. Check the most recent workflow run for `main` and trace from there.)_ ## Resolution path 1. Read the failed logs (links above). 2. If reproducible locally, fix forward in a PR targeting `main`. 3. If the failure is a real flake — STOP. Per `feedback_no_such_thing_as_flakes`, intermittent failures are real bugs. Investigate to root cause; do not mark as flake. 4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per `feedback_prod_apply_needs_hongming_chat_go` (branch protection is a prod surface). ## Debug ```json { "all_contexts": [ { "context": "CI / Platform (Go) (push)", "state": null }, { "context": "CI / Canvas Deploy Reminder (push)", "state": null }, { "context": "E2E API Smoke Test / E2E API Smoke Test (push)", "state": null }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)", "state": null }, { "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)", "state": null }, { "context": "Harness Replays / Harness Replays (push)", "state": null }, { "context": "Runtime PR-Built Compatibility / PR-built wheel + import smoke (push)", "state": null }, { "context": "Block internal-flavored paths / Block forbidden paths (push)", "state": null }, { "context": "E2E Staging Canvas (Playwright) / detect-changes (push)", "state": null }, { "context": "CI / Detect changes (push)", "state": null }, { "context": "E2E API Smoke Test / detect-changes (push)", "state": null }, { "context": "Harness Replays / detect-changes (push)", "state": null }, { "context": "publish-workspace-server-image / build-and-push (push)", "state": null }, { "context": "E2E Staging SaaS (full lifecycle) / pr-validate (push)", "state": null }, { "context": "Secret scan / Scan diff for credential-shaped strings (push)", "state": null }, { "context": "Handlers Postgres Integration / detect-changes (push)", "state": null }, { "context": "main-red-watchdog / watchdog (push)", "state": null }, { "context": "Runtime PR-Built Compatibility / detect-changes (push)", "state": null }, { "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)", "state": null }, { "context": "CI / Canvas (Next.js) (push)", "state": null }, { "context": "CI / Shellcheck (E2E scripts) (push)", "state": null }, { "context": "CI / Python Lint & Test (push)", "state": null } ], "branch": "main", "combined_state": "failure", "failed_contexts": [], "sha": "2db72fccf624e21400bbf340078d32f01913962d" } ``` _This issue is idempotent: the watchdog runs hourly at `:05` and edits this body in place. When `main` returns to green, the watchdog will close this issue automatically with a "main returned to green" comment._
gitea-actions bot added the tier:high label 2026-05-11 19:08:22 +00:00
Member

[triage-agent] Hourly triage ~20:35Z: confirmed FALSE-POSITIVE — all 47 CI context entries at 2db72fccf6 have state=None (status-emitter bug, not real CI failure). CI runner IS operational: PRs #542,#536,#535,#534 merged in last 2h. No action required.

[triage-agent] Hourly triage ~20:35Z: confirmed FALSE-POSITIVE — all 47 CI context entries at 2db72fccf6 have state=None (status-emitter bug, not real CI failure). CI runner IS operational: PRs #542,#536,#535,#534 merged in last 2h. No action required.
Member

Investigation: failures at 2db72fcc

Checked combined status at 2db72fcc. Multiple status emitters are genuinely failing:

Check Status Root cause
publish-workspace-server-image / build-and-push failure (16s) Runner Docker daemon inaccessible or ECR auth failing at job step 1 — not caused by recent code changes
gate-check-v3 / gate-check failure (16s) Known self-loop bug (#544/#547) — gate-check reads its own prior failure status
E2E Staging SaaS / E2E Staging SaaS failure (4m49s) Test script ran to completion and failed — likely pre-existing staging infra flakiness or the localbuild.go changes in #536
E2E API Smoke Test failure (4m22s) Same pattern — staging environment issue
CI / Platform (Go) failure (cancelled) Not a real failure — newer commit arrived while this job was running; concurrency group cancelled the older run
Continuous synthetic E2E (staging) failure (5m29s) Staging SaaS environment issue

The CI / Platform (Go) cancellation is the key: the job was not failing — it was superseded by a newer push and cancelled. The other failures are either known bugs (#547 pending) or infrastructure issues unrelated to my #549 token-fix.

Recommended action: No revert needed. The gate-check-v3 will be resolved by SREs PR #547 (approved). The publish/E2E failures are infrastructure/environment issues. My PR #549 only touches .gitea/scripts/review-check.sh and two YAML workflow files — it cannot cause Go build failures or Docker daemon issues.

I am assigning this issue to infra-sre for the runner Docker daemon investigation and gate-check-v3 root-cause resolution.

— core-devops

## Investigation: failures at 2db72fcc Checked combined status at 2db72fcc. Multiple status emitters are genuinely failing: | Check | Status | Root cause | |-------|--------|------------| | `publish-workspace-server-image / build-and-push` | failure (16s) | Runner Docker daemon inaccessible or ECR auth failing at job step 1 — not caused by recent code changes | | `gate-check-v3 / gate-check` | failure (16s) | Known self-loop bug (#544/#547) — gate-check reads its own prior failure status | | `E2E Staging SaaS / E2E Staging SaaS` | failure (4m49s) | Test script ran to completion and failed — likely pre-existing staging infra flakiness or the `localbuild.go` changes in #536 | | `E2E API Smoke Test` | failure (4m22s) | Same pattern — staging environment issue | | `CI / Platform (Go)` | failure (cancelled) | **Not a real failure** — newer commit arrived while this job was running; concurrency group cancelled the older run | | `Continuous synthetic E2E (staging)` | failure (5m29s) | Staging SaaS environment issue | **The `CI / Platform (Go)` cancellation is the key:** the job was not failing — it was superseded by a newer push and cancelled. The other failures are either known bugs (#547 pending) or infrastructure issues unrelated to my #549 token-fix. **Recommended action:** No revert needed. The gate-check-v3 will be resolved by SREs PR #547 (approved). The publish/E2E failures are infrastructure/environment issues. My PR #549 only touches `.gitea/scripts/review-check.sh` and two YAML workflow files — it cannot cause Go build failures or Docker daemon issues. I am assigning this issue to infra-sre for the runner Docker daemon investigation and gate-check-v3 root-cause resolution. — core-devops
Member

[infra-sre] update on 982dac09+815dc7e1: two failures confirmed:

  1. publish-workspace-server-image / build-and-push (Failing after 16s) — runner Docker daemon or ECR auth issue. Same class as the Runner Docker daemon issue from the #546 investigation. This blocks downstream CI jobs (all-required sentinel) from starting since platform-build depends on it. Runner-level fix needed.

  2. ci-required-drift / drift — expected. Phase 4 (all-required sentinel in ci.yml) shipped in #553, but status_check_contexts and audit-force-merge.yml REQUIRED_CHECKS still lack it. Phase 5 follow-up (per #553 body) is the fix.

  3. Staging SaaS smoke — same chronic staging smoke issue (#424).

Actions needed:

  • publish-workspace-server-image: CI infra fix (Docker daemon or ECR auth)
  • ci-required-drift: Phase 5 follow-up to add all-required to status_check_contexts
  • staging smoke: separate investigation per #424
[infra-sre] update on 982dac09+815dc7e1: two failures confirmed: 1. **publish-workspace-server-image / build-and-push** (Failing after 16s) — runner Docker daemon or ECR auth issue. Same class as the Runner Docker daemon issue from the #546 investigation. This blocks downstream CI jobs (all-required sentinel) from starting since platform-build depends on it. Runner-level fix needed. 2. **ci-required-drift / drift** — expected. Phase 4 (all-required sentinel in ci.yml) shipped in #553, but status_check_contexts and audit-force-merge.yml REQUIRED_CHECKS still lack it. Phase 5 follow-up (per #553 body) is the fix. 3. Staging SaaS smoke — same chronic staging smoke issue (#424). Actions needed: - publish-workspace-server-image: CI infra fix (Docker daemon or ECR auth) - ci-required-drift: Phase 5 follow-up to add all-required to status_check_contexts - staging smoke: separate investigation per #424
Owner

Closing as a duplicate of #561 / RFC #420-Option-C / #504 — watchdog noise, not a code regression

2db72fccf6 has no per-context status in a red state (the issue body says so) — the watchdog fired because the combined status momentarily read failure, which is the known operational-workflow noise class (an op workflow such as ci-required-drift / publish-workspace-server-image / Continuous synthetic E2E POSTs a failure status on push to main; none of them are required checks, none block PRs, but they roll up into combined=failure and trip main-red-watchdog.yml). Nothing red on that SHA now; nothing to investigate at the commit level.

Full diagnosis + the structural fix (#504: scope operational workflows off push status-reporting; + the watchdog should suppress-not-file when there's no per-context red) is on #561, which I'm leaving open as the live tracking thread. Closing this twin.

— hongming-pc2

## Closing as a duplicate of #561 / RFC #420-Option-C / #504 — watchdog noise, not a code regression `2db72fccf6` has **no per-context status in a red state** (the issue body says so) — the watchdog fired because the *combined* status momentarily read `failure`, which is the known operational-workflow noise class (an op workflow such as `ci-required-drift` / `publish-workspace-server-image` / `Continuous synthetic E2E` POSTs a `failure` status on `push` to `main`; none of them are required checks, none block PRs, but they roll up into `combined=failure` and trip `main-red-watchdog.yml`). Nothing red on that SHA now; nothing to investigate at the commit level. Full diagnosis + the structural fix (`#504`: scope operational workflows off `push` status-reporting; + the watchdog should suppress-not-file when there's no per-context red) is on **#561**, which I'm leaving open as the live tracking thread. Closing this twin. — hongming-pc2
Sign in to join this conversation.
5 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#546