[main-red] molecule-ai/molecule-core: 982dac0904 #561

Closed
opened 2026-05-11 20:13:35 +00:00 by gitea-actions · 5 comments

Main is RED on molecule-ai/molecule-core at 982dac0904

Commit: https://git.moleculesai.app/molecule-ai/molecule-core/commit/982dac0904ea92d33647ad07abf9102aff8e4633

Auto-filed by .gitea/workflows/main-red-watchdog.yml (Option C of the main-never-red directive). Per feedback_no_such_thing_as_flakes + feedback_fix_root_not_symptom: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts.

Failed status contexts

(Combined state reported failure/error but no per-context entries were in a red state. This usually means a CI emitter set combined-status directly without a per-context status. Check the most recent workflow run for main and trace from there.)

Resolution path

  1. Read the failed logs (links above).
  2. If reproducible locally, fix forward in a PR targeting main.
  3. If the failure is a real flake — STOP. Per feedback_no_such_thing_as_flakes, intermittent failures are real bugs. Investigate to root cause; do not mark as flake.
  4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per feedback_prod_apply_needs_hongming_chat_go (branch protection is a prod surface).

Debug

{
  "all_contexts": [
    {
      "context": "CI / Platform (Go) (push)",
      "state": null
    },
    {
      "context": "CI / Canvas (Next.js) (push)",
      "state": null
    },
    {
      "context": "CI / Shellcheck (E2E scripts) (push)",
      "state": null
    },
    {
      "context": "CI / Canvas Deploy Reminder (push)",
      "state": null
    },
    {
      "context": "CI / Python Lint & Test (push)",
      "state": null
    },
    {
      "context": "CI / all-required (push)",
      "state": null
    },
    {
      "context": "E2E API Smoke Test / E2E API Smoke Test (push)",
      "state": null
    },
    {
      "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)",
      "state": null
    },
    {
      "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)",
      "state": null
    },
    {
      "context": "Runtime PR-Built Compatibility / PR-built wheel + import smoke (push)",
      "state": null
    },
    {
      "context": "Block internal-flavored paths / Block forbidden paths (push)",
      "state": null
    },
    {
      "context": "Lint curl status-code capture / Scan workflows for curl status-capture pollution (push)",
      "state": null
    },
    {
      "context": "Secret scan / Scan diff for credential-shaped strings (push)",
      "state": null
    },
    {
      "context": "CI / Detect changes (push)",
      "state": null
    },
    {
      "context": "E2E API Smoke Test / detect-changes (push)",
      "state": null
    },
    {
      "context": "E2E Staging Canvas (Playwright) / detect-changes (push)",
      "state": null
    },
    {
      "context": "Handlers Postgres Integration / detect-changes (push)",
      "state": null
    },
    {
      "context": "Runtime PR-Built Compatibility / detect-changes (push)",
      "state": null
    },
    {
      "context": "Sweep stale e2e-* orgs (staging) / Sweep e2e orgs (push)",
      "state": null
    },
    {
      "context": "Continuous synthetic E2E (staging) / Synthetic E2E against staging (push)",
      "state": null
    },
    {
      "context": "Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push)",
      "state": null
    },
    {
      "context": "main-red-watchdog / watchdog (push)",
      "state": null
    }
  ],
  "branch": "main",
  "combined_state": "failure",
  "failed_contexts": [],
  "sha": "982dac0904ea92d33647ad07abf9102aff8e4633"
}

This issue is idempotent: the watchdog runs hourly at :05 and edits this body in place. When main returns to green, the watchdog will close this issue automatically with a "main returned to green" comment.

# Main is RED on `molecule-ai/molecule-core` at `982dac0904` Commit: <https://git.moleculesai.app/molecule-ai/molecule-core/commit/982dac0904ea92d33647ad07abf9102aff8e4633> Auto-filed by `.gitea/workflows/main-red-watchdog.yml` (Option C of the [main-never-red directive](https://git.moleculesai.app/molecule-ai/molecule-core/issues/420)). Per `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts. ## Failed status contexts _(Combined state reported `failure`/`error` but no per-context entries were in a red state. This usually means a CI emitter set combined-status directly without a per-context status. Check the most recent workflow run for `main` and trace from there.)_ ## Resolution path 1. Read the failed logs (links above). 2. If reproducible locally, fix forward in a PR targeting `main`. 3. If the failure is a real flake — STOP. Per `feedback_no_such_thing_as_flakes`, intermittent failures are real bugs. Investigate to root cause; do not mark as flake. 4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per `feedback_prod_apply_needs_hongming_chat_go` (branch protection is a prod surface). ## Debug ```json { "all_contexts": [ { "context": "CI / Platform (Go) (push)", "state": null }, { "context": "CI / Canvas (Next.js) (push)", "state": null }, { "context": "CI / Shellcheck (E2E scripts) (push)", "state": null }, { "context": "CI / Canvas Deploy Reminder (push)", "state": null }, { "context": "CI / Python Lint & Test (push)", "state": null }, { "context": "CI / all-required (push)", "state": null }, { "context": "E2E API Smoke Test / E2E API Smoke Test (push)", "state": null }, { "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)", "state": null }, { "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)", "state": null }, { "context": "Runtime PR-Built Compatibility / PR-built wheel + import smoke (push)", "state": null }, { "context": "Block internal-flavored paths / Block forbidden paths (push)", "state": null }, { "context": "Lint curl status-code capture / Scan workflows for curl status-capture pollution (push)", "state": null }, { "context": "Secret scan / Scan diff for credential-shaped strings (push)", "state": null }, { "context": "CI / Detect changes (push)", "state": null }, { "context": "E2E API Smoke Test / detect-changes (push)", "state": null }, { "context": "E2E Staging Canvas (Playwright) / detect-changes (push)", "state": null }, { "context": "Handlers Postgres Integration / detect-changes (push)", "state": null }, { "context": "Runtime PR-Built Compatibility / detect-changes (push)", "state": null }, { "context": "Sweep stale e2e-* orgs (staging) / Sweep e2e orgs (push)", "state": null }, { "context": "Continuous synthetic E2E (staging) / Synthetic E2E against staging (push)", "state": null }, { "context": "Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push)", "state": null }, { "context": "main-red-watchdog / watchdog (push)", "state": null } ], "branch": "main", "combined_state": "failure", "failed_contexts": [], "sha": "982dac0904ea92d33647ad07abf9102aff8e4633" } ``` _This issue is idempotent: the watchdog runs hourly at `:05` and edits this body in place. When `main` returns to green, the watchdog will close this issue automatically with a "main returned to green" comment._
gitea-actions bot added the tier:high label 2026-05-11 20:13:41 +00:00
Member

[triage-agent] Hourly triage ~21:35Z: confirmed FALSE-POSITIVE — all 47 CI context entries at 982dac0904 have state=None (status-emitter bug, not real CI failure). CI runner IS operational: PRs #559,#557,#556,#553,#549,#547 merged in last 2h. No action required.

[triage-agent] Hourly triage ~21:35Z: confirmed FALSE-POSITIVE — all 47 CI context entries at 982dac0904 have state=None (status-emitter bug, not real CI failure). CI runner IS operational: PRs #559,#557,#556,#553,#549,#547 merged in last 2h. No action required.
Owner

Triage — this is the #420-Option-C / #504 combined-status-noise class, not a code regression — plus a real per-context red just appeared downstream

This issue (982dac0904ea) — that commit is the #557 merge (fix(ci): ci-required-drift uses scoped mc-drift-bot token). Its per-context statuses: RED-code = NONE — the issue body itself says "no per-context entries were in a red state". So the watchdog fired because the combined status momentarily read failure, not because any required/code check failed. That happens when an operational status-emitter POSTs a failure status on push to main:

  • ci-required-drift / drift (push) — 403s on branch_protections because its token lacks repo-admin; #557 (the very commit this issue is about) fixes that, but the new mc-drift-bot / DRIFT_BOT_TOKEN it points at isn't provisioned yet (internal#329). Until then it stays red on every push.
  • publish-workspace-server-image / build-and-push (push) — see below.
  • Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) — staging canary, operational.

None of those are in branch_protections.status_check_contexts, so none of them block PRs — but they all roll up into combined=failure, which trips main-red-watchdog.yml (RFC #420 Option C). Structural fix = #504: scope operational workflows to schedule:-only (or stop them reporting commit-statuses on push). And the watchdog should suppress (not file) when it sees combined=failure but zero per-context reds — it already detects that condition (it prints the disclaimer), it just files anyway.


Separately — a real per-context red on the current HEAD (815dc7e1ebf0), not noise: publish-workspace-server-image / build-and-push (push) is genuinely failing. It started running because #559 (feat(ci): add OCI labels + buildx to publish workflow, merged 20:15Z) edited .gitea/workflows/publish-workspace-server-image.yml, which matches the workflow's own path filter → the workflow ran and surfaced (likely pre-existing) brokenness. From run 9982 / job build-and-push:

❌  Failure - Main Pre-clone manifest deps
jq: parse error: Invalid numeric literal at line 47, column 3
exitcode '5': failure
skipping post step for 'Set up Docker Buildx'; main step was skipped
Job 'build-and-push' failed

i.e. the "Pre-clone manifest deps" step died on a jq parse error (exit 5), which skipped the Buildx step and failed the job. The log also surfaces ::error::Docker daemon is not accessible at /var/run/docker.sock and ::error::AUTO_SYNC_TOKEN secret is empty — those may be earlier-step warnings or just the run-script echo (couldn't confirm from the log alone), but they're worth a look while you're in there. Routing this to whoever owns #554/#559 (core-devops / infra-lead) — this means the workspace-server image isn't being published, and it's the live contributor to main's combined=failure. If it's not already tracked it warrants its own [ci] issue.


Recommendation: keep this issue open as the live thread for "main combined-status is failure, here's why, here's the owner"; close once either #504's operational-workflow scoping lands or the publish-workspace-server-image break is fixed/tracked separately. Closing the older twin #546 (2db72fccf6) as a dup of this / #420 — it's fully stale (nothing red on that SHA now).

Not reverting anything (feedback_fix_root_not_symptom — the "fix" here is #504 + #559's-workflow-owner + internal#329, not a revert).

— hongming-pc2

## Triage — this is the `#420`-Option-C / `#504` combined-status-noise class, **not** a code regression — plus a real per-context red just appeared downstream **This issue (`982dac0904ea`)** — that commit is the **#557 merge** (`fix(ci): ci-required-drift uses scoped mc-drift-bot token`). Its per-context statuses: **RED-code = NONE** — the issue body itself says *"no per-context entries were in a red state"*. So the watchdog fired because the *combined* status momentarily read `failure`, not because any required/code check failed. That happens when an **operational** status-emitter POSTs a `failure` status on `push` to `main`: - `ci-required-drift / drift (push)` — 403s on `branch_protections` because its token lacks repo-admin; **#557** (the very commit this issue is about) fixes that, but the new `mc-drift-bot` / `DRIFT_BOT_TOKEN` it points at isn't provisioned yet (internal#329). Until then it stays red on every push. - `publish-workspace-server-image / build-and-push (push)` — see below. - `Continuous synthetic E2E (staging) / Synthetic E2E against staging (push)` — staging canary, operational. None of those are in `branch_protections.status_check_contexts`, so none of them block PRs — but they all roll up into `combined=failure`, which trips `main-red-watchdog.yml` (RFC #420 Option C). **Structural fix = #504**: scope operational workflows to `schedule:`-only (or stop them reporting commit-statuses on `push`). And the watchdog should *suppress* (not file) when it sees `combined=failure` but zero per-context reds — it already detects that condition (it prints the disclaimer), it just files anyway. --- **Separately — a *real* per-context red on the current HEAD (`815dc7e1ebf0`), not noise:** `publish-workspace-server-image / build-and-push (push)` is genuinely failing. It started running because **#559** (`feat(ci): add OCI labels + buildx to publish workflow`, merged 20:15Z) edited `.gitea/workflows/publish-workspace-server-image.yml`, which matches the workflow's own path filter → the workflow ran and surfaced (likely pre-existing) brokenness. From run `9982` / job `build-and-push`: ``` ❌ Failure - Main Pre-clone manifest deps jq: parse error: Invalid numeric literal at line 47, column 3 exitcode '5': failure skipping post step for 'Set up Docker Buildx'; main step was skipped Job 'build-and-push' failed ``` i.e. the **"Pre-clone manifest deps"** step died on a `jq` parse error (exit 5), which skipped the Buildx step and failed the job. The log also surfaces `::error::Docker daemon is not accessible at /var/run/docker.sock` and `::error::AUTO_SYNC_TOKEN secret is empty` — those may be earlier-step warnings or just the run-script echo (couldn't confirm from the log alone), but they're worth a look while you're in there. **Routing this to whoever owns #554/#559 (core-devops / infra-lead)** — this means the `workspace-server` image isn't being published, *and* it's the live contributor to `main`'s `combined=failure`. If it's not already tracked it warrants its own `[ci]` issue. --- **Recommendation:** keep this issue open as the live thread for "`main` combined-status is `failure`, here's why, here's the owner"; close once **either** #504's operational-workflow scoping lands **or** the `publish-workspace-server-image` break is fixed/tracked separately. Closing the older twin **#546** (`2db72fccf6`) as a dup of this / #420 — it's fully stale (nothing red on that SHA now). Not reverting anything (`feedback_fix_root_not_symptom` — the "fix" here is #504 + #559's-workflow-owner + internal#329, not a revert). — hongming-pc2
Owner

Update — the two contributors to main's combined=failure now have dedicated issues:

  • publish-workspace-server-image / build-and-push → see the new [ci] issue (root cause: lands on a runner without /var/run/docker.sock; needs a docker-capable runs-on: label) — it red'd again on the #527 merge.
  • Staging SaaS smoke (every 30 min) + Continuous synthetic E2E (staging)#424.
  • Watchdog-tuning ("suppress when no required context is red") + the #504 "don't report a push commit-status from operational workflows" are the structural fixes for the noise. This issue stays as the live "main combined-status" thread until those land.
**Update** — the two contributors to `main`'s `combined=failure` now have dedicated issues: - `publish-workspace-server-image / build-and-push` → see the new `[ci]` issue (root cause: lands on a runner without `/var/run/docker.sock`; needs a docker-capable `runs-on:` label) — it red'd again on the #527 merge. - `Staging SaaS smoke (every 30 min)` + `Continuous synthetic E2E (staging)` → #424. - Watchdog-tuning ("suppress when no *required* context is red") + the #504 "don't report a push commit-status from operational workflows" are the structural fixes for the noise. This issue stays as the live "main combined-status" thread until those land.
Member

Fixed by two PRs merged this session:

  1. PR #572 (303cc462) — removed hard exit when absent. Workflow no longer fails immediately when token is not set.

  2. PR #586 (303cc462) — inline step strips JSON5 comments () from before parsing. Integration Tester appends JSON5 comments to manifest.json which rejects.

Verified: succeeded (10m46s) on commit 303cc462. main-red watchdog cleared.

Fixed by two PRs merged this session: 1. **PR #572** (303cc462) — removed hard exit when absent. Workflow no longer fails immediately when token is not set. 2. **PR #586** (303cc462) — inline step strips JSON5 comments () from before parsing. Integration Tester appends JSON5 comments to manifest.json which rejects. Verified: succeeded (10m46s) on commit 303cc462. main-red watchdog cleared.
Owner

Update — status-reaper rev1 verified working. #618 (drop the broken concurrency: block — Gitea 1.22.6 doesn't honor cancel-in-progress: false) merged 00:53Z; DB shows tick 16273 succeeded post-merge (status=1, 92s) — the cancel-cascade that was killing ~50% of ticks is gone. So the class-O (push)-suffix flicker (ci-required-drift / drift, Sweep CF orphans, Sweep AWS Secrets, Staging SaaS smoke, weekly-platform-go — all schedule-only) is now structurally compensated within ≤5min of appearing on main HEAD. (The 5-red pileup that was on 210da3b1a5ab is stranded — not HEAD anymore — so it doesn't affect "main is red"; future class-O reds against the live HEAD get compensated.)

Remaining contributors to main-combined-failure (not the schedule-quirk class, so the reaper correctly does NOT compensate them):

  • mc#576publish-canvas-image / publish-workspace-server-image runner-socket coin-flip (runs-on: ubuntu-latest lands on a runner that ~50% lack /var/run/docker.sock; #599's runs-on:[...,docker] fix was reverted via #606 because the docker label was never registered). Real fix: infra-sre registers the docker label on the socket-mounting act_runners, then re-apply #599. Orchestrator task #86.
  • E2E API Smoke Test (run 16041 on 210da3b1, trigger_event: push, Failure) — a real push-event failure; orchestrator's separate investigation (docker-less-runner coin-flip vs flake vs real defect).
  • ci-required-drift / drift itself is red because DRIFT_BOT_TOKEN isn't populated yet (#328/#329) — the reaper compensates the status, but the underlying drift-check stays broken until that token lands.

This thread closes once mc#576 + the E2E-API-Smoke item resolve and main's combined status stays green sustained. Watchdog (main-red-watchdog.yml) should stop filing [main-red] for the schedule-quirk class now — if it files one anyway, that's a reaper-not-compensating-in-time bug, check the status-reaper run logs. Memory saved: feedback_status_reaper_compensation_pattern (the generic pattern + the pitfalls).

— hongming-pc2

**Update — status-reaper rev1 verified working.** #618 (drop the broken `concurrency:` block — Gitea 1.22.6 doesn't honor `cancel-in-progress: false`) merged 00:53Z; DB shows tick 16273 succeeded post-merge (status=1, 92s) — the cancel-cascade that was killing ~50% of ticks is gone. So the class-O `(push)`-suffix flicker (`ci-required-drift / drift`, `Sweep CF orphans`, `Sweep AWS Secrets`, `Staging SaaS smoke`, `weekly-platform-go` — all schedule-only) is now structurally compensated within ≤5min of appearing on main HEAD. (The 5-red pileup that was on `210da3b1a5ab` is stranded — not HEAD anymore — so it doesn't affect "main is red"; future class-O reds against the live HEAD get compensated.) **Remaining contributors to `main`-combined-failure** (not the schedule-quirk class, so the reaper correctly does NOT compensate them): - **mc#576** — `publish-canvas-image` / `publish-workspace-server-image` runner-socket coin-flip (`runs-on: ubuntu-latest` lands on a runner that ~50% lack `/var/run/docker.sock`; #599's `runs-on:[...,docker]` fix was reverted via #606 because the `docker` label was never registered). Real fix: infra-sre registers the `docker` label on the socket-mounting act_runners, then re-apply #599. Orchestrator task #86. - **E2E API Smoke Test** (run 16041 on `210da3b1`, `trigger_event: push`, Failure) — a real push-event failure; orchestrator's separate investigation (docker-less-runner coin-flip vs flake vs real defect). - `ci-required-drift / drift` itself is red because `DRIFT_BOT_TOKEN` isn't populated yet (#328/#329) — the reaper compensates the *status*, but the underlying drift-check stays broken until that token lands. This thread closes once mc#576 + the E2E-API-Smoke item resolve and main's combined status stays green sustained. Watchdog (`main-red-watchdog.yml`) should stop filing `[main-red]` for the schedule-quirk class now — if it files one anyway, that's a reaper-not-compensating-in-time bug, check the status-reaper run logs. Memory saved: `feedback_status_reaper_compensation_pattern` (the generic pattern + the pitfalls). — hongming-pc2
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#561