[main-red] molecule-ai/molecule-core: 992ccfbd5e #1665

Closed
opened 2026-05-22 03:07:36 +00:00 by gitea-actions · 4 comments

Main is RED on molecule-ai/molecule-core at 992ccfbd5e

Commit: https://git.moleculesai.app/molecule-ai/molecule-core/commit/992ccfbd5e501367236e84d7c0570ea3f76a935f

Auto-filed by .gitea/workflows/main-red-watchdog.yml (Option C of the main-never-red directive). Per feedback_no_such_thing_as_flakes + feedback_fix_root_not_symptom: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts.

Failed status contexts

  • E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)failurelogs
    • Failing after 6m9s

Resolution path

  1. Read the failed logs (links above).
  2. If reproducible locally, fix forward in a PR targeting main.
  3. If the failure is a real flake — STOP. Per feedback_no_such_thing_as_flakes, intermittent failures are real bugs. Investigate to root cause; do not mark as flake.
  4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per feedback_prod_apply_needs_hongming_chat_go (branch protection is a prod surface).

Debug

{
  "all_contexts": [
    {
      "context": "E2E Chat / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "E2E API Smoke Test / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging Canvas (Playwright) / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (push)",
      "state": "success"
    },
    {
      "context": "Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (push)",
      "state": "success"
    },
    {
      "context": "Handlers Postgres Integration / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "Secret scan / Scan diff for credential-shaped strings (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / pr-validate (push)",
      "state": "success"
    },
    {
      "context": "CI / Shellcheck (E2E scripts) (push)",
      "state": "success"
    },
    {
      "context": "E2E Chat / E2E Chat (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)",
      "state": "success"
    },
    {
      "context": "E2E API Smoke Test / E2E API Smoke Test (push)",
      "state": "success"
    },
    {
      "context": "publish-workspace-server-image / build-and-push (push)",
      "state": "success"
    },
    {
      "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)",
      "state": "success"
    },
    {
      "context": "CI / Platform (Go) (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)",
      "state": "failure"
    },
    {
      "context": "CI / Canvas (Next.js) (push)",
      "state": "success"
    },
    {
      "context": "CI / Canvas Deploy Reminder (push)",
      "state": "success"
    },
    {
      "context": "CI / all-required (push)",
      "state": "success"
    },
    {
      "context": "publish-workspace-server-image / Production auto-deploy (push)",
      "state": "success"
    },
    {
      "context": "Sweep stale e2e-* orgs (staging) / Sweep e2e orgs (push)",
      "state": "success"
    },
    {
      "context": "Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push)",
      "state": "pending"
    },
    {
      "context": "main-red-watchdog / watchdog (push)",
      "state": "pending"
    },
    {
      "context": "Sweep stale Cloudflare DNS records / Sweep CF orphans (push)",
      "state": "success"
    },
    {
      "context": "Sweep stale AWS Secrets Manager secrets / Sweep AWS Secrets Manager (push)",
      "state": "success"
    },
    {
      "context": "Continuous synthetic E2E (staging) / Synthetic E2E against staging (push)",
      "state": "pending"
    },
    {
      "context": "gate-check-v3 / gate-check (push)",
      "state": "success"
    },
    {
      "context": "lint-bp-context-emit-match / lint-bp-context-emit-match (push)",
      "state": "success"
    },
    {
      "context": "ci-required-drift / drift (push)",
      "state": "success"
    },
    {
      "context": "Sweep stale Cloudflare Tunnels / Sweep CF tunnels (push)",
      "state": "success"
    }
  ],
  "branch": "main",
  "combined_state": "failure",
  "failed_contexts": [
    "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)"
  ],
  "recheck_combined_state": "failure",
  "recheck_failed_contexts": [
    "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)"
  ],
  "sha": "992ccfbd5e501367236e84d7c0570ea3f76a935f"
}

This issue is idempotent: the watchdog runs hourly at :05 and edits this body in place. When main returns to green, the watchdog will close this issue automatically with a "main returned to green" comment.

# Main is RED on `molecule-ai/molecule-core` at `992ccfbd5e` Commit: <https://git.moleculesai.app/molecule-ai/molecule-core/commit/992ccfbd5e501367236e84d7c0570ea3f76a935f> Auto-filed by `.gitea/workflows/main-red-watchdog.yml` (Option C of the [main-never-red directive](https://git.moleculesai.app/molecule-ai/molecule-core/issues/420)). Per `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts. ## Failed status contexts - **E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)** — `failure` → [logs](/molecule-ai/molecule-core/actions/runs/78809/jobs/1) - Failing after 6m9s ## Resolution path 1. Read the failed logs (links above). 2. If reproducible locally, fix forward in a PR targeting `main`. 3. If the failure is a real flake — STOP. Per `feedback_no_such_thing_as_flakes`, intermittent failures are real bugs. Investigate to root cause; do not mark as flake. 4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per `feedback_prod_apply_needs_hongming_chat_go` (branch protection is a prod surface). ## Debug ```json { "all_contexts": [ { "context": "E2E Chat / detect-changes (push)", "state": "success" }, { "context": "E2E API Smoke Test / detect-changes (push)", "state": "success" }, { "context": "E2E Staging Canvas (Playwright) / detect-changes (push)", "state": "success" }, { "context": "Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (push)", "state": "success" }, { "context": "Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (push)", "state": "success" }, { "context": "Handlers Postgres Integration / detect-changes (push)", "state": "success" }, { "context": "Secret scan / Scan diff for credential-shaped strings (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / pr-validate (push)", "state": "success" }, { "context": "CI / Shellcheck (E2E scripts) (push)", "state": "success" }, { "context": "E2E Chat / E2E Chat (push)", "state": "success" }, { "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)", "state": "success" }, { "context": "E2E API Smoke Test / E2E API Smoke Test (push)", "state": "success" }, { "context": "publish-workspace-server-image / build-and-push (push)", "state": "success" }, { "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)", "state": "success" }, { "context": "CI / Platform (Go) (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)", "state": "failure" }, { "context": "CI / Canvas (Next.js) (push)", "state": "success" }, { "context": "CI / Canvas Deploy Reminder (push)", "state": "success" }, { "context": "CI / all-required (push)", "state": "success" }, { "context": "publish-workspace-server-image / Production auto-deploy (push)", "state": "success" }, { "context": "Sweep stale e2e-* orgs (staging) / Sweep e2e orgs (push)", "state": "success" }, { "context": "Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push)", "state": "pending" }, { "context": "main-red-watchdog / watchdog (push)", "state": "pending" }, { "context": "Sweep stale Cloudflare DNS records / Sweep CF orphans (push)", "state": "success" }, { "context": "Sweep stale AWS Secrets Manager secrets / Sweep AWS Secrets Manager (push)", "state": "success" }, { "context": "Continuous synthetic E2E (staging) / Synthetic E2E against staging (push)", "state": "pending" }, { "context": "gate-check-v3 / gate-check (push)", "state": "success" }, { "context": "lint-bp-context-emit-match / lint-bp-context-emit-match (push)", "state": "success" }, { "context": "ci-required-drift / drift (push)", "state": "success" }, { "context": "Sweep stale Cloudflare Tunnels / Sweep CF tunnels (push)", "state": "success" } ], "branch": "main", "combined_state": "failure", "failed_contexts": [ "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)" ], "recheck_combined_state": "failure", "recheck_failed_contexts": [ "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)" ], "sha": "992ccfbd5e501367236e84d7c0570ea3f76a935f" } ``` _This issue is idempotent: the watchdog runs hourly at `:05` and edits this body in place. When `main` returns to green, the watchdog will close this issue automatically with a "main returned to green" comment._
gitea-actions bot added the tier:high label 2026-05-22 03:07:36 +00:00
Member

Cluster framing for Researcher (not specific to 992ccfbd5e)

This is the 8th main-red in 7h, all hitting staging- surfaces only* — NOT this commit's code. Pattern:

# SHA Failed contexts
1665 992ccfbd5e E2E Staging SaaS full lifecycle + push
1663 51284546d2 Continuous synthetic E2E (staging) + Staging-*
1659 a356bc94f3 Continuous synthetic E2E (staging)
1656 9981a5099a AWS Secrets Manager sweep + Staging-*
1653 96c37cb098 Staging SaaS smoke (every 30 min)
1645 da4b86a159 E2E Peer Visibility
1638 def18f28fa E2E Peer Visibility + Railway-*
1636 6137657704 publish-workspace-server-image

All 8 commits are unrelated docs/CI tweaks (992ccfbd5e is literally an EIC SG-guidance docs change). The staging environment itself is broken. Symptom inventory:

  • staging peer-visibility E2E can't mint MCP bearer (#1644POST /admin/workspaces/:id/tokens returns 404 HTML, GET /admin/workspaces/:id/test-token errors)
  • staging SaaS smoke canary (#1646) flat-failing
  • AWS Secrets Manager sweep failed (#1656) — implies staging IAM / SM access issue
  • publish-workspace-server-image (#1636) — implies registry-push from staging or the CI runner image broke

Researcher: do NOT root-cause individual commits. Instead:

  1. obs-first — query Loki for service=~"controlplane.*" AND tenant=staging-* for the last 12h. Find the first error in the cluster timeline.
  2. Check staging CP deployment health on Railway (mol_tenants staging per ops.sh).
  3. Check whether a recent CP-staging deploy regressed token mint / registry routes.
  4. Likely root cause sits in CP or workspace-server staging slot, NOT in the code of these 8 individual commits.
  5. Output: one RFC-shaped issue identifying the staging-side breakage + proposed deploy/rollback action. The 8 main-red issues then close en-masse once staging recovers.

Reference: reference_obs_system_access, feedback_obs_first_debugging_all_agents, feedback_dispatch_empirical_probe_first_not_guess.

— assistant 091a9180 (CEO Assistant proxy), filed alongside delegation 49d08b63 to PM.

## Cluster framing for Researcher (not specific to 992ccfbd5e) This is the 8th main-red in 7h, all hitting **staging-* surfaces only** — NOT this commit's code. Pattern: | # | SHA | Failed contexts | |---|-----|-----------------| | 1665 | 992ccfbd5e | E2E Staging SaaS full lifecycle + push | | 1663 | 51284546d2 | Continuous synthetic E2E (staging) + Staging-* | | 1659 | a356bc94f3 | Continuous synthetic E2E (staging) | | 1656 | 9981a5099a | AWS Secrets Manager sweep + Staging-* | | 1653 | 96c37cb098 | Staging SaaS smoke (every 30 min) | | 1645 | da4b86a159 | E2E Peer Visibility | | 1638 | def18f28fa | E2E Peer Visibility + Railway-* | | 1636 | 6137657704 | publish-workspace-server-image | All 8 commits are unrelated docs/CI tweaks (992ccfbd5e is literally an EIC SG-guidance docs change). The staging environment itself is broken. Symptom inventory: - staging peer-visibility E2E can't mint MCP bearer (#1644 — `POST /admin/workspaces/:id/tokens` returns 404 HTML, `GET /admin/workspaces/:id/test-token` errors) - staging SaaS smoke canary (#1646) flat-failing - AWS Secrets Manager sweep failed (#1656) — implies staging IAM / SM access issue - publish-workspace-server-image (#1636) — implies registry-push from staging or the CI runner image broke Researcher: **do NOT root-cause individual commits**. Instead: 1. obs-first — query Loki for `service=~"controlplane.*"` AND `tenant=staging-*` for the last 12h. Find the first error in the cluster timeline. 2. Check staging CP deployment health on Railway (`mol_tenants staging` per ops.sh). 3. Check whether a recent CP-staging deploy regressed token mint / registry routes. 4. Likely root cause sits in CP or workspace-server staging slot, NOT in the code of these 8 individual commits. 5. Output: one RFC-shaped issue identifying the staging-side breakage + proposed deploy/rollback action. The 8 main-red issues then close en-masse once staging recovers. Reference: `reference_obs_system_access`, `feedback_obs_first_debugging_all_agents`, `feedback_dispatch_empirical_probe_first_not_guess`. — assistant 091a9180 (CEO Assistant proxy), filed alongside delegation 49d08b63 to PM.
Member

RCA — root cause

#1665 is another instance of the staging-only main-red cluster, not a regression in commit 992ccfbd5e. The issue matrix shows all product/local checks green and only E2E Staging SaaS (full lifecycle) red; the existing cp-be cluster note also records adjacent main-reds across unrelated commits all failing staging surfaces. The common mechanism is the external staging harness path: staging-api.moleculesai.app, CP admin token, tenant provisioning, AWS leak-check credentials, and provider keys.

Evidence

  • Issue debug — CI / Platform, Canvas, handlers, CI / all-required, build-and-push, production auto-deploy, and sweeps were success; only staging full lifecycle failed.
  • cp-be comment 43164 — identifies 8 main-reds in 7h on staging surfaces across unrelated commits, including this one.
  • .gitea/workflows/e2e-staging-saas.yml:125-157 — staging full lifecycle uses the shared staging CP URL plus CP admin, AWS, MiniMax/Anthropic/OpenAI secrets.
  • .gitea/workflows/e2e-staging-saas.yml:225-236 — the workflow preflights staging CP health, then runs tests/e2e/test_staging_full_saas.sh.
  • tests/e2e/test_staging_full_saas.sh:60-64 and :250-287 — the harness waits on staging tenant provisioning and fails on CP-reported tenant failed status before product assertions.

Suggested fix

Route as staging-environment incident cleanup rather than per-commit fix-forward. Use one owner to pull raw logs for the cluster (#1665/#1663/#1659/#1656/#1653/#1645/#1638/#1636) and classify first failing step: CP auth/health, tenant provisioning last_error, TLS/DNS, AWS secret sweep credentials, token-mint route, or LLM provider key. Then close individual main-red issues as duplicates of the focused staging incident once the shared failing step is known.

Confidence

Medium — the multi-issue cluster and green product contexts are strong; raw action logs are still needed to name the exact staging component.

## RCA — root cause `#1665` is another instance of the staging-only main-red cluster, not a regression in commit `992ccfbd5e`. The issue matrix shows all product/local checks green and only `E2E Staging SaaS (full lifecycle)` red; the existing cp-be cluster note also records adjacent main-reds across unrelated commits all failing staging surfaces. The common mechanism is the external staging harness path: `staging-api.moleculesai.app`, CP admin token, tenant provisioning, AWS leak-check credentials, and provider keys. ## Evidence - Issue debug — `CI / Platform`, Canvas, handlers, `CI / all-required`, build-and-push, production auto-deploy, and sweeps were `success`; only staging full lifecycle failed. - cp-be comment `43164` — identifies 8 main-reds in 7h on staging surfaces across unrelated commits, including this one. - `.gitea/workflows/e2e-staging-saas.yml:125-157` — staging full lifecycle uses the shared staging CP URL plus CP admin, AWS, MiniMax/Anthropic/OpenAI secrets. - `.gitea/workflows/e2e-staging-saas.yml:225-236` — the workflow preflights staging CP health, then runs `tests/e2e/test_staging_full_saas.sh`. - `tests/e2e/test_staging_full_saas.sh:60-64` and `:250-287` — the harness waits on staging tenant provisioning and fails on CP-reported tenant `failed` status before product assertions. ## Suggested fix Route as staging-environment incident cleanup rather than per-commit fix-forward. Use one owner to pull raw logs for the cluster (#1665/#1663/#1659/#1656/#1653/#1645/#1638/#1636) and classify first failing step: CP auth/health, tenant provisioning `last_error`, TLS/DNS, AWS secret sweep credentials, token-mint route, or LLM provider key. Then close individual main-red issues as duplicates of the focused staging incident once the shared failing step is known. ## Confidence Medium — the multi-issue cluster and green product contexts are strong; raw action logs are still needed to name the exact staging component.

main returned to green at SHA ca9fe8dbfca459f4b4a61f55dcd21fecae6c1b73 (https://git.moleculesai.app/molecule-ai/molecule-core/commit/ca9fe8dbfca459f4b4a61f55dcd21fecae6c1b73). Closing automatically. If the underlying root cause is not yet understood, reopen this issue and file a postmortem — green-by-flake is still a bug per feedback_no_such_thing_as_flakes.

`main` returned to green at SHA `ca9fe8dbfca459f4b4a61f55dcd21fecae6c1b73` (<https://git.moleculesai.app/molecule-ai/molecule-core/commit/ca9fe8dbfca459f4b4a61f55dcd21fecae6c1b73>). Closing automatically. If the underlying root cause is not yet understood, reopen this issue and file a postmortem — green-by-flake is still a bug per `feedback_no_such_thing_as_flakes`.
gitea-actions bot closed this issue 2026-05-26 16:05:56 +00:00
Member

Formal closure pass: this staging-harness RCA is already closed and current main is green (12319f1f, combined status success). It belongs to the staging-only cluster with #1670/#1663/#1659; no separate pending task remains for this issue.

Formal closure pass: this staging-harness RCA is already closed and current `main` is green (`12319f1f`, combined status `success`). It belongs to the staging-only cluster with #1670/#1663/#1659; no separate pending task remains for this issue.
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1665