[main-red] molecule-ai/molecule-core: 0dd269e80e #2747

Closed
opened 2026-06-13 11:07:15 +00:00 by gitea-actions · 5 comments

Main is RED on molecule-ai/molecule-core at 0dd269e80e

Commit: https://git.moleculesai.app/molecule-ai/molecule-core/commit/0dd269e80edc0dee69571255d8d78ccccb7cabb6

Auto-filed by .gitea/workflows/main-red-watchdog.yml (Option C of the main-never-red directive). Per feedback_no_such_thing_as_flakes + feedback_fix_root_not_symptom: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts.

Failed status contexts

  • E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)failurelogs
    • Failing after 7m37s
  • E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)failurelogs
    • Failing after 7m57s

Resolution path

  1. Read the failed logs (links above).
  2. If reproducible locally, fix forward in a PR targeting main.
  3. If the failure is a real flake — STOP. Per feedback_no_such_thing_as_flakes, intermittent failures are real bugs. Investigate to root cause; do not mark as flake.
  4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per feedback_prod_apply_needs_hongming_chat_go (branch protection is a prod surface).

Debug

{
  "all_contexts": [
    {
      "context": "Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (push)",
      "state": "success"
    },
    {
      "context": "Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (push)",
      "state": "success"
    },
    {
      "context": "E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (push)",
      "state": "success"
    },
    {
      "context": "E2E API Smoke Test / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "CI / Detect changes (push)",
      "state": "success"
    },
    {
      "context": "Secret scan / Scan diff for credential-shaped strings (push)",
      "state": "success"
    },
    {
      "context": "lint-no-coe-on-required / lint-no-coe-on-required (push)",
      "state": "success"
    },
    {
      "context": "CI / Platform (Go) (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging Canvas (Playwright) / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "CI / Canvas (Next.js) (push)",
      "state": "success"
    },
    {
      "context": "CI / Canvas Deploy Status (push)",
      "state": "success"
    },
    {
      "context": "E2E Chat / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)",
      "state": "success"
    },
    {
      "context": "E2E Chat / E2E Chat (push)",
      "state": "success"
    },
    {
      "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (push)",
      "state": "success"
    },
    {
      "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)",
      "state": "success"
    },
    {
      "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)",
      "state": "success"
    },
    {
      "context": "CI / Shellcheck (E2E scripts) (push)",
      "state": "success"
    },
    {
      "context": "CI / all-required (push)",
      "state": "success"
    },
    {
      "context": "E2E API Smoke Test / E2E API Smoke Test (push)",
      "state": "success"
    },
    {
      "context": "publish-workspace-server-image / build-and-push (push)",
      "state": "success"
    },
    {
      "context": "publish-workspace-server-image / Production auto-deploy (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)",
      "state": "pending"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / pr-validate (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)",
      "state": "failure"
    },
    {
      "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)",
      "state": "failure"
    }
  ],
  "branch": "main",
  "combined_state": "failure",
  "failed_contexts": [
    "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)",
    "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)"
  ],
  "recheck_combined_state": "failure",
  "recheck_failed_contexts": [
    "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)",
    "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)"
  ],
  "sha": "0dd269e80edc0dee69571255d8d78ccccb7cabb6"
}

This issue is idempotent: the watchdog runs hourly at :05 and edits this body in place. When main returns to green, the watchdog will close this issue automatically with a "main returned to green" comment.

# Main is RED on `molecule-ai/molecule-core` at `0dd269e80e` Commit: <https://git.moleculesai.app/molecule-ai/molecule-core/commit/0dd269e80edc0dee69571255d8d78ccccb7cabb6> Auto-filed by `.gitea/workflows/main-red-watchdog.yml` (Option C of the [main-never-red directive](https://git.moleculesai.app/molecule-ai/molecule-core/issues/420)). Per `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts. ## Failed status contexts - **E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)** — `failure` → [logs](/molecule-ai/molecule-core/actions/runs/358888/jobs/488230) - Failing after 7m37s - **E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)** — `failure` → [logs](/molecule-ai/molecule-core/actions/runs/358888/jobs/488231) - Failing after 7m57s ## Resolution path 1. Read the failed logs (links above). 2. If reproducible locally, fix forward in a PR targeting `main`. 3. If the failure is a real flake — STOP. Per `feedback_no_such_thing_as_flakes`, intermittent failures are real bugs. Investigate to root cause; do not mark as flake. 4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per `feedback_prod_apply_needs_hongming_chat_go` (branch protection is a prod surface). ## Debug ```json { "all_contexts": [ { "context": "Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (push)", "state": "success" }, { "context": "Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (push)", "state": "success" }, { "context": "E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (push)", "state": "success" }, { "context": "E2E API Smoke Test / detect-changes (push)", "state": "success" }, { "context": "CI / Detect changes (push)", "state": "success" }, { "context": "Secret scan / Scan diff for credential-shaped strings (push)", "state": "success" }, { "context": "lint-no-coe-on-required / lint-no-coe-on-required (push)", "state": "success" }, { "context": "CI / Platform (Go) (push)", "state": "success" }, { "context": "E2E Staging Canvas (Playwright) / detect-changes (push)", "state": "success" }, { "context": "CI / Canvas (Next.js) (push)", "state": "success" }, { "context": "CI / Canvas Deploy Status (push)", "state": "success" }, { "context": "E2E Chat / detect-changes (push)", "state": "success" }, { "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)", "state": "success" }, { "context": "E2E Chat / E2E Chat (push)", "state": "success" }, { "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (push)", "state": "success" }, { "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)", "state": "success" }, { "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)", "state": "success" }, { "context": "CI / Shellcheck (E2E scripts) (push)", "state": "success" }, { "context": "CI / all-required (push)", "state": "success" }, { "context": "E2E API Smoke Test / E2E API Smoke Test (push)", "state": "success" }, { "context": "publish-workspace-server-image / build-and-push (push)", "state": "success" }, { "context": "publish-workspace-server-image / Production auto-deploy (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push)", "state": "pending" }, { "context": "E2E Staging SaaS (full lifecycle) / pr-validate (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (push)", "state": "success" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)", "state": "failure" }, { "context": "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)", "state": "failure" } ], "branch": "main", "combined_state": "failure", "failed_contexts": [ "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)", "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)" ], "recheck_combined_state": "failure", "recheck_failed_contexts": [ "E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push)", "E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push)" ], "sha": "0dd269e80edc0dee69571255d8d78ccccb7cabb6" } ``` _This issue is idempotent: the watchdog runs hourly at `:05` and edits this body in place. When `main` returns to green, the watchdog will close this issue automatically with a "main returned to green" comment._
Member

MECHANISM: This main-red is a three-lane staging SaaS cluster on run 358888, not one shared code path. Platform Boot reaches a routable parent and direct A2A returns PONG, then the known-answer path enters a2a_send_or_poll_queue and times out after a durable queue insert never drains to a terminal state (tests/e2e/test_staging_full_saas.sh:1097-1170). Full SaaS is a separate executor-time failure: the parent A2A call returns a terminal JSON-RPC result.message with Agent error (_ResultError), so the queue already drained and the runtime executor caught an SDK/LLM exception (molecule_runtime/a2a_executor.py:739-758, executor_helpers.py:685-691). Concierge Creates Workspace is a third path: the tenant provisions, finds the kind=platform root, then that concierge stays provisioning and flips failed, triggering the fail-closed live guard in tests/e2e/test_staging_concierge_creates_workspace_e2e.sh:101-109,315-330.

EVIDENCE: Issue #2747 points to commit 0dd269e80edc0dee69571255d8d78ccccb7cabb6. Job 488231 logs PONG, then queue id 205f85c4-4330-4653-8789-6ec4cd6b3d53, then 30 polls all status=queued, then queue poll timed out. Job 488230 logs send at 10:54:17 and full response at 10:54:19 with Agent error (_ResultError) plus an activity JSON parse failure, matching #2748's execution-time classification rather than #2737's queue-stuck mode. Job 488234 logs concierge c1233ab9-8c10-5dc6-bf7f-a0e97d3ad483 as provisioning at 10:49:43, failed at 11:02:13, then E2E_REQUIRE_LIVE=1 — a skip is a false-green guard breach here.

RECOMMENDED FIX SHAPE: Split routing, do not collapse these into a single revert/fix. Route Platform Boot through the #2737 queue-drain fix: add a deterministic queue drainer/sweeper or stronger heartbeat-triggered drain in workspace-server/internal/handlers/a2a_queue.go / registry.go, plus queue-state diagnostics. Route the terminal _ResultError through #2748: owner/CTO-gated workspace logs are needed to expose the live executor exception before fixing provider credentials/runtime unwrapping. Route Concierge Creates Workspace to the platform-agent provisioning lane: inspect why the staging org's platform root fails before online (platform-agent image/provision env/register path), not the create_workspace tool assertion itself.

MECHANISM: This main-red is a three-lane staging SaaS cluster on run 358888, not one shared code path. Platform Boot reaches a routable parent and direct A2A returns `PONG`, then the known-answer path enters `a2a_send_or_poll_queue` and times out after a durable queue insert never drains to a terminal state (`tests/e2e/test_staging_full_saas.sh:1097-1170`). Full SaaS is a separate executor-time failure: the parent A2A call returns a terminal JSON-RPC `result.message` with `Agent error (_ResultError)`, so the queue already drained and the runtime executor caught an SDK/LLM exception (`molecule_runtime/a2a_executor.py:739-758`, `executor_helpers.py:685-691`). Concierge Creates Workspace is a third path: the tenant provisions, finds the `kind=platform` root, then that concierge stays provisioning and flips failed, triggering the fail-closed live guard in `tests/e2e/test_staging_concierge_creates_workspace_e2e.sh:101-109,315-330`. EVIDENCE: Issue #2747 points to commit `0dd269e80edc0dee69571255d8d78ccccb7cabb6`. Job 488231 logs `PONG`, then queue id `205f85c4-4330-4653-8789-6ec4cd6b3d53`, then 30 polls all `status=queued`, then `queue poll timed out`. Job 488230 logs send at 10:54:17 and full response at 10:54:19 with `Agent error (_ResultError)` plus an activity JSON parse failure, matching #2748's execution-time classification rather than #2737's queue-stuck mode. Job 488234 logs concierge `c1233ab9-8c10-5dc6-bf7f-a0e97d3ad483` as `provisioning` at 10:49:43, `failed` at 11:02:13, then `E2E_REQUIRE_LIVE=1 — a skip is a false-green guard breach here`. RECOMMENDED FIX SHAPE: Split routing, do not collapse these into a single revert/fix. Route Platform Boot through the #2737 queue-drain fix: add a deterministic queue drainer/sweeper or stronger heartbeat-triggered drain in `workspace-server/internal/handlers/a2a_queue.go` / `registry.go`, plus queue-state diagnostics. Route the terminal `_ResultError` through #2748: owner/CTO-gated workspace logs are needed to expose the live executor exception before fixing provider credentials/runtime unwrapping. Route Concierge Creates Workspace to the platform-agent provisioning lane: inspect why the staging org's platform root fails before online (platform-agent image/provision env/register path), not the create_workspace tool assertion itself.
Member

Autonomous RCA tick / CI-health audit update (no new incident dispatched): current molecule-core main 0dd269e80edc0dee69571255d8d78ccccb7cabb6 still has required CI green (CI / all-required, run 358883/job 488219) but the advisory full-SaaS lane red on run 358888: Platform Boot job 488231, aggregate SaaS job 488230, and Concierge Creates Workspace job 488234. This is therefore still advisory-lane red, not a required-CI regression.

MECHANISM: the Platform Boot half remains the queue-drain/A2A poll path, not the runtime _ResultError execution path. The owning script is tests/e2e/test_staging_full_saas.sh: a2a_send_or_poll_queue polls /workspaces/$ws_id/a2a/queue/$qid at lines ~1097-1170 and is used by the initial parent A2A send at ~1278 and known-answer send at ~1404. A task that stays queued|dispatched|in_progress through all 30 polls fails the boot check; that maps to #2737, while the executor-time _ResultError track remains separate (#2748).

EVIDENCE: status API for main shows E2E Staging Platform Boot = Failing after 6m35s, E2E Staging SaaS = Failing after 7m7s, and E2E Staging Concierge Creates Workspace = Failing after 17m40s; user_tasks and Platform Agent subchecks are green in the same run. RECOMMENDED FIX SHAPE: do not reroute this as a new issue. Keep #2737 owned by core-be around the queue dispatch/drain path exercised by test_staging_full_saas.sh; keep #2748 owner-gated for executor exceptions; investigate Concierge Creates Workspace separately if it persists after the queue-drain fix.

Autonomous RCA tick / CI-health audit update (no new incident dispatched): current molecule-core main `0dd269e80edc0dee69571255d8d78ccccb7cabb6` still has required CI green (`CI / all-required`, run 358883/job 488219) but the advisory full-SaaS lane red on run 358888: Platform Boot job 488231, aggregate SaaS job 488230, and Concierge Creates Workspace job 488234. This is therefore still advisory-lane red, not a required-CI regression. MECHANISM: the Platform Boot half remains the queue-drain/A2A poll path, not the runtime `_ResultError` execution path. The owning script is `tests/e2e/test_staging_full_saas.sh`: `a2a_send_or_poll_queue` polls `/workspaces/$ws_id/a2a/queue/$qid` at lines ~1097-1170 and is used by the initial parent A2A send at ~1278 and known-answer send at ~1404. A task that stays `queued|dispatched|in_progress` through all 30 polls fails the boot check; that maps to #2737, while the executor-time `_ResultError` track remains separate (#2748). EVIDENCE: status API for main shows `E2E Staging Platform Boot` = `Failing after 6m35s`, `E2E Staging SaaS` = `Failing after 7m7s`, and `E2E Staging Concierge Creates Workspace` = `Failing after 17m40s`; user_tasks and Platform Agent subchecks are green in the same run. RECOMMENDED FIX SHAPE: do not reroute this as a new issue. Keep #2737 owned by core-be around the queue dispatch/drain path exercised by `test_staging_full_saas.sh`; keep #2748 owner-gated for executor exceptions; investigate Concierge Creates Workspace separately if it persists after the queue-drain fix.
Member

Autonomous RCA tick update on run 358888 (attempt 3): no new issue opened; this remains the already-tracked staging SaaS advisory cluster.

MECHANISM: Platform Boot is exactly the #2737 queue-drain failure: the known-answer task is accepted into A2A queue but never leaves queued during the bounded poll window in tests/e2e/test_staging_full_saas.sh (a2a_send_or_poll_queue, ~1097-1170; known-answer call ~1404). This is separate from #2748, where the initial parent A2A did execute and returned a legible model/access _ResultError.

EVIDENCE: job 488231 lines show A2A known-answer queue poll attempt 1/30 status=queued through 30/30 status=queued, then queue poll timed out waiting for 49187194-ccee-4e65-83e9-91a03b821b45 to complete. Job 488234 is a separate concierge-image/provisioning dependency: concierge → failed, then concierge ... never reached online+routable ... no /opt/molecule-mcp-server, no model, and E2E_REQUIRE_LIVE=1 correctly fails the skip.

RECOMMENDED FIX SHAPE: keep the Platform Boot fix in the core A2A queue dispatch/drain lane (#2737). Track the Concierge Creates Workspace failure as platform-agent/concierge image provisioning readiness, not as the same queue bug and not as #2748 model access. Required CI is green; these are advisory full-SaaS gates.

Autonomous RCA tick update on run 358888 (attempt 3): no new issue opened; this remains the already-tracked staging SaaS advisory cluster. MECHANISM: Platform Boot is exactly the #2737 queue-drain failure: the known-answer task is accepted into A2A queue but never leaves `queued` during the bounded poll window in `tests/e2e/test_staging_full_saas.sh` (`a2a_send_or_poll_queue`, ~1097-1170; known-answer call ~1404). This is separate from #2748, where the initial parent A2A did execute and returned a legible model/access `_ResultError`. EVIDENCE: job 488231 lines show `A2A known-answer queue poll attempt 1/30 status=queued` through `30/30 status=queued`, then `queue poll timed out waiting for 49187194-ccee-4e65-83e9-91a03b821b45 to complete`. Job 488234 is a separate concierge-image/provisioning dependency: `concierge → failed`, then `concierge ... never reached online+routable ... no /opt/molecule-mcp-server, no model`, and `E2E_REQUIRE_LIVE=1` correctly fails the skip. RECOMMENDED FIX SHAPE: keep the Platform Boot fix in the core A2A queue dispatch/drain lane (#2737). Track the Concierge Creates Workspace failure as platform-agent/concierge image provisioning readiness, not as the same queue bug and not as #2748 model access. Required CI is green; these are advisory full-SaaS gates.
Member

Fresh full-SaaS run 359126 confirms the same split; no new root-cause class.

MECHANISM: Platform Boot is still the #2737 queue-drain failure. The known-answer A2A queue row is accepted but remains queued for every poll in tests/e2e/test_staging_full_saas.sh (a2a_send_or_poll_queue), so the boot gate times out waiting for completion.

EVIDENCE: job 488652 lines 215-245 show A2A known-answer queue poll attempt 1/30 status=queued through 30/30 status=queued, then queue poll timed out waiting for 723978a3-529d-4e6c-83bc-36feb8848767 to complete. In the same run, user_tasks, workspace-requests, and concierge platform-agent jobs are green; SaaS job 488651 fails earlier on the separate #2748 MiniMax-M2.7 model/access error.

RECOMMENDED FIX SHAPE: keep Platform Boot routed to the A2A queue dispatch/drain owner (#2737) and keep #2748 routed to model/provider entitlement/config. Do not merge these lanes.

Fresh full-SaaS run 359126 confirms the same split; no new root-cause class. MECHANISM: Platform Boot is still the #2737 queue-drain failure. The known-answer A2A queue row is accepted but remains `queued` for every poll in `tests/e2e/test_staging_full_saas.sh` (`a2a_send_or_poll_queue`), so the boot gate times out waiting for completion. EVIDENCE: job 488652 lines 215-245 show `A2A known-answer queue poll attempt 1/30 status=queued` through `30/30 status=queued`, then `queue poll timed out waiting for 723978a3-529d-4e6c-83bc-36feb8848767 to complete`. In the same run, user_tasks, workspace-requests, and concierge platform-agent jobs are green; SaaS job 488651 fails earlier on the separate #2748 MiniMax-M2.7 model/access error. RECOMMENDED FIX SHAPE: keep Platform Boot routed to the A2A queue dispatch/drain owner (#2737) and keep #2748 routed to model/provider entitlement/config. Do not merge these lanes.

The failing contexts from this SHA (0dd269e80e) have recovered on current HEAD c9e3480b04: E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push), E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push), E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push). Main is still red for other reasons; see the current [main-red] issue for c9e3480b04.

The failing contexts from this SHA (`0dd269e80e`) have recovered on current HEAD `c9e3480b04`: E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push), E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (push), E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (push). Main is still red for other reasons; see the current `[main-red]` issue for `c9e3480b04`.
gitea-actions bot closed this issue 2026-06-13 13:07:22 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2747