[main-red] molecule-ai/molecule-core: 9a40df22ba #2692

Closed
opened 2026-06-13 02:07:15 +00:00 by gitea-actions · 3 comments

Main is RED on molecule-ai/molecule-core at 9a40df22ba

Commit: https://git.moleculesai.app/molecule-ai/molecule-core/commit/9a40df22ba4b3fc075c166dd6869ff2539df12ae

Auto-filed by .gitea/workflows/main-red-watchdog.yml (Option C of the main-never-red directive). Per feedback_no_such_thing_as_flakes + feedback_fix_root_not_symptom: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts.

Failed status contexts

  • Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)failurelogs
    • Failing after 4m46s

Resolution path

  1. Read the failed logs (links above).
  2. If reproducible locally, fix forward in a PR targeting main.
  3. If the failure is a real flake — STOP. Per feedback_no_such_thing_as_flakes, intermittent failures are real bugs. Investigate to root cause; do not mark as flake.
  4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per feedback_prod_apply_needs_hongming_chat_go (branch protection is a prod surface).

Debug

{
  "all_contexts": [
    {
      "context": "Block internal-flavored paths / Block forbidden paths (push)",
      "state": "success"
    },
    {
      "context": "CI / Python Lint & Test (push)",
      "state": "success"
    },
    {
      "context": "Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (push)",
      "state": "success"
    },
    {
      "context": "Handlers Postgres Integration / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (push)",
      "state": "success"
    },
    {
      "context": "Secret scan / Scan diff for credential-shaped strings (push)",
      "state": "success"
    },
    {
      "context": "CI / Detect changes (push)",
      "state": "success"
    },
    {
      "context": "E2E Chat / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "lint-no-coe-on-required / lint-no-coe-on-required (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging Canvas (Playwright) / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "CI / Platform (Go) (push)",
      "state": "success"
    },
    {
      "context": "E2E API Smoke Test / detect-changes (push)",
      "state": "success"
    },
    {
      "context": "CI / Canvas (Next.js) (push)",
      "state": "success"
    },
    {
      "context": "CI / Canvas Deploy Status (push)",
      "state": "success"
    },
    {
      "context": "E2E Chat / E2E Chat (push)",
      "state": "success"
    },
    {
      "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)",
      "state": "success"
    },
    {
      "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)",
      "state": "success"
    },
    {
      "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (push)",
      "state": "success"
    },
    {
      "context": "CI / Shellcheck (E2E scripts) (push)",
      "state": "success"
    },
    {
      "context": "CI / all-required (push)",
      "state": "success"
    },
    {
      "context": "E2E API Smoke Test / E2E API Smoke Test (push)",
      "state": "success"
    },
    {
      "context": "publish-workspace-server-image / build-and-push (push)",
      "state": "success"
    },
    {
      "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)",
      "state": "failure"
    },
    {
      "context": "publish-workspace-server-image / Production auto-deploy (push)",
      "state": "success"
    }
  ],
  "branch": "main",
  "combined_state": "failure",
  "failed_contexts": [
    "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)"
  ],
  "recheck_combined_state": "failure",
  "recheck_failed_contexts": [
    "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)"
  ],
  "sha": "9a40df22ba4b3fc075c166dd6869ff2539df12ae"
}

This issue is idempotent: the watchdog runs hourly at :05 and edits this body in place. When main returns to green, the watchdog will close this issue automatically with a "main returned to green" comment.

# Main is RED on `molecule-ai/molecule-core` at `9a40df22ba` Commit: <https://git.moleculesai.app/molecule-ai/molecule-core/commit/9a40df22ba4b3fc075c166dd6869ff2539df12ae> Auto-filed by `.gitea/workflows/main-red-watchdog.yml` (Option C of the [main-never-red directive](https://git.moleculesai.app/molecule-ai/molecule-core/issues/420)). Per `feedback_no_such_thing_as_flakes` + `feedback_fix_root_not_symptom`: investigate the root cause; do NOT revert as a reflex. The watchdog itself never reverts. ## Failed status contexts - **Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)** — `failure` → [logs](/molecule-ai/molecule-core/actions/runs/355924/jobs/482768) - Failing after 4m46s ## Resolution path 1. Read the failed logs (links above). 2. If reproducible locally, fix forward in a PR targeting `main`. 3. If the failure is a real flake — STOP. Per `feedback_no_such_thing_as_flakes`, intermittent failures are real bugs. Investigate to root cause; do not mark as flake. 4. If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per `feedback_prod_apply_needs_hongming_chat_go` (branch protection is a prod surface). ## Debug ```json { "all_contexts": [ { "context": "Block internal-flavored paths / Block forbidden paths (push)", "state": "success" }, { "context": "CI / Python Lint & Test (push)", "state": "success" }, { "context": "Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (push)", "state": "success" }, { "context": "Handlers Postgres Integration / detect-changes (push)", "state": "success" }, { "context": "Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (push)", "state": "success" }, { "context": "Secret scan / Scan diff for credential-shaped strings (push)", "state": "success" }, { "context": "CI / Detect changes (push)", "state": "success" }, { "context": "E2E Chat / detect-changes (push)", "state": "success" }, { "context": "lint-no-coe-on-required / lint-no-coe-on-required (push)", "state": "success" }, { "context": "E2E Staging Canvas (Playwright) / detect-changes (push)", "state": "success" }, { "context": "CI / Platform (Go) (push)", "state": "success" }, { "context": "E2E API Smoke Test / detect-changes (push)", "state": "success" }, { "context": "CI / Canvas (Next.js) (push)", "state": "success" }, { "context": "CI / Canvas Deploy Status (push)", "state": "success" }, { "context": "E2E Chat / E2E Chat (push)", "state": "success" }, { "context": "E2E Staging Canvas (Playwright) / Canvas tabs E2E (push)", "state": "success" }, { "context": "Handlers Postgres Integration / Handlers Postgres Integration (push)", "state": "success" }, { "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (push)", "state": "success" }, { "context": "CI / Shellcheck (E2E scripts) (push)", "state": "success" }, { "context": "CI / all-required (push)", "state": "success" }, { "context": "E2E API Smoke Test / E2E API Smoke Test (push)", "state": "success" }, { "context": "publish-workspace-server-image / build-and-push (push)", "state": "success" }, { "context": "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)", "state": "failure" }, { "context": "publish-workspace-server-image / Production auto-deploy (push)", "state": "success" } ], "branch": "main", "combined_state": "failure", "failed_contexts": [ "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)" ], "recheck_combined_state": "failure", "recheck_failed_contexts": [ "Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (push)" ], "sha": "9a40df22ba4b3fc075c166dd6869ff2539df12ae" } ``` _This issue is idempotent: the watchdog runs hourly at `:05` and edits this body in place. When `main` returns to green, the watchdog will close this issue automatically with a "main returned to green" comment._
Member

MECHANISM: This main-red is the same degraded-restart class already tracked under #2680, now on merge-queue head 9a40df22ba. The local real-image advisory waits for restart recovery at tests/e2e/test_local_provision_lifecycle_e2e.sh:546-559; after restart the workspace remains degraded, not online. The log shows auth-token injection does happen, so this is not solely a missing-token-on-recreate symptom. The remaining failure path is restart-context delivery through workspace-server/internal/handlers/restart_context.go:259-301: it calls ProxyA2ARequest with caller system:restart-context; if the target is busy, workspace-server/internal/handlers/a2a_proxy_helpers.go:77-113 enqueues that caller id into A2A queue/activity logging paths that still expect UUID-shaped workspace ids.

EVIDENCE: Run 355924 job 482768 failed the watchdog context for commit 9a40df22ba. The job log includes Provisioner: injected fresh auth token, then boot_register_failed status=400, then invalid input syntax for type uuid, then workspace back online after restart (status=degraded). Current origin/main includes #2688 (c09dfd51) and has the fresh-heartbeat guard at restart_context.go:281-283, but the synthetic caller id is still sent at restart_context.go:293-296; the failure therefore survives the timeout/guard change when the proxy enters the busy-enqueue path.

RECOMMENDED FIX SHAPE: Keep the fix in molecule-core restart-context/A2A queue handling, not the lifecycle test. The minimal shape is to make server-side system callers (system:restart-context specifically) bypass UUID-only persistence fields or normalize them to NULL/system metadata before EnqueueA2A/activity logging, while preserving the trusted server-side bypass documented in a2a_proxy.go:328-347. #2530 token reinjection and #2688 restart timing are still relevant hardening, but this main-red needs the system-caller-to-UUID boundary closed so restart-context cannot poison activity_logs.source_id and leave the workspace degraded.

MECHANISM: This main-red is the same degraded-restart class already tracked under #2680, now on merge-queue head 9a40df22ba4b3fc075c166dd6869ff2539df12ae. The local real-image advisory waits for restart recovery at `tests/e2e/test_local_provision_lifecycle_e2e.sh:546-559`; after restart the workspace remains `degraded`, not `online`. The log shows auth-token injection does happen, so this is not solely a missing-token-on-recreate symptom. The remaining failure path is restart-context delivery through `workspace-server/internal/handlers/restart_context.go:259-301`: it calls `ProxyA2ARequest` with caller `system:restart-context`; if the target is busy, `workspace-server/internal/handlers/a2a_proxy_helpers.go:77-113` enqueues that caller id into A2A queue/activity logging paths that still expect UUID-shaped workspace ids. EVIDENCE: Run 355924 job 482768 failed the watchdog context for commit 9a40df22ba. The job log includes `Provisioner: injected fresh auth token`, then `boot_register_failed status=400`, then `invalid input syntax for type uuid`, then `workspace back online after restart (status=degraded)`. Current `origin/main` includes #2688 (`c09dfd51`) and has the fresh-heartbeat guard at `restart_context.go:281-283`, but the synthetic caller id is still sent at `restart_context.go:293-296`; the failure therefore survives the timeout/guard change when the proxy enters the busy-enqueue path. RECOMMENDED FIX SHAPE: Keep the fix in molecule-core restart-context/A2A queue handling, not the lifecycle test. The minimal shape is to make server-side system callers (`system:restart-context` specifically) bypass UUID-only persistence fields or normalize them to NULL/system metadata before `EnqueueA2A`/activity logging, while preserving the trusted server-side bypass documented in `a2a_proxy.go:328-347`. #2530 token reinjection and #2688 restart timing are still relevant hardening, but this main-red needs the system-caller-to-UUID boundary closed so restart-context cannot poison `activity_logs.source_id` and leave the workspace degraded.
Member

Filed follow-up #2693 documenting the root cause.

Short version: this is the same wedge-detector issue #2688 was supposed to help (test-harness + longer timeout), but the production-code root cause is #2530 (auth-token loss on container re-create). The wedge detector in workspace-server/internal/handlers/registry.go:950-955 flips online → degraded on hasRecentRegisterFailure (a register failure within the last 5 minutes). After a restart, the new container's POST /registry/register returns 400 (token mismatch), the wedge fires within seconds, and the test polls online for the full RESTART_TIMEOUT (240s) and still times out with degraded.

My #2688 PR body explicitly called this out as out-of-scope — the test-harness fix (RESTART_TIMEOUT=240s in MiniMax mode, exact-match diagnostic) is a partial mitigation but doesn't fix the production-code auth-token-rotation-on-restart contract. The production-code fix needs a spec-level design (option a/b/c in #2693) and substantial changes to the restart provisioning.

Recommended action per the watchdog's resolution path: "If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO." → Filed #2693, do NOT revert #2688. The test-harness improvements in #2688 (RESTART_TIMEOUT=240s, exact-match diagnostic) are net-positive and should stay.

Production-code fix is needed (in workspace-server/internal/handlers/workspace_restart.go or wherever the token rotation lives) — but it's a substantial change to the auth-token-rotation contract and needs CR2 review. Out of scope for a single tick; assigned to whoever owns the restart-provisioning / auth-token-rotation contract.

Filed follow-up #2693 documenting the root cause. Short version: this is the same wedge-detector issue #2688 was supposed to help (test-harness + longer timeout), but the production-code root cause is **#2530 (auth-token loss on container re-create)**. The wedge detector in `workspace-server/internal/handlers/registry.go:950-955` flips `online → degraded` on `hasRecentRegisterFailure` (a register failure within the last 5 minutes). After a restart, the new container's `POST /registry/register` returns 400 (token mismatch), the wedge fires within seconds, and the test polls `online` for the full RESTART_TIMEOUT (240s) and still times out with `degraded`. **My #2688 PR body explicitly called this out as out-of-scope** — the test-harness fix (RESTART_TIMEOUT=240s in MiniMax mode, exact-match diagnostic) is a partial mitigation but doesn't fix the production-code auth-token-rotation-on-restart contract. The production-code fix needs a spec-level design (option a/b/c in #2693) and substantial changes to the restart provisioning. **Recommended action per the watchdog's resolution path:** "If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO." → Filed #2693, do NOT revert #2688. The test-harness improvements in #2688 (RESTART_TIMEOUT=240s, exact-match diagnostic) are net-positive and should stay. **Production-code fix is needed** (in `workspace-server/internal/handlers/workspace_restart.go` or wherever the token rotation lives) — but it's a substantial change to the auth-token-rotation contract and needs CR2 review. Out of scope for a single tick; assigned to whoever owns the restart-provisioning / auth-token-rotation contract.

main returned to green at SHA ab5dcee676a54342116748caaea16ea5c5b0ec81 (https://git.moleculesai.app/molecule-ai/molecule-core/commit/ab5dcee676a54342116748caaea16ea5c5b0ec81). Closing automatically. If the underlying root cause is not yet understood, reopen this issue and file a postmortem — green-by-flake is still a bug per feedback_no_such_thing_as_flakes.

`main` returned to green at SHA `ab5dcee676a54342116748caaea16ea5c5b0ec81` (<https://git.moleculesai.app/molecule-ai/molecule-core/commit/ab5dcee676a54342116748caaea16ea5c5b0ec81>). Closing automatically. If the underlying root cause is not yet understood, reopen this issue and file a postmortem — green-by-flake is still a bug per `feedback_no_such_thing_as_flakes`.
gitea-actions bot closed this issue 2026-06-13 06:07:16 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2692