Canary failing: staging SaaS smoke #2712

Closed
opened 2026-06-13 06:06:43 +00:00 by gitea-actions · 8 comments

Smoke run failed at 2026-06-13T06:06:43Z.

Run: https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/357348

This issue auto-closes on the next green smoke run. Consecutive failures add a comment here rather than a new issue.

Smoke run failed at 2026-06-13T06:06:43Z. Run: https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/357348 This issue auto-closes on the next green smoke run. Consecutive failures add a comment here rather than a new issue.
Smoke still failing. https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/357598
Member

MECHANISM: The current staging-smoke red is not the old initial queued-response bug and not the #2713 duplicate-chat regression. On main 424f725e, tests/e2e/test_staging_full_saas.sh:1276-1279 sends through a2a_send_or_poll_queue, so queued 2xx responses are polled. This run reaches a healthy platform path first: staging key present, parent workspace provisioned, terminal reachable, config.yaml PUT 200, workspace online/routable. The failure is then the agent/runtime returning text Agent error (_ResultError), which the script catches at tests/e2e/test_staging_full_saas.sh:1352-1354 as a generic error-shaped response.

EVIDENCE: main staging-smoke run 357598, job 485802, head 424f725e. Log excerpt: A2A returned an error-shaped response. Immediately before that, the log shows online and routable and config.yaml PUT OK (HTTP 200), so CP provisioning/routing and queue polling are not the break. The job also shows MOLECULE_STAGING_MINIMAX_API_KEY present for runtime=claude-code, while OpenAI/Anthropic staging keys are empty, matching the MiniMax canary path.

RECOMMENDED FIX SHAPE: Treat this as a staging LLM/backend/runtime failure class, not a harness queue bug. In molecule-core, improve the smoke failure classification around tests/e2e/test_staging_full_saas.sh:1352-1354 so _ResultError gets a named backend/runtime diagnostic and, where possible, include recent workspace/agent log tail for the failed workspace id. Operationally, inspect the claude-code MiniMax canary workspace logs for the underlying _ResultError cause (auth/base-url/model/backend empty completion) before changing platform code.

MECHANISM: The current staging-smoke red is not the old initial queued-response bug and not the #2713 duplicate-chat regression. On main `424f725e`, `tests/e2e/test_staging_full_saas.sh:1276-1279` sends through `a2a_send_or_poll_queue`, so queued 2xx responses are polled. This run reaches a healthy platform path first: staging key present, parent workspace provisioned, terminal reachable, `config.yaml` PUT 200, workspace online/routable. The failure is then the agent/runtime returning text `Agent error (_ResultError)`, which the script catches at `tests/e2e/test_staging_full_saas.sh:1352-1354` as a generic error-shaped response. EVIDENCE: main staging-smoke run 357598, job 485802, head `424f725e`. Log excerpt: `A2A returned an error-shaped response`. Immediately before that, the log shows `online and routable` and `config.yaml PUT OK (HTTP 200)`, so CP provisioning/routing and queue polling are not the break. The job also shows `MOLECULE_STAGING_MINIMAX_API_KEY` present for `runtime=claude-code`, while OpenAI/Anthropic staging keys are empty, matching the MiniMax canary path. RECOMMENDED FIX SHAPE: Treat this as a staging LLM/backend/runtime failure class, not a harness queue bug. In molecule-core, improve the smoke failure classification around `tests/e2e/test_staging_full_saas.sh:1352-1354` so `_ResultError` gets a named backend/runtime diagnostic and, where possible, include recent workspace/agent log tail for the failed workspace id. Operationally, inspect the claude-code MiniMax canary workspace logs for the underlying `_ResultError` cause (auth/base-url/model/backend empty completion) before changing platform code.
Member

MECHANISM: The continuous synthetic E2E red is the same staging runtime/backend _ResultError class as the staging-smoke red, not a provisioning, queue-polling, or artifact/file-path failure. On current main 424f725e, tests/e2e/test_staging_full_saas.sh:1276-1279 routes the parent A2A through a2a_send_or_poll_queue; the script then extracts result.parts[0].text at :1279-1284 and fails at the generic error catch :1352-1354 when the agent text contains _ResultError.

EVIDENCE: continuous-synth run 357604, job 485809, head 424f725e, failed after both parent and child were fully online/routable. The log shows tenant reachable, MODEL_SLUG=MiniMax-M2.7, both workspaces image upload/download OK, terminal reachable, config.yaml PUT OK (HTTP 200), and both workspaces re-online after config write. The failing excerpt is Agent error (_ResultError). That makes this downstream of CP/routing/files/queue handling and upstream in the claude-code MiniMax agent turn.

RECOMMENDED FIX SHAPE: Keep this tracked with the staging smoke alert, but split the generic _ResultError catch into a named staging-runtime/backend diagnostic in tests/e2e/test_staging_full_saas.sh, and attach/fetch the failing workspace agent log tail before teardown where possible. The next operator investigation should inspect the claude-code MiniMax canary workspace logs for the underlying result error (auth/base URL/model/backend response), not reopen #2708/#2713.

MECHANISM: The continuous synthetic E2E red is the same staging runtime/backend `_ResultError` class as the staging-smoke red, not a provisioning, queue-polling, or artifact/file-path failure. On current main `424f725e`, `tests/e2e/test_staging_full_saas.sh:1276-1279` routes the parent A2A through `a2a_send_or_poll_queue`; the script then extracts `result.parts[0].text` at `:1279-1284` and fails at the generic error catch `:1352-1354` when the agent text contains `_ResultError`. EVIDENCE: continuous-synth run 357604, job 485809, head `424f725e`, failed after both parent and child were fully online/routable. The log shows tenant reachable, `MODEL_SLUG=MiniMax-M2.7`, both workspaces image upload/download OK, terminal reachable, `config.yaml PUT OK (HTTP 200)`, and both workspaces re-online after config write. The failing excerpt is `Agent error (_ResultError)`. That makes this downstream of CP/routing/files/queue handling and upstream in the claude-code MiniMax agent turn. RECOMMENDED FIX SHAPE: Keep this tracked with the staging smoke alert, but split the generic `_ResultError` catch into a named staging-runtime/backend diagnostic in `tests/e2e/test_staging_full_saas.sh`, and attach/fetch the failing workspace agent log tail before teardown where possible. The next operator investigation should inspect the claude-code MiniMax canary workspace logs for the underlying result error (auth/base URL/model/backend response), not reopen #2708/#2713.
Smoke still failing. https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/357598
Smoke still failing. https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/357697
Smoke still failing. https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/357900
Smoke still failing. https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/358088
Smoke still failing. https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/358286
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2712