RCA: engine _ResultError is executor-time, not A2A queue-drain #2748

Open
opened 2026-06-13 11:11:26 +00:00 by agent-researcher · 4 comments
Member

MECHANISM: The engine _ResultError path is execution-time, not a queue dispatch/drain failure and should stay separate from #2737. CI samples that return Agent error (_ResultError) — see workspace logs for details already have a completed JSON-RPC result.message with role=agent, which means the A2A item dequeued, entered the runtime executor, caught an exception, and emitted the runtime fallback. The code path is molecule_runtime/a2a_executor.py:739-758, which catches execution exceptions and sends sanitize_agent_error(exc=e, stderr=error_detail_for_external(e)); molecule_runtime/executor_helpers.py:685-691 emits the exact opaque fallback only when there is no safe detail. The claude-code template then shows the provider/SDK execution layer raising _ResultError from claude_sdk_executor.py:1114-1220 / :1490-1691, not from queue status handling.

EVIDENCE: Public CI logs show fast terminal A2A replies, not queue timeout: staging smoke job 488344 logs send at 10:35:02 and full response at 10:35:04 with Agent error (_ResultError); run 358888/job 488230 logs send at 10:54:17 and full response at 10:54:19. Older local-real logs expose the same exception family inside execution: SDK agent error [claude-code]: _ResultError: Failed to authenticate. API Error: 401 invalid api key, with ANTHROPIC_AUTH_TOKEN=unset and MINIMAX_API_KEY=set. That older auth example proves where the exception originates, but it does not prove the current owner-gated engine workspaces have the same underlying provider error. The current public CI artifacts do not include the workspace-log stderr/body; the harness activity dump returns a JSON parse error, and the sanitized response has no detail.

RECOMMENDED FIX SHAPE: Do not route the engine _ResultError through the #2737 queue-drain fix. #2737 Platform Boot remains the opposite failure mode: no terminal response, 30 queue polls, stuck queued. For the engine failure, owner/CTO-gated workspace logs are required to read the executor traceback/stderr and classify the underlying live exception. If the gated logs match the older evidence, route an auth/credential-projection fix for the engine runtime/provider env. If they show an empty-message _ResultError, route a runtime/template diagnostic improvement to unwrap SDK ResultMessage stderr/result into error_detail_for_external before falling back to opaque text.

MECHANISM: The engine _ResultError path is execution-time, not a queue dispatch/drain failure and should stay separate from #2737. CI samples that return `Agent error (_ResultError) — see workspace logs for details` already have a completed JSON-RPC `result.message` with `role=agent`, which means the A2A item dequeued, entered the runtime executor, caught an exception, and emitted the runtime fallback. The code path is `molecule_runtime/a2a_executor.py:739-758`, which catches execution exceptions and sends `sanitize_agent_error(exc=e, stderr=error_detail_for_external(e))`; `molecule_runtime/executor_helpers.py:685-691` emits the exact opaque fallback only when there is no safe detail. The claude-code template then shows the provider/SDK execution layer raising `_ResultError` from `claude_sdk_executor.py:1114-1220` / `:1490-1691`, not from queue status handling. EVIDENCE: Public CI logs show fast terminal A2A replies, not queue timeout: staging smoke job 488344 logs send at 10:35:02 and full response at 10:35:04 with `Agent error (_ResultError)`; run 358888/job 488230 logs send at 10:54:17 and full response at 10:54:19. Older local-real logs expose the same exception family inside execution: `SDK agent error [claude-code]: _ResultError: Failed to authenticate. API Error: 401 invalid api key`, with `ANTHROPIC_AUTH_TOKEN=unset` and `MINIMAX_API_KEY=set`. That older auth example proves where the exception originates, but it does not prove the current owner-gated engine workspaces have the same underlying provider error. The current public CI artifacts do not include the workspace-log stderr/body; the harness activity dump returns a JSON parse error, and the sanitized response has no detail. RECOMMENDED FIX SHAPE: Do not route the engine _ResultError through the #2737 queue-drain fix. #2737 Platform Boot remains the opposite failure mode: no terminal response, 30 queue polls, stuck queued. For the engine failure, owner/CTO-gated workspace logs are required to read the executor traceback/stderr and classify the underlying live exception. If the gated logs match the older evidence, route an auth/credential-projection fix for the engine runtime/provider env. If they show an empty-message `_ResultError`, route a runtime/template diagnostic improvement to unwrap SDK ResultMessage stderr/result into `error_detail_for_external` before falling back to opaque text.
Author
Member

Canary harvest from rerun of run 358888 (attempt 358888-3, SaaS job 488230) after claude-code template main carried #121/f25cb6e2:

MECHANISM: the _ResultError is now LEGIBLE, so the provisioned claude-code image/path did pick up the #121 terminal sanitize_agent_error(..., stderr=...) fix. This is not the old opaque see workspace logs propagation failure. The failure class is model/access selection for MiniMax-M2.7, not queue-drain: the A2A task executed and returned an agent text payload.

EVIDENCE: job 488230 completed failure at 11:58:31Z. Diagnostic burst at 11:58:19Z shows response text: Agent error (_ResultError): There's an issue with the selected model (MiniMax-M2.7). It may not exist or you may not have access to it. There is no api_error_status field/string in the surfaced detail, so I cannot truthfully classify it as 401/429/404/5xx from this sample; the actionable classification is model-not-found/model-not-entitled/model-access class for the configured MODEL_SLUG=MiniMax-M2.7.

RECOMMENDED FIX SHAPE: route this to the model/provider configuration lane, not template-image propagation. Verify the staging claude-code canary model mapping/entitlement for MiniMax-M2.7 and either switch the canary to a model slug the selected provider account can access or update the provider/model registry/credentials so MiniMax-M2.7 is valid. If exact HTTP status remains required, add/ensure the SDK error path preserves api_error_status for this selected-model error; this run only surfaced the human-safe model/access message.

Canary harvest from rerun of run 358888 (attempt `358888-3`, SaaS job 488230) after claude-code template main carried #121/f25cb6e2: MECHANISM: the `_ResultError` is now LEGIBLE, so the provisioned claude-code image/path did pick up the #121 terminal `sanitize_agent_error(..., stderr=...)` fix. This is not the old opaque `see workspace logs` propagation failure. The failure class is model/access selection for `MiniMax-M2.7`, not queue-drain: the A2A task executed and returned an agent text payload. EVIDENCE: job 488230 completed failure at 11:58:31Z. Diagnostic burst at 11:58:19Z shows response text: `Agent error (_ResultError): There's an issue with the selected model (MiniMax-M2.7). It may not exist or you may not have access to it.` There is no `api_error_status` field/string in the surfaced detail, so I cannot truthfully classify it as 401/429/404/5xx from this sample; the actionable classification is model-not-found/model-not-entitled/model-access class for the configured `MODEL_SLUG=MiniMax-M2.7`. RECOMMENDED FIX SHAPE: route this to the model/provider configuration lane, not template-image propagation. Verify the staging claude-code canary model mapping/entitlement for `MiniMax-M2.7` and either switch the canary to a model slug the selected provider account can access or update the provider/model registry/credentials so `MiniMax-M2.7` is valid. If exact HTTP status remains required, add/ensure the SDK error path preserves `api_error_status` for this selected-model error; this run only surfaced the human-safe model/access message.
Author
Member

Autonomous RCA tick update: a second independent sample confirms #2748 is a MiniMax-M2.7 model/access class, not an opaque sanitizer/template-propagation issue.

MECHANISM: current main 17733e42cfc33947e9d1e755691d1a72172f89b3 has required CI green, but the advisory real-image Local Provision job 488666 fails when the claude-code runtime actually executes the MiniMax round-trip. The runtime returns a legible _ResultError text payload, so #121-style external surfacing is active. The failure is provider/model selection or entitlement for MiniMax-M2.7.

EVIDENCE: Local Provision Lifecycle E2E real-image job 488666 lines 365-367: MiniMax reply: Agent error (_ResultError): There's an issue with the selected model (MiniMax-M2.7). It may not exist or you may not have access to it. This matches the SaaS canary sample from run 358888 attempt 3/job 488230 in #100258. No api_error_status is emitted in either sample.

RECOMMENDED FIX SHAPE: route to model/provider account configuration: verify that the staging MiniMax account and template/runtime model mapping can access MiniMax-M2.7, or switch the canary/advisory test to a known-entitled MiniMax model. If exact status is still required, add SDK/status preservation for this selected-model error; do not chase queue-drain or template-image propagation for this lane.

Autonomous RCA tick update: a second independent sample confirms #2748 is a MiniMax-M2.7 model/access class, not an opaque sanitizer/template-propagation issue. MECHANISM: current main `17733e42cfc33947e9d1e755691d1a72172f89b3` has required CI green, but the advisory real-image Local Provision job 488666 fails when the claude-code runtime actually executes the MiniMax round-trip. The runtime returns a legible `_ResultError` text payload, so #121-style external surfacing is active. The failure is provider/model selection or entitlement for `MiniMax-M2.7`. EVIDENCE: Local Provision Lifecycle E2E real-image job 488666 lines 365-367: `MiniMax reply: Agent error (_ResultError): There's an issue with the selected model (MiniMax-M2.7). It may not exist or you may not have access to it.` This matches the SaaS canary sample from run 358888 attempt 3/job 488230 in #100258. No `api_error_status` is emitted in either sample. RECOMMENDED FIX SHAPE: route to model/provider account configuration: verify that the staging MiniMax account and template/runtime model mapping can access `MiniMax-M2.7`, or switch the canary/advisory test to a known-entitled MiniMax model. If exact status is still required, add SDK/status preservation for this selected-model error; do not chase queue-drain or template-image propagation for this lane.
Author
Member

Fresh full-SaaS run 359126 gives a third sample of the same #2748 classification.

MECHANISM: the claude-code agent executes and returns a legible _ResultError text payload; this is not queue-drain and not template propagation. The selected configured model is MiniMax-M2.7, and the surfaced failure is model/access entitlement/config.

EVIDENCE: SaaS job 488651 diagnostic burst lines 270-295 includes text: Agent error (_ResultError): There's an issue with the selected model (MiniMax-M2.7). It may not exist or you may not have access to it. Still no api_error_status string is emitted. This matches #100258 and #100276.

RECOMMENDED FIX SHAPE: model/provider lane: verify staging MiniMax account entitlement and model slug mapping for MiniMax-M2.7, or change the canary/advisory model to one the account can access. Preserve/emit exact SDK status for selected-model failures if that status is needed operationally.

Fresh full-SaaS run 359126 gives a third sample of the same #2748 classification. MECHANISM: the claude-code agent executes and returns a legible `_ResultError` text payload; this is not queue-drain and not template propagation. The selected configured model is `MiniMax-M2.7`, and the surfaced failure is model/access entitlement/config. EVIDENCE: SaaS job 488651 diagnostic burst lines 270-295 includes text: `Agent error (_ResultError): There's an issue with the selected model (MiniMax-M2.7). It may not exist or you may not have access to it.` Still no `api_error_status` string is emitted. This matches #100258 and #100276. RECOMMENDED FIX SHAPE: model/provider lane: verify staging MiniMax account entitlement and model slug mapping for `MiniMax-M2.7`, or change the canary/advisory model to one the account can access. Preserve/emit exact SDK status for selected-model failures if that status is needed operationally.
Author
Member

Autonomous RCA tick update: current main c9e3480b04e9fcbe840eb82960fc90e75a8be5cf repeats the same #2748 advisory failure; no new root-cause class.

MECHANISM: required CI is not red yet — CI / all-required is waiting because Canvas is still in progress, while Platform Go/Shellcheck/Python are green. The completed failure is the advisory real-image MiniMax lane, where the runtime executes and returns the same legible selected-model/access _ResultError for MiniMax-M2.7.

EVIDENCE: Local Provision Lifecycle real-image job 488803 lines 366-368: MiniMax reply: Agent error (_ResultError): There's an issue with the selected model (MiniMax-M2.7). It may not exist or you may not have access to it. Run 359206 all-required job 488785 is waiting; Platform job 488780 and Shellcheck job 488782 are success; Canvas job 488781 is still in progress.

RECOMMENDED FIX SHAPE: no duplicate issue. Keep this in the #2748 provider/model entitlement/config lane: validate staging MiniMax account entitlement/model slug mapping or switch the advisory canary model to an accessible MiniMax model. Do not route as a core required-CI regression unless all-required later fails independently.

Autonomous RCA tick update: current main `c9e3480b04e9fcbe840eb82960fc90e75a8be5cf` repeats the same #2748 advisory failure; no new root-cause class. MECHANISM: required CI is not red yet — `CI / all-required` is waiting because Canvas is still in progress, while Platform Go/Shellcheck/Python are green. The completed failure is the advisory real-image MiniMax lane, where the runtime executes and returns the same legible selected-model/access `_ResultError` for `MiniMax-M2.7`. EVIDENCE: Local Provision Lifecycle real-image job 488803 lines 366-368: `MiniMax reply: Agent error (_ResultError): There's an issue with the selected model (MiniMax-M2.7). It may not exist or you may not have access to it.` Run 359206 all-required job 488785 is `waiting`; Platform job 488780 and Shellcheck job 488782 are success; Canvas job 488781 is still in progress. RECOMMENDED FIX SHAPE: no duplicate issue. Keep this in the #2748 provider/model entitlement/config lane: validate staging MiniMax account entitlement/model slug mapping or switch the advisory canary model to an accessible MiniMax model. Do not route as a core required-CI regression unless all-required later fails independently.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2748