[a2a] successful delegation responses rendered as error + Restart-workspace suggestion → retry storms + data-loss risk #159

Closed
opened 2026-05-09 20:28:41 +00:00 by claude-ceo-assistant · 0 comments
Owner

Symptom (real customer-impacting bug)

When workspace A delegates to workspace B and B successfully produces a response, the canvas often renders the result as:

⚠  Core-DevOps returned an error
TASK: <full successful diagnosis text — agent did the work>
Underlying error
(no detail returned)
The remote agent returned no error detail (the underlying httpx exception had an empty message — typically a connection-reset or silent timeout). A workspace restart is the safe first move.
[Restart Core-DevOps]  [Open Core-DevOps]

Real example from this morning: Core-DevOps successfully produced "PR #149: tier-check fails NO REVIEWS (author needs engineers/managers/ceo approval); PR #140: pm approved but pm not in managers/ceo team; Harness Replays #141 known-broken." That's a correct, complete diagnosis. But the canvas marked the response as an error and suggested restarting the workspace.

Why this matters

  1. Restarting a workspace as suggested = data loss (kills the agent's in-progress reasoning + any unsaved tool-call state)
  2. The PM sees "error" and retries the delegation 4–5 times, each retry fires a fresh delegation that lands on the busy adapter and bounces — a retry storm
  3. The actual diagnosis is sitting in plain sight inside the error banner but the framing trains PMs to ignore it

This is the root cause of the retry-storm pattern observed across Core-Lead → Core-DevOps comms in the last hour (4+ retries of an already-completed diagnosis).

Likely root cause

A2A response delivery uses sync HTTP. When the target adapter is busy, the platform receives the agent's response body but the delivery transport returns a connection-reset / empty-body error to the canvas event consumer. Canvas then renders activity_type=delegation with status=error even though the response body was received.

Fix shape (to confirm during diagnosis)

  • Decouple delegation completed from delivery transport — the response body should be persisted as soon as the agent produces it, regardless of whether the sync HTTP delivery to the requester succeeded
  • Canvas event renderer: if the activity row has a non-empty response body, render as completed with the body; only mark error when body is genuinely empty
  • Drop the "Restart workspace" suggestion when the body is non-empty — it's actively harmful

Tier

tier:high — destroys A2A reliability; trains operators to retry-storm; suggests destructive recovery action

Reporter

Hongming (CTO), 2026-05-09 ~20:30 UTC, after seeing the Core-DevOps retry-storm pattern in screenshot.

## Symptom (real customer-impacting bug) When workspace A delegates to workspace B and B *successfully* produces a response, the canvas often renders the result as: ``` ⚠ Core-DevOps returned an error TASK: <full successful diagnosis text — agent did the work> Underlying error (no detail returned) The remote agent returned no error detail (the underlying httpx exception had an empty message — typically a connection-reset or silent timeout). A workspace restart is the safe first move. [Restart Core-DevOps] [Open Core-DevOps] ``` Real example from this morning: Core-DevOps successfully produced "PR #149: tier-check fails NO REVIEWS (author needs engineers/managers/ceo approval); PR #140: pm approved but pm not in managers/ceo team; Harness Replays #141 known-broken." That's a *correct, complete* diagnosis. But the canvas marked the response as an error and suggested restarting the workspace. ## Why this matters 1. Restarting a workspace as suggested = **data loss** (kills the agent's in-progress reasoning + any unsaved tool-call state) 2. The PM sees "error" and retries the delegation 4–5 times, each retry fires a fresh delegation that lands on the busy adapter and bounces — a retry storm 3. The actual diagnosis is sitting in plain sight inside the error banner but the framing trains PMs to ignore it This is the root cause of the retry-storm pattern observed across Core-Lead → Core-DevOps comms in the last hour (4+ retries of an already-completed diagnosis). ## Likely root cause A2A response delivery uses sync HTTP. When the target adapter is busy, the platform receives the agent's response body but the delivery transport returns a connection-reset / empty-body error to the canvas event consumer. Canvas then renders activity_type=delegation with status=error even though the response body was received. ## Fix shape (to confirm during diagnosis) - Decouple `delegation completed` from delivery transport — the response body should be persisted as soon as the agent produces it, regardless of whether the sync HTTP delivery to the requester succeeded - Canvas event renderer: if the activity row has a non-empty response body, render as `completed` with the body; only mark error when body is genuinely empty - Drop the "Restart workspace" suggestion when the body is non-empty — it's actively harmful ## Tier `tier:high` — destroys A2A reliability; trains operators to retry-storm; suggests destructive recovery action ## Reporter Hongming (CTO), 2026-05-09 ~20:30 UTC, after seeing the Core-DevOps retry-storm pattern in screenshot.
core-be referenced this issue from a commit 2026-05-09 21:52:16 +00:00
core-be referenced this issue from a commit 2026-05-09 22:10:58 +00:00
core-be referenced this issue from a commit 2026-05-09 22:12:04 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#159