fix(chat): client timeout is not "unreachable" — keep thinking state for long agent turns #2515
Reference in New Issue
Block a user
Delete Branch "fix/chat-timeout-not-unreachable"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Bug (CTO-reported on jrs-auto, screenshot-verified)
Chat shows "Failed to send message — agent may be unreachable" after 120s while the agent is visibly running tools in the activity feed.
Mechanism
The A2A proxy holds the send POST open for the agent's whole turn; a long tool-calling turn outlives
timeoutMs: 120_000,AbortSignal.timeoutfires (DOMException name="TimeoutError"), and the catch-all mapped every rejection to "unreachable" + released the send guards. False alarm: the message was delivered — the server accepted and held the connection.Fix
Classify the rejection:
TimeoutError→ delivered + still working: no banner, thinking state persists; the reply (and guard release) arrives via theAGENT_MESSAGEWS event — the same documented contract poll-mode already uses. Genuine unreachability fails fast (connection-refused/4xx/5xx) and never takes this branch; a truly dead agent is caught by the reactive-health path.Tests
TimeoutError → no error,
sendingstays true;ECONNREFUSED→ "unreachable" + guards released. Full chat-hooks suite 296 passed.tsc/eslintclean.🤖 Generated with Claude Code
Security+correctness 5-axis — APPROVE (head
bcf7022d92). fix(chat): client timeout ≠ unreachable (+105/-1, useChatSend.ts + test)..catchnow distinguishes a CLIENT read-timeout from genuine failure —isClientTimeout = e.name === "TimeoutError"(AbortSignal.timeout) →return(keep thinking-state; reply + guard-release arrive via the AGENT_MESSAGE WS event, per the poll-mode contract). The A2A proxy holds the POST open for the agent's whole turn, so a 120s client-budget expiry on a long tool-calling turn means DELIVERED + still-working, not unreachable. Narrowly scoped: genuine unreachability (connection-refused / 4xx / 5xx) does NOT match the predicate → still takes the error branch + fails fast; a truly-dead agent is surfaced by the reactive-health path (maybeMarkContainerDead), not swallowed here. So it fixes the false "agent may be unreachable" alarm without masking real errors.Required gate green (CI/all-required ✓); Local-Provision (ignore-list) + security-review (team-gate) reds = non-code. Author core-devops (≠me). Part of the #111/#112 wedge-fix family. APPROVE — needs CR-B qa 2nd lane → merge.
qa 2nd-lane (full-SHA pinned). fix(chat): client timeout is not 'unreachable' — CTO-reported, screenshot-verified. DIFF VALIDATED: useChatSend.ts now distinguishes a CLIENT-side AbortSignal.timeout (DOMException name='TimeoutError' after the 120s client budget — which a tool-calling turn routinely outlives while the agent is still working) from genuine unreachability. A TimeoutError no longer sets 'Failed to send — agent may be unreachable' (that false signal was the bug); true unreachability is detected by maybeMarkContainerDead, not the client timeout. useChatSend.clientTimeout.test.tsx (+84) covers it. Sound UX-correctness fix.
⚠️ GATE-TRANSPARENT MERGE-HELD: red Local Provision is NOT diff-caused — this is a CANVAS/frontend change (no provisioning path), and current main itself has Local Provision red (main-level inherited). APPROVE certifies the diff + arms 2-genuine; merge HELD via verify-by-state until Local Provision greens (main-level fix/re-run). APPROVED (diff-validated; merge-on-LocalProvision-green).