fix(a2a): raise ResponseHeaderTimeout 5m→30m for long synchronous turns (core#2723) #2749
Reference in New Issue
Block a user
Delete Branch "fix/a2a-response-header-timeout-long-turns"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Bug (CTO: SEO Agent error)
The JRS SEO Agent's "migrate from blob" turn ran 443s then errored. Tenant activity log:
That's Go's
Transport.ResponseHeaderTimeout(5min default) — a second 5-min wall in the A2A path, distinct from the idle watchdog I raised in #2727. When the runtime computes a long turn synchronously (doesn't stream a 200 early), no response headers arrive for >5min and the transport aborts a legitimate, still-working turn.Fix
Align the default to 30min = the agent-to-agent ceiling + the canvas idle default (#2727), so no legit turn trips it (the system already caps turns at 30min). A genuinely-unreachable agent is still surfaced fast by
DialContext(10s, connection-level) + the reactive-health/heartbeat path — not by cutting a working turn short.A2A_PROXY_RESPONSE_HEADER_TIMEOUTstill overrides. Test updated.The full long-turn picture (all 3 cuts in this class now fixed)
Durable follow-up: runtime ACKs 200 early + streams, so RHT never gates a busy agent.
🤖 Generated with Claude Code
APPROVED — 5-axis review on head
1cb81bb891.Correctness: raising the default Transport.ResponseHeaderTimeout from 5m to 30m addresses the reported long synchronous turn failure class. This aligns with the 30m agent/canvas ceiling and leaves A2A_PROXY_RESPONSE_HEADER_TIMEOUT as the override.
Robustness: this does not weaken connection-level failure detection: DialContext remains 10s, TLS timeout remains 10s, and reactive health/heartbeat still handle genuinely unreachable agents. The change only stops cutting off a working agent before response headers arrive.
Security: no auth, input-validation, or secret-handling surface changes.
Performance: the longer header wait can hold a transport slot for legitimate long turns, but the system already permits 30m turns and this matches that ceiling; no busy-loop or N+1 behavior.
Readability: the comment now documents the precise timeout wall and relation to #2727/runtime#130, and the test pins the 30m default.
CI note: required code checks are still running on this head as of review time; current failures are review/ceremony gates plus an advisory real-image local provision E2E. Do not treat this as merge-ready until required CI/gates settle green.
/sop-ack
APPROVED — 2nd genuine review on head
1cb81bb891. Diff is narrowly scoped to raising the A2A proxy Transport.ResponseHeaderTimeout default from 5m to 30m and updating the pinning test. The env override A2A_PROXY_RESPONSE_HEADER_TIMEOUT remains intact; DialContext and TLSHandshakeTimeout fast-fail behavior are unchanged, so unreachable workspaces still fail quickly while legitimate long synchronous turns are no longer cut at 5m. I do not see a new security surface: this only changes outbound wait-for-response-headers on the existing A2A proxy path and aligns with the established 30m agent/canvas ceiling. Required code CI is green, including CI/all-required, Platform Go, Canvas, handlers, API smoke, and stub lifecycle. Remaining red contexts are governance/SOP/reserved-path ceremony plus the known advisory real-image MiniMax lane, not this diff./sop-ack
SOP checklist reviewer acks for #2749 (ResponseHeaderTimeout 5m -> 30m, no security surface, test pins default):
/sop-ack comprehensive-testing CI code checks include Platform Go/all-required green; test pins 30m default.
/sop-ack local-postgres-e2e N/A for this change: no Postgres/backend DB surface; one transport timeout default + unit pin.
/sop-ack staging-smoke N/A/pre-merge: no deploy/runtime data path beyond A2A proxy timeout default; advisory live-image red is independent MiniMax-account outage per dispatch.
/sop-ack root-cause Root cause is Go Transport.ResponseHeaderTimeout 5m aborting long synchronous no-early-header turns.
/sop-ack five-axis-review CR2 completed 5-axis review in APPROVED #11430.
/sop-ack no-backwards-compat No compatibility shim or dead code; env override A2A_PROXY_RESPONSE_HEADER_TIMEOUT remains intact.
/sop-ack memory-consulted Applied core#2723 timeout-class context: #2727 idle raise + runtime#130 heartbeat + this ResponseHeaderTimeout cut.