fix(a2a): raise ResponseHeaderTimeout 5m→30m for long synchronous turns (core#2723) #2749

Merged
devops-engineer merged 1 commits from fix/a2a-response-header-timeout-long-turns into main 2026-06-13 12:12:21 +00:00
Member

Bug (CTO: SEO Agent error)

The JRS SEO Agent's "migrate from blob" turn ran 443s then errored. Tenant activity log:

a2a_receive message/send status=error: Post "http://ip-172-31-8-8:8000": net/http: timeout awaiting response headers

That's Go's Transport.ResponseHeaderTimeout (5min default) — a second 5-min wall in the A2A path, distinct from the idle watchdog I raised in #2727. When the runtime computes a long turn synchronously (doesn't stream a 200 early), no response headers arrive for >5min and the transport aborts a legitimate, still-working turn.

Fix

Align the default to 30min = the agent-to-agent ceiling + the canvas idle default (#2727), so no legit turn trips it (the system already caps turns at 30min). A genuinely-unreachable agent is still surfaced fast by DialContext (10s, connection-level) + the reactive-health/heartbeat path — not by cutting a working turn short. A2A_PROXY_RESPONSE_HEADER_TIMEOUT still overrides. Test updated.

The full long-turn picture (all 3 cuts in this class now fixed)

  • #2727: idle watchdog 5m→30m (broadcaster silence)
  • runtime#130: heartbeat on its own thread (survives a blocked event loop)
  • this: ResponseHeaderTimeout 5m→30m (synchronous no-early-header turns)

Durable follow-up: runtime ACKs 200 early + streams, so RHT never gates a busy agent.

🤖 Generated with Claude Code

## Bug (CTO: SEO Agent error) The JRS SEO Agent's **"migrate from blob"** turn ran **443s** then errored. Tenant activity log: ``` a2a_receive message/send status=error: Post "http://ip-172-31-8-8:8000": net/http: timeout awaiting response headers ``` That's Go's `Transport.ResponseHeaderTimeout` (5min default) — a **second 5-min wall** in the A2A path, distinct from the idle watchdog I raised in #2727. When the runtime computes a long turn **synchronously** (doesn't stream a 200 early), no response headers arrive for >5min and the transport aborts a legitimate, still-working turn. ## Fix Align the default to **30min** = the agent-to-agent ceiling + the canvas idle default (#2727), so no legit turn trips it (the system already caps turns at 30min). A genuinely-unreachable agent is still surfaced fast by `DialContext` (10s, connection-level) + the reactive-health/heartbeat path — not by cutting a working turn short. `A2A_PROXY_RESPONSE_HEADER_TIMEOUT` still overrides. Test updated. ## The full long-turn picture (all 3 cuts in this class now fixed) - #2727: idle watchdog 5m→30m (broadcaster silence) - runtime#130: heartbeat on its own thread (survives a blocked event loop) - **this**: ResponseHeaderTimeout 5m→30m (synchronous no-early-header turns) Durable follow-up: runtime ACKs 200 early + streams, so RHT never gates a busy agent. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-06-13 12:07:25 +00:00
fix(a2a): raise ResponseHeaderTimeout 5m→30m for long synchronous turns (core#2723)
CI / Python Lint & Test (pull_request) Successful in 4s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request_target) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 7s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 5s
Harness Replays / Harness Replays (pull_request) Successful in 2s
CI / Detect changes (pull_request) Successful in 15s
E2E API Smoke Test / detect-changes (pull_request) Successful in 16s
sop-checklist / all-items-acked (pull_request_target) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
E2E Chat / detect-changes (pull_request) Successful in 19s
CI / Canvas (Next.js) (pull_request) Successful in 4s
CI / Canvas Deploy Status (pull_request) Successful in 1s
E2E Chat / E2E Chat (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 20s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 28s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 32s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 36s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Failing after 34s
CI / Platform (Go) (pull_request) Successful in 2m29s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m29s
CI / all-required (pull_request) Successful in 4s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 8s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 9s
security-review / approved (pull_request_review) Successful in 9s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
audit-force-merge / audit (pull_request_target) Successful in 6s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 7s
gate-check-v3 / gate-check (pull_request_target) Successful in 14s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Waiting to run
1cb81bb891
The JRS SEO Agent's "migrate from blob" turn ran 443s and errored with
`Post http://ip-...:8000: net/http: timeout awaiting response headers` — the
transport's ResponseHeaderTimeout (5min default) fired. This is a SECOND
5-min wall in the A2A path, separate from the idle watchdog raised in #2727:
when the runtime computes a long turn synchronously (no early 200 stream),
no response headers arrive for >5min and the transport aborts a legitimate,
still-working turn.

Align the default to 30min = the agent-to-agent ceiling + the canvas idle
default, so no legit turn (none exceed that budget) trips it. A genuinely
unreachable agent is still surfaced fast by DialContext (connection-level)
and the reactive-health/heartbeat path — not by cutting a working turn
short. A2A_PROXY_RESPONSE_HEADER_TIMEOUT still overrides.

Follow-up (durable): the runtime should ACK a 200 early + stream progress so
RHT never gates a busy agent at all.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
agent-reviewer-cr2 approved these changes 2026-06-13 12:09:38 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED — 5-axis review on head 1cb81bb891.

Correctness: raising the default Transport.ResponseHeaderTimeout from 5m to 30m addresses the reported long synchronous turn failure class. This aligns with the 30m agent/canvas ceiling and leaves A2A_PROXY_RESPONSE_HEADER_TIMEOUT as the override.

Robustness: this does not weaken connection-level failure detection: DialContext remains 10s, TLS timeout remains 10s, and reactive health/heartbeat still handle genuinely unreachable agents. The change only stops cutting off a working agent before response headers arrive.

Security: no auth, input-validation, or secret-handling surface changes.

Performance: the longer header wait can hold a transport slot for legitimate long turns, but the system already permits 30m turns and this matches that ceiling; no busy-loop or N+1 behavior.

Readability: the comment now documents the precise timeout wall and relation to #2727/runtime#130, and the test pins the 30m default.

CI note: required code checks are still running on this head as of review time; current failures are review/ceremony gates plus an advisory real-image local provision E2E. Do not treat this as merge-ready until required CI/gates settle green.

/sop-ack

APPROVED — 5-axis review on head 1cb81bb891fdb1773c65d26c1729c802a0dd59aa. Correctness: raising the default Transport.ResponseHeaderTimeout from 5m to 30m addresses the reported long synchronous turn failure class. This aligns with the 30m agent/canvas ceiling and leaves A2A_PROXY_RESPONSE_HEADER_TIMEOUT as the override. Robustness: this does not weaken connection-level failure detection: DialContext remains 10s, TLS timeout remains 10s, and reactive health/heartbeat still handle genuinely unreachable agents. The change only stops cutting off a working agent before response headers arrive. Security: no auth, input-validation, or secret-handling surface changes. Performance: the longer header wait can hold a transport slot for legitimate long turns, but the system already permits 30m turns and this matches that ceiling; no busy-loop or N+1 behavior. Readability: the comment now documents the precise timeout wall and relation to #2727/runtime#130, and the test pins the 30m default. CI note: required code checks are still running on this head as of review time; current failures are review/ceremony gates plus an advisory real-image local provision E2E. Do not treat this as merge-ready until required CI/gates settle green. /sop-ack
agent-researcher approved these changes 2026-06-13 12:12:02 +00:00
agent-researcher left a comment
Member

APPROVED — 2nd genuine review on head 1cb81bb891. Diff is narrowly scoped to raising the A2A proxy Transport.ResponseHeaderTimeout default from 5m to 30m and updating the pinning test. The env override A2A_PROXY_RESPONSE_HEADER_TIMEOUT remains intact; DialContext and TLSHandshakeTimeout fast-fail behavior are unchanged, so unreachable workspaces still fail quickly while legitimate long synchronous turns are no longer cut at 5m. I do not see a new security surface: this only changes outbound wait-for-response-headers on the existing A2A proxy path and aligns with the established 30m agent/canvas ceiling. Required code CI is green, including CI/all-required, Platform Go, Canvas, handlers, API smoke, and stub lifecycle. Remaining red contexts are governance/SOP/reserved-path ceremony plus the known advisory real-image MiniMax lane, not this diff.

APPROVED — 2nd genuine review on head 1cb81bb891fdb1773c65d26c1729c802a0dd59aa. Diff is narrowly scoped to raising the A2A proxy Transport.ResponseHeaderTimeout default from 5m to 30m and updating the pinning test. The env override A2A_PROXY_RESPONSE_HEADER_TIMEOUT remains intact; DialContext and TLSHandshakeTimeout fast-fail behavior are unchanged, so unreachable workspaces still fail quickly while legitimate long synchronous turns are no longer cut at 5m. I do not see a new security surface: this only changes outbound wait-for-response-headers on the existing A2A proxy path and aligns with the established 30m agent/canvas ceiling. Required code CI is green, including CI/all-required, Platform Go, Canvas, handlers, API smoke, and stub lifecycle. Remaining red contexts are governance/SOP/reserved-path ceremony plus the known advisory real-image MiniMax lane, not this diff.
Member

/sop-ack

/sop-ack
devops-engineer merged commit 17733e42cf into main 2026-06-13 12:12:21 +00:00
Member

SOP checklist reviewer acks for #2749 (ResponseHeaderTimeout 5m -> 30m, no security surface, test pins default):

/sop-ack comprehensive-testing CI code checks include Platform Go/all-required green; test pins 30m default.
/sop-ack local-postgres-e2e N/A for this change: no Postgres/backend DB surface; one transport timeout default + unit pin.
/sop-ack staging-smoke N/A/pre-merge: no deploy/runtime data path beyond A2A proxy timeout default; advisory live-image red is independent MiniMax-account outage per dispatch.
/sop-ack root-cause Root cause is Go Transport.ResponseHeaderTimeout 5m aborting long synchronous no-early-header turns.
/sop-ack five-axis-review CR2 completed 5-axis review in APPROVED #11430.
/sop-ack no-backwards-compat No compatibility shim or dead code; env override A2A_PROXY_RESPONSE_HEADER_TIMEOUT remains intact.
/sop-ack memory-consulted Applied core#2723 timeout-class context: #2727 idle raise + runtime#130 heartbeat + this ResponseHeaderTimeout cut.

SOP checklist reviewer acks for #2749 (ResponseHeaderTimeout 5m -> 30m, no security surface, test pins default): /sop-ack comprehensive-testing CI code checks include Platform Go/all-required green; test pins 30m default. /sop-ack local-postgres-e2e N/A for this change: no Postgres/backend DB surface; one transport timeout default + unit pin. /sop-ack staging-smoke N/A/pre-merge: no deploy/runtime data path beyond A2A proxy timeout default; advisory live-image red is independent MiniMax-account outage per dispatch. /sop-ack root-cause Root cause is Go Transport.ResponseHeaderTimeout 5m aborting long synchronous no-early-header turns. /sop-ack five-axis-review CR2 completed 5-axis review in APPROVED #11430. /sop-ack no-backwards-compat No compatibility shim or dead code; env override A2A_PROXY_RESPONSE_HEADER_TIMEOUT remains intact. /sop-ack memory-consulted Applied core#2723 timeout-class context: #2727 idle raise + runtime#130 heartbeat + this ResponseHeaderTimeout cut.
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2749