a2a-proxy ResponseHeaderTimeout is a hardcoded 60s — too short for Opus turns → ~300/hr timeout awaiting response headers on the leads #310

Closed
opened 2026-05-10 12:47:59 +00:00 by hongming-pc2 · 2 comments
Owner

Symptom

Since the dev team moved to Opus leads + the runtime/platform refresh (activity ~2× → ~2100 acts/30m), the platform's a2a-proxy logs ~300 Post "http://ws-…:8000": net/http: timeout awaiting response headers per hour, almost all on the leads (Dev Lead ~50/h, Core Platform Lead ~50/h, Infra Lead ~36/h, …) — i.e. ~6% of all team activity is a timed-out delegation, plus another ~135/h agent busy — adapter handles retry and a handful of unexpected response shape … {'message': 'workspace agent busy'} (the runtime's SSOT response parser doesn't recognize that busy-shape envelope).

Root cause

internal/handlers/a2a_proxy.go — the proxy's http.Client has Transport.ResponseHeaderTimeout: 60 * time.Second (hardcoded). An Opus agent turn (big context + tool calls + the delegate_task round-trips it does internally) routinely exceeds 60s, so the proxy gives up waiting for the target workspace's response headers and surfaces a timeout. The work isn't lost (the busy workspace finishes its turn; the sender gets a queued/error envelope and retries), but the noise is high and some delegations get double-attempted.

Fix

  1. Bump ResponseHeaderTimeout to ~180s, and/or make it env-configurable (A2A_PROXY_RESPONSE_HEADER_TIMEOUT / honor an existing knob) — Opus turns fit comfortably in 180s. The X-Timeout caller header (a2a_proxy.go ~line 241) bounds the absolute request; this is just the header-arrival ceiling, which should be generous.
  2. Minor: teach the runtime's SSOT response parser (molecule_runtime/a2a_response.py) to recognize {'message': 'workspace agent busy'} as a transient/retry shape instead of unexpected response shape (small, separate).

Low-urgency (team is functional, just noisy) but it inflates the error-log signal-to-noise and causes redundant delegation attempts under the new load. Related: the dev-team monitor flags this each cycle.

🤖 Generated with Claude Code

## Symptom Since the dev team moved to Opus leads + the runtime/platform refresh (activity ~2× → ~2100 acts/30m), the platform's a2a-proxy logs **~300 `Post "http://ws-…:8000": net/http: timeout awaiting response headers` per hour**, almost all on the **leads** (Dev Lead ~50/h, Core Platform Lead ~50/h, Infra Lead ~36/h, …) — i.e. ~6% of all team activity is a timed-out delegation, plus another ~135/h `agent busy — adapter handles retry` and a handful of `unexpected response shape … {'message': 'workspace agent busy'}` (the runtime's SSOT response parser doesn't recognize that busy-shape envelope). ## Root cause `internal/handlers/a2a_proxy.go` — the proxy's `http.Client` has `Transport.ResponseHeaderTimeout: 60 * time.Second` (hardcoded). An Opus agent turn (big context + tool calls + the `delegate_task` round-trips it does internally) routinely exceeds 60s, so the proxy gives up waiting for the target workspace's response headers and surfaces a timeout. The work isn't lost (the busy workspace finishes its turn; the sender gets a queued/error envelope and retries), but the noise is high and some delegations get double-attempted. ## Fix 1. Bump `ResponseHeaderTimeout` to ~180s, and/or make it env-configurable (`A2A_PROXY_RESPONSE_HEADER_TIMEOUT` / honor an existing knob) — Opus turns fit comfortably in 180s. The `X-Timeout` caller header (a2a_proxy.go ~line 241) bounds the *absolute* request; this is just the *header*-arrival ceiling, which should be generous. 2. Minor: teach the runtime's SSOT response parser (`molecule_runtime/a2a_response.py`) to recognize `{'message': 'workspace agent busy'}` as a transient/retry shape instead of `unexpected response shape` (small, separate). Low-urgency (team is functional, just noisy) but it inflates the error-log signal-to-noise and causes redundant delegation attempts under the new load. Related: the dev-team monitor flags this each cycle. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
fullstack-engineer self-assigned this 2026-05-10 13:16:32 +00:00
Member

[triage-operator] I-1..I-6 triage

I-1 (Understand): a2a-proxy ResponseHeaderTimeout hardcoded at 60s is too short for Opus leads — ~300 timeout errors per hour. Fix: increase to 180s.

I-2 (PR?): Yes. PR #331 (core-be, +50/-6, main base) and PR #322 (fullstack-engineer, +475/-46, staging base). #331 is the cleaner fix.

I-3 (Severity): MEDIUM — production impact (~6% delegation failure rate on leads), but not data-loss or security.

I-4 (Owner): core-be (fix), Dev Lead (approval).

I-5 (Milestone): Close once PR #331 merges.

I-6 (Acceptance criteria): ResponseHeaderTimeout increased to ≥120s. No regressions on short-turnaround delegations.

Recommendation: Label tier:medium. Close this issue as resolved by PR #331.

[triage-operator] I-1..I-6 triage **I-1 (Understand):** a2a-proxy ResponseHeaderTimeout hardcoded at 60s is too short for Opus leads — ~300 timeout errors per hour. Fix: increase to 180s. **I-2 (PR?):** Yes. PR #331 (core-be, +50/-6, main base) and PR #322 (fullstack-engineer, +475/-46, staging base). #331 is the cleaner fix. **I-3 (Severity):** MEDIUM — production impact (~6% delegation failure rate on leads), but not data-loss or security. **I-4 (Owner):** core-be (fix), Dev Lead (approval). **I-5 (Milestone):** Close once PR #331 merges. **I-6 (Acceptance criteria):** ResponseHeaderTimeout increased to ≥120s. No regressions on short-turnaround delegations. **Recommendation:** Label tier:medium. Close this issue as resolved by PR #331.
Author
Owner

Field measurement post-deploy — #310 is doing its job, but there's a residual class

Rebuilt platform image from main@108b9a54 (this PR + #294 included), deployed on PC2 dev-team host (2026-05-11 04:10 UTC), measured timeout rate before/after:

Window Timeouts/min
Pre-#310 (60s default) 8/min (240/30min)
Post-#310 (180s default, env-configurable) 3.2/min (16/5min)

So #310 cuts the rate by ~60% — confirms the bump works for the genuinely-recoverable bucket (turns that needed 60-180s).

Residual ~40% bucket — root cause is architectural, not a higher timeout:

The remaining timeouts cluster on the same handful of workspaces (ws-8dbf4813-8b7, ws-4b3b7a49-3fc, ws-924b65d7-32a, ws-716e77e3-131 — all model=opus leads doing real agent work). Repro pattern:

  • Lead is mid-Opus-turn, holding _run_lock
  • Platform proxies a child→lead delegate_task POST to the lead's :8000
  • Lead's execute() waits for _run_lock (held by the in-flight Opus turn)
  • Opus turn takes >180s (large prompt + many tool calls)
  • Platform's Transport.ResponseHeaderTimeout (180s) fires → "timeout awaiting response headers"
  • Workspace eventually finishes the in-flight turn, picks up the queued delegate, sends 200 — but the platform client has already given up; the response is dropped

Bumping timeout further (300s? 600s?) trades faster-failure for longer-held connections. The structural fix is the durable async + poll path that a2a_tools_delegation._delegate_sync_via_polling already implements but isn't the default:

# a2a_tools_delegation.py:60
_SYNC_POLL_BUDGET_S = float(os.environ.get("DELEGATION_TIMEOUT", "300.0"))

Setting DELEGATION_SYNC_VIA_INBOX=1 switches the workspace to that path. Suggestion: flip that env to default-on once it's validated. This sidesteps the 180s proxy ceiling entirely by POST /delegate (202 async) + poll /delegations. Probably worth a separate RFC (call it "RFC: Default DELEGATION_SYNC_VIA_INBOX=1 — remove proxy 180s ceiling from sync delegation latency budget").

This is also related to molecule-core#354 (claude-code adapter no auto-resume after async A2A delegations) — both share the "long-running peer + proxy gives up" failure shape.

#310 itself is done; this is a follow-up class. Filing as a comment here so the trail is visible. Will start a separate issue if needed.

— hongming-pc2 (field measurement post-rebuild + redeploy)

## Field measurement post-deploy — #310 is doing its job, but there's a residual class Rebuilt platform image from `main@108b9a54` (this PR + #294 included), deployed on PC2 dev-team host (2026-05-11 04:10 UTC), measured timeout rate before/after: | Window | Timeouts/min | |---|---| | Pre-#310 (60s default) | 8/min (240/30min) | | Post-#310 (180s default, env-configurable) | 3.2/min (16/5min) | So #310 cuts the rate by ~60% — confirms the bump works for the genuinely-recoverable bucket (turns that needed 60-180s). **Residual ~40% bucket — root cause is architectural, not a higher timeout**: The remaining timeouts cluster on the same handful of workspaces (`ws-8dbf4813-8b7`, `ws-4b3b7a49-3fc`, `ws-924b65d7-32a`, `ws-716e77e3-131` — all `model=opus` leads doing real agent work). Repro pattern: - Lead is mid-Opus-turn, holding `_run_lock` - Platform proxies a child→lead delegate_task POST to the lead's `:8000` - Lead's `execute()` waits for `_run_lock` (held by the in-flight Opus turn) - Opus turn takes >180s (large prompt + many tool calls) - Platform's `Transport.ResponseHeaderTimeout` (180s) fires → "timeout awaiting response headers" - Workspace eventually finishes the in-flight turn, picks up the queued delegate, sends 200 — but the platform client has already given up; the response is dropped Bumping timeout further (300s? 600s?) trades faster-failure for longer-held connections. The structural fix is the **durable async + poll path** that `a2a_tools_delegation._delegate_sync_via_polling` already implements but isn't the default: ```python # a2a_tools_delegation.py:60 _SYNC_POLL_BUDGET_S = float(os.environ.get("DELEGATION_TIMEOUT", "300.0")) ``` Setting `DELEGATION_SYNC_VIA_INBOX=1` switches the workspace to that path. **Suggestion**: flip that env to default-on once it's validated. This sidesteps the 180s proxy ceiling entirely by `POST /delegate` (202 async) + poll `/delegations`. Probably worth a separate RFC (call it "RFC: Default DELEGATION_SYNC_VIA_INBOX=1 — remove proxy 180s ceiling from sync delegation latency budget"). This is also related to molecule-core#354 (claude-code adapter no auto-resume after async A2A delegations) — both share the "long-running peer + proxy gives up" failure shape. #310 itself is done; this is a follow-up class. Filing as a comment here so the trail is visible. Will start a separate issue if needed. — hongming-pc2 (field measurement post-rebuild + redeploy)
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#310