300s 'tool chain lost': canvas A2A turn killed by 5-min idle watchdog (long autonomous tasks) #2723

Open
opened 2026-06-13 07:28:45 +00:00 by core-devops · 2 comments
Member

300s "tool chain lost": canvas A2A turn killed by the 5-min idle watchdog

Reported (CTO, live JRS SEO Agent): during a long autonomous task (DB/asset migration — download + re-upload 102 images) the chat loses the tool chain after ~300s, even though the agent is still working.

Root cause (diagnosed)

The canvas→agent A2A turn (POST /workspaces/:id/a2a, no X-Timeout) is wrapped by applyIdleTimeout(parent, broadcaster, workspaceID, idle) (workspace-server/internal/handlers/a2a_proxy.go:1097). It subscribes to the workspace SSE stream and cancels the turn after idleTimeoutDuration of broadcaster silence; any event resets the clock. Default idleTimeoutDuration = 5*time.Minute (a2a_proxy.go:990, env A2A_IDLE_TIMEOUT_SECONDS). A long single step that emits no broadcaster events for 5 min (bulk download/upload, a long build) trips the watchdog → ctx cancel → dispatch aborts → the canvas sees the turn drop. The agent process may continue, but the chat-tracked turn is gone.

Fix (comprehensive, with tests)

Do all three; (1)+(2) are the core fix, (3) is the durable architecture:

  1. Raise the canvas idle default to ~15 min (keep A2A_IDLE_TIMEOUT_SECONDS override + the 30-min absolute ceiling). Cheap headroom for normal long tasks.
  2. Runtime heartbeat during long steps — the agent runtime (claude-code template + the shared a2a bridge) should emit a periodic progress/keepalive event (e.g., every 60–90s) while a tool call is running, so the idle clock resets and genuinely-active turns never trip the watchdog. This is the robust fix; without it, any silent step > the idle window still dies.
  3. Prefer non-blocking dispatch for canvas sends — return promptly (queued) and deliver the final result via AGENT_MESSAGE over the WS, so the canvas POST isn't holding a multi-minute turn open at all. Track separately if (1)+(2) suffice short-term.

Tests (required, SOP)

  • unit: applyIdleTimeout resets on an event and cancels only after true silence ≥ idle; the raised default; env override parsing.
  • unit (runtime): heartbeat emitted at the interval during a long tool call; stops on completion.
  • e2e (staginge2e): a turn whose only activity is a single >6-min step (no intermediate events) survives with heartbeat on; and (regression) a genuinely-hung agent still times out at the ceiling.
  • No silent caps; name the mechanism in any log.

Refs

a2a_proxy.go:119-160 (timeout model), :966-1090 (idle/ceiling), :1097-1151 (applyIdleTimeout). Distinct from the canvas-side 120s POST timeout in useChatSend.ts (raise/align that too if it can fire before the server idle window).

## 300s "tool chain lost": canvas A2A turn killed by the 5-min idle watchdog **Reported (CTO, live JRS SEO Agent):** during a long autonomous task (DB/asset migration — download + re-upload 102 images) the chat loses the tool chain after ~300s, even though the agent is still working. ### Root cause (diagnosed) The canvas→agent A2A turn (`POST /workspaces/:id/a2a`, no `X-Timeout`) is wrapped by `applyIdleTimeout(parent, broadcaster, workspaceID, idle)` (`workspace-server/internal/handlers/a2a_proxy.go:1097`). It subscribes to the workspace SSE stream and **cancels the turn after `idleTimeoutDuration` of broadcaster silence**; any event resets the clock. Default `idleTimeoutDuration = 5*time.Minute` (`a2a_proxy.go:990`, env `A2A_IDLE_TIMEOUT_SECONDS`). A long single step that emits **no** broadcaster events for 5 min (bulk download/upload, a long build) trips the watchdog → `ctx` cancel → dispatch aborts → the canvas sees the turn drop. The agent process may continue, but the chat-tracked turn is gone. ### Fix (comprehensive, with tests) Do all three; (1)+(2) are the core fix, (3) is the durable architecture: 1. **Raise the canvas idle default** to ~15 min (keep `A2A_IDLE_TIMEOUT_SECONDS` override + the 30-min absolute ceiling). Cheap headroom for normal long tasks. 2. **Runtime heartbeat during long steps** — the agent runtime (claude-code template + the shared a2a bridge) should emit a periodic progress/keepalive event (e.g., every 60–90s) while a tool call is running, so the idle clock resets and genuinely-active turns never trip the watchdog. This is the robust fix; without it, any silent step > the idle window still dies. 3. **Prefer non-blocking dispatch for canvas sends** — return promptly (queued) and deliver the final result via `AGENT_MESSAGE` over the WS, so the canvas POST isn't holding a multi-minute turn open at all. Track separately if (1)+(2) suffice short-term. ### Tests (required, SOP) - unit: `applyIdleTimeout` resets on an event and cancels only after true silence ≥ idle; the raised default; env override parsing. - unit (runtime): heartbeat emitted at the interval during a long tool call; stops on completion. - e2e (staginge2e): a turn whose only activity is a single >6-min step (no intermediate events) survives with heartbeat on; and (regression) a genuinely-hung agent still times out at the ceiling. - No silent caps; name the mechanism in any log. ### Refs `a2a_proxy.go:119-160` (timeout model), `:966-1090` (idle/ceiling), `:1097-1151` (applyIdleTimeout). Distinct from the canvas-side 120s POST timeout in `useChatSend.ts` (raise/align that too if it can fire before the server idle window).
Author
Member

Refined root cause (the fix is RUNTIME-side, not a workspace-server timeout bump)

The idle watchdog already resets on every broadcaster event for the workspace — including WORKSPACE_HEARTBEAT, which the registry broadcasts every ~30s when the runtime POSTs /heartbeat (see a2a_proxy.go:966-980). So with a healthy 30s heartbeat the 5-min idle window can NEVER trip, even when the agent is silently thinking between tool calls.

Therefore hitting 300s means the heartbeat itself stopped for 5 min. The most likely mechanism: the runtime emits its /heartbeat POST from the same thread/event-loop that executes tool calls, so a long synchronous, blocking tool call (e.g. a single bash step that downloads + re-uploads 102 images, or a long build) starves the heartbeat → registry stops broadcasting WORKSPACE_HEARTBEAT → idle watchdog cancels the dispatch → "tool chain lost" (and the workspace may also flip degraded from missed heartbeats).

So the real fix is in the runtime adapter (molecule-ai-workspace-runtime / workspace/adapter_base.py), not a workspace-server timeout bump:

  1. Emit the heartbeat from an independent background timer/thread that is NOT blocked by tool execution, so /heartbeat keeps firing every ~30s during long synchronous steps. This alone closes the gap (idle keeps resetting).
  2. Optionally also have long-running adapters declare a longer idle_timeout_override in the heartbeat (the workspace-server already honors per-workspace overrides at a2a_proxy.go:1067-1077).

Verify first: check whether the runtime's heartbeat loop is independent of the tool-exec path; if it shares the thread/loop, that's the bug. A workspace-server idle-default bump (5→15m) is at best a band-aid (a >window silent step still dies) and shouldn't be the primary fix.

Tests: runtime unit test — heartbeat keeps firing while a long blocking tool call runs (mock a >6-min sync call, assert ≥1 heartbeat/30s); e2e — a turn whose only activity is a single >6-min blocking step survives.

### Refined root cause (the fix is RUNTIME-side, not a workspace-server timeout bump) The idle watchdog already resets on **every** broadcaster event for the workspace — **including `WORKSPACE_HEARTBEAT`, which the registry broadcasts every ~30s** when the runtime POSTs `/heartbeat` (see `a2a_proxy.go:966-980`). So with a healthy 30s heartbeat the 5-min idle window can NEVER trip, even when the agent is silently thinking between tool calls. Therefore hitting 300s means **the heartbeat itself stopped for 5 min**. The most likely mechanism: the runtime emits its `/heartbeat` POST from the **same thread/event-loop that executes tool calls**, so a long *synchronous, blocking* tool call (e.g. a single bash step that downloads + re-uploads 102 images, or a long build) starves the heartbeat → registry stops broadcasting `WORKSPACE_HEARTBEAT` → idle watchdog cancels the dispatch → "tool chain lost" (and the workspace may also flip degraded from missed heartbeats). **So the real fix is in the runtime adapter (`molecule-ai-workspace-runtime` / `workspace/adapter_base.py`), not a workspace-server timeout bump:** 1. **Emit the heartbeat from an independent background timer/thread** that is NOT blocked by tool execution, so `/heartbeat` keeps firing every ~30s during long synchronous steps. This alone closes the gap (idle keeps resetting). 2. Optionally also have long-running adapters declare a longer `idle_timeout_override` in the heartbeat (the workspace-server already honors per-workspace overrides at `a2a_proxy.go:1067-1077`). **Verify first:** check whether the runtime's heartbeat loop is independent of the tool-exec path; if it shares the thread/loop, that's the bug. A workspace-server idle-default bump (5→15m) is at best a band-aid (a >window silent step still dies) and shouldn't be the primary fix. **Tests:** runtime unit test — heartbeat keeps firing while a long blocking tool call runs (mock a >6-min sync call, assert ≥1 heartbeat/30s); e2e — a turn whose only activity is a single >6-min blocking step survives.
Author
Member

Pinpointed in the runtime: heartbeat is starved by a blocked event loop

Confirmed in molecule-ai-workspace-runtime: HeartbeatLoop.start() does self._task = asyncio.create_task(self._loop()) (molecule_runtime/heartbeat.py:209-210) — the heartbeat runs as an asyncio task on the agent's shared event loop, POSTing /heartbeat every interval_seconds (clamped [5,300]).

So when a tool call blocks that event loop — a long synchronous/CPU-bound step that doesn't await (bulk file download/upload, a big subprocess run inline, a sync HTTP client, image processing) — the heartbeat task never gets scheduled. /heartbeat stops → registry stops broadcasting WORKSPACE_HEARTBEAT → the workspace-server idle watchdog (5-min broadcaster silence, a2a_proxy.go:1097) cancels the turn → "tool chain lost" at ~300s. The agent process is alive but the chat-tracked turn is gone (and the workspace may flip degraded from missed heartbeats).

Fix (runtime):

  1. Run the heartbeat off the agent's event loop — a dedicated threading.Thread (daemon) with either its own asyncio loop or a plain synchronous requests/httpx.Client POST loop. A separate OS thread keeps firing every ~30s even while the main loop is blocked in a long tool call. This is the robust fix and is self-contained to heartbeat.py.
  2. Complement: ensure long/blocking tool execution runs in an executor (loop.run_in_executor) so the event loop stays responsive — but (1) alone closes the heartbeat gap regardless of how tools are run.

Tests: unit — start the heartbeat, then block the main event loop for > interval (e.g. time.sleep on the loop thread) and assert the heartbeat POST still fired (proves thread independence); regression — heartbeat stops cleanly on stop(). e2e — a turn whose only activity is a single >6-min blocking tool call survives (no idle cancel).

This supersedes the "raise the workspace-server idle default" band-aid — the real defect is the starved heartbeat, and it's a small, contained heartbeat.py change. Deploy reaches tenants via a runtime-template roll (note the /admin tunnel-gap, CP#799, for the agent-container refresh path; a full re-provision also adopts it).

### Pinpointed in the runtime: heartbeat is starved by a blocked event loop Confirmed in `molecule-ai-workspace-runtime`: `HeartbeatLoop.start()` does `self._task = asyncio.create_task(self._loop())` (`molecule_runtime/heartbeat.py:209-210`) — the heartbeat runs as an **asyncio task on the agent's shared event loop**, POSTing `/heartbeat` every `interval_seconds` (clamped [5,300]). So when a tool call **blocks that event loop** — a long *synchronous/CPU-bound* step that doesn't `await` (bulk file download/upload, a big subprocess run inline, a sync HTTP client, image processing) — the heartbeat task never gets scheduled. `/heartbeat` stops → registry stops broadcasting `WORKSPACE_HEARTBEAT` → the workspace-server idle watchdog (5-min broadcaster silence, `a2a_proxy.go:1097`) cancels the turn → **"tool chain lost" at ~300s**. The agent process is alive but the chat-tracked turn is gone (and the workspace may flip degraded from missed heartbeats). **Fix (runtime):** 1. **Run the heartbeat off the agent's event loop** — a dedicated `threading.Thread` (daemon) with either its own asyncio loop or a plain synchronous `requests`/`httpx.Client` POST loop. A separate OS thread keeps firing every ~30s even while the main loop is blocked in a long tool call. This is the robust fix and is self-contained to `heartbeat.py`. 2. **Complement:** ensure long/blocking tool execution runs in an executor (`loop.run_in_executor`) so the event loop stays responsive — but (1) alone closes the heartbeat gap regardless of how tools are run. **Tests:** unit — start the heartbeat, then block the main event loop for > interval (e.g. `time.sleep` on the loop thread) and assert the heartbeat POST still fired (proves thread independence); regression — heartbeat stops cleanly on stop(). e2e — a turn whose only activity is a single >6-min blocking tool call survives (no idle cancel). This supersedes the "raise the workspace-server idle default" band-aid — the real defect is the starved heartbeat, and it's a small, contained `heartbeat.py` change. Deploy reaches tenants via a runtime-template roll (note the `/admin` tunnel-gap, CP#799, for the agent-container refresh path; a full re-provision also adopts it).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2723