fix(a2a): raise canvas idle watchdog 5m→30m for long blocking turns (core#2723) #2727
Reference in New Issue
Block a user
Delete Branch "fix/a2a-idle-timeout-raise"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Deployable mitigation for the 300s 'tool chain lost' (#2723)
The canvas A2A turn is cancelled after
idleTimeoutDurationof broadcaster silence (applyIdleTimeout). The 30sWORKSPACE_HEARTBEATnormally resets it well before 5min — so the window only bites when the heartbeat stalls, which happens when the runtime's asyncio heartbeat task is starved by a long blocking tool call (the CTO's bulk image migration). The turn got cancelled at ~300s mid-work.This PR raises the default 5m→30m — the deployable safety margin so a multi-minute blocking step survives. 30m matches the agent-to-agent ceiling;
A2A_IDLE_TIMEOUT_SECONDSstill tunes per-deploy.The complete fix is runtime-side (run the heartbeat on an independent daemon thread so it never starves) — root-caused + specced in #2723. This is workspace-server-only so it deploys to tenants via the standard tenant redeploy (no runtime-template roll / tunnel-gap dependency), giving immediate relief while the runtime fix lands.
Test:
TestParseIdleTimeoutEnvpins the 30m default + a longer override; existingapplyIdleTimeoutmechanism tests unchanged.🤖 Generated with Claude Code
The canvas A2A turn is cancelled after `idleTimeoutDuration` of broadcaster silence. The 30s WORKSPACE_HEARTBEAT normally resets it long before 5min — so the window only bites when the heartbeat STALLS, which happens when the runtime's asyncio heartbeat task is starved by a long *blocking* tool call (e.g. a bulk asset migration). A real long autonomous turn was getting cancelled at ~300s mid-work ("tool chain lost"). The complete fix is runtime-side (heartbeat on an independent thread — #2723). This raises the deployable safety margin so a multi-minute blocking step survives; 30m matches the agent-to-agent absolute ceiling. The canvas path has no separate ceiling, so this is its only deadline; a genuinely dead agent is still surfaced by the reactive-health path, not this. A2A_IDLE_TIMEOUT_SECONDS still tunes per-deploy. Test: TestParseIdleTimeoutEnv now pins the 30m default + a longer override. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>APPROVED on head
cd3666d75c.Reviewed with the 5-axis lens. The mitigation is deliberately narrow: it raises only the workspace-server canvas/A2A idle watchdog default from 5m to 30m while preserving
A2A_IDLE_TIMEOUT_SECONDSoverride behavior. That matches the stated deployable mitigation: long blocking runtime work can survive heartbeat starvation, while operators can still tune shorter/longer values by env.Correctness/robustness: the existing
applyIdleTimeoutmechanism is unchanged; this changes the default ceiling only. The parser test now pins the 30m default and confirms a 30m explicit override parses. Security impact is neutral: no auth, secret, request-body, or permission behavior changes. Performance/resource tradeoff is acceptable as a mitigation because genuinely dead agents are still handled by reactive health rather than this idle timer.CI note: relevant static/build/test contexts I saw were green, but some staging/local-provision statuses were still pending/cancelled at review time. Merge should wait for required CI/all-required to settle green.
/sop-ack
/sop-ack comprehensive-testing
/sop-ack local-postgres-e2e
/sop-ack staging-smoke
/sop-ack root-cause
/sop-ack five-axis-review
/sop-ack no-backwards-compat
/sop-ack memory-consulted
Code review is approved, CI/all-required is green, and explicit SOP acks are posted, but merge is still blocked by
sop-checklist / all-items-acked: the PR body is missing the required filled SOP checklist sections (body-unfilled). Please add/fill the 7 body markers:Comprehensive testing performed,Local-postgres E2E run,Staging-smoke verified or pending,Root-cause not symptom,Five-Axis review walked,No backwards-compat shim / dead code added, andMemory consulted.