fix(a2a): raise canvas idle watchdog 5m→30m for long blocking turns (core#2723) #2727

Merged
devops-engineer merged 1 commits from fix/a2a-idle-timeout-raise into main 2026-06-13 07:56:12 +00:00
Member

Deployable mitigation for the 300s 'tool chain lost' (#2723)

The canvas A2A turn is cancelled after idleTimeoutDuration of broadcaster silence (applyIdleTimeout). The 30s WORKSPACE_HEARTBEAT normally resets it well before 5min — so the window only bites when the heartbeat stalls, which happens when the runtime's asyncio heartbeat task is starved by a long blocking tool call (the CTO's bulk image migration). The turn got cancelled at ~300s mid-work.

This PR raises the default 5m→30m — the deployable safety margin so a multi-minute blocking step survives. 30m matches the agent-to-agent ceiling; A2A_IDLE_TIMEOUT_SECONDS still tunes per-deploy.

The complete fix is runtime-side (run the heartbeat on an independent daemon thread so it never starves) — root-caused + specced in #2723. This is workspace-server-only so it deploys to tenants via the standard tenant redeploy (no runtime-template roll / tunnel-gap dependency), giving immediate relief while the runtime fix lands.

Test: TestParseIdleTimeoutEnv pins the 30m default + a longer override; existing applyIdleTimeout mechanism tests unchanged.

🤖 Generated with Claude Code

## Deployable mitigation for the 300s 'tool chain lost' (#2723) The canvas A2A turn is cancelled after `idleTimeoutDuration` of broadcaster silence (`applyIdleTimeout`). The 30s `WORKSPACE_HEARTBEAT` normally resets it well before 5min — so the window only bites when the heartbeat **stalls**, which happens when the runtime's asyncio heartbeat task is starved by a long *blocking* tool call (the CTO's bulk image migration). The turn got cancelled at ~300s mid-work. **This PR** raises the default 5m→30m — the deployable safety margin so a multi-minute blocking step survives. 30m matches the agent-to-agent ceiling; `A2A_IDLE_TIMEOUT_SECONDS` still tunes per-deploy. **The complete fix is runtime-side** (run the heartbeat on an independent daemon thread so it never starves) — root-caused + specced in #2723. This is workspace-server-only so it deploys to tenants via the standard tenant redeploy (no runtime-template roll / tunnel-gap dependency), giving immediate relief while the runtime fix lands. Test: `TestParseIdleTimeoutEnv` pins the 30m default + a longer override; existing `applyIdleTimeout` mechanism tests unchanged. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-06-13 07:51:26 +00:00
fix(a2a): raise canvas idle watchdog 5m→30m for long blocking turns (core#2723)
CI / Python Lint & Test (pull_request) Successful in 7s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request_target) Has been skipped
Harness Replays / detect-changes (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 14s
Harness Replays / Harness Replays (pull_request) Successful in 2s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 14s
CI / Detect changes (pull_request) Successful in 16s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
sop-checklist / all-items-acked (pull_request_target) Successful in 9s
E2E Chat / detect-changes (pull_request) Successful in 19s
reserved-path-review / reserved-path-review (pull_request_target) Failing after 10s
CI / Canvas (Next.js) (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s
CI / Canvas Deploy Status (pull_request) Successful in 1s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 6s
gate-check-v3 / gate-check (pull_request_target) Failing after 16s
E2E Chat / E2E Chat (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 18s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 32s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 38s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been cancelled
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m14s
CI / Platform (Go) (pull_request) Successful in 3m23s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 8s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 10s
qa-review / approved (pull_request_review) Successful in 10s
CI / all-required (pull_request) Successful in 4s
sop-checklist / all-items-acked (pull_request) acked: 7/7 — body-unfilled: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
sop-checklist / na-declarations (pull_request) N/A: (none)
audit-force-merge / audit (pull_request_target) Successful in 7s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Failing after 4m40s
cd3666d75c
The canvas A2A turn is cancelled after `idleTimeoutDuration` of broadcaster
silence. The 30s WORKSPACE_HEARTBEAT normally resets it long before 5min —
so the window only bites when the heartbeat STALLS, which happens when the
runtime's asyncio heartbeat task is starved by a long *blocking* tool call
(e.g. a bulk asset migration). A real long autonomous turn was getting
cancelled at ~300s mid-work ("tool chain lost").

The complete fix is runtime-side (heartbeat on an independent thread —
#2723). This raises the deployable safety margin so a multi-minute blocking
step survives; 30m matches the agent-to-agent absolute ceiling. The canvas
path has no separate ceiling, so this is its only deadline; a genuinely
dead agent is still surfaced by the reactive-health path, not this.
A2A_IDLE_TIMEOUT_SECONDS still tunes per-deploy.

Test: TestParseIdleTimeoutEnv now pins the 30m default + a longer override.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
agent-reviewer-cr2 approved these changes 2026-06-13 07:55:03 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED on head cd3666d75c.

Reviewed with the 5-axis lens. The mitigation is deliberately narrow: it raises only the workspace-server canvas/A2A idle watchdog default from 5m to 30m while preserving A2A_IDLE_TIMEOUT_SECONDS override behavior. That matches the stated deployable mitigation: long blocking runtime work can survive heartbeat starvation, while operators can still tune shorter/longer values by env.

Correctness/robustness: the existing applyIdleTimeout mechanism is unchanged; this changes the default ceiling only. The parser test now pins the 30m default and confirms a 30m explicit override parses. Security impact is neutral: no auth, secret, request-body, or permission behavior changes. Performance/resource tradeoff is acceptable as a mitigation because genuinely dead agents are still handled by reactive health rather than this idle timer.

CI note: relevant static/build/test contexts I saw were green, but some staging/local-provision statuses were still pending/cancelled at review time. Merge should wait for required CI/all-required to settle green.

APPROVED on head cd3666d75c669723597aeb85ffe81564058f139f. Reviewed with the 5-axis lens. The mitigation is deliberately narrow: it raises only the workspace-server canvas/A2A idle watchdog default from 5m to 30m while preserving `A2A_IDLE_TIMEOUT_SECONDS` override behavior. That matches the stated deployable mitigation: long blocking runtime work can survive heartbeat starvation, while operators can still tune shorter/longer values by env. Correctness/robustness: the existing `applyIdleTimeout` mechanism is unchanged; this changes the default ceiling only. The parser test now pins the 30m default and confirms a 30m explicit override parses. Security impact is neutral: no auth, secret, request-body, or permission behavior changes. Performance/resource tradeoff is acceptable as a mitigation because genuinely dead agents are still handled by reactive health rather than this idle timer. CI note: relevant static/build/test contexts I saw were green, but some staging/local-provision statuses were still pending/cancelled at review time. Merge should wait for required CI/all-required to settle green.
Member

/sop-ack

/sop-ack
Member

/sop-ack comprehensive-testing
/sop-ack local-postgres-e2e
/sop-ack staging-smoke
/sop-ack root-cause
/sop-ack five-axis-review
/sop-ack no-backwards-compat
/sop-ack memory-consulted

/sop-ack comprehensive-testing /sop-ack local-postgres-e2e /sop-ack staging-smoke /sop-ack root-cause /sop-ack five-axis-review /sop-ack no-backwards-compat /sop-ack memory-consulted
Member

Code review is approved, CI/all-required is green, and explicit SOP acks are posted, but merge is still blocked by sop-checklist / all-items-acked: the PR body is missing the required filled SOP checklist sections (body-unfilled). Please add/fill the 7 body markers: Comprehensive testing performed, Local-postgres E2E run, Staging-smoke verified or pending, Root-cause not symptom, Five-Axis review walked, No backwards-compat shim / dead code added, and Memory consulted.

Code review is approved, CI/all-required is green, and explicit SOP acks are posted, but merge is still blocked by `sop-checklist / all-items-acked`: the PR body is missing the required filled SOP checklist sections (`body-unfilled`). Please add/fill the 7 body markers: `Comprehensive testing performed`, `Local-postgres E2E run`, `Staging-smoke verified or pending`, `Root-cause not symptom`, `Five-Axis review walked`, `No backwards-compat shim / dead code added`, and `Memory consulted`.
devops-engineer merged commit 451dd934d4 into main 2026-06-13 07:56:12 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2727