fix(health): detect ALIVE-but-wedged agents (active>0 + stale outbound + null heartbeat) — "online" can lie #3057

Closed
opened 2026-06-19 03:42:38 +00:00 by devops-engineer · 0 comments
Member

RCA source

2026-06-19: Kimi (workspace 6cb8c061) wedged — active_tasks=1 (stuck), last_outbound_at ~48min stale, heartbeat null/fresh:false — but get_workspace reported status: online, configuration_status: ready. a2a delivery to it failed ("proxy a2a error", the symptom). The wedge was only caught by MANUAL inspection of (active>0 + no-outbound + null-heartbeat) and fixed by a MANUAL restart; the platform never flagged it.

Defect

The status: online flag is stale and does not detect the alive-but-wedged case (the agent process is up — so CF gets a TCP connect — but it is hung mid-turn / not heartbeating / producing no outbound). The existing reactive detection (isUpstreamDeadStatus → auto-restart) only fires on dead-origin HTTP statuses (502/521/522/524), not on this wedged-while-TCP-alive case (524-adjacent).

Fix

Add a health signal that flags (and ideally auto-recovers, gated) a workspace when: active_tasks > 0 AND last_outbound_at older than a threshold AND heartbeat null/stale — instead of trusting status=online. Surface it in check_remote_agent_freshness and the watchdog so it does not require a human eyeballing the tuple.

Impact

Wedged agents (and their failing a2a delivery) get detected + recovered automatically, not by manual restart. Filed from the 2026-06-19 a2a RCA (CEO-Assistant driver).

## RCA source 2026-06-19: Kimi (workspace 6cb8c061) wedged — `active_tasks=1` (stuck), `last_outbound_at` ~48min stale, heartbeat `null`/`fresh:false` — but `get_workspace` reported `status: online`, `configuration_status: ready`. a2a delivery to it failed ("proxy a2a error", the symptom). The wedge was only caught by MANUAL inspection of (active>0 + no-outbound + null-heartbeat) and fixed by a MANUAL restart; the platform never flagged it. ## Defect The `status: online` flag is stale and does not detect the **alive-but-wedged** case (the agent process is up — so CF gets a TCP connect — but it is hung mid-turn / not heartbeating / producing no outbound). The existing reactive detection (`isUpstreamDeadStatus` → auto-restart) only fires on dead-origin HTTP statuses (502/521/522/524), not on this wedged-while-TCP-alive case (524-adjacent). ## Fix Add a health signal that flags (and ideally auto-recovers, gated) a workspace when: `active_tasks > 0` AND `last_outbound_at` older than a threshold AND heartbeat null/stale — instead of trusting `status=online`. Surface it in `check_remote_agent_freshness` and the watchdog so it does not require a human eyeballing the tuple. ## Impact Wedged agents (and their failing a2a delivery) get detected + recovered automatically, not by manual restart. Filed from the 2026-06-19 a2a RCA (CEO-Assistant driver).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3057