fix(health): detect ALIVE-but-wedged agents via active>0 + no outbound + null heartbeat (#3057) #3061
Reference in New Issue
Block a user
Delete Branch "fix/3057-wedged-agent-health-signal"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #3057.
Problem
The 2026-06-19 a2a RCA found that a workspace agent can be alive-but-wedged: the agent process is up (so the platform's TCP connect succeeds), but it is hung mid-turn and produces no outbound A2A, no heartbeats, and (eventually) no progress.
The 2026-06-19 incident (Kimi, workspace 6cb8c061) had:
active_tasks=1(stuck)last_outbound_at~48 min stalenull/fresh:falsestatus: online,configuration_status: readyThe existing reactive detection (
isUpstreamDeadStatus→ auto-restart) only fires on dead-origin HTTP statuses (502/504/521/522/523/524 + 503-restarting), not on the wedged-while-TCP-alive case. The wedge was only caught by MANUAL inspection of the tuple; the platform never flagged it.Fix (this PR)
A new wedge-detection monitor and a surfaced
wedgedflag inget_workspace.Wedge definition (all must hold):
active_tasks > 0— agent claims mid-turn, not idlelast_outbound_atis NULL OR older than thresholdlast_heartbeat_atis NULL OR older than thresholdDefault threshold: 5 minutes (
WEDGED_AGENT_THRESHOLD_SECONDSenv override).Files:
registry/wedged_agent.go—IsWedgedAgentpure predicate,StartWedgedAgentMonitorperiodic sweep,WedgedThresholdForHTTPpublic symbol.registry/wedged_agent_test.go— 11 unit tests covering the full truth table.handlers/workspace.go—get_workspacesurfaces awedged: boolalongsidelast_outbound_atandlast_heartbeat_at(both fetched as separate column-shaped follow-up queries — see follow-up note).cmd/server/main.go—onWorkspaceWedgedhandler (log + broadcast aWORKSPACE_WEDGEDevent; auto-restart is intentionally NOT wired, see follow-up note).Follow-up (post-reviewers)
HTTP integration fix (
1ea0c718on top of788a090a): CR2 12459 + Researcher 12460 caught that the original implementation assumedlast_heartbeat_atwas in the scanned row ofget_workspace, but neither the GET SELECT norscanWorkspaceRowinclude that column. The predicate was readingsql.NullTime{}(valid=false) for the heartbeat, so a busy + heartbeating agent with a stale outbound was returned aswedged:true— recreating the false-positive safety concern this PR exists to prevent. The fix fetcheslast_heartbeat_atas a separate column-shaped follow-up query (mirroringlast_outbound_at), surfaces it in the response, and pins the integration boundary with two new handler tests (one wedge case, one busy case).Auto-recovery (out of scope, per scope discipline): the original issue says "ideally auto-recovers, gated". This PR implements log +
WORKSPACE_WEDGEDbroadcast only. A gated auto-restart is a follow-up: a wedge can mask a slow-but-busy agent, and restarting such an agent loses in-flight state. The initial conservative shape lets operators see the broadcast + the new HTTP flag and decide. The sameonWorkspaceWedgedhandler is the integration point for the follow-up.Test plan
registry/wedged_agent_test.gocover the full truth table.workspace_test.gopin BOTH directions of the predicate at the HTTP integration boundary: busy + fresh-heartbeat →wedged:false(the canonical legitimate-busy case), busy + stale-heartbeat →wedged:true(the wedge).registryandhandlerstest suites pass (~40s).go vetclean.SOP checklist
go build ./...clean;go test ./internal/handlers/... ./internal/registry/...green (40s);go vet ./internal/...clean. The HTTP integration fix adds 2 handler tests pinning both directions of the predicate at the GET boundary, so the false-positive regression cannot recur.active_tasks,last_outbound_at,last_heartbeat_atcolumns.WORKSPACE_WEDGEDbroadcast is a black-box visible signal; staging-smoke will verify the broadcast path post-merge.1ea0c718.wedgedfield is the new API; theget_workspaceresponse shape gains one boolean and one timestamp. No deprecated field, no migration.feedback_no_single_source_of_truth(the predicate is the single source of truth shared by monitor + HTTP),feedback_follow_dev_sop_phase1_evidence_first(the integration fix was grounded in the actualscanWorkspaceRow/ GET SELECT, not assumed from the comment).5-axis review on head
788a090a:REQUEST_CHANGES.
Correctness / false-positive safety:
Getcomputes the surfacedwedgedflag fromws["last_heartbeat_at"], but the workspace GET query /scanWorkspaceRowdo not select or populatelast_heartbeat_at. That leaveslastHeartbeatinvalid for every response, so any workspace withactive_tasks > 0and stale/nulllast_outbound_atis reported as wedged even if it has a fresh heartbeat. This violates the stated safety property (active>0 + stale outbound + null/stale heartbeat) and can falsely flag legitimately busy agents as wedged. Please fetchlast_heartbeat_atfor this path and add a handler-level regression test proving fresh heartbeat + stale outbound returnswedged:false.Correctness / requested behavior: the dispatched review request describes this as the durable wedge-detection/auto-recover fix and asks me to review reset/restart idempotency. This PR intentionally wires log +
WORKSPACE_WEDGEDbroadcast only; it does not reset or restart anything. If #3057 requires auto-recovery, the recovery action is missing and cannot be reviewed for idempotency. If detection-only is the intended scope, the PR/title/request should be aligned before merge.Robustness: the pure predicate is conservative and has useful unit truth-table coverage, but the HTTP integration gap above is not covered. Security: no new auth/data exposure beyond a boolean flag. Performance: 30s polling query is bounded to online/degraded active workspaces; acceptable. Readability: generally clear, though the comment saying
last_heartbeat_atis already in the scanned row is currently incorrect.SOP status: the PR body does not contain the required 7-item SOP checklist, so I did not post any
/sop-ack <slug>comments.REQUEST_CHANGES
5-axis review on head
788a090a.Correctness: blocker in the surfaced HTTP
wedgedflag.WorkspaceHandler.Getsayslast_heartbeat_atis already in the scanned row, but the GET SELECT atworkspace-server/internal/handlers/workspace.go:1272-1285does not select it, andscanWorkspaceRowdoes not scan/store it (workspace.go:1131-1207). As a result,lastHeartbeatremains invalid atworkspace.go:1369-1377, soregistry.IsWedgedAgent(...)treats heartbeat as null. A legitimately busy, heartbeating agent withactive_tasks>0and stale/no outbound will be returned aswedged:true, even though the monitor predicate and tests correctly say recent heartbeat should make it NOT wedged. This is exactly the false-positive safety concern for this PR.Robustness: the pure predicate is conservative (
active>0 && outbound stale/null && heartbeat stale/null) and the monitor SQL mirrors it, then re-applies the Go predicate. The current issue is integration drift inget_workspace; addlast_heartbeat_atto the row/query or fetch it alongsidelast_outbound_at, plus a handler test for recent heartbeat + stale outbound =>wedged:false.Security: no new auth boundary issue found; the flag is operational metadata on an existing workspace GET surface.
Performance: the monitor query is narrow enough for the intended sweep. Adding one heartbeat timestamp to the GET path is negligible.
Readability/operability: log+broadcast only is a good initial recovery shape; no auto-restart is wired, so no restart thrash loop in this PR. But the HTTP flag must not disagree with the monitor.
Verification: local Go tests could not be run in this container because
gois not installed. SOP ack status: no/sop-ackposted because of the code blocker. Also, the PR body visible through the API does not contain the required 7-item SOP checklist, so I cannot ack individual checklist slugs from the body.5-axis re-review on head
1ea0c718:APPROVED.
Correctness: my prior 12459 false-positive finding is resolved.
Getnow fetcheslast_heartbeat_atwith a separate column query, stores it in the response, and passes the realsql.NullTimeintoregistry.IsWedgedAgentalongsideactive_tasksandlast_outbound_at. The new handler regression test covers the important legitimate-busy case (active_tasks=1, stale outbound, fresh heartbeat) and assertswedged:false; the companion stale-heartbeat test asserts the actual wedge still returnswedged:true.Robustness: the predicate remains conservative and excludes active agents with fresh heartbeat; external runtimes are still excluded from the monitor query. The monitor action is log/broadcast only, so there is no unsafe auto-kill/reset path in this PR. Security: exposes a boolean and heartbeat timestamp for workspace status; no new auth or input trust boundary. Performance: GET adds one small single-row timestamp query, matching the existing
last_outbound_atshape; monitor query is scoped to online/degraded active non-external workspaces. Readability: comments now correctly describe the separate heartbeat fetch and the scope note makes auto-restart explicitly follow-up.CI visible at review time: several code checks are still pending, and template-delivery-e2e is currently failing; this approval is for the reviewed code/diff, not a statement that the merge gate is green.
APPROVED
Fresh re-review on head
1ea0c718.5-axis review:
WorkspaceHandler.Getnow fetcheslast_heartbeat_atseparately (workspace-server/internal/handlers/workspace.go:1368-1375) and passes the realsql.NullTimeintoregistry.IsWedgedAgentwithactive_tasksandlast_outbound_at(workspace.go:1386-1387). The new handler tests pin both directions: active + stale outbound + fresh heartbeat =>wedged:false, and active + stale outbound + stale heartbeat =>wedged:true.active_tasks > 0, outbound stale/null, heartbeat stale/null), the monitor reuses the predicate after SQL filtering, and this PR still avoids auto-restart, so false positives do not kill working agents.last_heartbeat_atis operational metadata consistent with the existing surfacedlast_outbound_at/status fields.Verification: local Go tests could not be run in this container because
gois not installed. I reviewed the diff and regression tests directly; dispatch notes code CI is still pending./sop-ack comprehensive-testing
/sop-ack local-postgres-e2e
/sop-ack staging-smoke
/sop-ack root-cause
/sop-ack five-axis-review
/sop-ack no-backwards-compat
/sop-ack memory-consulted