Files
Molecule AI Dev Engineer B (MiniMax) 788a090a2e fix(health): detect ALIVE-but-wedged agents via active>0 + no outbound + null heartbeat
Closes #3057.

The 2026-06-19 a2a RCA found that a workspace agent can be
'alive-but-wedged': the agent process is up (so the platform's TCP
connect succeeds), but it is hung mid-turn and produces no outbound
A2A, no heartbeats, and (eventually) no progress. The 2026-06-19
incident (Kimi, workspace 6cb8c061) had active_tasks=1 (stuck),
last_outbound_at ~48min stale, heartbeat null/fresh:false — but
status:online was set. The wedge was only caught by MANUAL
inspection of the tuple; the platform never flagged it.

The existing reactive detection (isUpstreamDeadStatus →
auto-restart) only fires on dead-origin HTTP statuses
(502/504/521/522/523/524 + 503-restarting), not on the
wedged-while-TCP-alive case. We need a separate health signal that
combines (active_tasks>0, last_outbound_at stale, last_heartbeat_at
stale).

Fix: a new wedge-detection monitor and a surfaced wedged flag in
get_workspace.

Files:
  - registry/wedged_agent.go: IsWedgedAgent pure predicate,
    StartWedgedAgentMonitor periodic sweep, DefaultWedgedThreshold
    (5m) with WEDGED_AGENT_THRESHOLD_SECONDS env override, and
    WedgedThresholdForHTTP public symbol so the HTTP flag and the
    monitor use the same threshold.
  - registry/wedged_agent_test.go: 11 unit tests covering the full
    truth table (wedge / busy / idle / heartbeat-only / non-positive
    threshold / active=2 / etc.) plus the threshold boundary.
  - handlers/workspace.go: get_workspace surfaces a 'wedged' boolean
    alongside last_outbound_at, computed via the shared
    IsWedgedAgent predicate so the flag and the monitor can never
    disagree.
  - cmd/server/main.go: onWorkspaceWedged handler (log + broadcast a
    WORKSPACE_WEDGED event; auto-restart is intentionally NOT wired —
    it is a follow-up gated on ops review because a wedge can mask a
    slow-but-busy agent and restarting it loses in-flight state).
    StartWedgedAgentMonitor wired with supervised.RunWithRecover,
    same contract as StartHealthSweep / StartHibernationMonitor.

Tests: 11/11 new tests pass; full registry + handlers test suites
pass (40s); go vet clean.

Follow-up intentionally out of scope: gated auto-restart of wedged
workspaces (operator review needed to confirm a wedge != a slow
turn).
2026-06-19 04:21:49 +00:00
..