788a090a2e
Closes #3057. The 2026-06-19 a2a RCA found that a workspace agent can be 'alive-but-wedged': the agent process is up (so the platform's TCP connect succeeds), but it is hung mid-turn and produces no outbound A2A, no heartbeats, and (eventually) no progress. The 2026-06-19 incident (Kimi, workspace 6cb8c061) had active_tasks=1 (stuck), last_outbound_at ~48min stale, heartbeat null/fresh:false — but status:online was set. The wedge was only caught by MANUAL inspection of the tuple; the platform never flagged it. The existing reactive detection (isUpstreamDeadStatus → auto-restart) only fires on dead-origin HTTP statuses (502/504/521/522/523/524 + 503-restarting), not on the wedged-while-TCP-alive case. We need a separate health signal that combines (active_tasks>0, last_outbound_at stale, last_heartbeat_at stale). Fix: a new wedge-detection monitor and a surfaced wedged flag in get_workspace. Files: - registry/wedged_agent.go: IsWedgedAgent pure predicate, StartWedgedAgentMonitor periodic sweep, DefaultWedgedThreshold (5m) with WEDGED_AGENT_THRESHOLD_SECONDS env override, and WedgedThresholdForHTTP public symbol so the HTTP flag and the monitor use the same threshold. - registry/wedged_agent_test.go: 11 unit tests covering the full truth table (wedge / busy / idle / heartbeat-only / non-positive threshold / active=2 / etc.) plus the threshold boundary. - handlers/workspace.go: get_workspace surfaces a 'wedged' boolean alongside last_outbound_at, computed via the shared IsWedgedAgent predicate so the flag and the monitor can never disagree. - cmd/server/main.go: onWorkspaceWedged handler (log + broadcast a WORKSPACE_WEDGED event; auto-restart is intentionally NOT wired — it is a follow-up gated on ops review because a wedge can mask a slow-but-busy agent and restarting it loses in-flight state). StartWedgedAgentMonitor wired with supervised.RunWithRecover, same contract as StartHealthSweep / StartHibernationMonitor. Tests: 11/11 new tests pass; full registry + handlers test suites pass (40s); go vet clean. Follow-up intentionally out of scope: gated auto-restart of wedged workspaces (operator review needed to confirm a wedge != a slow turn).