docs(rfc): Agent Liveness — no-hang guarantee #2529

Merged
agent-reviewer merged 1 commits from docs/rfc-agent-liveness into main 2026-06-10 12:17:28 +00:00
+111
View File
@@ -0,0 +1,111 @@
# RFC: Agent Liveness — the no-hang guarantee
**Status:** proposed (CTO design-approved; awaiting sign-off to build) · **Author:** CEO-assistant · **Date:** 2026-06-10
## 1. Motivation
An agent can hang and sit **dead for hours while still reporting `online`**, with no
self-recovery. Two real cases today:
- **JRS SEO agent:** `npx vercel` blocked on an interactive prompt at 08:24Z; the agent
produced **zero activity for ~2.5h**, held `active_tasks=1`, and every inbound message
300s-timed-out. Only a manual soft-restart recovered it.
- **agents-team reviewers:** the same blocking-A2A pressure (`300002ms` `timeout awaiting
response headers`) overloaded the tenant host (sustained 524s) and throttled review
throughput fleet-wide.
Three independent gaps allow this, and **fixing any one alone is insufficient**:
1. **Unbounded tool execution** — a subprocess (`npx vercel`) can block forever.
2. **Synchronous blocking A2A** — inbound `message/send` blocks the caller up to 300s; with
a slow/hung agent these pile up and starve the workspace-server.
3. **Liveness keyed on status, not activity** — `online` comes from the registry heartbeat,
not real responsiveness, so a hung agent stays green ("clean telemetry ≠ healthy").
**Goal:** no agent stays silently hung; bounded worst-case recovery (~1217 min, not hours);
no inbound-message loss; works for **every tenant**, not just agents-team.
## 2. Design — three layers
### L1 — Bounded tool execution (runtime)
Every tool / subprocess call gets a **hard timeout** (default **300 s**, per-tool override).
On expiry: kill the **process group** (not just the leader), and return a structured
`{error:"tool_timeout", detail:"… killed after Ns"}` so the agent loop continues instead of
blocking. Known-interactive CLIs (`vercel`, `gh`, `npm`, `git` credential prompts) run
**non-interactive** — inject `--yes`/`--no-input`, run with **stdin closed** — so they can't
block on a prompt at all.
- **Where:** the runtime's tool-execution wrapper (the Bash/subprocess path in
`molecule_runtime`).
- **Config:** `MOLECULE_TOOL_TIMEOUT_S` (default 300), optional per-tool map.
- **Effect:** kills the trigger class — a subprocess can't wedge an agent if it can't run
forever.
### L2 — Non-blocking A2A (runtime + platform)
Inbound `message/send` **enqueues and returns `202 accepted` immediately**; the agent drains
its queue on its next tick. `MOLECULE_A2A_NONBLOCKING=true` already exists (runtime#112,
shipped on image 0.3.13) — this layer is the **fleet rollout** (re-provision onto 0.3.13 with
the flag) plus making it the default.
- **Effect:** stops the 300s pileup + host overload; inbound is never lost, just deferred.
### L3 — Stall watchdog → probe → restart (platform, ALL tenants)
A per-tenant periodic sweep in the workspace-server (covers every org, unlike today's
operator-only agents-team script):
- **Detect:** workspace with `active_tasks > 0` **AND** `last_activity_at < now()
STALE_AFTER` (default **12 min**) **AND** not in cooldown → **suspected-stalled**.
- **Probe:** inject a liveness message — *"You've had no activity for N min while marked
busy. Reply (any activity) or you'll be restarted in M min."*
- **Escalate:** if still no activity after `PROBE_GRACE` (default **5 min**) →
**soft-restart** (existing-volume). The probe separates *slow-but-alive* (resumes activity,
left alone) from *deadlocked* (can't process the probe → restarted).
- **Audit + anti-flap:** every transition logged; no re-probe/restart within a cooldown
(default 30 min) to avoid restart loops; never act on `paused`/`hibernated`.
- **State machine:** `healthy → suspected_stalled →(activity)→ healthy` |
`→(silence > grace)→ restarting → healthy`.
This is the platform-enforced version of "verify by real activity." It **supersedes** the
operator `agent-health-watchdog` (which only fires on `status=failed` and explicitly treats
`active=0` as fine — it would never have caught the JRS `active=1 but silent` signature) and
extends coverage to all tenants.
## 3. Data model
- `workspaces.last_activity_at TIMESTAMPTZ` — stamped on every `activity_logs` insert (or a
cheap trigger / write-through in the activity writer). Drives L3.
- Optional `workspace_liveness_events` audit table (`workspace_id, event, at`) for
probe/restart history. (Can defer; structured logs suffice for v1.)
## 4. Config / defaults (override any)
| Knob | Default |
|---|---|
| `MOLECULE_TOOL_TIMEOUT_S` (L1) | 300 |
| non-interactive CLI enforcement (L1) | on |
| `MOLECULE_A2A_NONBLOCKING` (L2) | true (rollout) |
| stall-after (L3) | 12 min |
| probe-grace (L3) | 5 min |
| restart cooldown (L3) | 30 min |
| sweep cadence (L3) | 3 min |
Worst-case recovery = stall-after + probe-grace ≈ **17 min** (vs. today's *unbounded*).
## 5. Phasing (each = SOP PR → 2-genuine → merge)
- **A1** — L1 tool-call timeouts + non-interactive CLI (runtime). Highest leverage, smallest
blast radius; lands first.
- **A2** — L3 stall watchdog + probe→restart + `last_activity_at` (platform / workspace-server).
- **A3** — L2 non-blocking A2A fleet rollout (flag default + re-provision — the gated op,
JRS-first then fleet).
## 6. Interactions & non-goals
- **vs. the requests/inbox P4 idle-nudge:** different population — P4 nudges *idle* agents
with *pending inbox items*; L3 probes *busy-but-silent* agents. Complementary; share the
`last_activity_at` signal.
- **vs. operator watchdog:** L3 generalizes + replaces it (platform-side, all tenants,
activity-based not status-based). Keep the operator script until A2 ships, then retire.
- **Non-goals:** not a perf profiler; not killing legitimately-long *tasks* — L1 bounds a
single *tool call*, not the whole task; L3 gives the agent a chance to answer the probe
before any restart, so a long-but-alive task that still emits activity is never touched.
## 7. Open questions / risks
- **False positive on a genuinely long, silent tool call** (e.g. a 20-min build emitting no
interim output): mitigated by (a) per-tool L1 timeout tuning, (b) the L3 probe — the agent
can answer "still working" to defer restart. If a tool legitimately needs >300s with zero
output, raise its per-tool timeout. Worth confirming which tools that affects.
- **`last_activity_at` write overhead** on the hot activity path — use a lightweight
write-through, not a synchronous extra round-trip.
- **Probe delivery to a deadlocked agent** won't be processed — by design; that's what the
escalate-to-restart handles.