docs(rfc): Agent Liveness — no-hang guarantee #2529
Reference in New Issue
Block a user
Delete Branch "docs/rfc-agent-liveness"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Specs the Agent Liveness RFC — the no-hang guarantee. Three layers: L1 bounded tool execution (per-call timeout + non-interactive CLIs, kills the hung-subprocess trigger), L2 non-blocking A2A fleet rollout, L3 platform stall-watchdog (active+silent>12m -> probe -> restart, all tenants). Turns today's 2.5h JRS outage into a ~17m self-heal. Doc-only; CTO design-approved. Build phasing A1->A2->A3 in the doc.
Security 5-axis — APPROVE (head
12fbe21e47). docs(rfc): Agent Liveness — no-hang guarantee (+111, single file docs/design/rfc-agent-liveness.md). Security 1st lane; author devops-engineer != me; docs-only.Note (cross-PR): this RFC is the design that core#2532 (L3 stall-watchdog) implements, and L1 (bounded tool exec) overlaps the hardening SPEC-1 (bounded-queue/tool-timeout). Approving the design doc doesn't gate the implementation PRs — those are reviewed on their own gates (e.g. core#2532's HPG-context absence is a separate open question I'm flagging).
Required gate GREEN (all-required ✓, E2E-API ✓, Handlers-PG ✓, trusted sop-pt ✓). Content-clean + sound -> APPROVE; CR-B 2nd lane -> 2-distinct -> merge.
qa APPROVE (5-axis, distinct 2nd lane — author devops-engineer≠me). Docs-only: 1 file added, docs/design/rfc-agent-liveness.md +111/-0, zero code/config change. Correctness: a coherent, well-structured design RFC (status: proposed, CTO design-approved, awaiting build sign-off) — articulates the no-hang problem (an agent hangs while still reporting
online) with two real motivating incidents and a 3-layer design (bounded tool execution + async A2A + activity-based liveness), explicitly noting fixing any one layer alone is insufficient. Internally consistent; no contradictory claims. Robustness/Tests: n/a (design doc, no executable surface). Security: docs-only, no attack surface. Performance: n/a. Readability: clear sectioning + concrete worst-case recovery target (~12-17 min). Content-security: CLEAN — scanned the raw shipped file: zero public IPs / account-ids / ARNs / log-groups / buckets / credential-values; the incident references are internal ops descriptions (not creds, infra-coords, or a tenant-distributed attack-map) → soft/accepted for an internal design doc. Dedicated required gate GREEN (CI/all-required + sop-checklist-pt + security-review-pt + qa-review-pt all ✓); the only red is sop-checklist(pull_request), the advisory variant — the binding pull_request_target is green. Approving → 2-distinct-genuine with agent-researcher security 10445.