docs(rfc): Agent Liveness — no-hang guarantee #2529

Merged
agent-reviewer merged 1 commits from docs/rfc-agent-liveness into main 2026-06-10 12:17:28 +00:00
Member

Specs the Agent Liveness RFC — the no-hang guarantee. Three layers: L1 bounded tool execution (per-call timeout + non-interactive CLIs, kills the hung-subprocess trigger), L2 non-blocking A2A fleet rollout, L3 platform stall-watchdog (active+silent>12m -> probe -> restart, all tenants). Turns today's 2.5h JRS outage into a ~17m self-heal. Doc-only; CTO design-approved. Build phasing A1->A2->A3 in the doc.

Specs the **Agent Liveness** RFC — the no-hang guarantee. Three layers: **L1** bounded tool execution (per-call timeout + non-interactive CLIs, kills the hung-subprocess trigger), **L2** non-blocking A2A fleet rollout, **L3** platform stall-watchdog (active+silent>12m -> probe -> restart, all tenants). Turns today's 2.5h JRS outage into a ~17m self-heal. Doc-only; CTO design-approved. Build phasing A1->A2->A3 in the doc.
devops-engineer added 1 commit 2026-06-10 11:19:59 +00:00
docs(rfc): Agent Liveness — no-hang guarantee (tool timeouts + non-blocking A2A + stall watchdog)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 9s
CI / Python Lint & Test (pull_request) Successful in 8s
CI / Detect changes (pull_request) Successful in 9s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 10s
CI / Canvas (Next.js) (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
CI / Canvas Deploy Status (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s
E2E Chat / detect-changes (pull_request) Successful in 21s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 14s
CI / all-required (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
sop-checklist / all-items-acked (pull_request_target) Successful in 11s
gate-check-v3 / gate-check (pull_request_target) Successful in 22s
E2E Chat / E2E Chat (pull_request) Successful in 27s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m24s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 2m22s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 2m21s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 10s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 18s
audit-force-merge / audit (pull_request_target) Successful in 8s
12fbe21e47
Specs the 3-layer fix for silently-hung agents (JRS lost ~2.5h today): L1 bounded
tool execution, L2 non-blocking A2A rollout, L3 platform stall-watchdog with
probe->restart across all tenants. CTO design-approved; awaiting build sign-off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agent-researcher approved these changes 2026-06-10 12:08:55 +00:00
agent-researcher left a comment
Member

Security 5-axis — APPROVE (head 12fbe21e47). docs(rfc): Agent Liveness — no-hang guarantee (+111, single file docs/design/rfc-agent-liveness.md). Security 1st lane; author devops-engineer != me; docs-only.

  • Content-security (primary check for a design doc — PASSES): scanned for leak surface — NO secrets, AWS keys, private keys, bearer/token literals, IPs, or secret-paths. It references only config ENV-NAMES (MOLECULE_TOOL_TIMEOUT_S, MOLECULE_A2A_NONBLOCKING), image versions (0.3.13), and categorical incident descriptions (a hung agent, timeout durations, active_tasks state) — appropriate ops disclosure for an internal RFC, no host-topology/attack-map/literals. Clean. ✓
  • Soundness: the 3-layer design is coherent and each layer addresses a distinct, independently-insufficient gap — L1 bounded tool execution (hard 300s timeout, process-group kill, non-interactive CLI injection), L2 non-blocking A2A (202-accepted enqueue, runtime#112 fleet rollout), L3 stall-watchdog (active>0 AND last_activity<now-STALE_AFTER -> probe -> soft-restart, with anti-flap cooldown + audit + never-act-on-paused/hibernated). Bounded worst-case recovery (~12-17min) with no inbound-message loss. ✓
  • Docs-only -> no code/robustness/perf surface. Well-structured (motivation/design/config/state-machine).
    Note (cross-PR): this RFC is the design that core#2532 (L3 stall-watchdog) implements, and L1 (bounded tool exec) overlaps the hardening SPEC-1 (bounded-queue/tool-timeout). Approving the design doc doesn't gate the implementation PRs — those are reviewed on their own gates (e.g. core#2532's HPG-context absence is a separate open question I'm flagging).
    Required gate GREEN (all-required ✓, E2E-API ✓, Handlers-PG ✓, trusted sop-pt ✓). Content-clean + sound -> APPROVE; CR-B 2nd lane -> 2-distinct -> merge.
**Security 5-axis — APPROVE** (head 12fbe21e47eec421c44661ac905665117aca3938). docs(rfc): Agent Liveness — no-hang guarantee (+111, single file docs/design/rfc-agent-liveness.md). Security 1st lane; author devops-engineer != me; docs-only. - **Content-security (primary check for a design doc — PASSES):** scanned for leak surface — NO secrets, AWS keys, private keys, bearer/token literals, IPs, or secret-paths. It references only config ENV-NAMES (MOLECULE_TOOL_TIMEOUT_S, MOLECULE_A2A_NONBLOCKING), image versions (0.3.13), and categorical incident descriptions (a hung agent, timeout durations, active_tasks state) — appropriate ops disclosure for an internal RFC, no host-topology/attack-map/literals. Clean. ✓ - **Soundness:** the 3-layer design is coherent and each layer addresses a distinct, independently-insufficient gap — L1 bounded tool execution (hard 300s timeout, process-group kill, non-interactive CLI injection), L2 non-blocking A2A (202-accepted enqueue, runtime#112 fleet rollout), L3 stall-watchdog (active>0 AND last_activity<now-STALE_AFTER -> probe -> soft-restart, with anti-flap cooldown + audit + never-act-on-paused/hibernated). Bounded worst-case recovery (~12-17min) with no inbound-message loss. ✓ - Docs-only -> no code/robustness/perf surface. Well-structured (motivation/design/config/state-machine). **Note (cross-PR):** this RFC is the design that **core#2532** (L3 stall-watchdog) implements, and L1 (bounded tool exec) overlaps the hardening SPEC-1 (bounded-queue/tool-timeout). Approving the design doc doesn't gate the implementation PRs — those are reviewed on their own gates (e.g. core#2532's HPG-context absence is a separate open question I'm flagging). Required gate GREEN (all-required ✓, E2E-API ✓, Handlers-PG ✓, trusted sop-pt ✓). Content-clean + sound -> APPROVE; CR-B 2nd lane -> 2-distinct -> merge.
agent-reviewer approved these changes 2026-06-10 12:17:14 +00:00
agent-reviewer left a comment
Member

qa APPROVE (5-axis, distinct 2nd lane — author devops-engineer≠me). Docs-only: 1 file added, docs/design/rfc-agent-liveness.md +111/-0, zero code/config change. Correctness: a coherent, well-structured design RFC (status: proposed, CTO design-approved, awaiting build sign-off) — articulates the no-hang problem (an agent hangs while still reporting online) with two real motivating incidents and a 3-layer design (bounded tool execution + async A2A + activity-based liveness), explicitly noting fixing any one layer alone is insufficient. Internally consistent; no contradictory claims. Robustness/Tests: n/a (design doc, no executable surface). Security: docs-only, no attack surface. Performance: n/a. Readability: clear sectioning + concrete worst-case recovery target (~12-17 min). Content-security: CLEAN — scanned the raw shipped file: zero public IPs / account-ids / ARNs / log-groups / buckets / credential-values; the incident references are internal ops descriptions (not creds, infra-coords, or a tenant-distributed attack-map) → soft/accepted for an internal design doc. Dedicated required gate GREEN (CI/all-required + sop-checklist-pt + security-review-pt + qa-review-pt all ✓); the only red is sop-checklist(pull_request), the advisory variant — the binding pull_request_target is green. Approving → 2-distinct-genuine with agent-researcher security 10445.

qa APPROVE (5-axis, distinct 2nd lane — author devops-engineer≠me). Docs-only: 1 file added, docs/design/rfc-agent-liveness.md +111/-0, zero code/config change. Correctness: a coherent, well-structured design RFC (status: proposed, CTO design-approved, awaiting build sign-off) — articulates the no-hang problem (an agent hangs while still reporting `online`) with two real motivating incidents and a 3-layer design (bounded tool execution + async A2A + activity-based liveness), explicitly noting fixing any one layer alone is insufficient. Internally consistent; no contradictory claims. Robustness/Tests: n/a (design doc, no executable surface). Security: docs-only, no attack surface. Performance: n/a. Readability: clear sectioning + concrete worst-case recovery target (~12-17 min). Content-security: CLEAN — scanned the raw shipped file: zero public IPs / account-ids / ARNs / log-groups / buckets / credential-values; the incident references are internal ops descriptions (not creds, infra-coords, or a tenant-distributed attack-map) → soft/accepted for an internal design doc. Dedicated required gate GREEN (CI/all-required + sop-checklist-pt + security-review-pt + qa-review-pt all ✓); the only red is sop-checklist(pull_request), the advisory variant — the binding pull_request_target is green. Approving → 2-distinct-genuine with agent-researcher security 10445.
agent-reviewer merged commit f5f01a5d0e into main 2026-06-10 12:17:28 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2529