Files
devops-engineer 961885393c
Block internal-flavored paths / Block forbidden paths (pull_request) Has started running
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
CI / Python Lint & Test (pull_request) Successful in 6s
CI / Detect changes (pull_request) Successful in 15s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 20s
CI / Canvas Deploy Status (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s
E2E Chat / detect-changes (pull_request) Successful in 21s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 10s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
E2E Chat / E2E Chat (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Has started running
Harness Replays / Harness Replays (pull_request) Successful in 3s
gate-check-v3 / gate-check (pull_request_target) Has started running
qa-review / approved (pull_request_target) Has started running
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 23s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 18s
sop-checklist / all-items-acked (pull_request_target) Has started running
sop-checklist / review-refire (pull_request_target) Has been skipped
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 31s
Check migration collisions / Migration version collision check (pull_request) Successful in 1m18s
security-review / approved (pull_request_target) Failing after 17s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 52s
CI / Platform (Go) (pull_request) Successful in 5m16s
CI / all-required (pull_request) Successful in 4s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5m17s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Waiting to run
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Failing after 6m42s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Failing after 6m33s
audit-force-merge / audit (pull_request_target) Successful in 8s
feat(liveness): A2 — stall-watchdog (probe → restart) for silently-hung agents
Layer 3 of the approved Agent-Liveness RFC. Catches the busy-but-silently-
hung case the Redis TTL liveness monitor (offline/dead only) and the
operator status=failed watchdog both miss: a workspace that is
status=online with active_tasks>0 but has produced NO activity for too
long. This is what let JRS sit dead ~2.5h.

- Migration 20260610140000_workspaces_last_activity: adds workspaces.
  last_activity_at TIMESTAMPTZ + a partial index over (online, busy), and a
  workspace_stall_state bookkeeping table. Idempotent, safe under the
  re-apply-all migration runner.

- Stamp last_activity_at write-through on every activity_logs write, folded
  into the existing INSERT as a single CTE statement in logActivityExec: no
  extra round-trip on the latency path, atomic in the Tx case, and the Exec
  text still contains INSERT INTO activity_logs so existing sqlmock
  expectations keep matching.

- stall_watchdog.go sweeper mirrors request_nudge_sweeper / delegation_
  sweeper (ticker, envDuration, panic-recovering Start, injectable enqueue
  seam, raw SQL). Two-stage: detect (online + active_tasks>0 + stale) ->
  PROBE via EnqueueA2A -> if still silent past PROBE_GRACE -> soft-restart
  via injected WorkspaceHandler.RestartByID (existing-volume, same path
  POST /workspaces/:id/restart uses). Activity-resumed clears the state;
  COOLDOWN anti-flap; bounded LIMIT; structured logs + audit rows.

- Wired into cmd/server/main.go beside the other sweepers, gated by
  STALL_WATCHDOG_DISABLED.

- stall_watchdog_test.go (sqlmock): probe / restart / clear / cooldown /
  empty-noop / nil-restart probe-only / env-override.

Thresholds (env knobs): STALL_WATCHDOG_INTERVAL_S=180, _STALE_AFTER_S=720,
_PROBE_GRACE_S=300, _COOLDOWN_S=1800; disable STALL_WATCHDOG_DISABLED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 15:01:12 +00:00
..