hermes-agent

History

Dev Lead b14758f09a Some checks failed Tests / e2e (pull_request) Successful in 57s Details Nix / nix (macos-latest) (pull_request) Waiting to run Details Contributor Attribution Check / check-attribution (pull_request) Failing after 9s Details Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 10s Details Tests / test (pull_request) Failing after 7m7s Details Nix / nix (ubuntu-latest) (pull_request) Failing after 13m19s Details fix(tools/environments): SIGKILL-only on KeyboardInterrupt; gate cmd_update survivor sweep on real grace (partial close hermes-agent#9) Restores the Apr 2026 orphan-bug fix for the local terminal backend (``sleep 300`` survives ``hermes chat -q`` SIGTERM, originally reported by Physikal) and aligns the ``hermes update`` survivor sweep with the contract its tests have always pinned. Three things move: 1. ``tools/environments/local.py:_kill_process`` - Was: SIGTERM → wait up to 1s polling ``os.killpg(pgid, 0)`` → SIGKILL → wait up to 2s on the same pollee. - Now: SIGKILL directly + ``proc.wait(timeout=0.5)`` to reap the wrapper. - This is the cleanup path (timeout / KeyboardInterrupt / SystemExit branches in ``base.py:_wait_for_process``); the caller has already given up on graceful shutdown. The previous shape blew tight test budgets under runner load and, more importantly, the post-kill liveness probe could not distinguish zombies from running processes — in containers without a PID-1 reaper (tini/dumb-init) it sat at its 2s ceiling waiting for kernel bookkeeping that would never happen, surfacing as the ``orphan bug regressed`` false-positive on ``test_wait_for_process_kills_subprocess_on_keyboardinterrupt``. 2. ``tests/tools/test_local_interrupt_cleanup.py`` - ``_pgid_still_alive``: switch from ``os.killpg(pgid, 0)`` to ``ps -g STAT`` so zombies are not reported as alive. - ``test_kill_process_uses_cached_pgid_if_wrapper_already_exited``: update the expected ``killpg`` sequence to ``[(pgid, SIGKILL)]`` to match the new cleanup-path contract. 3. ``hermes_cli/main.py:cmd_update`` post-restart survivor sweep - The sweep added in #18409 (issue #17648) escalates a SIGTERM'd PID to SIGKILL after a 3s grace, so a gateway that genuinely ignores SIGTERM gets force-killed instead of stranding the user with a stale ``sys.modules``. The fixture-mocked ``time.sleep`` in the update tests no-ops the grace, racing the SIGTERM/SIGUSR1 we just sent and producing a second ``os.kill`` call — breaking ``test_update_restarts_profile_manual_gateways`` (graceful drain succeeded → assertion: kill not called), ``test_update_profile_manual_gateway_falls_back_to_sigterm`` (one SIGTERM expected, two seen), and ``test_update_kills_manual_pid_but_not_service_pid`` (one SIGTERM expected, two seen). - Fix: gate the sweep on a real wall-clock grace. Sample ``time.monotonic()`` before and after the 3s sleep; if less than 2.5s elapsed (test fixture, signal handler, etc.), skip the sweep entirely. Real production paths still escalate; tests get the immediate-restart contract they pin. Also probe each candidate PID with ``os.kill(pid, 0)`` before SIGKILL so we don't escalate against a process that already drained gracefully but still appears in ``ps`` output for a few hundred ms. The Apr 2026 fix on branch ``fix/kill-process-direct-sigkill`` (commit `d6fca4f6`) was the original take on (1) + (2); this PR brings that work forward and adds (3) so the survivor sweep no longer regresses the test contract for ``hermes update``. Verification: - ``pytest -x tests/tools/test_local_interrupt_cleanup.py tests/hermes_cli/test_update_gateway_restart.py -v`` — 49/49 pass. - ``pytest -q tests/tools/test_local_background_child_hang.py tests/tools/test_base_environment.py tests/tools/test_windows_compat.py`` — all pass. - Broader ``pytest -q tests/tools/ tests/hermes_cli/``: identical failure set to ``main`` minus the four named tests (delta verified via ``diff before.txt after.txt``). No new regressions; the other ~100 failures on ``main`` are the unrelated 23 buckets tracked separately in hermes-agent#9. Closes the four signal-handling buckets in #9; remaining 23 untouched.		2026-05-08 12:08:23 -07:00
..
__init__.py	feat(environments): add Daytona cloud sandbox backend	2026-03-05 10:02:21 -08:00
base.py	fix(cli): respect terminal.cwd config in local terminal backend	2026-04-28 22:16:08 -07:00
daytona.py	fix: harden sync_back — PID-suffix temp path, size cap, lifecycle guards	2026-04-16 19:39:21 -07:00
docker.py	feat(docker): run container as host user to avoid root-owned bind mounts	2026-04-29 16:16:43 +10:00
file_sync.py	fix(ci): stabilize current main test regressions	2026-04-30 06:36:50 -07:00
local.py	fix(tools/environments): SIGKILL-only on KeyboardInterrupt; gate cmd_update survivor sweep on real grace (partial close hermes-agent#9)	2026-05-08 12:08:23 -07:00
managed_modal.py	feat(environments): unified spawn-per-call execution layer	2026-04-08 17:23:15 -07:00
modal_utils.py	fix: follow-up for salvaged PR #10854	2026-04-16 06:42:45 -07:00
modal.py	feat(file-sync): sync remote changes back to host on teardown	2026-04-16 19:39:21 -07:00
singularity.py	feat(environments): unified spawn-per-call execution layer	2026-04-08 17:23:15 -07:00
ssh.py	fix(ssh): prevent tar from overwriting remote home dir permissions	2026-04-30 03:26:35 -07:00
vercel_sandbox.py	fix: address self-review findings for Vercel Sandbox salvage	2026-04-29 07:22:33 -07:00