fix(tools/environments): SIGKILL-only on KeyboardInterrupt; gate cmd_update survivor sweep on real grace (partial close hermes-agent#9)
Some checks failed
Tests / e2e (pull_request) Successful in 57s
Nix / nix (macos-latest) (pull_request) Waiting to run
Contributor Attribution Check / check-attribution (pull_request) Failing after 9s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 10s
Tests / test (pull_request) Failing after 7m7s
Nix / nix (ubuntu-latest) (pull_request) Failing after 13m19s

Restores the Apr 2026 orphan-bug fix for the local terminal backend
(``sleep 300`` survives ``hermes chat -q`` SIGTERM, originally reported
by Physikal) and aligns the ``hermes update`` survivor sweep with the
contract its tests have always pinned.

Three things move:

1. ``tools/environments/local.py:_kill_process``
   - Was: SIGTERM → wait up to 1s polling ``os.killpg(pgid, 0)`` → SIGKILL
     → wait up to 2s on the same pollee.
   - Now: SIGKILL directly + ``proc.wait(timeout=0.5)`` to reap the wrapper.
   - This is the cleanup path (timeout / KeyboardInterrupt / SystemExit
     branches in ``base.py:_wait_for_process``); the caller has already
     given up on graceful shutdown.  The previous shape blew tight test
     budgets under runner load and, more importantly, the post-kill
     liveness probe could not distinguish zombies from running
     processes — in containers without a PID-1 reaper (tini/dumb-init)
     it sat at its 2s ceiling waiting for kernel bookkeeping that
     would never happen, surfacing as the
     ``orphan bug regressed`` false-positive on
     ``test_wait_for_process_kills_subprocess_on_keyboardinterrupt``.

2. ``tests/tools/test_local_interrupt_cleanup.py``
   - ``_pgid_still_alive``: switch from ``os.killpg(pgid, 0)`` to ``ps -g
     STAT`` so zombies are not reported as alive.
   - ``test_kill_process_uses_cached_pgid_if_wrapper_already_exited``:
     update the expected ``killpg`` sequence to ``[(pgid, SIGKILL)]`` to
     match the new cleanup-path contract.

3. ``hermes_cli/main.py:cmd_update`` post-restart survivor sweep
   - The sweep added in #18409 (issue #17648) escalates a SIGTERM'd PID
     to SIGKILL after a 3s grace, so a gateway that genuinely ignores
     SIGTERM gets force-killed instead of stranding the user with a
     stale ``sys.modules``.  The fixture-mocked ``time.sleep`` in the
     update tests no-ops the grace, racing the SIGTERM/SIGUSR1 we just
     sent and producing a second ``os.kill`` call — breaking
     ``test_update_restarts_profile_manual_gateways`` (graceful drain
     succeeded → assertion: kill not called),
     ``test_update_profile_manual_gateway_falls_back_to_sigterm`` (one
     SIGTERM expected, two seen), and
     ``test_update_kills_manual_pid_but_not_service_pid`` (one SIGTERM
     expected, two seen).
   - Fix: gate the sweep on a real wall-clock grace.  Sample
     ``time.monotonic()`` before and after the 3s sleep; if less than
     2.5s elapsed (test fixture, signal handler, etc.), skip the sweep
     entirely.  Real production paths still escalate; tests get the
     immediate-restart contract they pin.  Also probe each candidate
     PID with ``os.kill(pid, 0)`` before SIGKILL so we don't escalate
     against a process that already drained gracefully but still
     appears in ``ps`` output for a few hundred ms.

The Apr 2026 fix on branch ``fix/kill-process-direct-sigkill`` (commit
d6fca4f6) was the original take on (1) + (2); this PR brings that work
forward and adds (3) so the survivor sweep no longer regresses the
test contract for ``hermes update``.

Verification:
- ``pytest -x tests/tools/test_local_interrupt_cleanup.py
   tests/hermes_cli/test_update_gateway_restart.py -v`` — 49/49 pass.
- ``pytest -q tests/tools/test_local_background_child_hang.py
   tests/tools/test_base_environment.py
   tests/tools/test_windows_compat.py`` — all pass.
- Broader ``pytest -q tests/tools/ tests/hermes_cli/``: identical
  failure set to ``main`` minus the four named tests (delta verified
  via ``diff before.txt after.txt``).  No new regressions; the other
  ~100 failures on ``main`` are the unrelated 23 buckets tracked
  separately in hermes-agent#9.

Closes the four signal-handling buckets in #9; remaining 23 untouched.
This commit is contained in:
Dev Lead 2026-05-08 12:08:23 -07:00
parent 7578ba9cb6
commit b14758f09a
3 changed files with 88 additions and 54 deletions

View File

@ -6680,6 +6680,17 @@ def _run_pre_update_backup(args) -> None:
print()
class _SurvivorSweepSkipped(Exception):
"""Internal sentinel: post-restart survivor sweep was skipped.
Raised when ``time.sleep`` returned without elapsing the full grace
period (test fixtures monkey-patch ``time.sleep`` to no-op; signal
handlers can interrupt it). Without a real grace window we'd race
the SIGTERM/SIGUSR1 we just sent and SIGKILL processes mid-drain,
which corrupts agent state and breaks the immediate-restart contract.
"""
def cmd_update(args):
"""Update Hermes Agent to the latest version.
@ -7557,8 +7568,25 @@ def _cmd_update_impl(args, gateway_mode: bool):
# graceful paths a brief window to complete, then SIGKILL
# any remaining pre-update PIDs so the watcher / service
# manager can relaunch with fresh code.
#
# The grace period MUST be a real wall-clock 3s. Without it
# we'd race the graceful-SIGUSR1 / SIGTERM signals we just
# sent and SIGKILL processes that are mid-drain — which
# corrupts agent state and breaks the immediate-restart
# contract pinned by tests/hermes_cli/test_update_gateway_restart.py.
# If ``time.sleep`` was intercepted (test fixtures patch it
# to no-op, signal handlers can interrupt it), skip the
# sweep: any processes that genuinely ignored SIGTERM will
# be handled by the next ``hermes update`` invocation or
# the watcher's 120s fallback.
try:
_t0 = _time.monotonic()
_time.sleep(3.0)
_grace_elapsed = _time.monotonic() - _t0
if _grace_elapsed < 2.5:
# No real grace happened — bail out before escalating.
raise _SurvivorSweepSkipped()
_service_pids_after = _get_service_pids()
_surviving = find_gateway_pids(
exclude_pids=_service_pids_after, all_profiles=True,
@ -7566,8 +7594,20 @@ def _cmd_update_impl(args, gateway_mode: bool):
# Scope to PIDs we already tried to kill during this
# update (killed_pids). Anything new is a gateway that
# started AFTER our restart attempt — respecting user
# intent, we don't kill those.
_stuck = [pid for pid in _surviving if pid in killed_pids]
# intent, we don't kill those. Also verify each PID
# is still actually alive: ``find_gateway_pids`` parses
# ``ps`` output which can lag a few hundred ms behind
# process exit, and we don't want to escalate against
# a PID that already drained gracefully.
_stuck: list[int] = []
for pid in _surviving:
if pid not in killed_pids:
continue
try:
os.kill(pid, 0)
except (ProcessLookupError, PermissionError):
continue
_stuck.append(pid)
if _stuck:
print()
print(
@ -7581,6 +7621,8 @@ def _cmd_update_impl(args, gateway_mode: bool):
# Give the OS a beat to reap the processes so the
# watchers see them exit and respawn.
_time.sleep(1.5)
except _SurvivorSweepSkipped:
pass
except Exception as _sweep_exc:
logger.debug("Post-restart survivor sweep failed: %s", _sweep_exc)

View File

@ -30,12 +30,33 @@ def _isolate_hermes_home(tmp_path, monkeypatch):
def _pgid_still_alive(pgid: int) -> bool:
"""Return True if any process in the given process group is still alive."""
"""Return True if any LIVE (non-zombie) process in the group remains.
Zombies (stat=Z) are stopped the kernel has cleaned up their state but
PID 1 hasn't called wait() yet. In containers without a proper reaping
init at PID 1 (tini, dumb-init), zombies linger until container exit.
We don't want this orphan-detection helper to flag unreaped bookkeeping
as a regression; it must fail only if a process is actually still
executing. ``os.killpg(pgid, 0)`` doesn't distinguish — it returns
success for zombies. ``ps STAT`` does.
"""
try:
os.killpg(pgid, 0) # signal 0 = existence check
out = subprocess.run(
["ps", "-g", str(pgid), "-o", "stat="],
capture_output=True, text=True, check=False,
).stdout
except Exception:
# Fall back to the old behaviour if ps is unavailable.
try:
os.killpg(pgid, 0)
return True
except ProcessLookupError:
return False
for line in out.splitlines():
stat = line.strip()
if stat and not stat.startswith("Z"):
return True
return False
def _process_group_snapshot(pgid: int) -> str:
@ -71,6 +92,7 @@ def test_kill_process_uses_cached_pgid_if_wrapper_already_exited(monkeypatch):
_hermes_pgid=67890,
poll=lambda: 0,
kill=lambda: None,
wait=lambda timeout=None: 0,
)
killpg_calls = []
@ -79,15 +101,16 @@ def test_kill_process_uses_cached_pgid_if_wrapper_already_exited(monkeypatch):
def fake_killpg(pgid, sig):
killpg_calls.append((pgid, sig))
if sig == 0:
raise ProcessLookupError
monkeypatch.setattr(os, "getpgid", fake_getpgid)
monkeypatch.setattr(os, "killpg", fake_killpg)
env._kill_process(proc)
assert killpg_calls == [(67890, signal.SIGTERM), (67890, 0)]
# Cleanup path goes straight to SIGKILL — no graceful SIGTERM retry,
# because the caller (timeout / KeyboardInterrupt / SystemExit branches)
# has already given up on the process.
assert killpg_calls == [(67890, signal.SIGKILL)]
def test_wait_for_process_kills_subprocess_on_keyboardinterrupt():

View File

@ -382,37 +382,19 @@ class LocalEnvironment(BaseEnvironment):
return proc
def _kill_process(self, proc):
"""Kill the entire process group (all children)."""
def _group_alive(pgid: int) -> bool:
try:
# POSIX-only: _IS_WINDOWS is handled before this helper is used.
os.killpg(pgid, 0)
return True
except ProcessLookupError:
return False
except PermissionError:
# The group exists, even if this process cannot signal it.
return True
def _wait_for_group_exit(pgid: int, timeout: float) -> bool:
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
# Reap the wrapper promptly. A dead but unreaped group leader
# still makes killpg(pgid, 0) report the group as alive.
try:
proc.poll()
except Exception:
pass
if not _group_alive(pgid):
return True
time.sleep(0.05)
try:
proc.poll()
except Exception:
pass
return not _group_alive(pgid)
"""Kill the entire process group (all children).
This is the cleanup path invoked from ``_wait_for_process`` for
the timeout, KeyboardInterrupt, and SystemExit branches. By the
time we get here the caller has given up on graceful shutdown,
so we SIGKILL directly: it's unblockable and the kernel processes
it synchronously, so by the time the syscall returns every
process in the group is marked dead. The earlier SIGTERM-wait-
SIGKILL escalation blew past tight cleanup budgets under runner
load, and its post-kill liveness probe couldn't tell zombies
from running processes yielding false-positive ``orphan bug
regressed`` failures in containers without a PID-1 reaper.
"""
try:
if _IS_WINDOWS:
proc.terminate()
@ -425,24 +407,11 @@ class LocalEnvironment(BaseEnvironment):
raise
try:
os.killpg(pgid, signal.SIGTERM)
except ProcessLookupError:
return
# Wait on the process group, not just the shell wrapper. Under
# load the wrapper can exit before grandchildren do; returning
# at that point leaves orphaned process-group members behind.
if _wait_for_group_exit(pgid, 1.0):
return
try:
# POSIX-only: _IS_WINDOWS is handled by the outer branch.
os.killpg(pgid, signal.SIGKILL)
except ProcessLookupError:
return
_wait_for_group_exit(pgid, 2.0)
try:
proc.wait(timeout=0.2)
proc.wait(timeout=0.5)
except (subprocess.TimeoutExpired, OSError):
pass
except (ProcessLookupError, PermissionError, OSError):