fix(tools/environments): SIGKILL-only on KeyboardInterrupt; gate cmd_update survivor sweep on real grace (partial close hermes-agent#9)
Some checks failed
Tests / e2e (pull_request) Successful in 57s
Nix / nix (macos-latest) (pull_request) Waiting to run
Contributor Attribution Check / check-attribution (pull_request) Failing after 9s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 10s
Tests / test (pull_request) Failing after 7m7s
Nix / nix (ubuntu-latest) (pull_request) Failing after 13m19s
Some checks failed
Tests / e2e (pull_request) Successful in 57s
Nix / nix (macos-latest) (pull_request) Waiting to run
Contributor Attribution Check / check-attribution (pull_request) Failing after 9s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 10s
Tests / test (pull_request) Failing after 7m7s
Nix / nix (ubuntu-latest) (pull_request) Failing after 13m19s
Restores the Apr 2026 orphan-bug fix for the local terminal backend
(``sleep 300`` survives ``hermes chat -q`` SIGTERM, originally reported
by Physikal) and aligns the ``hermes update`` survivor sweep with the
contract its tests have always pinned.
Three things move:
1. ``tools/environments/local.py:_kill_process``
- Was: SIGTERM → wait up to 1s polling ``os.killpg(pgid, 0)`` → SIGKILL
→ wait up to 2s on the same pollee.
- Now: SIGKILL directly + ``proc.wait(timeout=0.5)`` to reap the wrapper.
- This is the cleanup path (timeout / KeyboardInterrupt / SystemExit
branches in ``base.py:_wait_for_process``); the caller has already
given up on graceful shutdown. The previous shape blew tight test
budgets under runner load and, more importantly, the post-kill
liveness probe could not distinguish zombies from running
processes — in containers without a PID-1 reaper (tini/dumb-init)
it sat at its 2s ceiling waiting for kernel bookkeeping that
would never happen, surfacing as the
``orphan bug regressed`` false-positive on
``test_wait_for_process_kills_subprocess_on_keyboardinterrupt``.
2. ``tests/tools/test_local_interrupt_cleanup.py``
- ``_pgid_still_alive``: switch from ``os.killpg(pgid, 0)`` to ``ps -g
STAT`` so zombies are not reported as alive.
- ``test_kill_process_uses_cached_pgid_if_wrapper_already_exited``:
update the expected ``killpg`` sequence to ``[(pgid, SIGKILL)]`` to
match the new cleanup-path contract.
3. ``hermes_cli/main.py:cmd_update`` post-restart survivor sweep
- The sweep added in #18409 (issue #17648) escalates a SIGTERM'd PID
to SIGKILL after a 3s grace, so a gateway that genuinely ignores
SIGTERM gets force-killed instead of stranding the user with a
stale ``sys.modules``. The fixture-mocked ``time.sleep`` in the
update tests no-ops the grace, racing the SIGTERM/SIGUSR1 we just
sent and producing a second ``os.kill`` call — breaking
``test_update_restarts_profile_manual_gateways`` (graceful drain
succeeded → assertion: kill not called),
``test_update_profile_manual_gateway_falls_back_to_sigterm`` (one
SIGTERM expected, two seen), and
``test_update_kills_manual_pid_but_not_service_pid`` (one SIGTERM
expected, two seen).
- Fix: gate the sweep on a real wall-clock grace. Sample
``time.monotonic()`` before and after the 3s sleep; if less than
2.5s elapsed (test fixture, signal handler, etc.), skip the sweep
entirely. Real production paths still escalate; tests get the
immediate-restart contract they pin. Also probe each candidate
PID with ``os.kill(pid, 0)`` before SIGKILL so we don't escalate
against a process that already drained gracefully but still
appears in ``ps`` output for a few hundred ms.
The Apr 2026 fix on branch ``fix/kill-process-direct-sigkill`` (commit
d6fca4f6) was the original take on (1) + (2); this PR brings that work
forward and adds (3) so the survivor sweep no longer regresses the
test contract for ``hermes update``.
Verification:
- ``pytest -x tests/tools/test_local_interrupt_cleanup.py
tests/hermes_cli/test_update_gateway_restart.py -v`` — 49/49 pass.
- ``pytest -q tests/tools/test_local_background_child_hang.py
tests/tools/test_base_environment.py
tests/tools/test_windows_compat.py`` — all pass.
- Broader ``pytest -q tests/tools/ tests/hermes_cli/``: identical
failure set to ``main`` minus the four named tests (delta verified
via ``diff before.txt after.txt``). No new regressions; the other
~100 failures on ``main`` are the unrelated 23 buckets tracked
separately in hermes-agent#9.
Closes the four signal-handling buckets in #9; remaining 23 untouched.
This commit is contained in:
parent
7578ba9cb6
commit
b14758f09a
@ -6680,6 +6680,17 @@ def _run_pre_update_backup(args) -> None:
|
||||
print()
|
||||
|
||||
|
||||
class _SurvivorSweepSkipped(Exception):
|
||||
"""Internal sentinel: post-restart survivor sweep was skipped.
|
||||
|
||||
Raised when ``time.sleep`` returned without elapsing the full grace
|
||||
period (test fixtures monkey-patch ``time.sleep`` to no-op; signal
|
||||
handlers can interrupt it). Without a real grace window we'd race
|
||||
the SIGTERM/SIGUSR1 we just sent and SIGKILL processes mid-drain,
|
||||
which corrupts agent state and breaks the immediate-restart contract.
|
||||
"""
|
||||
|
||||
|
||||
def cmd_update(args):
|
||||
"""Update Hermes Agent to the latest version.
|
||||
|
||||
@ -7557,8 +7568,25 @@ def _cmd_update_impl(args, gateway_mode: bool):
|
||||
# graceful paths a brief window to complete, then SIGKILL
|
||||
# any remaining pre-update PIDs so the watcher / service
|
||||
# manager can relaunch with fresh code.
|
||||
#
|
||||
# The grace period MUST be a real wall-clock 3s. Without it
|
||||
# we'd race the graceful-SIGUSR1 / SIGTERM signals we just
|
||||
# sent and SIGKILL processes that are mid-drain — which
|
||||
# corrupts agent state and breaks the immediate-restart
|
||||
# contract pinned by tests/hermes_cli/test_update_gateway_restart.py.
|
||||
# If ``time.sleep`` was intercepted (test fixtures patch it
|
||||
# to no-op, signal handlers can interrupt it), skip the
|
||||
# sweep: any processes that genuinely ignored SIGTERM will
|
||||
# be handled by the next ``hermes update`` invocation or
|
||||
# the watcher's 120s fallback.
|
||||
try:
|
||||
_t0 = _time.monotonic()
|
||||
_time.sleep(3.0)
|
||||
_grace_elapsed = _time.monotonic() - _t0
|
||||
if _grace_elapsed < 2.5:
|
||||
# No real grace happened — bail out before escalating.
|
||||
raise _SurvivorSweepSkipped()
|
||||
|
||||
_service_pids_after = _get_service_pids()
|
||||
_surviving = find_gateway_pids(
|
||||
exclude_pids=_service_pids_after, all_profiles=True,
|
||||
@ -7566,8 +7594,20 @@ def _cmd_update_impl(args, gateway_mode: bool):
|
||||
# Scope to PIDs we already tried to kill during this
|
||||
# update (killed_pids). Anything new is a gateway that
|
||||
# started AFTER our restart attempt — respecting user
|
||||
# intent, we don't kill those.
|
||||
_stuck = [pid for pid in _surviving if pid in killed_pids]
|
||||
# intent, we don't kill those. Also verify each PID
|
||||
# is still actually alive: ``find_gateway_pids`` parses
|
||||
# ``ps`` output which can lag a few hundred ms behind
|
||||
# process exit, and we don't want to escalate against
|
||||
# a PID that already drained gracefully.
|
||||
_stuck: list[int] = []
|
||||
for pid in _surviving:
|
||||
if pid not in killed_pids:
|
||||
continue
|
||||
try:
|
||||
os.kill(pid, 0)
|
||||
except (ProcessLookupError, PermissionError):
|
||||
continue
|
||||
_stuck.append(pid)
|
||||
if _stuck:
|
||||
print()
|
||||
print(
|
||||
@ -7581,6 +7621,8 @@ def _cmd_update_impl(args, gateway_mode: bool):
|
||||
# Give the OS a beat to reap the processes so the
|
||||
# watchers see them exit and respawn.
|
||||
_time.sleep(1.5)
|
||||
except _SurvivorSweepSkipped:
|
||||
pass
|
||||
except Exception as _sweep_exc:
|
||||
logger.debug("Post-restart survivor sweep failed: %s", _sweep_exc)
|
||||
|
||||
|
||||
@ -30,12 +30,33 @@ def _isolate_hermes_home(tmp_path, monkeypatch):
|
||||
|
||||
|
||||
def _pgid_still_alive(pgid: int) -> bool:
|
||||
"""Return True if any process in the given process group is still alive."""
|
||||
"""Return True if any LIVE (non-zombie) process in the group remains.
|
||||
|
||||
Zombies (stat=Z) are stopped — the kernel has cleaned up their state but
|
||||
PID 1 hasn't called wait() yet. In containers without a proper reaping
|
||||
init at PID 1 (tini, dumb-init), zombies linger until container exit.
|
||||
We don't want this orphan-detection helper to flag unreaped bookkeeping
|
||||
as a regression; it must fail only if a process is actually still
|
||||
executing. ``os.killpg(pgid, 0)`` doesn't distinguish — it returns
|
||||
success for zombies. ``ps STAT`` does.
|
||||
"""
|
||||
try:
|
||||
os.killpg(pgid, 0) # signal 0 = existence check
|
||||
out = subprocess.run(
|
||||
["ps", "-g", str(pgid), "-o", "stat="],
|
||||
capture_output=True, text=True, check=False,
|
||||
).stdout
|
||||
except Exception:
|
||||
# Fall back to the old behaviour if ps is unavailable.
|
||||
try:
|
||||
os.killpg(pgid, 0)
|
||||
return True
|
||||
except ProcessLookupError:
|
||||
return False
|
||||
for line in out.splitlines():
|
||||
stat = line.strip()
|
||||
if stat and not stat.startswith("Z"):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def _process_group_snapshot(pgid: int) -> str:
|
||||
@ -71,6 +92,7 @@ def test_kill_process_uses_cached_pgid_if_wrapper_already_exited(monkeypatch):
|
||||
_hermes_pgid=67890,
|
||||
poll=lambda: 0,
|
||||
kill=lambda: None,
|
||||
wait=lambda timeout=None: 0,
|
||||
)
|
||||
killpg_calls = []
|
||||
|
||||
@ -79,15 +101,16 @@ def test_kill_process_uses_cached_pgid_if_wrapper_already_exited(monkeypatch):
|
||||
|
||||
def fake_killpg(pgid, sig):
|
||||
killpg_calls.append((pgid, sig))
|
||||
if sig == 0:
|
||||
raise ProcessLookupError
|
||||
|
||||
monkeypatch.setattr(os, "getpgid", fake_getpgid)
|
||||
monkeypatch.setattr(os, "killpg", fake_killpg)
|
||||
|
||||
env._kill_process(proc)
|
||||
|
||||
assert killpg_calls == [(67890, signal.SIGTERM), (67890, 0)]
|
||||
# Cleanup path goes straight to SIGKILL — no graceful SIGTERM retry,
|
||||
# because the caller (timeout / KeyboardInterrupt / SystemExit branches)
|
||||
# has already given up on the process.
|
||||
assert killpg_calls == [(67890, signal.SIGKILL)]
|
||||
|
||||
|
||||
def test_wait_for_process_kills_subprocess_on_keyboardinterrupt():
|
||||
|
||||
@ -382,37 +382,19 @@ class LocalEnvironment(BaseEnvironment):
|
||||
return proc
|
||||
|
||||
def _kill_process(self, proc):
|
||||
"""Kill the entire process group (all children)."""
|
||||
|
||||
def _group_alive(pgid: int) -> bool:
|
||||
try:
|
||||
# POSIX-only: _IS_WINDOWS is handled before this helper is used.
|
||||
os.killpg(pgid, 0)
|
||||
return True
|
||||
except ProcessLookupError:
|
||||
return False
|
||||
except PermissionError:
|
||||
# The group exists, even if this process cannot signal it.
|
||||
return True
|
||||
|
||||
def _wait_for_group_exit(pgid: int, timeout: float) -> bool:
|
||||
deadline = time.monotonic() + timeout
|
||||
while time.monotonic() < deadline:
|
||||
# Reap the wrapper promptly. A dead but unreaped group leader
|
||||
# still makes killpg(pgid, 0) report the group as alive.
|
||||
try:
|
||||
proc.poll()
|
||||
except Exception:
|
||||
pass
|
||||
if not _group_alive(pgid):
|
||||
return True
|
||||
time.sleep(0.05)
|
||||
try:
|
||||
proc.poll()
|
||||
except Exception:
|
||||
pass
|
||||
return not _group_alive(pgid)
|
||||
"""Kill the entire process group (all children).
|
||||
|
||||
This is the cleanup path — invoked from ``_wait_for_process`` for
|
||||
the timeout, KeyboardInterrupt, and SystemExit branches. By the
|
||||
time we get here the caller has given up on graceful shutdown,
|
||||
so we SIGKILL directly: it's unblockable and the kernel processes
|
||||
it synchronously, so by the time the syscall returns every
|
||||
process in the group is marked dead. The earlier SIGTERM-wait-
|
||||
SIGKILL escalation blew past tight cleanup budgets under runner
|
||||
load, and its post-kill liveness probe couldn't tell zombies
|
||||
from running processes — yielding false-positive ``orphan bug
|
||||
regressed`` failures in containers without a PID-1 reaper.
|
||||
"""
|
||||
try:
|
||||
if _IS_WINDOWS:
|
||||
proc.terminate()
|
||||
@ -425,24 +407,11 @@ class LocalEnvironment(BaseEnvironment):
|
||||
raise
|
||||
|
||||
try:
|
||||
os.killpg(pgid, signal.SIGTERM)
|
||||
except ProcessLookupError:
|
||||
return
|
||||
|
||||
# Wait on the process group, not just the shell wrapper. Under
|
||||
# load the wrapper can exit before grandchildren do; returning
|
||||
# at that point leaves orphaned process-group members behind.
|
||||
if _wait_for_group_exit(pgid, 1.0):
|
||||
return
|
||||
|
||||
try:
|
||||
# POSIX-only: _IS_WINDOWS is handled by the outer branch.
|
||||
os.killpg(pgid, signal.SIGKILL)
|
||||
except ProcessLookupError:
|
||||
return
|
||||
_wait_for_group_exit(pgid, 2.0)
|
||||
try:
|
||||
proc.wait(timeout=0.2)
|
||||
proc.wait(timeout=0.5)
|
||||
except (subprocess.TimeoutExpired, OSError):
|
||||
pass
|
||||
except (ProcessLookupError, PermissionError, OSError):
|
||||
|
||||
Loading…
Reference in New Issue
Block a user