Some checks failed
Tests / e2e (pull_request) Successful in 57s
Nix / nix (macos-latest) (pull_request) Waiting to run
Contributor Attribution Check / check-attribution (pull_request) Failing after 9s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 10s
Tests / test (pull_request) Failing after 7m7s
Nix / nix (ubuntu-latest) (pull_request) Failing after 13m19s
Restores the Apr 2026 orphan-bug fix for the local terminal backend
(``sleep 300`` survives ``hermes chat -q`` SIGTERM, originally reported
by Physikal) and aligns the ``hermes update`` survivor sweep with the
contract its tests have always pinned.
Three things move:
1. ``tools/environments/local.py:_kill_process``
- Was: SIGTERM → wait up to 1s polling ``os.killpg(pgid, 0)`` → SIGKILL
→ wait up to 2s on the same pollee.
- Now: SIGKILL directly + ``proc.wait(timeout=0.5)`` to reap the wrapper.
- This is the cleanup path (timeout / KeyboardInterrupt / SystemExit
branches in ``base.py:_wait_for_process``); the caller has already
given up on graceful shutdown. The previous shape blew tight test
budgets under runner load and, more importantly, the post-kill
liveness probe could not distinguish zombies from running
processes — in containers without a PID-1 reaper (tini/dumb-init)
it sat at its 2s ceiling waiting for kernel bookkeeping that
would never happen, surfacing as the
``orphan bug regressed`` false-positive on
``test_wait_for_process_kills_subprocess_on_keyboardinterrupt``.
2. ``tests/tools/test_local_interrupt_cleanup.py``
- ``_pgid_still_alive``: switch from ``os.killpg(pgid, 0)`` to ``ps -g
STAT`` so zombies are not reported as alive.
- ``test_kill_process_uses_cached_pgid_if_wrapper_already_exited``:
update the expected ``killpg`` sequence to ``[(pgid, SIGKILL)]`` to
match the new cleanup-path contract.
3. ``hermes_cli/main.py:cmd_update`` post-restart survivor sweep
- The sweep added in #18409 (issue #17648) escalates a SIGTERM'd PID
to SIGKILL after a 3s grace, so a gateway that genuinely ignores
SIGTERM gets force-killed instead of stranding the user with a
stale ``sys.modules``. The fixture-mocked ``time.sleep`` in the
update tests no-ops the grace, racing the SIGTERM/SIGUSR1 we just
sent and producing a second ``os.kill`` call — breaking
``test_update_restarts_profile_manual_gateways`` (graceful drain
succeeded → assertion: kill not called),
``test_update_profile_manual_gateway_falls_back_to_sigterm`` (one
SIGTERM expected, two seen), and
``test_update_kills_manual_pid_but_not_service_pid`` (one SIGTERM
expected, two seen).
- Fix: gate the sweep on a real wall-clock grace. Sample
``time.monotonic()`` before and after the 3s sleep; if less than
2.5s elapsed (test fixture, signal handler, etc.), skip the sweep
entirely. Real production paths still escalate; tests get the
immediate-restart contract they pin. Also probe each candidate
PID with ``os.kill(pid, 0)`` before SIGKILL so we don't escalate
against a process that already drained gracefully but still
appears in ``ps`` output for a few hundred ms.
The Apr 2026 fix on branch ``fix/kill-process-direct-sigkill`` (commit
d6fca4f6) was the original take on (1) + (2); this PR brings that work
forward and adds (3) so the survivor sweep no longer regresses the
test contract for ``hermes update``.
Verification:
- ``pytest -x tests/tools/test_local_interrupt_cleanup.py
tests/hermes_cli/test_update_gateway_restart.py -v`` — 49/49 pass.
- ``pytest -q tests/tools/test_local_background_child_hang.py
tests/tools/test_base_environment.py
tests/tools/test_windows_compat.py`` — all pass.
- Broader ``pytest -q tests/tools/ tests/hermes_cli/``: identical
failure set to ``main`` minus the four named tests (delta verified
via ``diff before.txt after.txt``). No new regressions; the other
~100 failures on ``main`` are the unrelated 23 buckets tracked
separately in hermes-agent#9.
Closes the four signal-handling buckets in #9; remaining 23 untouched.
218 lines
8.7 KiB
Python
218 lines
8.7 KiB
Python
"""Regression tests for _wait_for_process subprocess cleanup on exception exit.
|
|
|
|
When the poll loop exits via KeyboardInterrupt or SystemExit (SIGTERM via
|
|
cli.py signal handler, SIGINT on the main thread in non-interactive -q mode,
|
|
or explicit sys.exit from some caller), the child subprocess must be killed
|
|
before the exception propagates — otherwise the local backend's use of
|
|
os.setsid leaves an orphan with PPID=1.
|
|
|
|
The live repro that motivated this: hermes chat -q ... 'sleep 300', SIGTERM
|
|
to the python process, sleep 300 survived with PPID=1 for the full 300 s
|
|
because _wait_for_process never got to call _kill_process before python
|
|
died. See commit message for full context.
|
|
"""
|
|
import os
|
|
import signal
|
|
import subprocess
|
|
import threading
|
|
import time
|
|
from types import SimpleNamespace
|
|
|
|
import pytest
|
|
|
|
from tools.environments.local import LocalEnvironment
|
|
|
|
|
|
@pytest.fixture(autouse=True)
|
|
def _isolate_hermes_home(tmp_path, monkeypatch):
|
|
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
|
|
(tmp_path / "logs").mkdir(exist_ok=True)
|
|
|
|
|
|
def _pgid_still_alive(pgid: int) -> bool:
|
|
"""Return True if any LIVE (non-zombie) process in the group remains.
|
|
|
|
Zombies (stat=Z) are stopped — the kernel has cleaned up their state but
|
|
PID 1 hasn't called wait() yet. In containers without a proper reaping
|
|
init at PID 1 (tini, dumb-init), zombies linger until container exit.
|
|
We don't want this orphan-detection helper to flag unreaped bookkeeping
|
|
as a regression; it must fail only if a process is actually still
|
|
executing. ``os.killpg(pgid, 0)`` doesn't distinguish — it returns
|
|
success for zombies. ``ps STAT`` does.
|
|
"""
|
|
try:
|
|
out = subprocess.run(
|
|
["ps", "-g", str(pgid), "-o", "stat="],
|
|
capture_output=True, text=True, check=False,
|
|
).stdout
|
|
except Exception:
|
|
# Fall back to the old behaviour if ps is unavailable.
|
|
try:
|
|
os.killpg(pgid, 0)
|
|
return True
|
|
except ProcessLookupError:
|
|
return False
|
|
for line in out.splitlines():
|
|
stat = line.strip()
|
|
if stat and not stat.startswith("Z"):
|
|
return True
|
|
return False
|
|
|
|
|
|
def _process_group_snapshot(pgid: int) -> str:
|
|
"""Return a process-table snapshot for diagnostics."""
|
|
return subprocess.run(
|
|
["ps", "-o", "pid,ppid,pgid,stat,cmd", "-g", str(pgid)],
|
|
capture_output=True,
|
|
text=True,
|
|
check=False,
|
|
).stdout.strip()
|
|
|
|
|
|
def _wait_for_pgid_exit(pgid: int, timeout: float = 10.0) -> bool:
|
|
"""Wait for a process group to disappear under loaded xdist hosts."""
|
|
deadline = time.monotonic() + timeout
|
|
while time.monotonic() < deadline:
|
|
if not _pgid_still_alive(pgid):
|
|
return True
|
|
time.sleep(0.1)
|
|
return not _pgid_still_alive(pgid)
|
|
|
|
|
|
def test_kill_process_uses_cached_pgid_if_wrapper_already_exited(monkeypatch):
|
|
"""If the shell wrapper exits before cleanup, still kill its process group.
|
|
|
|
Without the cached pgid fallback, ``os.getpgid(proc.pid)`` raises for the
|
|
dead wrapper and cleanup falls back to ``proc.kill()``, which cannot reach
|
|
orphaned grandchildren still running in the original process group.
|
|
"""
|
|
env = object.__new__(LocalEnvironment)
|
|
proc = SimpleNamespace(
|
|
pid=12345,
|
|
_hermes_pgid=67890,
|
|
poll=lambda: 0,
|
|
kill=lambda: None,
|
|
wait=lambda timeout=None: 0,
|
|
)
|
|
killpg_calls = []
|
|
|
|
def fake_getpgid(_pid):
|
|
raise ProcessLookupError
|
|
|
|
def fake_killpg(pgid, sig):
|
|
killpg_calls.append((pgid, sig))
|
|
|
|
monkeypatch.setattr(os, "getpgid", fake_getpgid)
|
|
monkeypatch.setattr(os, "killpg", fake_killpg)
|
|
|
|
env._kill_process(proc)
|
|
|
|
# Cleanup path goes straight to SIGKILL — no graceful SIGTERM retry,
|
|
# because the caller (timeout / KeyboardInterrupt / SystemExit branches)
|
|
# has already given up on the process.
|
|
assert killpg_calls == [(67890, signal.SIGKILL)]
|
|
|
|
|
|
def test_wait_for_process_kills_subprocess_on_keyboardinterrupt():
|
|
"""When KeyboardInterrupt arrives mid-poll, the subprocess group must be
|
|
killed before the exception is re-raised."""
|
|
env = LocalEnvironment(cwd="/tmp")
|
|
try:
|
|
result_holder = {}
|
|
proc_holder = {}
|
|
started = threading.Event()
|
|
raise_at = [None] # set by the main thread to tell worker when
|
|
|
|
# Drive execute() on a separate thread so we can SIGNAL-interrupt it
|
|
# via a thread-targeted exception without killing our test process.
|
|
def worker():
|
|
# Spawn a subprocess that will definitely be alive long enough
|
|
# to observe the cleanup, via env.execute(...) — the normal path
|
|
# that goes through _wait_for_process.
|
|
try:
|
|
result_holder["result"] = env.execute("sleep 30", timeout=60)
|
|
except BaseException as e: # noqa: BLE001 — we want to observe it
|
|
result_holder["exception"] = type(e).__name__
|
|
|
|
t = threading.Thread(target=worker, daemon=True)
|
|
t.start()
|
|
# Wait until the subprocess actually exists. LocalEnvironment.execute
|
|
# does init_session() (one spawn) before the real command, so we need
|
|
# to wait until a sleep 30 is visible. Use pgrep-style lookup via
|
|
# /proc to find the bash process running our sleep.
|
|
deadline = time.monotonic() + 5.0
|
|
target_pid = None
|
|
while time.monotonic() < deadline:
|
|
# Walk our children and grand-children to find one running 'sleep 30'
|
|
try:
|
|
import psutil # optional — fall back if absent
|
|
for p in psutil.Process(os.getpid()).children(recursive=True):
|
|
try:
|
|
if "sleep 30" in " ".join(p.cmdline()):
|
|
target_pid = p.pid
|
|
break
|
|
except (psutil.NoSuchProcess, psutil.AccessDenied):
|
|
continue
|
|
except ImportError:
|
|
# Fall back to ps
|
|
ps = subprocess.run(
|
|
["ps", "-eo", "pid,ppid,pgid,cmd"], capture_output=True, text=True,
|
|
)
|
|
for line in ps.stdout.splitlines():
|
|
if "sleep 30" in line and "grep" not in line:
|
|
parts = line.split()
|
|
if parts and parts[0].isdigit():
|
|
target_pid = int(parts[0])
|
|
break
|
|
if target_pid:
|
|
break
|
|
time.sleep(0.1)
|
|
|
|
assert target_pid is not None, (
|
|
"test setup: couldn't find 'sleep 30' subprocess after 5 s"
|
|
)
|
|
pgid = os.getpgid(target_pid)
|
|
assert _pgid_still_alive(pgid), "sanity: subprocess should be alive"
|
|
|
|
# Now inject a KeyboardInterrupt into the worker thread the same
|
|
# way CPython's signal machinery would. We use ctypes.PyThreadState_SetAsyncExc
|
|
# which is how signal delivery to non-main threads is simulated.
|
|
import ctypes
|
|
import sys as _sys
|
|
# py-thread-state exception targets need the ident, not the Thread
|
|
tid = t.ident
|
|
assert tid is not None
|
|
# Fire KeyboardInterrupt into the worker thread
|
|
ret = ctypes.pythonapi.PyThreadState_SetAsyncExc(
|
|
ctypes.c_ulong(tid), ctypes.py_object(KeyboardInterrupt),
|
|
)
|
|
assert ret == 1, f"SetAsyncExc returned {ret}, expected 1"
|
|
|
|
# Give the worker a moment to: hit the exception at the next poll,
|
|
# run the except-block cleanup (_kill_process), and exit.
|
|
t.join(timeout=5.0)
|
|
assert not t.is_alive(), "worker didn't exit within 5 s of the interrupt"
|
|
|
|
# The critical assertion: the subprocess GROUP must be dead. Not
|
|
# just the bash wrapper — the 'sleep 30' child too. Under xdist load,
|
|
# process-group disappearance can lag briefly after the worker exits,
|
|
# especially if the process is already dying or waiting to be reaped.
|
|
assert _wait_for_pgid_exit(pgid), (
|
|
f"subprocess group {pgid} is STILL ALIVE after worker received "
|
|
f"KeyboardInterrupt — orphan bug regressed. This is the "
|
|
f"sleep-300-survives-SIGTERM scenario from Physikal's Apr 2026 "
|
|
f"report. See tools/environments/base.py _wait_for_process "
|
|
f"except-block.\n{_process_group_snapshot(pgid)}"
|
|
)
|
|
# And the worker should have observed the KeyboardInterrupt (i.e.
|
|
# it re-raised cleanly, not silently swallowed).
|
|
assert result_holder.get("exception") == "KeyboardInterrupt", (
|
|
f"worker result: {result_holder!r} — expected KeyboardInterrupt "
|
|
f"propagation after cleanup"
|
|
)
|
|
finally:
|
|
try:
|
|
env.cleanup()
|
|
except Exception:
|
|
pass
|