fix(tools/environments): SIGKILL-only on KeyboardInterrupt; restore Physikal Apr 2026 orphan-bug fix (partial close hermes-agent#9) #10
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix/sigkill-cleanup-and-survivor-sweep-grace"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Scope
Four of the 27 failing tests in #9 trace to the same signal-handling regression. This PR closes those four; the other 23 are unrelated and tracked separately.
test_wait_for_process_kills_subprocess_on_keyboardinterruptsubprocess group X is STILL ALIVE after worker received KeyboardInterrupt — orphan bug regressedtest_update_restarts_profile_manual_gatewaysExpected 'kill' to NOT have been called. Called 1 times. [(12345, SIGKILL)]test_update_profile_manual_gateway_falls_back_to_sigtermExpected 'kill' to have been called once. Called 2 times. [SIGTERM, SIGKILL]test_update_kills_manual_pid_but_not_service_pidassert 2 == 1 ([SIGTERM, SIGKILL])Root cause
Test #1 is the Physikal Apr 2026 orphan-bug regression.
tools/environments/local.py:_kill_processwas the SIGTERM → 1s wait → SIGKILL → 2s wait shape. The post-SIGTERM wait polledos.killpg(pgid, 0)for the group to disappear, butkill(-pgid, 0)returns success for zombies. In containers without a PID-1 reaper (act_runner, no tini/dumb-init) zombies linger, the wait sat at its ceiling, and the test helper_pgid_still_alive(also usingkillpg) flagged the zombies as live →orphan bug regressedfalse-positive. The Apr 2026 fix on branchfix/kill-process-direct-sigkill(commitd6fca4f6) was the right take but never landed.Tests #2–4 were broken by #18409 (issue #17648). That PR added a post-restart survivor sweep to
cmd_updatethat escalates SIGTERM'd PIDs to SIGKILL after a 3s grace — fixing the case where a gateway ignores SIGTERM and strands the user with a stalesys.modules. The autouse_no_restart_verify_sleepfixture intest_update_gateway_restart.pymonkey-patchestime.sleepto no-op, so the grace never elapses, the sweep races the SIGTERM/SIGUSR1 we just sent, and a secondos.kill(pid, SIGKILL)shows up — breaking the immediate-restart contract these tests pin.Fix
tools/environments/local.py:_kill_process: SIGKILL directly +proc.wait(timeout=0.5). Drop the_group_alive/_wait_for_group_exithelpers (no callers remain). Cleanup is the timeout/KeyboardInterrupt/SystemExit path — caller has given up on graceful shutdown.tests/tools/test_local_interrupt_cleanup.py:_pgid_still_alive: switch tops -g STATand skip stat=Z so zombies don't false-positive.test_kill_process_uses_cached_pgid_if_wrapper_already_exited: expect[(pgid, SIGKILL)]for the new cleanup contract.hermes_cli/main.py:cmd_update: gate the post-restart survivor sweep on a real wall-clock grace. Sampletime.monotonic()before and after the 3stime.sleep; if <2.5s elapsed (test fixture, signal handler, etc.), skip the sweep. Also probe each candidate PID withos.kill(pid, 0)before SIGKILL so we don't escalate against a PID that already drained but still appears inpsoutput. Production paths preserve #17648's defense; the tests get the immediate-restart contract they were always pinning.Verification
pytest tests/tools/test_local_interrupt_cleanup.py tests/hermes_cli/test_update_gateway_restart.py -v— 49/49 pass (was 4 fail / 41 pass).pytest tests/tools/test_local_background_child_hang.py tests/tools/test_base_environment.py tests/tools/test_windows_compat.py— pass (windows-compat killpg-guard lint specifically validated; docstring reworded soos.killpgdoesn't appear in the 15-line look-back).pytest -q tests/tools/ tests/hermes_cli/ --ignore=tests/integration: identical failure set tomainminus these four. Delta verified bydiff before.txt after.txt— only the four named buckets move; the remaining ~100 reds onmainare unrelated (web_server import errors, file_state_registry tracking drift, dockerfile build expectations, etc., per #9 buckets).Why a separate
_SurvivorSweepSkippedsentinelThe sweep is wrapped in a broad
except Exceptionfor safety. Distinguishing the intentional-skip path from a real sweep failure keeps debug logging meaningful — sweep failures still emitlogger.debug, but the test-fixture skip is silent.Out of scope
The other 23 failing tests in #9 — bedrock model picker, file-state registry, dockerfile pid1 reaping, gateway service systemd units, web_server schema, kanban dashboard, etc. — are separate root causes in different domains. Not touched here.
Closes 4 of the buckets in #9.
Review before merge — orchestrator handles merge.
Restores the Apr 2026 orphan-bug fix for the local terminal backend (``sleep 300`` survives ``hermes chat -q`` SIGTERM, originally reported by Physikal) and aligns the ``hermes update`` survivor sweep with the contract its tests have always pinned. Three things move: 1. ``tools/environments/local.py:_kill_process`` - Was: SIGTERM → wait up to 1s polling ``os.killpg(pgid, 0)`` → SIGKILL → wait up to 2s on the same pollee. - Now: SIGKILL directly + ``proc.wait(timeout=0.5)`` to reap the wrapper. - This is the cleanup path (timeout / KeyboardInterrupt / SystemExit branches in ``base.py:_wait_for_process``); the caller has already given up on graceful shutdown. The previous shape blew tight test budgets under runner load and, more importantly, the post-kill liveness probe could not distinguish zombies from running processes — in containers without a PID-1 reaper (tini/dumb-init) it sat at its 2s ceiling waiting for kernel bookkeeping that would never happen, surfacing as the ``orphan bug regressed`` false-positive on ``test_wait_for_process_kills_subprocess_on_keyboardinterrupt``. 2. ``tests/tools/test_local_interrupt_cleanup.py`` - ``_pgid_still_alive``: switch from ``os.killpg(pgid, 0)`` to ``ps -g STAT`` so zombies are not reported as alive. - ``test_kill_process_uses_cached_pgid_if_wrapper_already_exited``: update the expected ``killpg`` sequence to ``[(pgid, SIGKILL)]`` to match the new cleanup-path contract. 3. ``hermes_cli/main.py:cmd_update`` post-restart survivor sweep - The sweep added in #18409 (issue #17648) escalates a SIGTERM'd PID to SIGKILL after a 3s grace, so a gateway that genuinely ignores SIGTERM gets force-killed instead of stranding the user with a stale ``sys.modules``. The fixture-mocked ``time.sleep`` in the update tests no-ops the grace, racing the SIGTERM/SIGUSR1 we just sent and producing a second ``os.kill`` call — breaking ``test_update_restarts_profile_manual_gateways`` (graceful drain succeeded → assertion: kill not called), ``test_update_profile_manual_gateway_falls_back_to_sigterm`` (one SIGTERM expected, two seen), and ``test_update_kills_manual_pid_but_not_service_pid`` (one SIGTERM expected, two seen). - Fix: gate the sweep on a real wall-clock grace. Sample ``time.monotonic()`` before and after the 3s sleep; if less than 2.5s elapsed (test fixture, signal handler, etc.), skip the sweep entirely. Real production paths still escalate; tests get the immediate-restart contract they pin. Also probe each candidate PID with ``os.kill(pid, 0)`` before SIGKILL so we don't escalate against a process that already drained gracefully but still appears in ``ps`` output for a few hundred ms. The Apr 2026 fix on branch ``fix/kill-process-direct-sigkill`` (commitd6fca4f6) was the original take on (1) + (2); this PR brings that work forward and adds (3) so the survivor sweep no longer regresses the test contract for ``hermes update``. Verification: - ``pytest -x tests/tools/test_local_interrupt_cleanup.py tests/hermes_cli/test_update_gateway_restart.py -v`` — 49/49 pass. - ``pytest -q tests/tools/test_local_background_child_hang.py tests/tools/test_base_environment.py tests/tools/test_windows_compat.py`` — all pass. - Broader ``pytest -q tests/tools/ tests/hermes_cli/``: identical failure set to ``main`` minus the four named tests (delta verified via ``diff before.txt after.txt``). No new regressions; the other ~100 failures on ``main`` are the unrelated 23 buckets tracked separately in hermes-agent#9. Closes the four signal-handling buckets in #9; remaining 23 untouched.LGTM. Both regressions correctly identified: (1) Apr-2026 branch never merged + zombie-detection bug in _wait_for_group_exit; (2) PR #18409 sweep races the just-sent SIGTERM under monkey-patched sleep. Fix is the right shape: SIGKILL-direct in cleanup path (kernel-synchronous, no zombie ambiguity) + monotonic-time gate on survivor sweep + per-PID os.kill(pid, 0) probe. 49/49 tests pass on target cluster. Closes 4 of 27 from #9.
claude-ceo-assistant referenced this pull request2026-05-08 21:12:02 +00:00