|
Some checks failed
Tests / e2e (pull_request) Successful in 57s
Nix / nix (macos-latest) (pull_request) Waiting to run
Contributor Attribution Check / check-attribution (pull_request) Failing after 9s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 10s
Tests / test (pull_request) Failing after 7m7s
Nix / nix (ubuntu-latest) (pull_request) Failing after 13m19s
Restores the Apr 2026 orphan-bug fix for the local terminal backend
(``sleep 300`` survives ``hermes chat -q`` SIGTERM, originally reported
by Physikal) and aligns the ``hermes update`` survivor sweep with the
contract its tests have always pinned.
Three things move:
1. ``tools/environments/local.py:_kill_process``
- Was: SIGTERM → wait up to 1s polling ``os.killpg(pgid, 0)`` → SIGKILL
→ wait up to 2s on the same pollee.
- Now: SIGKILL directly + ``proc.wait(timeout=0.5)`` to reap the wrapper.
- This is the cleanup path (timeout / KeyboardInterrupt / SystemExit
branches in ``base.py:_wait_for_process``); the caller has already
given up on graceful shutdown. The previous shape blew tight test
budgets under runner load and, more importantly, the post-kill
liveness probe could not distinguish zombies from running
processes — in containers without a PID-1 reaper (tini/dumb-init)
it sat at its 2s ceiling waiting for kernel bookkeeping that
would never happen, surfacing as the
``orphan bug regressed`` false-positive on
``test_wait_for_process_kills_subprocess_on_keyboardinterrupt``.
2. ``tests/tools/test_local_interrupt_cleanup.py``
- ``_pgid_still_alive``: switch from ``os.killpg(pgid, 0)`` to ``ps -g
STAT`` so zombies are not reported as alive.
- ``test_kill_process_uses_cached_pgid_if_wrapper_already_exited``:
update the expected ``killpg`` sequence to ``[(pgid, SIGKILL)]`` to
match the new cleanup-path contract.
3. ``hermes_cli/main.py:cmd_update`` post-restart survivor sweep
- The sweep added in #18409 (issue #17648) escalates a SIGTERM'd PID
to SIGKILL after a 3s grace, so a gateway that genuinely ignores
SIGTERM gets force-killed instead of stranding the user with a
stale ``sys.modules``. The fixture-mocked ``time.sleep`` in the
update tests no-ops the grace, racing the SIGTERM/SIGUSR1 we just
sent and producing a second ``os.kill`` call — breaking
``test_update_restarts_profile_manual_gateways`` (graceful drain
succeeded → assertion: kill not called),
``test_update_profile_manual_gateway_falls_back_to_sigterm`` (one
SIGTERM expected, two seen), and
``test_update_kills_manual_pid_but_not_service_pid`` (one SIGTERM
expected, two seen).
- Fix: gate the sweep on a real wall-clock grace. Sample
``time.monotonic()`` before and after the 3s sleep; if less than
2.5s elapsed (test fixture, signal handler, etc.), skip the sweep
entirely. Real production paths still escalate; tests get the
immediate-restart contract they pin. Also probe each candidate
PID with ``os.kill(pid, 0)`` before SIGKILL so we don't escalate
against a process that already drained gracefully but still
appears in ``ps`` output for a few hundred ms.
The Apr 2026 fix on branch ``fix/kill-process-direct-sigkill`` (commit
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| base.py | ||
| daytona.py | ||
| docker.py | ||
| file_sync.py | ||
| local.py | ||
| managed_modal.py | ||
| modal_utils.py | ||
| modal.py | ||
| singularity.py | ||
| ssh.py | ||
| vercel_sandbox.py | ||