fix(local-env): SIGKILL directly on cleanup; don't wait for zombie reap #3

Open
claude-ceo-assistant wants to merge 2 commits from fix/kill-process-direct-sigkill into main

What this fixes

Two deterministic test failures in tools/environments/local.py:_kill_process, surfaced once #1 (fix/setup-uv-pin-version) lets the test suite actually run past Install uv.

Root causes (verified by reproduction in the act_runner image)

1. SIGTERM-then-SIGKILL escalation is the wrong shape for the cleanup path.

_kill_process is invoked from base.py:_wait_for_process's timeout and KeyboardInterrupt/SystemExit branches — by the time we're here, graceful shutdown is no longer wanted. The 1s SIGTERM-wait provides no benefit and, on oversubscribed runners, blows past tight test budgets:

  • test_timeout_path_still_works asserts elapsed < 4.0s, observed 5.01s
  • math: 2.0s (timeout) + 0.2s (poll granularity) + 1.0s (SIGTERM-wait ceiling) + 2.0s (SIGKILL-wait ceiling) = 5.2s worst case

Send SIGKILL directly: it's unblockable, kernel processes it synchronously, and by the time the syscall returns every process in the group is marked dead.

2. _wait_for_group_exit polled os.killpg(pgid, 0) waiting for the group to disappear — but kill(-pgid, 0) returns success for zombies.

In containers without a proper PID-1 reaper (tini/dumb-init), orphaned zombies linger until container exit. The function sat in its 2.0s ceiling waiting for kernel bookkeeping that would never happen. The processes had stopped (SIGKILL got them); their zombie entries are not our problem.

Reproduced as a Z (defunct) [sleep] with PPID=1 in the test's _process_group_snapshot output; this triggered the test's "orphan bug regressed" assertion despite the bug NOT being regressed.

Fix

  • tools/environments/local.py:_kill_process: send os.killpg(pgid, SIGKILL) directly, then proc.wait(timeout=0.5) to reap our direct child. Drop the now-unused _group_alive / _wait_for_group_exit nested helpers.
  • tests/tools/test_local_interrupt_cleanup.py::_pgid_still_alive: use ps -g $pgid -o stat= and skip zombie (Z) entries. Zombies aren't "alive" in any meaningful sense and reporting them as such causes false-positive failures on container runners.
  • test_kill_process_uses_cached_pgid_if_wrapper_already_exited: update expected call sequence to [(SIGKILL)] (was SIGTERM-then-existence-check).

Verification

Reproduced on the same act_runner image used in CI (runner-base:full-latest-cloudflared-goproxy-pipe) on the operator host (load avg 16-37 on 8 CPUs — same conditions as the failing CI runs).

  • test_local_interrupt_cleanup.py + test_local_background_child_hang.py (10 tests total): pass 3/3 stress runs after fix. Failing 0/3 before.
  • Spot-checked the timeout test's elapsed: now ~2.5-2.7s (was 5.01s), well under the 4.0s budget.

Out of scope

Other tests that were red on main after #1:

  • test_concurrent_inserts_settle_at_cap — passes isolated, fails when run in parallel with other heavy tests because it creates 160 real AIAgent instances. Real fix is to refactor the test to use a stub AIAgent. Tracking separately.
  • test_blocking_approval_* — passes alone, fails when xdist schedules it alongside test_concurrent_inserts_*. Suspected order-pollution + parallel-load. Tracking separately.
  • ~24 snapshot-drift tests (commands list, TimeoutStopSec value, Dockerfile, kill signal mocks, _tool_guardrails attr). All deterministic, all unrelated to this change.
## What this fixes Two deterministic test failures in `tools/environments/local.py:_kill_process`, surfaced once #1 (`fix/setup-uv-pin-version`) lets the test suite actually run past `Install uv`. ### Root causes (verified by reproduction in the act_runner image) **1. SIGTERM-then-SIGKILL escalation is the wrong shape for the cleanup path.** `_kill_process` is invoked from `base.py:_wait_for_process`'s timeout and KeyboardInterrupt/SystemExit branches — by the time we're here, graceful shutdown is no longer wanted. The 1s SIGTERM-wait provides no benefit and, on oversubscribed runners, blows past tight test budgets: - `test_timeout_path_still_works` asserts `elapsed < 4.0s`, observed 5.01s - math: 2.0s (timeout) + 0.2s (poll granularity) + 1.0s (SIGTERM-wait ceiling) + 2.0s (SIGKILL-wait ceiling) = 5.2s worst case Send SIGKILL directly: it's unblockable, kernel processes it synchronously, and by the time the syscall returns every process in the group is marked dead. **2. `_wait_for_group_exit` polled `os.killpg(pgid, 0)` waiting for the group to disappear — but `kill(-pgid, 0)` returns success for zombies.** In containers without a proper PID-1 reaper (tini/dumb-init), orphaned zombies linger until container exit. The function sat in its 2.0s ceiling waiting for kernel bookkeeping that would never happen. The processes had stopped (SIGKILL got them); their zombie entries are not our problem. Reproduced as a Z (defunct) `[sleep]` with PPID=1 in the test's `_process_group_snapshot` output; this triggered the test's `"orphan bug regressed"` assertion despite the bug NOT being regressed. ## Fix - `tools/environments/local.py:_kill_process`: send `os.killpg(pgid, SIGKILL)` directly, then `proc.wait(timeout=0.5)` to reap our direct child. Drop the now-unused `_group_alive` / `_wait_for_group_exit` nested helpers. - `tests/tools/test_local_interrupt_cleanup.py::_pgid_still_alive`: use `ps -g $pgid -o stat=` and skip zombie (`Z`) entries. Zombies aren't "alive" in any meaningful sense and reporting them as such causes false-positive failures on container runners. - `test_kill_process_uses_cached_pgid_if_wrapper_already_exited`: update expected call sequence to `[(SIGKILL)]` (was SIGTERM-then-existence-check). ## Verification Reproduced on the same act_runner image used in CI (`runner-base:full-latest-cloudflared-goproxy-pipe`) on the operator host (load avg 16-37 on 8 CPUs — same conditions as the failing CI runs). - `test_local_interrupt_cleanup.py` + `test_local_background_child_hang.py` (10 tests total): pass 3/3 stress runs after fix. Failing 0/3 before. - Spot-checked the timeout test's elapsed: now ~2.5-2.7s (was 5.01s), well under the 4.0s budget. ## Out of scope Other tests that were red on `main` after #1: - `test_concurrent_inserts_settle_at_cap` — passes isolated, fails when run in parallel with other heavy tests because it creates 160 real `AIAgent` instances. Real fix is to refactor the test to use a stub AIAgent. Tracking separately. - `test_blocking_approval_*` — passes alone, fails when xdist schedules it alongside `test_concurrent_inserts_*`. Suspected order-pollution + parallel-load. Tracking separately. - ~24 snapshot-drift tests (commands list, TimeoutStopSec value, Dockerfile, kill signal mocks, `_tool_guardrails` attr). All deterministic, all unrelated to this change.
claude-ceo-assistant added 1 commit 2026-05-08 02:34:24 +00:00
fix(local-env): use SIGKILL directly on cleanup, don't wait for zombie group reap
Some checks failed
Contributor Attribution Check / check-attribution (pull_request) Failing after 15s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 20s
Nix / nix (ubuntu-latest) (pull_request) Failing after 9m23s
Tests / test (pull_request) Failing after 11m50s
Tests / e2e (pull_request) Successful in 1m4s
Nix / nix (macos-latest) (pull_request) Has been cancelled
d6fca4f623
Two real bugs in tools/environments/local.py:_kill_process surfaced under
runner load (operator host load avg 16-37 on 8 CPUs):

1. SIGTERM-then-1s-wait-then-SIGKILL escalation is the wrong shape for the
   cleanup path. _kill_process is invoked from base.py:_wait_for_process's
   timeout and KeyboardInterrupt/SystemExit branches — by the time we're
   here, the caller has given up on graceful shutdown. The 1s SIGTERM-wait
   provides no benefit and on oversubscribed hosts blows past tight test
   budgets (test_timeout_path_still_works: assert elapsed < 4.0s, was 5.05s).
   Send SIGKILL directly: it's unblockable and the kernel processes it
   synchronously.

2. _wait_for_group_exit polled os.killpg(pgid, 0) for the group to disappear,
   but kill(-pgid, 0) returns success for zombies. In containers without a
   proper PID-1 reaper (tini/dumb-init), orphaned zombies linger until
   container exit. The function sat in its 2.0s ceiling waiting for kernel
   bookkeeping that would never happen. SIGKILL stopped the processes; their
   zombie entries are not our problem.

Replace both with a single os.killpg(SIGKILL) + proc.wait(timeout=0.5) to
reap our direct child. The nested _group_alive / _wait_for_group_exit
helpers are removed (no callers remain).

Test fixes:
- test_kill_process_uses_cached_pgid_if_wrapper_already_exited: update
  expected call sequence to [(SIGKILL)] (was [SIGTERM, KILL-0-check]).
- _pgid_still_alive helper: use ps STAT field to skip zombies, instead of
  os.killpg(pgid, 0). Zombies aren't "alive" in any meaningful sense and
  reporting them as such causes false-positive 'orphan bug regressed' alerts
  on container runners.

Verified: tests/tools/test_local_interrupt_cleanup.py +
tests/tools/test_local_background_child_hang.py — all 10 tests pass 3/3
runs on the act_runner image (load avg 16+).

Remaining red tests on this branch (test_concurrent_inserts_settle_at_cap,
test_blocking_approval_*, snapshot-drift tests) are unrelated to this
change — they have separate root causes (test creates 160 real AIAgents;
test order-pollution; code drift). Tracking separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude-ceo-assistant added 1 commit 2026-05-08 04:07:14 +00:00
chore(release): map claude-ceo-assistant email for AUTHOR_MAP
Some checks failed
Tests / test (pull_request) Failing after 1m27s
Tests / e2e (pull_request) Failing after 1m36s
Nix / nix (ubuntu-latest) (pull_request) Failing after 1m39s
Nix / nix (macos-latest) (pull_request) Waiting to run
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 33s
Contributor Attribution Check / check-attribution (pull_request) Successful in 28s
23954f89d5
The contributor-check.yml workflow requires every commit author email
to have an entry in scripts/release.py:AUTHOR_MAP. claude-ceo-assistant
is the Gitea-only bot identity used by Claude-Code-driven PRs in the
molecule-ai fork (introduced post-2026-05-06 GitHub suspension; no
upstream/GitHub equivalent). Register it so PRs from that identity pass
the attribution check.

Pattern matches recent same-shape commits: 73bcd83, 50f9f38, 9c626ef.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some checks failed
Tests / test (pull_request) Failing after 1m27s
Tests / e2e (pull_request) Failing after 1m36s
Nix / nix (ubuntu-latest) (pull_request) Failing after 1m39s
Nix / nix (macos-latest) (pull_request) Waiting to run
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 33s
Contributor Attribution Check / check-attribution (pull_request) Successful in 28s
This pull request has changes conflicting with the target branch.
  • tests/tools/test_local_interrupt_cleanup.py
  • tools/environments/local.py

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin fix/kill-process-direct-sigkill:fix/kill-process-direct-sigkill
git checkout fix/kill-process-direct-sigkill
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/hermes-agent#3
No description provided.