Two real bugs in tools/environments/local.py:_kill_process surfaced under
runner load (operator host load avg 16-37 on 8 CPUs):
1. SIGTERM-then-1s-wait-then-SIGKILL escalation is the wrong shape for the
cleanup path. _kill_process is invoked from base.py:_wait_for_process's
timeout and KeyboardInterrupt/SystemExit branches — by the time we're
here, the caller has given up on graceful shutdown. The 1s SIGTERM-wait
provides no benefit and on oversubscribed hosts blows past tight test
budgets (test_timeout_path_still_works: assert elapsed < 4.0s, was 5.05s).
Send SIGKILL directly: it's unblockable and the kernel processes it
synchronously.
2. _wait_for_group_exit polled os.killpg(pgid, 0) for the group to disappear,
but kill(-pgid, 0) returns success for zombies. In containers without a
proper PID-1 reaper (tini/dumb-init), orphaned zombies linger until
container exit. The function sat in its 2.0s ceiling waiting for kernel
bookkeeping that would never happen. SIGKILL stopped the processes; their
zombie entries are not our problem.
Replace both with a single os.killpg(SIGKILL) + proc.wait(timeout=0.5) to
reap our direct child. The nested _group_alive / _wait_for_group_exit
helpers are removed (no callers remain).
Test fixes:
- test_kill_process_uses_cached_pgid_if_wrapper_already_exited: update
expected call sequence to [(SIGKILL)] (was [SIGTERM, KILL-0-check]).
- _pgid_still_alive helper: use ps STAT field to skip zombies, instead of
os.killpg(pgid, 0). Zombies aren't "alive" in any meaningful sense and
reporting them as such causes false-positive 'orphan bug regressed' alerts
on container runners.
Verified: tests/tools/test_local_interrupt_cleanup.py +
tests/tools/test_local_background_child_hang.py — all 10 tests pass 3/3
runs on the act_runner image (load avg 16+).
Remaining red tests on this branch (test_concurrent_inserts_settle_at_cap,
test_blocking_approval_*, snapshot-drift tests) are unrelated to this
change — they have separate root causes (test creates 160 real AIAgents;
test order-pollution; code drift). Tracking separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>