fix(local-env): SIGKILL directly on cleanup; don't wait for zombie reap #3
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix/kill-process-direct-sigkill"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What this fixes
Two deterministic test failures in
tools/environments/local.py:_kill_process, surfaced once #1 (fix/setup-uv-pin-version) lets the test suite actually run pastInstall uv.Root causes (verified by reproduction in the act_runner image)
1. SIGTERM-then-SIGKILL escalation is the wrong shape for the cleanup path.
_kill_processis invoked frombase.py:_wait_for_process's timeout and KeyboardInterrupt/SystemExit branches — by the time we're here, graceful shutdown is no longer wanted. The 1s SIGTERM-wait provides no benefit and, on oversubscribed runners, blows past tight test budgets:test_timeout_path_still_worksassertselapsed < 4.0s, observed 5.01sSend SIGKILL directly: it's unblockable, kernel processes it synchronously, and by the time the syscall returns every process in the group is marked dead.
2.
_wait_for_group_exitpolledos.killpg(pgid, 0)waiting for the group to disappear — butkill(-pgid, 0)returns success for zombies.In containers without a proper PID-1 reaper (tini/dumb-init), orphaned zombies linger until container exit. The function sat in its 2.0s ceiling waiting for kernel bookkeeping that would never happen. The processes had stopped (SIGKILL got them); their zombie entries are not our problem.
Reproduced as a Z (defunct)
[sleep]with PPID=1 in the test's_process_group_snapshotoutput; this triggered the test's"orphan bug regressed"assertion despite the bug NOT being regressed.Fix
tools/environments/local.py:_kill_process: sendos.killpg(pgid, SIGKILL)directly, thenproc.wait(timeout=0.5)to reap our direct child. Drop the now-unused_group_alive/_wait_for_group_exitnested helpers.tests/tools/test_local_interrupt_cleanup.py::_pgid_still_alive: useps -g $pgid -o stat=and skip zombie (Z) entries. Zombies aren't "alive" in any meaningful sense and reporting them as such causes false-positive failures on container runners.test_kill_process_uses_cached_pgid_if_wrapper_already_exited: update expected call sequence to[(SIGKILL)](was SIGTERM-then-existence-check).Verification
Reproduced on the same act_runner image used in CI (
runner-base:full-latest-cloudflared-goproxy-pipe) on the operator host (load avg 16-37 on 8 CPUs — same conditions as the failing CI runs).test_local_interrupt_cleanup.py+test_local_background_child_hang.py(10 tests total): pass 3/3 stress runs after fix. Failing 0/3 before.Out of scope
Other tests that were red on
mainafter #1:test_concurrent_inserts_settle_at_cap— passes isolated, fails when run in parallel with other heavy tests because it creates 160 realAIAgentinstances. Real fix is to refactor the test to use a stub AIAgent. Tracking separately.test_blocking_approval_*— passes alone, fails when xdist schedules it alongsidetest_concurrent_inserts_*. Suspected order-pollution + parallel-load. Tracking separately._tool_guardrailsattr). All deterministic, all unrelated to this change.Checkout
From your project repository, check out a new branch and test the changes.