fix(local-env): SIGKILL directly on cleanup; don't wait for zombie reap #3

Open

claude-ceo-assistant wants to merge 2 commits from fix/kill-process-direct-sigkill into main

Author SHA1 Message Date

Author	SHA1	Message	Date
claude-ceo-assistant	23954f89d5	chore(release): map claude-ceo-assistant email for AUTHOR_MAP Some checks failed Tests / test (pull_request) Failing after 1m27s Details Tests / e2e (pull_request) Failing after 1m36s Details Nix / nix (ubuntu-latest) (pull_request) Failing after 1m39s Details Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 33s Details Contributor Attribution Check / check-attribution (pull_request) Successful in 28s Details Nix / nix (macos-latest) (pull_request) Has been cancelled Details The contributor-check.yml workflow requires every commit author email to have an entry in scripts/release.py:AUTHOR_MAP. claude-ceo-assistant is the Gitea-only bot identity used by Claude-Code-driven PRs in the molecule-ai fork (introduced post-2026-05-06 GitHub suspension; no upstream/GitHub equivalent). Register it so PRs from that identity pass the attribution check. Pattern matches recent same-shape commits: `73bcd83`, `50f9f38`, `9c626ef`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 04:07:10 +00:00
claude-ceo-assistant	d6fca4f623	fix(local-env): use SIGKILL directly on cleanup, don't wait for zombie group reap Some checks failed Contributor Attribution Check / check-attribution (pull_request) Failing after 15s Details Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 20s Details Nix / nix (ubuntu-latest) (pull_request) Failing after 9m23s Details Tests / test (pull_request) Failing after 11m50s Details Tests / e2e (pull_request) Successful in 1m4s Details Nix / nix (macos-latest) (pull_request) Has been cancelled Details Two real bugs in tools/environments/local.py:_kill_process surfaced under runner load (operator host load avg 16-37 on 8 CPUs): 1. SIGTERM-then-1s-wait-then-SIGKILL escalation is the wrong shape for the cleanup path. _kill_process is invoked from base.py:_wait_for_process's timeout and KeyboardInterrupt/SystemExit branches — by the time we're here, the caller has given up on graceful shutdown. The 1s SIGTERM-wait provides no benefit and on oversubscribed hosts blows past tight test budgets (test_timeout_path_still_works: assert elapsed < 4.0s, was 5.05s). Send SIGKILL directly: it's unblockable and the kernel processes it synchronously. 2. _wait_for_group_exit polled os.killpg(pgid, 0) for the group to disappear, but kill(-pgid, 0) returns success for zombies. In containers without a proper PID-1 reaper (tini/dumb-init), orphaned zombies linger until container exit. The function sat in its 2.0s ceiling waiting for kernel bookkeeping that would never happen. SIGKILL stopped the processes; their zombie entries are not our problem. Replace both with a single os.killpg(SIGKILL) + proc.wait(timeout=0.5) to reap our direct child. The nested _group_alive / _wait_for_group_exit helpers are removed (no callers remain). Test fixes: - test_kill_process_uses_cached_pgid_if_wrapper_already_exited: update expected call sequence to [(SIGKILL)] (was [SIGTERM, KILL-0-check]). - _pgid_still_alive helper: use ps STAT field to skip zombies, instead of os.killpg(pgid, 0). Zombies aren't "alive" in any meaningful sense and reporting them as such causes false-positive 'orphan bug regressed' alerts on container runners. Verified: tests/tools/test_local_interrupt_cleanup.py + tests/tools/test_local_background_child_hang.py — all 10 tests pass 3/3 runs on the act_runner image (load avg 16+). Remaining red tests on this branch (test_concurrent_inserts_settle_at_cap, test_blocking_approval_*, snapshot-drift tests) are unrelated to this change — they have separate root causes (test creates 160 real AIAgents; test order-pollution; code drift). Tracking separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 02:33:43 +00:00

claude-ceo-assistant

23954f89d5

chore(release): map claude-ceo-assistant email for AUTHOR_MAP

Tests / test (pull_request) Failing after 1m27s

Details

Tests / e2e (pull_request) Failing after 1m36s

Details

Nix / nix (ubuntu-latest) (pull_request) Failing after 1m39s

Details

Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 33s

Details

Contributor Attribution Check / check-attribution (pull_request) Successful in 28s

Details

Nix / nix (macos-latest) (pull_request) Has been cancelled

Details

The contributor-check.yml workflow requires every commit author email
to have an entry in scripts/release.py:AUTHOR_MAP. claude-ceo-assistant
is the Gitea-only bot identity used by Claude-Code-driven PRs in the
molecule-ai fork (introduced post-2026-05-06 GitHub suspension; no
upstream/GitHub equivalent). Register it so PRs from that identity pass
the attribution check.

Pattern matches recent same-shape commits: 73bcd83, 50f9f38, 9c626ef.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 04:07:10 +00:00

claude-ceo-assistant

d6fca4f623

fix(local-env): use SIGKILL directly on cleanup, don't wait for zombie group reap

Contributor Attribution Check / check-attribution (pull_request) Failing after 15s

Details

Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 20s

Details

Nix / nix (ubuntu-latest) (pull_request) Failing after 9m23s

Details

Tests / test (pull_request) Failing after 11m50s

Details

Tests / e2e (pull_request) Successful in 1m4s

Details

Nix / nix (macos-latest) (pull_request) Has been cancelled

Details

Two real bugs in tools/environments/local.py:_kill_process surfaced under
runner load (operator host load avg 16-37 on 8 CPUs):

1. SIGTERM-then-1s-wait-then-SIGKILL escalation is the wrong shape for the
   cleanup path. _kill_process is invoked from base.py:_wait_for_process's
   timeout and KeyboardInterrupt/SystemExit branches — by the time we're
   here, the caller has given up on graceful shutdown. The 1s SIGTERM-wait
   provides no benefit and on oversubscribed hosts blows past tight test
   budgets (test_timeout_path_still_works: assert elapsed < 4.0s, was 5.05s).
   Send SIGKILL directly: it's unblockable and the kernel processes it
   synchronously.

2. _wait_for_group_exit polled os.killpg(pgid, 0) for the group to disappear,
   but kill(-pgid, 0) returns success for zombies. In containers without a
   proper PID-1 reaper (tini/dumb-init), orphaned zombies linger until
   container exit. The function sat in its 2.0s ceiling waiting for kernel
   bookkeeping that would never happen. SIGKILL stopped the processes; their
   zombie entries are not our problem.

Replace both with a single os.killpg(SIGKILL) + proc.wait(timeout=0.5) to
reap our direct child. The nested _group_alive / _wait_for_group_exit
helpers are removed (no callers remain).

Test fixes:
- test_kill_process_uses_cached_pgid_if_wrapper_already_exited: update
  expected call sequence to [(SIGKILL)] (was [SIGTERM, KILL-0-check]).
- _pgid_still_alive helper: use ps STAT field to skip zombies, instead of
  os.killpg(pgid, 0). Zombies aren't "alive" in any meaningful sense and
  reporting them as such causes false-positive 'orphan bug regressed' alerts
  on container runners.

Verified: tests/tools/test_local_interrupt_cleanup.py +
tests/tools/test_local_background_child_hang.py — all 10 tests pass 3/3
runs on the act_runner image (load avg 16+).

Remaining red tests on this branch (test_concurrent_inserts_settle_at_cap,
test_blocking_approval_*, snapshot-drift tests) are unrelated to this
change — they have separate root causes (test creates 160 real AIAgents;
test order-pollution; code drift). Tracking separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 02:33:43 +00:00

fix(local-env): SIGKILL directly on cleanup; don't wait for zombie reap #3

2 Commits