test+fix: snapshot-drift batch + pending_title regression #4

Open
claude-ceo-assistant wants to merge 10 commits from fix/snapshot-drift-batch into main

What this fixes

8 deterministic test failures, plus one production regression that those tests should have been catching.

Each commit is one concern (per SOP), each verified against the act_runner image on the operator host.

Pure snapshot-drift (test follows production)

Commit Test(s) Drift
447801c test_send_available_commands_update acp _ADVERTISED_COMMANDS added steer + queue
7672bd9 test_os_environ_still_wins_over_dotenv #18254 inverted the precedence to .env wins; obsolete test renamed + inverted
ed09572 test_*_unit_avoids_recursive_execstop_and_uses_extended_stop_timeout (×2) hard-coded TimeoutStopSec=90 was tied to old restart_drain_timeout=60; default bumped to 180 → unit emits 210. Compute expected from SSOT.
74d5e5a test_dockerfile_installs_tui_dependencies, test_dockerfile_materializes_local_tui_ink_package a49f4c6 retired the --prefix node_modules/@hermes/ink materialization dance for npm_config_install_links=false. New invariants asserted.
5f179d6 test_concurrent_interrupt_* (×2) _tool_guardrails + 3 helper methods added to AIAgent; stub class drift. _invoke_tool gained messages=, pre_tool_block_checked= kwargs; test fakes rejected them.
d70d6b1 test_update_restarts_profile_manual_gateways, test_update_profile_manual_gateway_falls_back_to_sigterm, test_update_kills_manual_pid_but_not_service_pid post-restart survivor sweep (#17648) sends an extra SIGKILL when find_gateway_pids continues to report the PID alive. Mocks were static; updated to model successful primary kill.
f88e022 test_send_typing (teams) adapter imports TypingActivityInput at module load; binding stays None when microsoft_teams isn't installed (CI runner). Test fixture's sys.modules mock can't fix a binding already captured. Patch the local binding directly.

Production regression caught + fixed (was being silently dropped by tests)

Commit Symptom Root cause
c04e05f test_session_create_drops_pending_title_on_valueerror c5b4c48 (#18370 "lazy session creation") moved pending_title application from eager to post-message-complete, but collapsed the original except ValueError: drop + except Exception: keep into a single except Exception: pass. Result: a duplicate-title pending sticks around forever, hitting set_session_title with the same losing argument every message. Auto-title can't kick in because pending_title still shadows it.

The c04e05f fix extracts a documented _apply_pending_session_title helper in tui_gateway/server.py with the three-branch semantics (success → clear, ValueError → drop, other Exception → retain), and replaces the previous test with four focused tests on the helper.

Verification

Reproduced on the same act_runner image used in CI on the operator host. Each fix verified individually; targeted test files all pass 1–3 stress runs (depending on test cost). test_tui_gateway_server.py full file: 163/163 passing after the helper extraction.

Full-suite verification under heavy host load (avg 16-37 on 8 CPUs) is in progress — will summarise in a comment when it completes.

Out of scope (separate PRs / issues)

  • D-Bus-dependent systemd tests (5×) + voice-mode env-detection tests (4×) — these need either a runner-image change (install dbus-user-session, route a session bus into the container) or skip-with-justification. Tracking separately so this PR stays scope-clean.
  • test_concurrent_inserts_settle_at_cap — passes isolated, fails under parallel xdist load because it creates 160 real AIAgent instances. Refactor to a stub agent is a bigger change.
  • test_blocking_approval_* — passes alone, fails when scheduled by xdist alongside concurrent_inserts. Order/load contamination, follow-up.

Security/versioning notes (per SOP)

No security-relevant changes: all touched code paths are pre-existing internal flows; no new untrusted input handling, no new auth/permissions, no new logging of sensitive data.

No backwards-compat impact: _apply_pending_session_title is a new private helper (leading underscore) called from one site in the same module; no public API surface changes.

## What this fixes 8 deterministic test failures, plus one production regression that those tests should have been catching. Each commit is one concern (per SOP), each verified against the act_runner image on the operator host. ### Pure snapshot-drift (test follows production) | Commit | Test(s) | Drift | |---|---|---| | 447801c | `test_send_available_commands_update` | acp `_ADVERTISED_COMMANDS` added `steer` + `queue` | | 7672bd9 | `test_os_environ_still_wins_over_dotenv` | #18254 inverted the precedence to .env wins; obsolete test renamed + inverted | | ed09572 | `test_*_unit_avoids_recursive_execstop_and_uses_extended_stop_timeout` (×2) | hard-coded `TimeoutStopSec=90` was tied to old `restart_drain_timeout=60`; default bumped to 180 → unit emits 210. Compute expected from SSOT. | | 74d5e5a | `test_dockerfile_installs_tui_dependencies`, `test_dockerfile_materializes_local_tui_ink_package` | a49f4c6 retired the `--prefix node_modules/@hermes/ink` materialization dance for `npm_config_install_links=false`. New invariants asserted. | | 5f179d6 | `test_concurrent_interrupt_*` (×2) | `_tool_guardrails` + 3 helper methods added to AIAgent; stub class drift. `_invoke_tool` gained `messages=`, `pre_tool_block_checked=` kwargs; test fakes rejected them. | | d70d6b1 | `test_update_restarts_profile_manual_gateways`, `test_update_profile_manual_gateway_falls_back_to_sigterm`, `test_update_kills_manual_pid_but_not_service_pid` | post-restart survivor sweep (#17648) sends an extra SIGKILL when find_gateway_pids continues to report the PID alive. Mocks were static; updated to model successful primary kill. | | f88e022 | `test_send_typing` (teams) | adapter imports `TypingActivityInput` at module load; binding stays `None` when microsoft_teams isn't installed (CI runner). Test fixture's sys.modules mock can't fix a binding already captured. Patch the local binding directly. | ### Production regression caught + fixed (was being silently dropped by tests) | Commit | Symptom | Root cause | |---|---|---| | c04e05f | `test_session_create_drops_pending_title_on_valueerror` | c5b4c48 (#18370 "lazy session creation") moved pending_title application from eager to post-message-complete, but collapsed the original `except ValueError: drop` + `except Exception: keep` into a single `except Exception: pass`. Result: a duplicate-title pending sticks around forever, hitting `set_session_title` with the same losing argument every message. Auto-title can't kick in because pending_title still shadows it. | The c04e05f fix extracts a documented `_apply_pending_session_title` helper in `tui_gateway/server.py` with the three-branch semantics (success → clear, ValueError → drop, other Exception → retain), and replaces the previous test with four focused tests on the helper. ## Verification Reproduced on the same act_runner image used in CI on the operator host. Each fix verified individually; targeted test files all pass 1–3 stress runs (depending on test cost). `test_tui_gateway_server.py` full file: 163/163 passing after the helper extraction. Full-suite verification under heavy host load (avg 16-37 on 8 CPUs) is in progress — will summarise in a comment when it completes. ## Out of scope (separate PRs / issues) - D-Bus-dependent systemd tests (5×) + voice-mode env-detection tests (4×) — these need either a runner-image change (install `dbus-user-session`, route a session bus into the container) or skip-with-justification. Tracking separately so this PR stays scope-clean. - `test_concurrent_inserts_settle_at_cap` — passes isolated, fails under parallel xdist load because it creates 160 real `AIAgent` instances. Refactor to a stub agent is a bigger change. - `test_blocking_approval_*` — passes alone, fails when scheduled by xdist alongside concurrent_inserts. Order/load contamination, follow-up. ## Security/versioning notes (per SOP) No security-relevant changes: all touched code paths are pre-existing internal flows; no new untrusted input handling, no new auth/permissions, no new logging of sensitive data. No backwards-compat impact: `_apply_pending_session_title` is a new private helper (leading underscore) called from one site in the same module; no public API surface changes.
claude-ceo-assistant added 8 commits 2026-05-08 03:54:14 +00:00
acp_adapter/server.py:_ADVERTISED_COMMANDS now includes "steer" (inject
guidance into active turn) and "queue" (run prompt after current turn
finishes) between "compact" and "version". Production code is the source
of truth; this test was the last reader still on the pre-feature snapshot.

The substantive features were added in PRs that introduced steer/queue
themselves; this is purely test-snapshot follow-through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stale invariant "os.environ wins over .env" was deliberately inverted
in 2ef1ad2 ("fix: prefer ~/.hermes/.env over os.environ when seeding
credential pool"). The fix targets the case where a parent shell (Codex
CLI, harness scripts) exports a stale OPENROUTER_API_KEY, the user updates
~/.hermes/.env with a fresh value, and Hermes silently 401s because
auth.json cached the stale env-var.

Rename + invert this test to assert the new invariant (.env wins). The
positive load_pool coverage already exists in
tests/agent/test_credential_pool.py::test_load_pool_prefers_dotenv_over_stale_os_environ
(added in 0a6865b alongside the fix); this case still serves a purpose
because it exercises _seed_from_env directly, which is a separate code
path from load_pool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The literal "TimeoutStopSec=90" assertion was correct when default
restart_drain_timeout was 60s (max(60,60)+30=90). The default was bumped
to 180s in hermes_cli/config.py, making the unit emit TimeoutStopSec=210
— still satisfying issue #8202 ("timeout must exceed drain so systemd
doesn't SIGKILL pre-cleanup") but breaking the literal assertion.

Refactor: compute the expected minimum from
gateway.restart.DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT and assert >=. The
test now expresses the invariant directly and survives future drain
bumps. Mutation-tested mentally: deleting the +30 headroom in
hermes_cli/gateway.py:1636 still fails this test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two contract tests guarded properties of the now-retired @hermes/ink
materialization dance (--prefix node_modules/@hermes/ink + omit=dev +
nested react cleanup + await-import smoke). The Dockerfile was
restructured in a49f4c6 ("fix: prevent tui rebuilding assets") to
instead:

1. Copy the full ui-tui/packages/hermes-ink/ tree (not just manifests)
2. Set ENV npm_config_install_links=false to force symlink resolution
   despite Debian's bundled npm 9.x defaulting to install-links=true

The new mechanism solves the same workspace-resolution problem with
fewer build steps (no rebuild on container start). Update the tests to
guard the new invariants, both designed to catch a regression that
would break @hermes/ink at runtime:

- test_dockerfile_installs_tui_dependencies now asserts the full
  packages/hermes-ink/ COPY (catches manifest-only revert)
- new test_dockerfile_forces_npm_install_links_false_for_workspace_resolution
  asserts the env var (catches accidental removal that re-introduces
  the npm 9 install-as-copy bug)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stub in tests/run_agent/test_concurrent_interrupt.py mirrors a subset
of AIAgent attributes/methods deliberately ("Avoid full AIAgent init —
just import the class and build a stub"). Two recent additions to the
real AIAgent broke it:

1. Tool-call guardrails: AIAgent gained _tool_guardrails (line 1160) +
   _append_guardrail_observation / _guardrail_block_result /
   _set_tool_guardrail_halt that the concurrent-execution path now
   invokes during result collection. Add a MagicMock guardrail
   controller whose decision objects mirror the
   ToolGuardrailDecision shape (action="allow", allows_execution=True,
   should_halt=False) — the bound methods read sensible defaults
   instead of truthy MagicMock children. Bind the three methods the
   same way as other AIAgent methods.

2. _invoke_tool gained kwargs: messages= and pre_tool_block_checked=
   are now forwarded by the concurrent path. The two test fakes
   (slow_tool, polling_tool) didn't accept them and raised TypeError
   on every call. Add **kwargs so future kwargs additions don't break
   these fakes.

These are pure stub-drift fixes — production behaviour is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cmd_update gained a post-restart survivor sweep in hermes_cli/main.py
(around line 7568, for issue #17648): after the primary kill attempts,
sleep 3s, re-poll find_gateway_pids, and SIGKILL any PIDs from
killed_pids that are still alive. This catches gateways that ignore
SIGTERM (stuck drain, blocked I/O, zombies).

Three tests patched find_gateway_pids with a static return_value
(or a fake that ignored call count), so the sweep always saw the
killed PIDs as still alive and fired an extra SIGKILL — contaminating
the kill-call assertions:

- test_update_restarts_profile_manual_gateways saw kill called 1×
  (SIGKILL from sweep) when expecting 0
- test_update_profile_manual_gateway_falls_back_to_sigterm saw kill
  called 2× (SIGTERM + SIGKILL from sweep) when expecting 1
- test_update_kills_manual_pid_but_not_service_pid saw 2 kills on the
  manual PID instead of 1

Real behaviour: after SIGTERM lands, the process exits and the next
find_gateway_pids() returns []. The static mocks didn't model that.
Make each find_gateway_pids mock call-count-aware: first call returns
the live PID(s), subsequent calls return empty / service-only —
modelling a successful primary kill. The survivor sweep then sees
nothing to escalate, so the original assertions hold.

Tests that want to exercise the sweep path can opt in by keeping the
PID alive across calls; none of the existing tests need that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The teams adapter imports TypingActivityInput at module load time:

    try:
        from microsoft_teams.api.activities.typing import TypingActivityInput
    except ImportError:
        TypingActivityInput = None

When the real microsoft_teams package isn't installed (CI runner image
doesn't bundle Microsoft Teams SDK), the import fails and the local
binding stays None — even though the test file's _ensure_teams_mock
fixture registers a MockTypingActivityInput in sys.modules. The
test-time mock-in-sys.modules trick only fixes future imports; a binding
captured before the mock was registered remains stale.

send_typing() calls TypingActivityInput() and the resulting TypeError
('NoneType' object is not callable) is swallowed by `except Exception: pass`,
so self._app.send is never reached and the test's assert_awaited_once
fails with "Awaited 0 times" — invisibly, because the swallowed error
hid the real cause.

Fix: monkey-patch the adapter module's local TypingActivityInput binding
in test_send_typing only — narrowest possible patch since no other test
exercises send_typing. Document the import-time-vs-mock-time gap inline
so a future reader doesn't fall into the same trap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(tui_gateway): drop pending_title on ValueError, retain on transient errors
Some checks failed
Nix / nix (macos-latest) (pull_request) Waiting to run
Contributor Attribution Check / check-attribution (pull_request) Failing after 9s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 9s
Tests / e2e (pull_request) Successful in 1m10s
Nix / nix (ubuntu-latest) (pull_request) Failing after 5m24s
Tests / test (pull_request) Failing after 7m44s
c04e05f05c
Production bug + missing test coverage. c5b4c48 ("fix: lazy session
creation — defer DB row until first message (#18370)") moved
pending_title application from the eager _start_agent_build path to a
post-message-complete handler. The original block had:

    except ValueError as e:
        current["pending_title"] = None
        logger.info("Dropping pending title for session %s: %s", sid, e)
    except Exception:
        logger.warning("Failed to apply pending title ...", exc_info=True)

…differentiating "title is invalid / duplicate, retrying won't help"
(ValueError, drop) from "transient DB failure, retry on next message"
(other Exception, keep + log).

The replacement block collapsed both into:

    except Exception:
        pass  # Best effort — auto-title will handle it below

…so a duplicate-title session keeps the same dud pending_title forever,
hitting set_session_title with the same losing argument on every
message-complete. Auto-title can't kick in because pending_title still
shadows it.

Fix: extract a documented _apply_pending_session_title helper that
restores the three-branch semantics (success → clear, ValueError →
drop, other Exception → retain). Call it from the
message-complete handler instead of the inline try/except.

Test rewrite: the previous test_session_create_drops_pending_title_on_valueerror
exercised an obsolete code path (eager apply during session.create) that
no longer existed after c5b4c48. Replace with four focused tests against
the helper:

  - drops_on_valueerror — invariant from the original test name
  - clears_on_success — happy path
  - retains_on_transient_exception — guards the new "don't lose title
    on a flaky DB" behaviour
  - no_op_without_pending — most calls hit this path

Mutation-tested mentally: deleting the `session["pending_title"] = None`
in the ValueError branch fails drops_on_valueerror; deleting the same in
the success branch fails clears_on_success; widening except ValueError
to except Exception fails retains_on_transient_exception.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude-ceo-assistant added 1 commit 2026-05-08 04:03:05 +00:00
test(env-isolation): mock is_container + _preflight_user_systemd in tests asserting non-container behaviour
Some checks failed
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 11s
Tests / e2e (pull_request) Successful in 33s
Nix / nix (macos-latest) (pull_request) Waiting to run
Contributor Attribution Check / check-attribution (pull_request) Failing after 11s
Tests / test (pull_request) Failing after 7m59s
Nix / nix (ubuntu-latest) (pull_request) Failing after 8m14s
c7ec523a21
Production code grew two environment-detection chains that test mocks
didn't anticipate, causing 10 tests to fail on container CI runners
without testing what their assertions claimed:

1. supports_systemd_services() routes container hosts through
   _container_systemd_operational(), which returns False inside the
   act_runner Docker. Two tests
   (test_native_linux, test_supports_systemd_services_returns_true_when_systemctl_present)
   mocked is_linux/is_termux/is_wsl but not is_container, so they hit
   the container branch and got False instead of True. Add the missing
   monkeypatch.

2. systemd_start / systemd_restart now call _preflight_user_systemd()
   which probes the user D-Bus socket and raises
   UserSystemdUnavailableError on its absence (common in containers and
   fresh SSH sessions without linger). Four tests
   (test_systemd_start_refreshes_outdated_unit,
    test_systemd_restart_refreshes_outdated_unit,
    test_systemd_restart_self_requests_graceful_restart_and_waits,
    test_systemd_restart_recovers_failed_planned_restart)
   exercise the unit-refresh / self-restart routing logic and shouldn't
   care about D-Bus availability. Mock _preflight_user_systemd as a
   no-op.

3. detect_audio_environment() routes container hosts through
   is_container() and emits a hard-fail "no audio devices" warning.
   Four tests in TestDetectAudioEnvironment assert per-environment
   detection (clean Linux, WSL with PulseAudio, Termux) and expect
   `available is True`; the container check overrode them. Add a
   class-level autouse fixture that mocks is_container=False so the
   per-environment logic runs unobstructed.

These are deliberate "isolate the unit under test from environmental
concerns" patches — production code is not changed. Tests that want
to exercise the container/D-Bus branches can opt in by overriding the
mocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude-ceo-assistant added 1 commit 2026-05-08 04:06:55 +00:00
chore(release): map claude-ceo-assistant email for AUTHOR_MAP
Some checks failed
Tests / e2e (pull_request) Failing after 1m56s
Tests / test (pull_request) Failing after 1m55s
Nix / nix (macos-latest) (pull_request) Waiting to run
Contributor Attribution Check / check-attribution (pull_request) Successful in 25s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 28s
Nix / nix (ubuntu-latest) (pull_request) Failing after 6m41s
4d71bcb47d
The contributor-check.yml workflow requires every commit author email
to have an entry in scripts/release.py:AUTHOR_MAP. claude-ceo-assistant
is the Gitea-only bot identity used by Claude-Code-driven PRs in the
molecule-ai fork (introduced post-2026-05-06 GitHub suspension; no
upstream/GitHub equivalent). Register it so PRs from that identity pass
the attribution check.

Pattern matches recent same-shape commits: 73bcd83, 50f9f38, 9c626ef.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some checks failed
Tests / e2e (pull_request) Failing after 1m56s
Tests / test (pull_request) Failing after 1m55s
Nix / nix (macos-latest) (pull_request) Waiting to run
Contributor Attribution Check / check-attribution (pull_request) Successful in 25s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 28s
Nix / nix (ubuntu-latest) (pull_request) Failing after 6m41s
This pull request has changes conflicting with the target branch.
  • tests/hermes_cli/test_gateway_service.py
  • tests/hermes_cli/test_gateway_wsl.py
  • tests/run_agent/test_concurrent_interrupt.py
  • tests/tools/test_dockerfile_pid1_reaping.py
  • tests/tools/test_voice_mode.py

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin fix/snapshot-drift-batch:fix/snapshot-drift-batch
git checkout fix/snapshot-drift-batch
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/hermes-agent#4
No description provided.