RCA: hermes-agent PR-only Tests/test failures under xdist runner load #36

Open
opened 2026-05-25 23:35:36 +00:00 by agent-researcher · 2 comments
Member

MECHANISM: The pull_request failures are xdist/load flakes in the full unit suite, not a hermes#35 regression and not caused by the enable-cache: false workflow setting. .gitea/workflows/tests.yml:97-101 runs python -m pytest ... -n auto; pyproject.toml:151 also sets -n auto; and tests/conftest.py:483-489 arms a hard 30s per-test timeout, so slow SQLite commits/FTS schema creation and thread-notification races become hard failures under PR runner load even when main push and rerun passes succeed.

EVIDENCE: hermes#35 head 3b2f9861d476226b7b84151cbb2c2bb966ad1909 currently has Tests / test (pull_request) success in run 520 after 10m4s, with log summary 19112 passed, 63 skipped. The stale failing PR #3 run 515 on head 374d388a753c3d58c583d7becf3b7340b427138c failed after 12m5s with 3 failed, 19107 passed, including TimeoutError: Test exceeded 30 second timeout from hermes_state.py:230 / tests/conftest.py:489, messages_fts_trigram initialization at hermes_state.py:507-509, the approval race at tests/gateway/test_approve_deny_commands.py:405-410, and the stale-cache timing assertion at tests/plugins/test_achievements_plugin.py:247-252. Main HEAD b52ae4dad0832f3905095ec201e054d467a98111 has Tests / test (push) success in run 470 after 43s, showing the repo/workflow can pass outside the loaded PR execution shape.

RECOMMENDED FIX SHAPE: Treat this as hermes-agent test isolation/CI-shape debt, not as a code regression in hermes#35. Responsible files are tests/conftest.py, pyproject.toml, .gitea/workflows/tests.yml, and the specific flaky tests named above: remove the duplicate/implicit xdist amplification, make the 30s timeout opt-in or substantially higher for SQLite-heavy tests, isolate gateway approval globals from parallel workers, and freeze/mock time in the achievements stale-cache test. Re-run stale PR heads after those guardrails; hermes#35 itself is green and should be considered a flake victim rather than the source.

MECHANISM: The `pull_request` failures are xdist/load flakes in the full unit suite, not a hermes#35 regression and not caused by the `enable-cache: false` workflow setting. `.gitea/workflows/tests.yml:97-101` runs `python -m pytest ... -n auto`; `pyproject.toml:151` also sets `-n auto`; and `tests/conftest.py:483-489` arms a hard 30s per-test timeout, so slow SQLite commits/FTS schema creation and thread-notification races become hard failures under PR runner load even when main push and rerun passes succeed. EVIDENCE: hermes#35 head `3b2f9861d476226b7b84151cbb2c2bb966ad1909` currently has `Tests / test (pull_request)` success in run 520 after 10m4s, with log summary `19112 passed, 63 skipped`. The stale failing PR #3 run 515 on head `374d388a753c3d58c583d7becf3b7340b427138c` failed after 12m5s with `3 failed, 19107 passed`, including `TimeoutError: Test exceeded 30 second timeout` from `hermes_state.py:230` / `tests/conftest.py:489`, `messages_fts_trigram` initialization at `hermes_state.py:507-509`, the approval race at `tests/gateway/test_approve_deny_commands.py:405-410`, and the stale-cache timing assertion at `tests/plugins/test_achievements_plugin.py:247-252`. Main HEAD `b52ae4dad0832f3905095ec201e054d467a98111` has `Tests / test (push)` success in run 470 after 43s, showing the repo/workflow can pass outside the loaded PR execution shape. RECOMMENDED FIX SHAPE: Treat this as hermes-agent test isolation/CI-shape debt, not as a code regression in hermes#35. Responsible files are `tests/conftest.py`, `pyproject.toml`, `.gitea/workflows/tests.yml`, and the specific flaky tests named above: remove the duplicate/implicit xdist amplification, make the 30s timeout opt-in or substantially higher for SQLite-heavy tests, isolate gateway approval globals from parallel workers, and freeze/mock time in the achievements stale-cache test. Re-run stale PR heads after those guardrails; hermes#35 itself is green and should be considered a flake victim rather than the source.
Author
Member

MECHANISM: Follow-up on the event-context hypothesis: the current .gitea/workflows/tests.yml does not define separate push vs pull_request job bodies, env, permissions, runner labels, or setup-uv options. Both events enter the same test job at .gitea/workflows/tests.yml:18-101 with runs-on: ubuntu-latest, permissions: contents: read, enable-cache: false, and the same pytest command, so I do not see workflow-file divergence or token-scope-specific behavior as the primary cause.

EVIDENCE: The PR logs show actions/checkout is checking out the PR head ref, not a synthetic refs/pull/N/merge: PR #3 fetched 374d388a...:refs/remotes/pull/3/head and checked out refs/remotes/pull/3/head, while PR #35 fetched 3b2f9861...:refs/remotes/pull/35/head and checked out refs/remotes/pull/35/head. Current statuses are also asymmetric by run freshness: #35 Tests / test is successful in 10m4s and #34 is successful in 16s, while #3/#1 still carry stale failing statuses from older heads/runs.

RECOMMENDED FIX SHAPE: Keep the RCA direction on test/runner-load hardening rather than PR event permissions. If engineers want to eliminate the remaining ambiguity, add a temporary diagnostic step to molecule-ai/hermes-agent:.gitea/workflows/tests.yml that prints github.event_name, github.ref, github.sha, github.event.pull_request.head.sha, runner name, CPU count, and pytest worker count before the test step; the responsible stabilization files remain tests/conftest.py, pyproject.toml, .gitea/workflows/tests.yml, and the named flaky tests.

MECHANISM: Follow-up on the event-context hypothesis: the current `.gitea/workflows/tests.yml` does not define separate push vs pull_request job bodies, env, permissions, runner labels, or setup-uv options. Both events enter the same `test` job at `.gitea/workflows/tests.yml:18-101` with `runs-on: ubuntu-latest`, `permissions: contents: read`, `enable-cache: false`, and the same pytest command, so I do not see workflow-file divergence or token-scope-specific behavior as the primary cause. EVIDENCE: The PR logs show actions/checkout is checking out the PR head ref, not a synthetic `refs/pull/N/merge`: PR #3 fetched `374d388a...:refs/remotes/pull/3/head` and checked out `refs/remotes/pull/3/head`, while PR #35 fetched `3b2f9861...:refs/remotes/pull/35/head` and checked out `refs/remotes/pull/35/head`. Current statuses are also asymmetric by run freshness: #35 `Tests / test` is successful in 10m4s and #34 is successful in 16s, while #3/#1 still carry stale failing statuses from older heads/runs. RECOMMENDED FIX SHAPE: Keep the RCA direction on test/runner-load hardening rather than PR event permissions. If engineers want to eliminate the remaining ambiguity, add a temporary diagnostic step to `molecule-ai/hermes-agent:.gitea/workflows/tests.yml` that prints `github.event_name`, `github.ref`, `github.sha`, `github.event.pull_request.head.sha`, runner name, CPU count, and pytest worker count before the test step; the responsible stabilization files remain `tests/conftest.py`, `pyproject.toml`, `.gitea/workflows/tests.yml`, and the named flaky tests.
Author
Member

MECHANISM: PR #3 is failing because the pull_request path runs the full xdist unit suite on the PR head, while the cited main push success did not run pytest at all. In run 470 the workflow took .gitea/workflows/tests.yml:58-60 and printed No Python/package/test changes, so it is not evidence that the same workload passes on push. The failing workload is .gitea/workflows/tests.yml:97-101 (python -m pytest ... -n auto) plus the global 30s SIGALRM in tests/conftest.py:524-535, which turns loaded xdist stalls/races into hard failures.

EVIDENCE: PR #3 run 515 checked out refs/remotes/pull/3/head at 374d388a, then failed after 9m52s with 3 failed, 19107 passed, 63 skipped, 2 errors. Specific failures: tests/cli/test_cli_new_session.py:122-137 timed out inside SessionDB.create_session / hermes_state.py:208-230; tests/agent/test_insights.py:43-50 timed out appending SQLite messages; tests/gateway/test_session.py:532-540 timed out creating messages_fts_trigram via hermes_state.py:499-511; tests/gateway/test_approve_deny_commands.py:405-410 missed its 2.5s notify window after check_all_command_guards performs tirith resolution/scanning in tools/approval.py:962-1073; and tests/plugins/test_achievements_plugin.py:231-252 races a stale timestamp against a background refresh path in plugins/hermes-achievements/dashboard/plugin_api.py:920-975. PR #35 is the better comparator: it did run the full pytest command and passed 19112 passed in run 520.

RECOMMENDED FIX SHAPE: Do not classify this as pull_request token/permission drift. Responsible surfaces are molecule-ai/hermes-agent:.gitea/workflows/tests.yml, pyproject.toml, tests/conftest.py, hermes_state.py, tests/gateway/test_approve_deny_commands.py, and tests/plugins/test_achievements_plugin.py. Make the push/PR comparison honest by reporting whether pytest was skipped; cap/declare xdist worker count for the 4GB act_runner shape; replace the global 30s SIGALRM with per-test marks or longer budgets for SQLite/FTS-heavy tests; pre-disable/mock tirith in approval tests; and freeze/mock time or block background refresh in the achievements stale-cache test.

MECHANISM: PR #3 is failing because the `pull_request` path runs the full xdist unit suite on the PR head, while the cited main `push` success did not run pytest at all. In run 470 the workflow took `.gitea/workflows/tests.yml:58-60` and printed `No Python/package/test changes`, so it is not evidence that the same workload passes on push. The failing workload is `.gitea/workflows/tests.yml:97-101` (`python -m pytest ... -n auto`) plus the global 30s SIGALRM in `tests/conftest.py:524-535`, which turns loaded xdist stalls/races into hard failures. EVIDENCE: PR #3 run 515 checked out `refs/remotes/pull/3/head` at `374d388a`, then failed after 9m52s with `3 failed, 19107 passed, 63 skipped, 2 errors`. Specific failures: `tests/cli/test_cli_new_session.py:122-137` timed out inside `SessionDB.create_session` / `hermes_state.py:208-230`; `tests/agent/test_insights.py:43-50` timed out appending SQLite messages; `tests/gateway/test_session.py:532-540` timed out creating `messages_fts_trigram` via `hermes_state.py:499-511`; `tests/gateway/test_approve_deny_commands.py:405-410` missed its 2.5s notify window after `check_all_command_guards` performs tirith resolution/scanning in `tools/approval.py:962-1073`; and `tests/plugins/test_achievements_plugin.py:231-252` races a stale timestamp against a background refresh path in `plugins/hermes-achievements/dashboard/plugin_api.py:920-975`. PR #35 is the better comparator: it did run the full pytest command and passed `19112 passed` in run 520. RECOMMENDED FIX SHAPE: Do not classify this as pull_request token/permission drift. Responsible surfaces are `molecule-ai/hermes-agent:.gitea/workflows/tests.yml`, `pyproject.toml`, `tests/conftest.py`, `hermes_state.py`, `tests/gateway/test_approve_deny_commands.py`, and `tests/plugins/test_achievements_plugin.py`. Make the push/PR comparison honest by reporting whether pytest was skipped; cap/declare xdist worker count for the 4GB act_runner shape; replace the global 30s SIGALRM with per-test marks or longer budgets for SQLite/FTS-heavy tests; pre-disable/mock tirith in approval tests; and freeze/mock time or block background refresh in the achievements stale-cache test.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/hermes-agent#36